# Purpose

### 2021-12-10
The CLD3 language codes are not the same as the georelevant codes, so we need a table to make it easier to investigate detected language.

In this table I also
- flag whether a language is included in USE-multilingual (the model used for initial round of topic models)
- Group languages that are not common at reddit as `"Other_language"`


# Imports & notebook setup

In [1]:
%load_ext google.colab.data_table
%load_ext autoreload
%autoreload 2

In [2]:
# colab auth for BigQuery & google drive
from google.colab import auth, files, drive
import sys  # need sys for mounting gdrive path

auth.authenticate_user()
print('Authenticated')

Authenticated


### attach up my drive + install my code

In [4]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
g_drive_root = '/content/drive'

try:
    drive._mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
    
except Exception as e:
    try:
        drive.mount(g_drive_root, force_remount=True)
        print('   Authenticated & mounted Google Drive')
    except Exception as e:
        print(e)
        raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install libraries

In [6]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

[K     |████████████████████████████████| 10.1 MB 4.3 MB/s 
[K     |████████████████████████████████| 14.2 MB 25.8 MB/s 
[K     |████████████████████████████████| 965 kB 65.5 MB/s 
[K     |████████████████████████████████| 144 kB 56.9 MB/s 
[K     |████████████████████████████████| 285 kB 49.6 MB/s 
[K     |████████████████████████████████| 13.2 MB 14.4 MB/s 
[K     |████████████████████████████████| 79.9 MB 118 kB/s 
[K     |████████████████████████████████| 132 kB 57.2 MB/s 
[K     |████████████████████████████████| 715 kB 22.8 MB/s 
[K     |████████████████████████████████| 112 kB 49.4 MB/s 
[K     |████████████████████████████████| 74 kB 2.7 MB/s 
[K     |████████████████████████████████| 79 kB 5.2 MB/s 
[K     |████████████████████████████████| 146 kB 56.9 MB/s 
[K     |████████████████████████████████| 58 kB 4.9 MB/s 
[K     |████████████████████████████████| 180 kB 60.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 62.4 MB/s 
[K     |███████████████████

## General imports

In [10]:
# Regular Imports
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted


# Set env variable needed by some libraries to get datay from BigQuery
# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

### `subclu` imports (custom module)

In [17]:
# subclu imports

# For reloading, need to force-delete some imported items
try:
    del LoadPosts, LoadSubreddits
    del (
        L_CLD3_CODES_FOR_TOP_LANGUAGES_USED_AT_REDDIT,
        L_CLD3_CODES_FOR_TOP_LANGUAGES_AND_USE_MULTILINGUAL,
        D_CLD3_CODE_TO_LANGUAGE_NAME,
    )
except Exception:
    pass

from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.data.data_loaders import LoadPosts, LoadSubreddits
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.utils.language_code_mapping import (
    DF_LANGUAGE_MAPPING,
    L_CLD3_CODES_FOR_TOP_LANGUAGES_USED_AT_REDDIT,
    L_CLD3_CODES_FOR_TOP_LANGUAGES_AND_USE_MULTILINGUAL,
    D_CLD3_CODE_TO_LANGUAGE_NAME,
)

setup_logging()
print_lib_versions([pd, np])

python		v 3.7.12
===
pandas		v: 1.1.5
numpy		v: 1.19.5


# Load df with codes & languages

- `language_name` = default name
- `language_name_top_only` = only show names of top ~40 languages and group other languages as `Other_language`.

Reference article (external analysis) on comments for on week:
- https://towardsdatascience.com/the-most-popular-languages-on-reddit-analyzed-with-snowflake-and-a-java-udtf-4e58c8ba473c

In [18]:
DF_LANGUAGE_MAPPING.shape

(112, 4)

In [20]:
counts_describe(DF_LANGUAGE_MAPPING)

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
language_code,object,112,112,100.00%,0,0.00%
language_name,object,112,103,91.96%,0,0.00%
language_name_top_only,object,112,48,42.86%,0,0.00%
language_in_use_multilingual,bool,112,2,1.79%,0,0.00%


### There should only be 16* languages for USE-multilingual

*15 when we collapse Chinese + Chinese (Taiwan)

We see 20 rows because some codes map to a single language.

In [21]:
value_counts_and_pcts(DF_LANGUAGE_MAPPING['language_in_use_multilingual'])

Unnamed: 0,language_in_use_multilingual-count,language_in_use_multilingual-percent,language_in_use_multilingual-pct_cumulative_sum
False,92,82.1%,82.1%
True,20,17.9%,100.0%


In [28]:
value_counts_and_pcts(
    DF_LANGUAGE_MAPPING[DF_LANGUAGE_MAPPING['language_in_use_multilingual'] == True],
    ['language_in_use_multilingual', 'language_name_top_only'],
    # sort_index=True,
    cumsum=False, top_n=None,
)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percent
language_in_use_multilingual,language_name_top_only,Unnamed: 2_level_1,Unnamed: 3_level_1
True,Chinese,4,20.0%
True,Japanese,2,10.0%
True,Russian,2,10.0%
True,Arabic,1,5.0%
True,Dutch,1,5.0%
True,English,1,5.0%
True,French,1,5.0%
True,German,1,5.0%
True,Italian,1,5.0%
True,Korean,1,5.0%


In [30]:
DF_LANGUAGE_MAPPING[DF_LANGUAGE_MAPPING['language_in_use_multilingual'] == True]

Unnamed: 0,language_code,language_name,language_name_top_only,language_in_use_multilingual
5,ar,Arabic,Arabic,True
107,zh,Chinese,Chinese,True
110,zh-tw,Chinese,Chinese,True
109,zh-Latn,Chinese,Chinese,True
108,zh-cn,Chinese,Chinese,True
72,nl,Dutch,Dutch,True
19,en,English,English,True
27,fr,French,French,True
16,de,German,German,True
45,it,Italian,Italian,True


# Save table to BigQuery

NOTE: Sorting is not guaranteed in the final BigQuery table.

We can "force" sorting if we set `chunksize` to a number smaller than the full df size.

In [35]:
(
    DF_LANGUAGE_MAPPING
    .assign(table_creation_date=pd.to_datetime(datetime.utcnow().date()))
    .to_gbq(
        destination_table='david_bermejo.language_detection_code_to_name_lookup_cld3',
        project_id='reddit-employee-datasets',
        chunksize=10,
        if_exists='replace'
    )
)

10 out of 112 rows loaded."
20 out of 112 rows loaded."
30 out of 112 rows loaded."
40 out of 112 rows loaded."
50 out of 112 rows loaded."
60 out of 112 rows loaded."
70 out of 112 rows loaded."
80 out of 112 rows loaded."
90 out of 112 rows loaded."
100 out of 112 rows loaded."
110 out of 112 rows loaded."
112 out of 112 rows loaded."
12it [00:43,  3.65s/it]
