# Purpose

Find the best model (so far) and do some basic EDA to select what to upload to BigQuery.

From manual inspection on mlflow GUI the best candidate is:<br>
`134cefe13ae34621a69fcc48c4d5fb71`

Because:
- it has high scores at the 100-to-200 & 200-to-300 bins 
- AND has the most subreddits (filtered out fewer subreddits due to low post counts)

Other clusters had slightly higher values at the 200-to-300 bin, but they clustered fewer subreddits.

# Imports & Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import logging
import os
from pathlib import Path

import numpy as np
import pandas as pd
import plotly
import seaborn as sns

import mlflow
import hydra

import subclu
from subclu.eda.aggregates import compare_raw_v_weighted_language
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl, 
    get_color_dict, base_colors_for_manual_labels,
    check_colors_used,
)
from subclu.data.data_loaders import LoadPosts, LoadSubreddits, create_sub_level_aggregates


# ===
# imports specific to this notebook
from collections import Counter
# import umap
# import openTSNE
# from openTSNE import TSNE

# import hdbscan

import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize  # if we normalize the data, euclidean distance is approx of cosine

from sklearn.cluster import KMeans, DBSCAN, OPTICS, AgglomerativeClustering

print_lib_versions([hydra, np, pd, plotly, sklearn, sns, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
sklearn		v: 0.24.1
seaborn		v: 0.11.1
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Get experiment IDs to use for clustering

There are two runs that completed and they both have the same parameters, so we should be able to use either one. For now, let's select:<br>
`0591fdae9b7d4da7ae3839767b8aab66`

In [6]:
%%time

df_mlf = mlf.search_all_runs(experiment_ids=[18])
df_mlf.shape

CPU times: user 303 ms, sys: 15.5 ms, total: 318 ms
Wall time: 319 ms


(97, 81)

In [9]:
mask_finished = df_mlf['status'] == 'FINISHED'
# mask_df_similarity_complete = ~df_mlf['metrics.df_sub_level_agg_a_post_only_similarity-rows'].isnull()

df_mlf_clustering_candidates = df_mlf[mask_finished]
df_mlf_clustering_candidates.shape

(97, 81)

In [11]:
cols_with_multiple_vals = df_mlf_clustering_candidates.columns[df_mlf_clustering_candidates.nunique(dropna=False) > 1]

print(len(cols_with_multiple_vals))
# df_mlf_clustering_candidates[cols_with_multiple_vals]

55


# Select model & inspect artifacts

In [12]:
run_uuid = '134cefe13ae34621a69fcc48c4d5fb71'

In [14]:
style_df_numeric(
    df_mlf[df_mlf['run_id'] == run_uuid][cols_with_multiple_vals],
    rename_cols_for_display=True,
)

Unnamed: 0,run id,artifact uri,start time,end time,metrics.primary topic- 300 to 400- adjusted rand score,metrics.primary topic- 100 to 200- adjusted mutual info score,metrics.primary topic- 200 to 300- adjusted rand score,metrics.primary topic- 300 to 400- homogeneity score,metrics.optimal k- 300 to 400,metrics.primary topic- 200 to 300- adjusted mutual info score,metrics.primary topic- 020 to 050- homogeneity score,metrics.optimal k- 010 to 020,metrics.primary topic- 020 to 050- adjusted mutual info score,metrics.filtered embeddings- n rows,metrics.vectorizing time minutes,metrics.primary topic- 010 to 020- adjusted mutual info score,metrics.primary topic- 010 to 020- homogeneity score,metrics.primary topic- 020 to 050- adjusted rand score,metrics.primary topic- 200 to 300- homogeneity score,metrics.optimal k- 200 to 300,metrics.memory used percent,metrics.memory free,metrics.memory used,metrics.model fit time minutes,metrics.primary topic- 400 to 600- adjusted mutual info score,metrics.primary topic- 100 to 200- adjusted rand score,metrics.primary topic- 400 to 600- homogeneity score,metrics.primary topic- 050 to 100- adjusted rand score,metrics.primary topic- 100 to 200- homogeneity score,metrics.optimal k- 050 to 100,metrics.primary topic- 050 to 100- homogeneity score,metrics.primary topic- 050 to 100- adjusted mutual info score,metrics.optimal k- 020 to 050,metrics.optimal k- 100 to 200,metrics.primary topic- 010 to 020- adjusted rand score,metrics.primary topic- 300 to 400- adjusted mutual info score,metrics.primary topic- 400 to 600- adjusted rand score,metrics.optimal k- 400 to 600,params. pipe- reduce random state,params.pipe- reduce name,params.col model leaves order,params. pipe- reduce n components,params.mlflow run name,params. pipe- reduce tol,params. pipe- cluster affinity,params. pipe- reduce algorithm,params. pipe- cluster linkage,params.optimal ks,params. pipe- reduce n iter,params. pipe- normalize copy,params. pipe- normalize norm,params.pipe- normalize name,tags.mlflow.log- model.history,tags.mlflow.source.git.commit,tags.mlflow.runName
77,134cefe13ae34621a69fcc48c4d5fb71,gs://i18n-subreddit-clustering/mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts,2021-10-26 07:18:36.195000+00:00,2021-10-26 07:21:04.807000+00:00,0.6,0.57,0.57,0.6,351.0,0.58,0.49,14.0,0.53,19053.0,2.47,0.49,0.43,0.44,0.58,248.0,5.11%,594641.0,32144.0,1.73,0.6,0.55,0.6,0.53,0.57,52.0,0.55,0.56,30.0,100.0,0.36,0.6,0.6,405.0,,,model_leaves_list_order_left_to_right,,embedding_clustering-2021-10-26_071835,,euclidean,,ward,,,True,l2,Normalizer,"[{""run_id"": ""134cefe13ae34621a69fcc48c4d5fb71"", ""artifact_path"": ""clustering_model"", ""utc_time_created"": ""2021-10-26 07:20:38.503359"", ""flavors"": {""sklearn"": {""pickled_model"": ""model.pkl"", ""sklearn_version"": ""0.24.1"", ""serialization_format"": ""cloudpickle""}}}]",2bece03c03d8b9b68d3a566aad19c9a6b7deb564,embedding_clustering-2021-10-26_071835


In [22]:
%%time

mlf.list_run_artifacts(
    run_id=run_uuid,
    verbose=False,
    only_top_level=True,
)

15:22:16 | INFO | "    28 <- Artifacts clean count"
15:22:16 | INFO | "    11 <- Artifacts & folders at TOP LEVEL clean count"


CPU times: user 1.02 s, sys: 49.2 ms, total: 1.07 s
Wall time: 1.2 s


['X_linkage',
 'clustering.log',
 'clustering_model',
 'config',
 'df_accel',
 'df_labels',
 'df_supervised_metrics',
 'figures',
 'hydra',
 'optimal_ks',
 'pipeline_params']

In [20]:
%%time
l_all_artifacts = mlf.list_run_artifacts(
    run_id=run_uuid,
    verbose=False,
    only_top_level=False,
)

15:22:05 | INFO | "    28 <- Artifacts clean count"
15:22:05 | INFO | "    11 <- Artifacts & folders at TOP LEVEL clean count"


CPU times: user 1.17 s, sys: 16.2 ms, total: 1.19 s
Wall time: 1.33 s


## Get optimal k-values

In [27]:
[n_ for n_ in l_all_artifacts if 'optimal' in n_]

['mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/optimal_ks/optimal_ks.csv',
 'mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/optimal_ks/optimal_ks.parquet']

In [28]:
%%time

df_opt_ks = mlf.read_run_artifact(
    run_id=run_uuid,
    artifact_folder='optimal_ks',
    artifact_file='optimal_ks.parquet',
    read_function='pd_parquet',
    verbose=False,
)
df_opt_ks.shape

15:25:22 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/optimal_ks"
100%|#############################################| 2/2 [00:00<00:00, 15.91it/s]
15:25:22 | INFO | "  Parquet files found:     1"
15:25:22 | INFO | "  Parquet files to use:     1"


CPU times: user 869 ms, sys: 25.7 ms, total: 894 ms
Wall time: 1.14 s


(7, 2)

In [29]:
df_opt_ks

Unnamed: 0,k,col_prefix
010_to_020,14,014_k
020_to_050,30,030_k
050_to_100,52,052_k
100_to_200,100,100_k
200_to_300,248,248_k
300_to_400,351,351_k
400_to_600,405,405_k


## Get labels

In [21]:
[n_ for n_ in l_all_artifacts if 'labels' in n_]

['mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/df_labels/df_labels.csv',
 'mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/df_labels/df_labels.parquet',
 'mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/df_supervised_metrics/d_df_crosstab_labels.gzip',
 'mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/df_supervised_metrics/metrics_for_known_labels-primary_topic.png']

In [24]:
%%time

df_labels = mlf.read_run_artifact(
    run_id=run_uuid,
    artifact_folder='df_labels',
    artifact_file='df_labels.parquet',
    read_function='pd_parquet',
    verbose=False,
)
df_labels.shape

15:23:26 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts/df_labels"
100%|##########################################| 2/2 [00:00<00:00, 13252.15it/s]
15:23:27 | INFO | "  Parquet files found:     1"
15:23:27 | INFO | "  Parquet files to use:     1"


CPU times: user 1.11 s, sys: 96.8 ms, total: 1.21 s
Wall time: 1.33 s


(19053, 65)

In [51]:
l_cols_label_core = [
    'model_leaves_list_order_left_to_right',
    'subreddit_name',
    'subreddit_id',
    'primary_topic',
    'posts_for_modeling_count',
]
cols_top_k_all = [c for c in df_labels.columns if any(c.startswith(k_) for k_ in df_opt_ks['col_prefix'].unique())]

counts_describe(df_labels[l_cols_label_core + cols_top_k])

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
model_leaves_list_order_left_to_right,int64,19053,19053,100.00%,0,0.00%
subreddit_name,object,19053,19053,100.00%,0,0.00%
subreddit_id,object,19053,19053,100.00%,0,0.00%
primary_topic,object,15929,51,0.32%,3124,16.40%
posts_for_modeling_count,float64,19053,1175,6.17%,0,0.00%
014_k_labels,int32,19053,14,0.07%,0,0.00%
030_k_labels,int32,19053,30,0.16%,0,0.00%
052_k_labels,int32,19053,52,0.27%,0,0.00%
100_k_labels,int32,19053,100,0.52%,0,0.00%
248_k_labels,int32,19053,248,1.30%,0,0.00%


In [52]:
style_df_numeric(
    df_labels[l_cols_label_core + cols_top_k]
    .sort_values(by=['model_leaves_list_order_left_to_right'])
    .head(20)
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,model leaves list order left to right,subreddit name,subreddit id,primary topic,posts for modeling count,014 k labels,030 k labels,052 k labels,100 k labels,248 k labels,351 k labels,405 k labels,014 k- predicted- primary topic,030 k- predicted- primary topic,052 k- predicted- primary topic,100 k- predicted- primary topic,248 k- predicted- primary topic,351 k- predicted- primary topic,405 k- predicted- primary topic
2227,0,blackmetal,t5_2rta0,Music,1169,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
10817,1,metal,t5_2qhud,Music,1038,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
4531,2,deathmetal,t5_2r5w5,Music,670,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
16920,3,thrashmetal,t5_2s66e,Music,285,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
16447,4,technicaldeathmetal,t5_2s8ge,Music,355,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
5046,5,doommetal,t5_2riaf,Music,715,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
13808,6,rabm,t5_2z5zk,Politics,167,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
13431,7,powermetal,t5_2qwe4,Music,607,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
13532,8,progmetal,t5_2s3pe,Music,812,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music
13551,9,progrockmusic,t5_2s6xc,Music,697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,Music,Music,Music,Music,Music,Music,Music


In [53]:
style_df_numeric(
    df_labels[l_cols_label_core + cols_top_k]
    .sort_values(by=['model_leaves_list_order_left_to_right'])
    .tail(20)
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,model leaves list order left to right,subreddit name,subreddit id,primary topic,posts for modeling count,014 k labels,030 k labels,052 k labels,100 k labels,248 k labels,351 k labels,405 k labels,014 k- predicted- primary topic,030 k- predicted- primary topic,052 k- predicted- primary topic,100 k- predicted- primary topic,248 k- predicted- primary topic,351 k- predicted- primary topic,405 k- predicted- primary topic
663,19033,amibeingdetained,t5_2yqn8,Funny/Humor,242,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
15597,19034,sovereigncitizen,t5_2x0c3,Law,85,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
9014,19035,jurastudium_ref,t5_3bepg2,,38,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
9596,19036,law,t5_2qh9k,Law,830,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
14784,19037,scotus,t5_2rfsw,Law,113,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
14923,19038,sexoffendersupport,t5_2veuz,Trauma Support,243,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
5769,19039,excons,t5_2xrpd,,74,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
13503,19040,prison,t5_2qvgf,Law,215,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
15968,19041,straightedge,t5_2qkzx,,60,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law
6871,19042,gangstalking,t5_2t8ya,Medical and Mental Health,893,14.0,30.0,52.0,100.0,248.0,351.0,405.0,Learning and Education,"Business, Economics, and Finance","Business, Economics, and Finance",Law,Law,Law,Law


# Load subreddits meta to keep only German subs

In [36]:
test_experiment = 'v0.4.0_use_multi_clustering_test'

cfg_cluster_test_v040 = LoadHydraConfig(
    config_name='clustering_v0.4.0_base',
    config_path="../config",
    overrides=[
        f"mlflow_experiment_name={test_experiment}"
#         f"data_text_and_metadata=top_subreddits_2021_07_16",
#         f"data_embeddings_to_cluster=top_subs-2021_07_16-use_multi_lower_case_false_00",
    ],
)

print([k for k in cfg_cluster_test_v040.config_dict.keys()])

['data_text_and_metadata', 'data_embeddings_to_cluster', 'clustering_algo', 'embeddings_to_cluster', 'n_sample_embedding_rows', 'filter_embeddings', 'mlflow_tracking_uri', 'mlflow_experiment_name', 'pipeline']


In [37]:
%%time

d_config_text_and_meta = cfg_cluster_test_v040.config_dict['data_text_and_metadata']

df_subs = LoadSubreddits(
    bucket_name=d_config_text_and_meta['bucket_name'],
    folder_path=d_config_text_and_meta['folder_subreddits_text_and_meta'],
    folder_posts=d_config_text_and_meta['folder_posts_text_and_meta'],
    columns=None,
).read_apply_transformations_and_merge_post_aggs(
    # cols_post='post_count_only_',  # use default so that we can calculate primary language
    df_format='pandas',
    read_fxn='dask',
    unique_check=False,
)

16:25:19 | INFO | "Loading df_posts from: posts/top/2021-09-27"
16:25:19 | INFO | "Reading raw data..."
16:25:19 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/posts/top/2021-09-27"
100%|##############################| 27/27 [00:00<00:00, 49258.90it/s]
16:25:26 | INFO | "  Applying transformations..."
16:26:00 | INFO | "  reading sub-level data & merging with aggregates..."
16:26:00 | INFO | "Reading raw data..."
16:26:00 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/subreddits/top/2021-09-24"
100%|#################################| 1/1 [00:00<00:00, 5570.12it/s]
16:26:01 | INFO | "  Applying transformations..."


In [38]:
df_subs.head()

Unnamed: 0,pt_date,subreddit_name,subreddit_id,geo_relevant_country_codes,geo_relevant_countries,geo_relevant_country_count,geo_relevant_subreddit,ambassador_subreddit,combined_topic,combined_topic_and_rating,rating_short,rating_name,primary_topic,secondary_topics,mature_themes_list,over_18,allow_top,video_whitelisted,subreddit_language,whitelist_status,subscribers,first_screenview_date,last_screenview_date,users_l7,users_l28,posts_l7,posts_l28,comments_l7,comments_l28,pt,...,Spanish_posts_percent,Swahili_posts_percent,Swedish_posts_percent,Tagalog_posts_percent,Thai_posts_percent,Turkish_posts_percent,UNKNOWN_posts_percent,Vietnamese_posts_percent,Welsh_posts_percent,primary_post_language,primary_post_language_percent,primary_post_language_in_use_multilingual,secondary_post_language,secondary_post_language_percent,crosspost_post_type_percent,gallery_post_type_percent,gif_post_type_percent,image_post_type_percent,link_post_type_percent,liveaudio_post_type_percent,multi_media_post_type_percent,poll_post_type_percent,rpan_post_type_percent,text_post_type_percent,video_post_type_percent,primary_post_type,primary_post_type_percent,posts_for_modeling_count,post_median_word_count,post_median_text_len
0,2021-09-21,askreddit,t5_2qh1i,,,,False,False,uncategorized,uncategorized,E,Everyone,Learning and Education,,"profanity_occasional, profanity",f,t,,es,all_ads,33604689,2020-08-24,2021-09-21,12563532,31513185,71934,296017,1525489,6194629,2021-09-24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,English,0.9975,True,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,text,1.0,1200.0,11.0,58.0
1,2021-09-21,pics,t5_2qh0u,,,,False,False,art,art,E,Everyone,Art,,,f,t,,en,all_ads,28014622,2020-08-24,2021-09-21,6062041,12928114,6101,24428,163585,742511,2021-09-24,...,0.001667,0.000833,0.0,0.005,0.0,0.001667,0.0,0.0,0.001667,English,0.935,True,,,0.003333,0.0,0.0,0.915,0.081667,0.0,0.0,0.0,0.0,0.0,0.0,image,0.915,1200.0,10.0,57.0
2,2021-09-21,funny,t5_2qh33,,,,False,False,uncategorized,uncategorized,E,Everyone,,,,f,t,f,en,all_ads,37367466,2020-08-24,2021-09-21,5767977,12250775,6892,28839,114801,463485,2021-09-24,...,0.001667,0.0,0.0025,0.010833,0.0,0.0,0.000833,0.000833,0.0075,English,0.861667,True,German,0.015833,0.0,0.0,0.0225,0.625,0.071667,0.0,0.0,0.0,0.0,0.000833,0.28,image,0.625,1200.0,6.0,33.0
3,2021-09-21,memes,t5_2qjpg,,,,False,False,uncategorized,uncategorized,E,Everyone,Funny/Humor,,"profanity, profanity_occasional",f,t,f,en,all_ads,16335892,2020-08-24,2021-09-21,3969463,10101856,27518,118705,430622,1900286,2021-09-24,...,0.0025,0.003333,0.008333,0.009167,0.0,0.0,0.0,0.0025,0.006667,English,0.805833,True,Danish,0.015,0.0,0.0,0.085,0.8925,0.0225,0.0,0.0,0.0,0.0,0.0,0.0,image,0.8925,1200.0,4.0,23.0
4,2021-09-21,interestingasfuck,t5_2qhsa,,,,False,False,uncategorized,uncategorized,E,Everyone,,,"profanity, profanity_sr_name",f,t,f,en,all_ads,8638369,2020-08-24,2021-09-21,5197231,10071629,1955,7784,132845,522494,2021-09-24,...,0.001667,0.0,0.001667,0.003333,0.0,0.0,0.0,0.0,0.0025,English,0.9375,True,,,0.0,0.0,0.0,0.688333,0.311667,0.0,0.0,0.0,0.0,0.0,0.0,image,0.688333,1200.0,11.0,60.0


In [42]:
mask_germany_only = df_subs['geo_relevant_countries'].fillna('').str.contains('Germany')
mask_germany_only.sum()

810

In [46]:
df_subs[mask_germany_only].head()

Unnamed: 0,pt_date,subreddit_name,subreddit_id,geo_relevant_country_codes,geo_relevant_countries,geo_relevant_country_count,geo_relevant_subreddit,ambassador_subreddit,combined_topic,combined_topic_and_rating,rating_short,rating_name,primary_topic,secondary_topics,mature_themes_list,over_18,allow_top,video_whitelisted,subreddit_language,whitelist_status,subscribers,first_screenview_date,last_screenview_date,users_l7,users_l28,posts_l7,posts_l28,comments_l7,comments_l28,pt,...,Spanish_posts_percent,Swahili_posts_percent,Swedish_posts_percent,Tagalog_posts_percent,Thai_posts_percent,Turkish_posts_percent,UNKNOWN_posts_percent,Vietnamese_posts_percent,Welsh_posts_percent,primary_post_language,primary_post_language_percent,primary_post_language_in_use_multilingual,secondary_post_language,secondary_post_language_percent,crosspost_post_type_percent,gallery_post_type_percent,gif_post_type_percent,image_post_type_percent,link_post_type_percent,liveaudio_post_type_percent,multi_media_post_type_percent,poll_post_type_percent,rpan_post_type_percent,text_post_type_percent,video_post_type_percent,primary_post_type,primary_post_type_percent,posts_for_modeling_count,post_median_word_count,post_median_text_len
224,2021-09-21,de,t5_22i0,DE,Germany,1.0,True,False,uncategorized,uncategorized,E,Everyone,Place,,,f,t,f,de,all_ads,492356,2020-08-24,2021-09-21,570140,1515454,1827,6893,74253,288987,2021-09-24,...,0.0,0.000833,0.000833,0.0,0.0,0.000833,0.0,0.0,0.000833,German,0.969167,True,English,0.01,0.035,0.0,0.006667,0.3125,0.3775,0.0,0.0025,0.0,0.0,0.226667,0.039167,link,0.3775,1200.0,12.0,77.0
413,2021-09-21,germany,t5_2qi4z,DE,Germany,1.0,True,False,place,place,E,Everyone,Place,,,f,t,f,en,all_ads,324731,2020-08-24,2021-09-21,224573,846867,602,2386,8799,44734,2021-09-24,...,0.0,0.0,0.001667,0.0,0.0,0.0,0.000833,0.0,0.000833,English,0.9475,True,German,0.036667,0.0375,0.036667,0.0,0.1125,0.071667,0.0,0.008333,0.0,0.0,0.7225,0.010833,text,0.7225,1200.0,58.0,304.0
421,2021-09-21,ich_iel,t5_37k29,DE,Germany,1.0,True,False,internet culture and memes,internet culture and memes,E,Everyone,Internet Culture and Memes,,,,t,,de,all_ads,338859,2020-08-24,2021-09-21,323195,834736,1549,6365,22032,92750,2021-09-24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,German,1.0,True,,,0.005833,0.0,0.005,0.915,0.055,0.0,0.0,0.0,0.0,0.0,0.019167,image,0.915,1200.0,2.0,8.0
878,2021-09-21,wasletztepreis,t5_3ntp6,DE,Germany,1.0,True,False,funny/humor,funny/humor,E,Everyone,Funny/Humor,,,,t,,de,all_ads,171017,2020-08-24,2021-09-21,153786,399480,148,629,3327,14139,2021-09-24,...,0.003333,0.0,0.000833,0.001667,0.0,0.000833,0.003333,0.000833,0.001667,German,0.925833,True,English,0.028333,0.010833,0.155,0.0,0.773333,0.05,0.0,0.0025,0.0,0.0,0.008333,0.0,image,0.773333,1200.0,4.0,24.0
1062,2021-09-21,finanzen,t5_35m5e,DE,Germany,1.0,True,False,"business, economics, and finance","business, economics, and finance",E,Everyone,"Business, Economics, and Finance",,,,t,,de,all_ads,117682,2020-08-24,2021-09-21,128919,330611,306,1353,9475,42539,2021-09-24,...,0.0,0.0,0.000833,0.0,0.0,0.0,0.0,0.0,0.0,German,0.979167,True,English,0.0175,0.0,0.006667,0.0,0.079167,0.055833,0.0,0.020833,0.0,0.0,0.8375,0.0,text,0.8375,1200.0,114.0,691.5


In [45]:
l_germany_subs = df_subs[mask_germany_only]['subreddit_id'].to_list()
len(l_germany_subs)

810

## Keep labels only for German subs

In [49]:
df_labels_de = (
    df_labels[df_labels['subreddit_id'].isin(l_germany_subs)]
    [l_cols_label_core + cols_top_k]
    .copy()
)
df_labels_de.shape

(793, 19)

In [50]:
df_labels_de.head()

Unnamed: 0,subreddit_name,subreddit_id,model_leaves_list_order_left_to_right,primary_topic,posts_for_modeling_count,014_k_labels,030_k_labels,052_k_labels,100_k_labels,248_k_labels,351_k_labels,405_k_labels,014_k-predicted-primary_topic,030_k-predicted-primary_topic,052_k-predicted-primary_topic,100_k-predicted-primary_topic,248_k-predicted-primary_topic,351_k-predicted-primary_topic,405_k-predicted-primary_topic
20,15cellynudes1,t5_4wmfjb,2176,,74.0,3,4,6,9,20,28,29,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content
30,1fcnuernberg,t5_30jst,9100,,13.0,7,12,21,35,91,130,149,Sports,Sports,Sports,Sports,Sports,Sports,Sports
121,600euro,t5_3caax,9663,Internet Culture and Memes,758.0,8,14,23,38,101,142,162,Politics,Politics,Politics,Politics,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes
133,88energyltd,t5_43q585,8702,"Business, Economics, and Finance",43.0,6,11,20,33,86,120,139,Crypto,Crypto,"Business, Economics, and Finance","Business, Economics, and Finance","Business, Economics, and Finance","Business, Economics, and Finance","Business, Economics, and Finance"
153,aachen,t5_2t4y2,17531,Place,96.0,13,28,49,91,227,324,373,Place,Place,Place,Place,Place,Place,Place
