# Purpose

### 2021-10-25
In this notebook I'll select the clusters for the One Feed experiment for DE to DE subreddits.

From manual inspection on mlflow GUI the best candidate is:<br>
`134cefe13ae34621a69fcc48c4d5fb71`

Because:
- it has high scores at the 100-to-200 & 200-to-300 bins 
- AND has the most subreddits (filtered out fewer subreddits due to low post counts)

Other clusters had slightly higher values at the 200-to-300 bin, but they clustered fewer subreddits.

# Imports & notebook setup

In [1]:
%load_ext google.colab.data_table

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# colab auth for BigQuery
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [4]:
# Attach google drive & import my python utility functions
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

import sys
l_paths_to_append = [
    '/content/gdrive/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    '/content/gdrive/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if not path_ in sys.path:
        print(f"Appending: {path_}")
        sys.path.append(path_)

Mounted at /content/gdrive
Appending: /content/gdrive/MyDrive/Colab Notebooks


In [5]:
## install subclu & libraries needed to read parquet files from GCS

!pip install -e "/content/gdrive/MyDrive/Colab Notebooks/subreddit_clustering_i18n/" --quiet

In [6]:
# Install needed to load data from GCS, for some reason not included in subclu?
!pip install gcsfs --quiet

In [7]:
# !pip list

In [11]:
# Regular Imports
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted


# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

In [9]:
# subclu imports

# For reloading, need to force-delete some imported items
try:
    del LoadPosts, LoadSubreddits
except Exception:
    pass

from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.data.data_loaders import LoadPosts, LoadSubreddits
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)


setup_logging()
print_lib_versions([pd, np])

python		v 3.7.12
===
pandas		v: 1.1.5
numpy		v: 1.19.5


# Load subreddit metadata

This data is already in bigQuery so read it straight from there. We'll use it to filter out geo-relevant (German) subs.

Also add the latest ratings so that we can filter based on those.

## SQL query

In [12]:
%%time

sql_geo_and_languages = f"""
-- select DE subreddits + get latest rating

SELECT 
    sl.subreddit_id
    , sl.subreddit_name
    , r.rating
    -- , r.subrating
    , r.version

    , slo.verdict
    , slo.quarantine

    , geo.country_name
    , geo.users_percent_in_country
    -- , sl.geo_relevant_countries
    , ambassador_subreddit
    , posts_for_modeling_count

    , primary_post_language
    , primary_post_language_percent
    , secondary_post_language
    , secondary_post_language_percent

    , geo_relevant_country_count
    , geo_relevant_country_codes
    , geo_relevant_subreddit

FROM `reddit-employee-datasets.david_bermejo.subclu_v0040_subreddit_languages` sl
LEFT JOIN (
    SELECT * FROM `data-prod-165221.ds_v2_postgres_tables.subreddit_lookup`
    # Look back 2 days because looking back 1-day could be an empty partition
    WHERE dt = (CURRENT_DATE() - 2)
) AS slo
    ON slo.subreddit_id = sl.subreddit_id
LEFT JOIN (
    SELECT * FROM `reddit-employee-datasets.david_bermejo.subclu_geo_subreddits_20210922`
    WHERE country_name = 'Germany'
) AS geo 
    ON sl.subreddit_id = geo.subreddit_id
LEFT JOIN (
    SELECT * FROM ds_v2_subreddit_tables.subreddit_ratings
    WHERE pt = '2021-10-24'
) AS r
    ON r.subreddit_id = sl.subreddit_id 
    
WHERE 1=1
    -- AND r.version = 'v2'
    -- AND COALESCE(r.rating, '') IN ('pg', 'pg13', 'g')
    AND COALESCE(slo.verdict, '') != 'admin-removed'
    AND COALESCE(slo.quarantine, false) != true
    AND (
        sl.geo_relevant_countries LIKE '%Germany%'
        OR sl.ambassador_subreddit = True
    )

ORDER BY users_percent_in_country ASC -- subreddit_name, ambassador_subreddit
;
"""

client = bigquery.Client()
df_geo_and_lang = client.query(sql_geo_and_languages).to_dataframe()
print(df_geo_and_lang.shape)

(830, 17)
CPU times: user 272 ms, sys: 24.7 ms, total: 296 ms
Wall time: 9.22 s


## Check data with geo + language information

In [13]:
df_geo_and_lang.head()

Unnamed: 0,subreddit_id,subreddit_name,rating,version,verdict,quarantine,country_name,users_percent_in_country,ambassador_subreddit,posts_for_modeling_count,primary_post_language,primary_post_language_percent,secondary_post_language,secondary_post_language_percent,geo_relevant_country_count,geo_relevant_country_codes,geo_relevant_subreddit
0,t5_4ckovw,buehne,,,,False,,,True,9.0,German,0.333333,Danish,0.111111,,,False
1,t5_4p0iav,de_events,,,,False,,,True,1.0,German,1.0,,,,,False
2,t5_2otu32,nikolacorporation,pg,v3,,False,Germany,0.160008,False,188.0,English,0.920213,Estonian,0.010638,1.0,DE,True
3,t5_vwvbb,vanmoofbicycle,pg,v1,,False,Germany,0.160199,False,305.0,English,0.963934,,,1.0,DE,True
4,t5_2rq3g,trackmania,pg,v1,,False,Germany,0.160615,False,958.0,English,0.927975,,,1.0,DE,True


In [14]:
df_geo_and_lang.tail()

Unnamed: 0,subreddit_id,subreddit_name,rating,version,verdict,quarantine,country_name,users_percent_in_country,ambassador_subreddit,posts_for_modeling_count,primary_post_language,primary_post_language_percent,secondary_post_language,secondary_post_language_percent,geo_relevant_country_count,geo_relevant_country_codes,geo_relevant_subreddit
825,t5_2tz7b,braunschweig,,,,False,Germany,0.95643,False,28.0,German,0.892857,English,0.071429,1.0,DE,True
826,t5_3255n,duschgedanken,,,,False,Germany,0.956434,True,90.0,German,1.0,,,1.0,DE,True
827,t5_2w4vt,bielefeld,,,,False,Germany,0.957314,False,27.0,German,0.925926,English,0.037037,1.0,DE,True
828,t5_2ty5z,bundeswehr,,,,False,Germany,0.959926,False,254.0,German,0.96063,English,0.031496,1.0,DE,True
829,t5_4o0ba2,nachthimmel,,,,False,Germany,1.0,True,18.0,German,0.833333,English,0.111111,,,False


# Load model labels

Ideally we could just pull the configuration data from github, but for now I'm just manually copying the artifact locations


In [15]:
config_clust_v040 = LoadHydraConfig(
    config_name='v0.4.0_2021_10_14-use_multi_lower_case_false_00',
    config_path="../config/data_embeddings_to_cluster",
)

In [16]:
run_uuid = '134cefe13ae34621a69fcc48c4d5fb71'
run_artifacts_uri = (
    'gs://i18n-subreddit-clustering/mlflow/mlruns/18/134cefe13ae34621a69fcc48c4d5fb71/artifacts'
)

gs_optimal_ks = f"{run_artifacts_uri}/optimal_ks/optimal_ks.parquet"
gs_model_labels = f"{run_artifacts_uri}/df_labels/df_labels.parquet"

### optimal values for K (cluster number)

In [17]:
%%time
df_opt_ks = pd.read_parquet(gs_optimal_ks)
print(df_opt_ks.shape)

(7, 2)
CPU times: user 219 ms, sys: 29.6 ms, total: 248 ms
Wall time: 688 ms


### Labels for many values of k

We'll use the optimal values to filter out & keep only the labels we's use for One Feed

In [18]:
%%time
df_labels = pd.read_parquet(gs_model_labels)
print(df_labels.shape)

(19053, 65)
CPU times: user 146 ms, sys: 40.9 ms, total: 187 ms
Wall time: 511 ms


In [19]:
df_labels.head()



Unnamed: 0,model_leaves_list_order_left_to_right,subreddit_name,subreddit_id,primary_topic,posts_for_modeling_count,010_k_labels,014_k_labels,020_k_labels,030_k_labels,040_k_labels,050_k_labels,052_k_labels,060_k_labels,070_k_labels,080_k_labels,090_k_labels,100_k_labels,110_k_labels,120_k_labels,130_k_labels,140_k_labels,150_k_labels,160_k_labels,170_k_labels,180_k_labels,190_k_labels,200_k_labels,210_k_labels,220_k_labels,230_k_labels,240_k_labels,248_k_labels,250_k_labels,351_k_labels,405_k_labels,010_k-predicted-primary_topic,014_k-predicted-primary_topic,020_k-predicted-primary_topic,030_k-predicted-primary_topic,040_k-predicted-primary_topic,050_k-predicted-primary_topic,052_k-predicted-primary_topic,060_k-predicted-primary_topic,070_k-predicted-primary_topic,080_k-predicted-primary_topic,090_k-predicted-primary_topic,100_k-predicted-primary_topic,110_k-predicted-primary_topic,120_k-predicted-primary_topic,130_k-predicted-primary_topic,140_k-predicted-primary_topic,150_k-predicted-primary_topic,160_k-predicted-primary_topic,170_k-predicted-primary_topic,180_k-predicted-primary_topic,190_k-predicted-primary_topic,200_k-predicted-primary_topic,210_k-predicted-primary_topic,220_k-predicted-primary_topic,230_k-predicted-primary_topic,240_k-predicted-primary_topic,248_k-predicted-primary_topic,250_k-predicted-primary_topic,351_k-predicted-primary_topic,405_k-predicted-primary_topic
0,7400,0sanitymemes,t5_2qlzfy,Internet Culture and Memes,559.0,4,5,8,10,13,17,18,19,21,25,27,31,36,39,42,45,47,50,53,54,58,61,63,66,69,70,73,74,103,117,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming
1,8343,0xpolygon,t5_2qgijx,Crypto,1188.0,5,6,9,11,14,18,19,20,22,26,28,32,38,41,44,48,50,54,58,59,63,66,69,73,77,78,83,84,117,135,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto,Crypto
2,527,100gecs,t5_131dor,Music,275.0,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,5,5,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music
3,666,100kanojo,t5_2asd3o,Anime,286.0,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,6,6,6,6,7,7,Music,Television,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime
4,7949,100thieves,t5_3e98s,Gaming,443.0,4,5,8,10,13,17,18,19,21,25,27,31,37,40,43,47,49,52,55,56,60,63,65,69,72,73,78,79,109,125,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming


### Filter to only k-optimal columns

Makes it easy to remove noise & understand cut-offs better

In [None]:
l_cols_label_core = [
    'model_leaves_list_order_left_to_right',
    'posts_for_modeling_count',
    'subreddit_id',
    'subreddit_name',
    'primary_topic',
]

cols_top_k = [c for c in df_labels.columns if any(c.startswith(k_) for k_ in df_opt_ks[df_opt_ks['k'] >= 50]['col_prefix'].unique())]

counts_describe(df_labels[l_cols_label_core + cols_top_k])

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
model_leaves_list_order_left_to_right,int64,19053,19053,100.00%,0,0.00%
posts_for_modeling_count,float64,19053,1175,6.17%,0,0.00%
subreddit_id,object,19053,19053,100.00%,0,0.00%
subreddit_name,object,19053,19053,100.00%,0,0.00%
primary_topic,object,15929,51,0.32%,3124,16.40%
052_k_labels,int32,19053,52,0.27%,0,0.00%
100_k_labels,int32,19053,100,0.52%,0,0.00%
248_k_labels,int32,19053,248,1.30%,0,0.00%
351_k_labels,int32,19053,351,1.84%,0,0.00%
405_k_labels,int32,19053,405,2.13%,0,0.00%


# Keep only labels for DE subreddits


In [None]:
l_ix_subs = ['subreddit_name', 'subreddit_id']

df_labels_de = (
    df_labels[l_cols_label_core + cols_top_k]
    .merge(
        df_geo_and_lang.drop(['posts_for_modeling_count'], axis=1),
        how='right',
        on=l_ix_subs,
    )
    .copy()
    .sort_values(by=['model_leaves_list_order_left_to_right'], ascending=True)
)

print(df_labels_de.shape)

(838, 29)


### Drop subs with too few posts

In the modeling process I drop subreddits with too few posts. We don't have recommendations for them, so let's drop them.

It would also not be a great experience to recommend dead subs.

In [None]:
df_labels_de['model_leaves_list_order_left_to_right'].isnull().sum()

33

In [None]:
df_labels_de = df_labels_de[
    ~df_labels_de['model_leaves_list_order_left_to_right'].isnull()
]
df_labels_de.shape

(805, 29)

In [None]:
l_cols_label_de = [c for c in df_labels_de.columns if c.endswith('k_labels')]
df_labels_de[l_cols_label_de] = df_labels_de[l_cols_label_de].astype(int)

In [None]:
style_df_numeric(
    df_labels_de.head(25),
    # rename_cols_for_display=True,
    l_bar_simple=[c for c in df_labels_de.columns if 'labels' in c]
)

Unnamed: 0,model_leaves_list_order_left_to_right,posts_for_modeling_count,subreddit_id,subreddit_name,primary_topic,052_k_labels,100_k_labels,248_k_labels,351_k_labels,405_k_labels,052_k-predicted-primary_topic,100_k-predicted-primary_topic,248_k-predicted-primary_topic,351_k-predicted-primary_topic,405_k-predicted-primary_topic,rating,version,verdict,quarantine,country_name,users_percent_in_country,ambassador_subreddit,primary_post_language,primary_post_language_percent,secondary_post_language,secondary_post_language_percent,geo_relevant_country_count,geo_relevant_country_codes,geo_relevant_subreddit
33,49.0,18,t5_2roop,hardtechno,,1,1,1,1,1,Music,Music,Music,Music,Music,,,,False,Germany,0,False,English,77.78%,Dutch,5.56%,1,DE,True
152,76.0,386,t5_2qziu,rappers,,1,1,1,1,1,Music,Music,Music,Music,Music,r,v2,admin-approved,False,Germany,0,False,English,81.87%,German,1.81%,1,DE,True
579,77.0,705,t5_2v7pv,germanrap,Music,1,1,1,1,1,Music,Music,Music,Music,Music,r,v2,,False,Germany,1,False,German,71.49%,English,13.05%,1,DE,True
231,85.0,23,t5_2smd3,musik,,1,1,1,1,1,Music,Music,Music,Music,Music,,,,False,Germany,1,False,German,82.61%,English,8.70%,1,DE,True
201,89.0,25,t5_39ea8,mgpmppjwfa,Music,1,1,1,1,1,Music,Music,Music,Music,Music,,,,False,Germany,0,False,German,52.00%,English,8.00%,1,DE,True
245,95.0,179,t5_2t6i4,germusic,Music,1,1,1,1,1,Music,Music,Music,Music,Music,pg13,,,False,Germany,1,False,German,58.10%,English,21.23%,1,DE,True
549,172.0,17,t5_31l12,kollegah,Music,1,2,2,2,2,Music,Music,Music,Music,Music,,,,False,Germany,1,False,German,94.12%,Indonesian,5.88%,1,DE,True
320,173.0,19,t5_2ylk3,moneyboy,Music,1,2,2,2,2,Music,Music,Music,Music,Music,,,,False,Germany,1,False,German,84.21%,Danish,5.26%,1,DE,True
14,283.0,101,t5_2t36v,billytalent,Music,1,2,4,4,4,Music,Music,Music,Music,Music,r,v1,,False,Germany,0,False,English,99.01%,Somali,0.99%,1,DE,True
15,370.0,37,t5_35q0o,lindemann,Music,1,2,4,4,4,Music,Music,Music,Music,Music,r,v2,,False,Germany,0,False,English,91.89%,Danish,2.70%,1,DE,True


In [None]:
style_df_numeric(
    df_labels_de.tail(25),
    # rename_cols_for_display=True,
    l_bar_simple=[c for c in df_labels_de.columns if 'labels' in c]
)

Unnamed: 0,model_leaves_list_order_left_to_right,posts_for_modeling_count,subreddit_id,subreddit_name,primary_topic,052_k_labels,100_k_labels,248_k_labels,351_k_labels,405_k_labels,052_k-predicted-primary_topic,100_k-predicted-primary_topic,248_k-predicted-primary_topic,351_k-predicted-primary_topic,405_k-predicted-primary_topic,rating,version,verdict,quarantine,country_name,users_percent_in_country,ambassador_subreddit,primary_post_language,primary_post_language_percent,secondary_post_language,secondary_post_language_percent,geo_relevant_country_count,geo_relevant_country_codes,geo_relevant_subreddit
241,18137.0,76,t5_2v6r0,tuberlin,Learning and Education,50,94,234,334,386,Learning and Education,Learning and Education,Learning and Education,Learning and Education,Learning and Education,pg,v3,,False,Germany,1,False,English,89.47%,German,10.53%,1,DE,True
92,18138.0,129,t5_2zhyp,tumunich,Learning and Education,50,94,234,334,386,Learning and Education,Learning and Education,Learning and Education,Learning and Education,Learning and Education,,,,False,Germany,0,False,English,98.45%,German,1.55%,1,DE,True
581,18273.0,16,t5_k4uk5,versicherung,,50,94,236,336,388,Learning and Education,Learning and Education,Careers,Careers,Careers,,,,False,Germany,1,False,German,100.00%,,-,1,DE,True
440,18282.0,13,t5_32ayp,medizin,Medical and Mental Health,50,94,236,336,388,Learning and Education,Learning and Education,Careers,Careers,Careers,,,,False,Germany,1,False,German,100.00%,,-,1,DE,True
796,18333.0,60,t5_3grce,lehrerzimmer,Learning and Education,50,94,237,337,389,Learning and Education,Learning and Education,Learning and Education,Learning and Education,Learning and Education,pg,,,False,Germany,1,False,German,100.00%,,-,1,DE,True
354,18375.0,11,t5_4lced8,geschlechtsneutral,Activism,50,94,237,338,390,Learning and Education,Learning and Education,Learning and Education,Science,Science,,,,False,Germany,1,False,German,100.00%,,-,1,DE,True
283,18393.0,937,t5_41u8uv,surveycircle_de,,50,94,237,338,390,Learning and Education,Learning and Education,Learning and Education,Science,Science,,,,False,Germany,1,False,German,80.90%,English,9.39%,1,DE,True
465,18395.0,58,t5_37u1s,samplesize_dach,,50,94,237,338,390,Learning and Education,Learning and Education,Learning and Education,Science,Science,,,,False,Germany,1,False,German,94.83%,English,5.17%,1,DE,True
357,18396.0,35,t5_2p1qpt,umfragen,,50,94,237,338,390,Learning and Education,Learning and Education,Learning and Education,Science,Science,,,,False,Germany,1,False,German,91.43%,English,5.71%,1,DE,True
214,18399.0,13,t5_39he0,luh,,50,94,237,338,390,Learning and Education,Learning and Education,Learning and Education,Science,Science,,,,False,Germany,0,False,German,84.62%,English,7.69%,1,DE,True


## Check if there are any single subreddits if we use cluster = 52

We want to avoid having clusters of one subreddit b/c that means we have nothing to recommend.

Looks like even at 52 there are some subreddits that are orphans.

In [None]:
?value_counts_and_pcts

In [None]:
value_counts_and_pcts(
    df_labels_de['052_k_labels'], top_n=None,
    reset_index=True,
    add_col_prefix=False,
)

Unnamed: 0,052_k_labels,count,percent,cumulative_percent
0,9,84,10.4%,10.4%
1,7,70,8.7%,19.1%
2,49,54,6.7%,25.8%
3,12,50,6.2%,32.0%
4,21,41,5.1%,37.1%
5,47,37,4.6%,41.7%
6,23,37,4.6%,46.3%
7,18,34,4.2%,50.6%
8,6,33,4.1%,54.7%
9,20,33,4.1%,58.8%


In [None]:
df_lbl_counts = value_counts_and_pcts(
    df_labels_de['052_k_labels'], top_n=None,
    reset_index=True,
    add_col_prefix=False,
    return_df=True,
).sort_values(by=['count'], ascending=False)

In [None]:
df_lbl_counts.tail()

Unnamed: 0,052_k_labels,count,percent,cumulative_percent
43,22,3,0.003727,0.992547
44,44,2,0.002484,0.995031
45,15,2,0.002484,0.997516
46,48,1,0.001242,0.998758
47,43,1,0.001242,1.0


In [None]:
l_orphan_cluster_ids = df_lbl_counts.tail()['052_k_labels'].values

df_labels_de[df_labels_de['052_k_labels'].isin(l_orphan_cluster_ids)]

Unnamed: 0,model_leaves_list_order_left_to_right,posts_for_modeling_count,subreddit_id,subreddit_name,primary_topic,052_k_labels,100_k_labels,248_k_labels,351_k_labels,405_k_labels,052_k-predicted-primary_topic,100_k-predicted-primary_topic,248_k-predicted-primary_topic,351_k-predicted-primary_topic,405_k-predicted-primary_topic,rating,version,verdict,quarantine,country_name,users_percent_in_country,ambassador_subreddit,primary_post_language,primary_post_language_percent,secondary_post_language,secondary_post_language_percent,geo_relevant_country_count,geo_relevant_country_codes,geo_relevant_subreddit
530,5837.0,38.0,t5_3ezif,pokemongogermany,,15,23,55,77,88,Gaming,Gaming,Gaming,Gaming,Gaming,,,,False,Germany,0.839662,False,German,0.921053,English,0.052632,1.0,DE,True
731,5861.0,150.0,t5_3wnb53,pokemonde,Gaming,15,23,55,77,88,Gaming,Gaming,Gaming,Gaming,Gaming,,,,False,Germany,0.894578,True,German,0.806667,Swedish,0.033333,1.0,DE,True
252,9298.0,36.0,t5_2scyzx,damaghshow,Gaming,22,37,96,136,155,Place,Place,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,,,,False,Germany,0.550459,False,Other_language,0.75,Croatian,0.027778,1.0,DE,True
285,9319.0,53.0,t5_1bigl8,menchtv,,22,37,96,136,155,Place,Place,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,,,,False,Germany,0.622054,False,English,0.396226,Norwegian,0.09434,1.0,DE,True
39,9384.0,230.0,t5_50b08s,trendycow,,22,37,97,137,156,Place,Place,Movies,Movies,Movies,,,,False,Germany,0.172694,False,English,0.934783,Somali,0.008696,1.0,DE,True
608,15011.0,26.0,t5_35ipo,fitnessde,Fitness and Nutrition,43,74,188,269,312,Fitness and Nutrition,Fitness and Nutrition,Fitness and Nutrition,Fitness and Nutrition,Fitness and Nutrition,pg,,,False,Germany,0.861554,False,German,1.0,,,1.0,DE,True
328,15177.0,95.0,t5_49sepp,leande,Science,44,76,190,273,317,Medical and Mental Health,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,,,,False,Germany,0.703629,False,German,0.652632,English,0.115789,1.0,DE,True
517,15187.0,154.0,t5_2zynz,drogen,Mature Themes and Adult Content,44,76,190,273,317,Medical and Mental Health,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,,,,False,Germany,0.834762,False,German,0.961039,English,0.019481,1.0,DE,True
446,17047.0,62.0,t5_4ognal,zelten,Outdoors and Nature,48,89,222,318,367,Place,Music,Outdoors and Nature,Outdoors and Nature,Outdoors and Nature,,,,False,Germany,0.801587,True,German,0.951613,French,0.016129,1.0,DE,True


# Reshape: flatten topics into 1 row = 1 cluster/topic

In [None]:
%%time

df_cluster_per_row = (
    df_labels_de
    .groupby(['052_k_labels', '052_k-predicted-primary_topic'])
    ['subreddit_name']
    .agg(
        [
            ('subreddit_count', 'count'),
            ('list_of_subs', list)
        ]
    )
    .reset_index()
)

# Convert the list of subs into a df & merge back with original sub (each sub should be in a new column)
df_cluster_per_row = (
    df_cluster_per_row
    .merge(
        pd.DataFrame(df_cluster_per_row['list_of_subs'].to_list()).fillna(''),
        how='left',
        left_index=True,
        right_index=True,
    )
    .drop(['list_of_subs'], axis=1)
)

print(df_cluster_per_row.shape)

(48, 87)
CPU times: user 30.2 ms, sys: 14 µs, total: 30.3 ms
Wall time: 31.6 ms


In [None]:
df_labels_de[df_labels_de['subreddit_name'] == 'sexmeets1']

Unnamed: 0,model_leaves_list_order_left_to_right,posts_for_modeling_count,subreddit_id,subreddit_name,primary_topic,052_k_labels,100_k_labels,248_k_labels,351_k_labels,405_k_labels,052_k-predicted-primary_topic,100_k-predicted-primary_topic,248_k-predicted-primary_topic,351_k-predicted-primary_topic,405_k-predicted-primary_topic,rating,version,verdict,quarantine,country_name,users_percent_in_country,ambassador_subreddit,primary_post_language,primary_post_language_percent,secondary_post_language,secondary_post_language_percent,geo_relevant_country_count,geo_relevant_country_codes,geo_relevant_subreddit
165,5129.0,12.0,t5_50c2v9,sexmeets1,,13,20,48,67,76,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,,,,False,Germany,0.321951,False,German,0.833333,English,0.166667,1.0,DE,True


In [None]:
df_cluster_per_row.to_csv(
    f"/content/gdrive/MyDrive/Colab Notebooks/data/{datetime.utcnow().strftime('%Y-%m-%d_%H%M')}_de_to_de_subreddits_raw.csv",
    index=False,
)

### by subreddit ID


In [None]:
%%time

col_to_list = 'subreddit_id'
df_cluster_per_row_id = (
    df_labels_de
    .groupby(['052_k_labels', '052_k-predicted-primary_topic'])
    [col_to_list]
    .agg(
        [
            ('subreddit_count', 'count'),
            ('list_of_subs', list)
        ]
    )
    .reset_index()
)

# Convert the list of subs into a df & merge back with original sub (each sub should be in a new column)
df_cluster_per_row_id = (
    df_cluster_per_row_id
    .merge(
        pd.DataFrame(df_cluster_per_row_id['list_of_subs'].to_list()).fillna(''),
        how='left',
        left_index=True,
        right_index=True,
    )
    .drop(['list_of_subs'], axis=1)
)

print(df_cluster_per_row_id.shape)

(48, 87)
CPU times: user 24.4 ms, sys: 0 ns, total: 24.4 ms
Wall time: 24.5 ms


In [None]:
df_cluster_per_row_id.head()

Unnamed: 0,052_k_labels,052_k-predicted-primary_topic,subreddit_count,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,...,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83
0,1,Music,13,t5_2roop,t5_2qziu,t5_2v7pv,t5_2smd3,t5_39ea8,t5_2t6i4,t5_31l12,t5_2ylk3,t5_2t36v,t5_35q0o,t5_30xre,t5_2tmlgt,t5_nilvp,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,Anime,5,t5_4thzyd,t5_4sfk6d,t5_50bxag,t5_4rxks9,t5_2w7ha,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,3,Television,5,t5_4r191u,t5_4qwonp,t5_4xjkpg,t5_4r2be4,t5_2sze6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,4,"Reading, Writing, and Literature",3,t5_4z4tos,t5_2sroz,t5_3jiqq,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,5,Movies,6,t5_2r3zh,t5_4cmjcc,t5_2ti1q,t5_33xyp,t5_1g8x6c,t5_4hb8ta,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [None]:
df_cluster_per_row_id.to_csv(
    f"/content/gdrive/MyDrive/Colab Notebooks/data/{datetime.utcnow().strftime('%Y-%m-%d_%H%M')}_de_to_de_subreddits_raw_ids.csv",
    index=False,
)

## Add partial list of subreddits to filter out

Most of the NSFW subredits are in these clusters:

But some were mis-classified

In [None]:
l_clusters_to_remove = [
    # NSFW clusters (porn/celebs)
    6,
    7,
    8,
    9,
    46,  # Sexual orientation & NSFW

    # drinking & drugs
    39,  
    44,  # drugs and detoxing?
]
l_subs_manual_remove = [
    'sexmeets1',
    'fuck',
    'eastgermandreams',
    'BonnyLangOfficial',

    # potential misinformation
    'wuhan_virus',
]

# Subs that appear to be misclassified, check to see what we can learn to improve
l_subs_investigate = [
    'outdoor',  # classified in podcast group
    
    'satire_de_en', # satire is hard to classify...
]