# Purpose

### 2022-02-14
In this notebook I'll select the clusters for the new FPR experiments for Canada, UK, Australia, & India.

Note that this is supposed to be an SFW experiment, so we'll need to filter out subreddits that are `over_18` or rated as `X`.

In one sheet include BOTH subreddit names & subreddit IDs.

TODO: Haven't included place logic (e.g., add direction to: city, state, country subreddits.)


### Updates



# Imports & notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

# Register bigquery magic
%load_ext google.cloud.bigquery

In [2]:
# auth for google sheets
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())

In [3]:
# Regular Imports
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted
from tqdm import tqdm

# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

In [4]:
# subclu imports
import subclu
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.models.clustering_utils import (
    create_dynamic_clusters,
    convert_distance_or_ab_to_list_for_fpr
)


setup_logging()
notebook_display_config()
print_lib_versions([gspread, pd, np])

python		v 3.7.11
===
gspread		v: 5.0.0
pandas		v: 1.2.4
numpy		v: 1.19.5


# Load subreddit metadata

This data is already in bigQuery so read it straight from there. We'll use it to filter out geo-relevant (German) subs.

Also add the latest ratings so that we can filter based on those.

English-speaking countries don't have ambassador subs right now, so we should be able to create a standard template and replace the country name for these queries.

## SQL query

In [5]:
%%time

sql_geo_and_languages = fr"""
-- Select geo+cultural subreddits for a target country
--  And add latest rating & over_18 flags to exclude X-rated & over_18
DECLARE TARGET_COUNTRY STRING DEFAULT 'Australia';


SELECT    
    s.* EXCEPT(over_18, pt, verdict) 
    , nt.rating_name
    , nt.primary_topic
    , nt.rating_short
    , slo.over_18
    , CASE 
        WHEN(COALESCE(slo.over_18, 'f') = 't') THEN 'over_18_or_X_M_D_V'
        WHEN(COALESCE(nt.rating_short, '') IN ('X', 'M', 'D', 'V')) THEN 'over_18_or_X_M_D_V'
        ELSE 'unrated_or_E'
    END AS grouped_rating

FROM `reddit-employee-datasets.david_bermejo.subclu_v0041_subreddit_clusters_c_a` AS t
    -- Inner join b/c we only want to keep subs that are geo-relevant AND in topic model
    INNER JOIN (
        SELECT *
        FROM `reddit-employee-datasets.david_bermejo.subclu_subreddit_geo_score_standardized_20220212`
        WHERE country_name = TARGET_COUNTRY
    ) AS s
        ON t.subreddit_id = s.subreddit_id

    -- Add rating so we can get an estimate for how many we can actually use for recommendation
    LEFT JOIN (
        SELECT *
        FROM `data-prod-165221.ds_v2_postgres_tables.subreddit_lookup`
        -- Get latest partition
        WHERE dt = DATE(CURRENT_DATE() - 2)
    ) AS slo
    ON s.subreddit_id = slo.subreddit_id
    LEFT JOIN (
        SELECT * FROM `data-prod-165221.cnc.shredded_crowdsource_topic_and_rating`
        WHERE pt = DATE(CURRENT_DATE() - 2)
    ) AS nt
        ON s.subreddit_id = nt.subreddit_id

    -- Exclude popular US subreddits
    -- Can't query this table from local notebook because of errors getting google drive permissions. smh, excludefor now
    -- LEFT JOIN `reddit-employee-datasets.david_bermejo.subclu_subreddits_top_us_to_exclude_from_relevance` tus
    --     ON s.subreddit_name = LOWER(tus.subreddit_name)

WHERE 1=1
    AND s.subreddit_name != 'profile'
    AND COALESCE(s.type, '') = 'public'
    AND COALESCE(s.verdict, 'f') <> 'admin_removed'
    AND COALESCE(slo.over_18, 'f') = 'f'
    AND COALESCE(nt.rating_short, '') NOT IN ('X', 'D')

    AND(
        s.geo_relevance_default = TRUE
        OR s.relevance_percent_by_subreddit = TRUE
        OR s.relevance_percent_by_country_standardized = TRUE
    )
    AND country_name IN (
            TARGET_COUNTRY
        )

    -- AND (
    --     -- Exclude subs that are top in US but we want to exclude as culturally relevant
    --     --  For simplicity, let's go with the English exclusion (more relaxed) than the non-English one
    --     COALESCE(tus.english_exclude_from_relevance, '') <> 'exclude'
    -- )

ORDER BY e_users_percent_by_country_standardized DESC, users_l7 DESC, subreddit_name
;
"""

client = bigquery.Client()
df_geo_and_lang = client.query(sql_geo_and_languages).to_dataframe()
print(df_geo_and_lang.shape)

(1433, 25)
CPU times: user 81.2 ms, sys: 45.5 ms, total: 127 ms
Wall time: 10 s


## Check data with geo + language information

In [6]:
df_geo_and_lang.head()

Unnamed: 0,subreddit_id,subreddit_name,country_name,geo_relevance_default,b_users_percent_by_subreddit,e_users_percent_by_country_standardized,c_users_percent_by_country,d_users_percent_by_country_rank,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,users_in_subreddit_from_country_l28,total_users_in_country_l28,total_users_in_subreddit_l28,geo_country_code,posts_not_removed_l28,users_l7,num_of_countries_with_visits_l28,users_percent_by_country_avg,users_percent_by_country_stdev,type,rating_name,primary_topic,rating_short,over_18,grouped_rating
0,t5_2qkhb,melbourne,Australia,True,0.669184,10.780123,0.029058,3,True,True,628586,21632112,939332,AU,2033,414195,119,0.000542,0.002645,public,Everyone,Place,E,f,unrated_or_E
1,t5_2qkob,sydney,Australia,True,0.821898,10.70648,0.019942,10,True,True,431393,21632112,524874,AU,1268,201481,117,0.00029,0.001836,public,Everyone,Place,E,f,unrated_or_E
2,t5_2uo3q,ausfinance,Australia,True,0.831555,10.663609,0.018428,14,True,True,398641,21632112,479392,AU,1346,204175,116,0.000261,0.001704,public,Everyone,"Business, Economics, and Finance",E,,unrated_or_E
3,t5_2qh8e,australia,Australia,True,0.414442,10.460037,0.046857,2,True,True,1013613,21632112,2445727,AU,2775,633423,119,0.001887,0.004299,public,Everyone,Place,E,f,unrated_or_E
4,t5_2g3blu,coronavirusdownunder,Australia,True,0.360947,10.406553,0.020909,8,True,True,452296,21632112,1253080,AU,3805,654697,119,0.000845,0.001928,public,Everyone,Place,E,,unrated_or_E


In [7]:
df_geo_and_lang.tail()

Unnamed: 0,subreddit_id,subreddit_name,country_name,geo_relevance_default,b_users_percent_by_subreddit,e_users_percent_by_country_standardized,c_users_percent_by_country,d_users_percent_by_country_rank,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,users_in_subreddit_from_country_l28,total_users_in_country_l28,total_users_in_subreddit_l28,geo_country_code,posts_not_removed_l28,users_l7,num_of_countries_with_visits_l28,users_percent_by_country_avg,users_percent_by_country_stdev,type,rating_name,primary_topic,rating_short,over_18,grouped_rating
1428,t5_3p4uu,pppoker,Australia,True,0.181644,-0.013817,4.391619e-06,46810,True,False,95,21632112,523,AU,25,100,16,4.561865e-06,1.232122e-05,public,Mature,,M,f,over_18_or_X_M_D_V
1429,t5_5d99cb,skulduggerysubreddit,Australia,True,0.130597,-0.015117,1.617965e-06,66419,False,False,35,21632112,268,AU,14,77,10,1.662482e-06,2.944914e-06,public,Everyone,"Reading, Writing, and Literature",E,,unrated_or_E
1430,t5_4chanv,polygirlsgonewild,Australia,False,0.302839,-0.020905,8.875694e-05,8175,True,False,1920,21632112,6340,AU,10,1549,11,9.433286e-05,0.000266732,public,,,,,unrated_or_E
1431,t5_2vxyq,nrlwarriors,Australia,True,0.305019,-0.277847,3.651978e-06,50299,True,False,79,21632112,259,AU,19,80,5,7.874533e-06,1.519741e-05,public,Everyone,,E,,unrated_or_E
1432,t5_2umjs,thelettera,Australia,True,0.021127,-0.991383,4.160481e-07,91801,False,False,9,21632112,426,AU,12,157,20,9.029383e-07,4.911221e-07,public,Everyone,Internet Culture and Memes,E,,unrated_or_E


# Load model labels

The clusters now live in a big Query table and have standardized names, so pull the data from there.

## Pull data from BigQuery


In [8]:
%%time
%%bigquery df_labels --project data-science-prod-218515 

-- select subreddit clusters from bigQuery

SELECT
    sc.subreddit_id
    , sc.subreddit_name
    , nt.primary_topic

    , sc.* EXCEPT(subreddit_id, subreddit_name, primary_topic_1214)
FROM `reddit-employee-datasets.david_bermejo.subclu_v0041_subreddit_clusters_c_a` sc
    LEFT JOIN (
        -- New view should be visible to all, but still comes from cnc_taxonomy_cassandra_sync
        SELECT * FROM `data-prod-165221.cnc.shredded_crowdsource_topic_and_rating`
        WHERE DATE(pt) = (CURRENT_DATE() - 2)
    ) AS nt
        ON sc.subreddit_id = nt.subreddit_id
;

Query complete after 0.01s: 100%|███████████████████████████████████████████████████| 3/3 [00:00<00:00, 1419.07query/s]
Downloading: 100%|██████████████████████████████████████████████████████████| 49558/49558 [00:01<00:00, 38298.01rows/s]

CPU times: user 307 ms, sys: 232 ms, total: 539 ms
Wall time: 5.85 s





In [9]:
print(df_labels.shape)
df_labels.head()

(49558, 51)


Unnamed: 0,subreddit_id,subreddit_name,primary_topic,model_sort_order,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label,k_0063_label,k_0079_label,k_0085_label,k_0118_label,k_0320_label,k_0657_label,k_0958_label,k_1065_label,k_1560_label,k_1840_label,k_2207_label,k_2351_label,k_2830_label,k_3145_label,k_3411_label,k_3706_label,k_3864_label,k_3927_label,k_0013_majority_primary_topic,k_0023_majority_primary_topic,k_0041_majority_primary_topic,k_0059_majority_primary_topic,k_0063_majority_primary_topic,k_0079_majority_primary_topic,k_0085_majority_primary_topic,k_0118_majority_primary_topic,k_0320_majority_primary_topic,k_0657_majority_primary_topic,k_0958_majority_primary_topic,k_1065_majority_primary_topic,k_1560_majority_primary_topic,k_1840_majority_primary_topic,k_2207_majority_primary_topic,k_2351_majority_primary_topic,k_2830_majority_primary_topic,k_3145_majority_primary_topic,k_3411_majority_primary_topic,k_3706_majority_primary_topic,k_3864_majority_primary_topic,k_3927_majority_primary_topic,table_creation_date,mlflow_run_uuid
0,t5_5a9iie,progonlydj,,40079,1000,12,19,34,49,52,65,69,97,267,538,780,868,1261,1489,1783,1900,2280,2528,2732,2965,3090,3136,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1,t5_2x9c7,googleplaymusic,Music,40080,31,12,19,34,49,52,65,69,97,267,538,780,868,1261,1489,1783,1900,2280,2528,2732,2965,3090,3136,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
2,t5_3jzsk,ravedj,Music,40081,1000,12,19,34,49,52,65,69,97,267,538,780,868,1261,1489,1783,1900,2280,2528,2732,2965,3090,3136,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
3,t5_2rgie,happyhardcore,Music,40082,152,12,19,34,49,52,65,69,97,267,538,780,868,1261,1489,1783,1900,2280,2528,2732,2965,3090,3136,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
4,t5_2ruv2,ukhardcore,,40083,21,12,19,34,49,52,65,69,97,267,538,780,868,1261,1489,1783,1900,2280,2528,2732,2965,3090,3136,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,Music,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5


In [10]:
counts_describe(df_labels)

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
subreddit_id,object,49558,49558,100.00%,0,0.00%
subreddit_name,object,49558,49558,100.00%,0,0.00%
primary_topic,object,40700,52,0.13%,8858,17.87%
model_sort_order,int64,49558,49558,100.00%,0,0.00%
posts_for_modeling_count,int64,49558,999,2.02%,0,0.00%
k_0013_label,int64,49558,13,0.03%,0,0.00%
k_0023_label,int64,49558,23,0.05%,0,0.00%
k_0041_label,int64,49558,41,0.08%,0,0.00%
k_0059_label,int64,49558,59,0.12%,0,0.00%
k_0063_label,int64,49558,63,0.13%,0,0.00%


# Keep only labels for Target subreddits


In [11]:
l_ix_subs = ['subreddit_name', 'subreddit_id']
col_sort_order = 'model_sort_order'

df_labels_target = (
    df_labels
    .merge(
        df_geo_and_lang
        .drop(['primary_topic'], axis=1)
        ,
        how='right',
        on=l_ix_subs,
    )
    .copy()
    .sort_values(by=[col_sort_order], ascending=True)
)

# move some columns to the end of the file
l_cols_to_end = ['table_creation_date', 'mlflow_run_uuid']

df_labels_target = df_labels_target[
    df_labels_target.drop(l_cols_to_end, axis=1).columns.to_list() +
    l_cols_to_end
]

# move cols to front
l_cols_to_front = [
    col_sort_order,
    'subreddit_id',
    'subreddit_name',
    'primary_topic',
    'rating_short',
    'rating_name',
    'over_18',
]
df_labels_target = df_labels_target[
    reorder_array(l_cols_to_front, df_labels_target.columns)
]
print(df_labels_target.shape)

(1433, 73)


In [12]:
df_labels_target.head()

Unnamed: 0,model_sort_order,subreddit_id,subreddit_name,primary_topic,rating_short,rating_name,over_18,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label,k_0063_label,k_0079_label,k_0085_label,k_0118_label,k_0320_label,k_0657_label,k_0958_label,k_1065_label,k_1560_label,k_1840_label,k_2207_label,k_2351_label,k_2830_label,k_3145_label,k_3411_label,k_3706_label,k_3864_label,k_3927_label,...,k_1840_majority_primary_topic,k_2207_majority_primary_topic,k_2351_majority_primary_topic,k_2830_majority_primary_topic,k_3145_majority_primary_topic,k_3411_majority_primary_topic,k_3706_majority_primary_topic,k_3864_majority_primary_topic,k_3927_majority_primary_topic,country_name,geo_relevance_default,b_users_percent_by_subreddit,e_users_percent_by_country_standardized,c_users_percent_by_country,d_users_percent_by_country_rank,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,users_in_subreddit_from_country_l28,total_users_in_country_l28,total_users_in_subreddit_l28,geo_country_code,posts_not_removed_l28,users_l7,num_of_countries_with_visits_l28,users_percent_by_country_avg,users_percent_by_country_stdev,type,grouped_rating,table_creation_date,mlflow_run_uuid
424,334,t5_44faux,neighboursbabez,Mature Themes and Adult Content,,,,36,1,1,1,1,1,1,1,1,3,5,6,7,10,10,13,13,21,23,26,30,31,31,...,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,0.117099,2.921178,3.8e-05,14648,False,True,817,21632112,6977,AU,22,2140,64,1.2e-05,9e-06,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1247,489,t5_iqt8v,altladyboners,Celebrity,E,Everyone,f,108,1,1,1,1,1,2,2,2,4,8,11,13,18,19,24,25,35,40,45,52,54,55,...,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,0.056172,2.060629,5.6e-05,11348,False,True,1205,21632112,21452,AU,53,5521,102,2.7e-05,1.4e-05,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1238,509,t5_3g9c8,rtgirls,Podcasts and Streamers,E,Everyone,,36,1,1,1,1,1,2,2,2,4,8,11,13,19,20,25,26,36,41,46,53,55,56,...,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,0.05491,2.064154,2.6e-05,18574,False,True,562,21632112,10235,AU,16,3134,56,9e-06,8e-06,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
437,510,t5_2bfy1d,neighboursbabes,Mature Themes and Adult Content,E,Everyone,f,170,1,1,1,1,1,2,2,2,4,8,11,13,19,20,25,26,36,41,46,53,55,56,...,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,0.13045,2.901861,8.8e-05,8269,False,True,1894,21632112,14519,AU,49,3944,80,2.3e-05,2.2e-05,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
108,661,t5_39o6d,aussiebabes,Mature Themes and Adult Content,E,Everyone,,21,1,1,1,1,1,2,2,2,5,9,12,14,22,26,31,32,43,50,55,63,65,66,...,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,True,0.410733,5.667158,7.3e-05,9421,True,True,1569,21632112,3820,AU,16,945,44,6e-06,1.2e-05,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5


### Drop subs with too few posts

In the modeling process I drop subreddits with too few posts. We don't have recommendations for them, so let's drop them.

It would also not be a great experience to recommend dead subs.

In [13]:

print(f"{df_labels_target[col_sort_order].isnull().sum():,.0f} <- subs to drop")
df_labels_target = df_labels_target[
    ~df_labels_target[col_sort_order].isnull()
].copy()
df_labels_target[col_sort_order] = df_labels_target[col_sort_order].astype(int)

l_cols_label_de = [c for c in df_labels_target.columns if c.endswith('_label')]
df_labels_target[l_cols_label_de] = df_labels_target[l_cols_label_de].astype(int)

df_labels_target.shape

0 <- subs to drop


(1433, 73)

In [14]:
style_df_numeric(
    df_labels_target.head(10),
    # rename_cols_for_display=True,
    pct_labels=['_percent_in_country', '_percent'],
    int_labels=None,
    pct_cols=['users_percent_in_country'],
    l_bar_simple=[c for c in df_labels_target.columns if '_label' in c]
)

Unnamed: 0,model_sort_order,subreddit_id,subreddit_name,primary_topic,rating_short,rating_name,over_18,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label,k_0063_label,k_0079_label,k_0085_label,k_0118_label,k_0320_label,k_0657_label,k_0958_label,k_1065_label,k_1560_label,k_1840_label,k_2207_label,k_2351_label,k_2830_label,k_3145_label,k_3411_label,k_3706_label,k_3864_label,k_3927_label,k_0013_majority_primary_topic,k_0023_majority_primary_topic,k_0041_majority_primary_topic,k_0059_majority_primary_topic,k_0063_majority_primary_topic,k_0079_majority_primary_topic,k_0085_majority_primary_topic,k_0118_majority_primary_topic,k_0320_majority_primary_topic,k_0657_majority_primary_topic,k_0958_majority_primary_topic,k_1065_majority_primary_topic,k_1560_majority_primary_topic,k_1840_majority_primary_topic,k_2207_majority_primary_topic,k_2351_majority_primary_topic,k_2830_majority_primary_topic,k_3145_majority_primary_topic,k_3411_majority_primary_topic,k_3706_majority_primary_topic,k_3864_majority_primary_topic,k_3927_majority_primary_topic,country_name,geo_relevance_default,b_users_percent_by_subreddit,e_users_percent_by_country_standardized,c_users_percent_by_country,d_users_percent_by_country_rank,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,users_in_subreddit_from_country_l28,total_users_in_country_l28,total_users_in_subreddit_l28,geo_country_code,posts_not_removed_l28,users_l7,num_of_countries_with_visits_l28,users_percent_by_country_avg,users_percent_by_country_stdev,type,grouped_rating,table_creation_date,mlflow_run_uuid
424,334,t5_44faux,neighboursbabez,Mature Themes and Adult Content,,,,36,1,1,1,1,1,1,1,1,3,5,6,7,10,10,13,13,21,23,26,30,31,31,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Celebrity,Celebrity,Celebrity,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,11.71%,3,0,14648,False,True,817,21632112,6977,AU,22,2140,64,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1247,489,t5_iqt8v,altladyboners,Celebrity,E,Everyone,f,108,1,1,1,1,1,2,2,2,4,8,11,13,18,19,24,25,35,40,45,52,54,55,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,5.62%,2,0,11348,False,True,1205,21632112,21452,AU,53,5521,102,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1238,509,t5_3g9c8,rtgirls,Podcasts and Streamers,E,Everyone,,36,1,1,1,1,1,2,2,2,4,8,11,13,19,20,25,26,36,41,46,53,55,56,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,5.49%,2,0,18574,False,True,562,21632112,10235,AU,16,3134,56,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
437,510,t5_2bfy1d,neighboursbabes,Mature Themes and Adult Content,E,Everyone,f,170,1,1,1,1,1,2,2,2,4,8,11,13,19,20,25,26,36,41,46,53,55,56,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,13.04%,3,0,8269,False,True,1894,21632112,14519,AU,49,3944,80,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
108,661,t5_39o6d,aussiebabes,Mature Themes and Adult Content,E,Everyone,,21,1,1,1,1,1,2,2,2,5,9,12,14,22,26,31,32,43,50,55,63,65,66,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,True,41.07%,6,0,9421,True,True,1569,21632112,3820,AU,16,945,44,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
47,662,t5_28i70j,downundercelebs,Celebrity,E,Everyone,f,6,1,1,1,1,1,2,2,2,5,9,12,14,22,26,31,32,43,50,55,63,65,66,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,True,38.22%,7,0,5415,True,True,3261,21632112,8532,AU,45,1888,63,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
123,730,t5_3fruf,girlstennis,Celebrity,E,Everyone,,821,1,1,1,1,1,2,2,2,5,10,13,15,23,27,32,33,44,51,56,65,67,68,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Celebrity,Celebrity,Celebrity,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,14.20%,5,0,2332,True,True,8827,21632112,62174,AU,433,19969,116,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
365,1045,t5_4dgbcr,laetitiabrown,,,,f,39,1,1,2,2,2,3,3,3,6,12,15,17,28,34,40,41,57,65,71,81,87,88,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Celebrity,Celebrity,Celebrity,Celebrity,Celebrity,Celebrity,Celebrity,Celebrity,Celebrity,Celebrity,Australia,False,29.35%,3,0,36907,True,True,162,21632112,552,AU,16,194,16,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1410,1105,t5_5bz2ml,polyslartz,,E,Everyone,f,18,1,1,2,2,2,3,3,3,6,12,16,18,31,37,43,45,62,70,76,86,92,93,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,False,26.30%,1,0,1319,True,False,15475,21632112,58846,AU,85,7112,82,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
326,1199,t5_4317um,jadetunchy,,,,f,40,1,1,2,2,2,3,3,3,7,13,17,19,32,38,44,47,65,73,80,92,99,100,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Australia,True,58.54%,3,0,18097,True,True,586,21632112,1001,AU,51,354,13,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5


In [15]:
style_df_numeric(
    df_labels_target.tail(10),
    # rename_cols_for_display=True,
    pct_labels=['_percent_in_country', '_percent'],
    int_labels=None,
    pct_cols=['users_percent_in_country'],
    l_bar_simple=[c for c in df_labels_target.columns if '_label' in c]
)

Unnamed: 0,model_sort_order,subreddit_id,subreddit_name,primary_topic,rating_short,rating_name,over_18,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label,k_0063_label,k_0079_label,k_0085_label,k_0118_label,k_0320_label,k_0657_label,k_0958_label,k_1065_label,k_1560_label,k_1840_label,k_2207_label,k_2351_label,k_2830_label,k_3145_label,k_3411_label,k_3706_label,k_3864_label,k_3927_label,k_0013_majority_primary_topic,k_0023_majority_primary_topic,k_0041_majority_primary_topic,k_0059_majority_primary_topic,k_0063_majority_primary_topic,k_0079_majority_primary_topic,k_0085_majority_primary_topic,k_0118_majority_primary_topic,k_0320_majority_primary_topic,k_0657_majority_primary_topic,k_0958_majority_primary_topic,k_1065_majority_primary_topic,k_1560_majority_primary_topic,k_1840_majority_primary_topic,k_2207_majority_primary_topic,k_2351_majority_primary_topic,k_2830_majority_primary_topic,k_3145_majority_primary_topic,k_3411_majority_primary_topic,k_3706_majority_primary_topic,k_3864_majority_primary_topic,k_3927_majority_primary_topic,country_name,geo_relevance_default,b_users_percent_by_subreddit,e_users_percent_by_country_standardized,c_users_percent_by_country,d_users_percent_by_country_rank,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,users_in_subreddit_from_country_l28,total_users_in_country_l28,total_users_in_subreddit_l28,geo_country_code,posts_not_removed_l28,users_l7,num_of_countries_with_visits_l28,users_percent_by_country_avg,users_percent_by_country_stdev,type,grouped_rating,table_creation_date,mlflow_run_uuid
1333,48736,t5_2qkt4,martialarts,Sports,E,Everyone,f,1000,13,23,41,59,63,79,85,116,311,636,927,1034,1511,1785,2142,2282,2747,3048,3309,3594,3748,3810,Animals and Pets,Anime,Music,Music,Music,Music,Music,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Australia,False,5.07%,2,0,1682,False,True,12297,21632112,242392,AU,1258,83883,119,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
442,48740,t5_2qn02,bjj,Sports,E,Everyone,f,1000,13,23,41,59,63,79,85,116,311,636,927,1034,1511,1785,2142,2282,2747,3048,3309,3594,3748,3810,Animals and Pets,Anime,Music,Music,Music,Music,Music,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Australia,False,5.66%,3,0,724,False,True,27124,21632112,479233,AU,2179,155916,119,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1239,48741,t5_2t06q,jiujitsu,Sports,E,Everyone,,152,13,23,41,59,63,79,85,116,311,636,927,1034,1511,1785,2142,2282,2747,3048,3309,3594,3748,3810,Animals and Pets,Anime,Music,Music,Music,Music,Music,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Australia,False,4.44%,2,0,11514,False,True,1181,21632112,26629,AU,145,7684,89,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
700,48757,t5_2qhj4,mma,Sports,E,Everyone,f,1000,13,23,41,59,63,79,85,116,311,637,928,1035,1512,1786,2143,2283,2748,3049,3310,3595,3749,3811,Animals and Pets,Anime,Music,Music,Music,Music,Music,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Australia,False,5.56%,2,0,199,False,True,90404,21632112,1624548,AU,1007,533579,119,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
649,48790,t5_2sljg,squaredcircle,Sports,E,Everyone,,1000,13,23,41,59,63,79,85,116,312,638,930,1037,1514,1788,2145,2285,2750,3051,3312,3598,3752,3814,Animals and Pets,Anime,Music,Music,Music,Music,Music,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Sports,Australia,False,4.17%,2,0,215,False,True,85268,21632112,2046254,AU,6789,769828,119,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
785,49002,t5_2qiu1,upvote,Internet Culture and Memes,E,Everyone,f,1000,13,23,41,59,63,79,85,118,315,643,935,1042,1523,1797,2154,2295,2762,3064,3325,3613,3768,3830,Animals and Pets,Anime,Music,Music,Music,Music,Music,Mature Themes and Adult Content,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Meta/Reddit,Australia,False,6.94%,2,0,10703,False,True,1311,21632112,18888,AU,809,4570,97,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
368,49010,t5_oizrm,namenerdcirclejerk,Internet Culture and Memes,E,Everyone,f,1000,13,23,41,59,63,79,85,118,316,644,936,1043,1524,1798,2155,2296,2763,3065,3326,3614,3769,3831,Animals and Pets,Anime,Music,Music,Music,Music,Music,Mature Themes and Adult Content,Internet Culture and Memes,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Australia,False,4.71%,3,0,5305,False,True,3343,21632112,71044,AU,627,28300,108,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
236,49012,t5_2xmrc,namenerds,,E,Everyone,,1000,13,23,41,59,63,79,85,118,316,644,936,1043,1524,1798,2155,2296,2763,3065,3326,3614,3769,3831,Animals and Pets,Anime,Music,Music,Music,Music,Music,Mature Themes and Adult Content,Internet Culture and Memes,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Animals and Pets,Australia,False,4.97%,4,0,705,False,True,27840,21632112,560720,AU,2977,194388,119,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
683,49240,t5_3dy72i,asx_banned,,E,Everyone,,118,13,23,41,59,63,79,85,118,316,649,947,1054,1541,1818,2179,2322,2797,3105,3368,3662,3818,3880,Animals and Pets,Anime,Music,Music,Music,Music,Music,Mature Themes and Adult Content,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Australia,True,82.70%,2,0,12023,True,True,1109,21632112,1341,AU,53,725,8,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5
1298,49241,t5_54a78i,asx_banned_purgatory,Trauma Support,E,Everyone,,277,13,23,41,59,63,79,85,118,316,649,947,1054,1541,1818,2179,2322,2797,3105,3368,3662,3818,3880,Animals and Pets,Anime,Music,Music,Music,Music,Music,Mature Themes and Adult Content,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Internet Culture and Memes,Australia,False,65.22%,2,0,36626,True,True,165,21632112,253,AU,20,81,6,0,0,public,unrated_or_E,2022-02-03 20:37:02.045516+00:00,e37b0a2c3af54c588818e7efdde15df5


In [16]:
# style_df_numeric(
#     df_labels_target.tail(10),
#     # rename_cols_for_display=True,
#     l_bar_simple=[c for c in df_labels_target.columns if '_label' in c]
# )

# Filter out subs [optional]

UPDATE: For now let's include all the subreddits for QA because this list could potentially help us rate/flag subreddits that aren't rated or mis-rated and have a lot of traffic.

Now that we have even more clusters (over 3,000), it's harder to figure out where to set the threshold for clusters to exclude.

--- 

The main use case for now are SFW subs, we could save some QA time by excluding these subs:
- Exclude NSFW clusters
- Exclude place subs


~We'll use the cluster labels to discard subreddits because~
- many of the DE subreddits don't have a `primary_topic`
- if the majority of subs for a subreddits are NSFW, then we wouldn't want to recommend those anyway

In [17]:
# # we can see that the subreddit count changes as we go 
# #  from shallow to deeper cluster counts
# value_counts_and_pcts(
#     df_labels_target['k_0118_majority_primary_topic'],
#     top_n=15,
#     reset_index=True,
#     add_col_prefix=False,
#     count_type='subreddits',
#     return_df=False,
# )

In [18]:
# value_counts_and_pcts(
#     df_labels_target['k_3145_majority_primary_topic'],
#     top_n=15,
#     reset_index=True,
#     add_col_prefix=False,
#     count_type='subreddits',
#     return_df=False,
# )

In [19]:
# value_counts_and_pcts(
#     df_labels_target['k_3927_majority_primary_topic'],
#     top_n=15,
#     reset_index=True,
#     add_col_prefix=False,
#     count_type='subreddits',
#     return_df=False,
# )

In [20]:
# # And the count is slightly different from the known primary topics
# #  We still have a large number of subs w/o a primary topic
# value_counts_and_pcts(
#     df_labels_target['primary_topic'],
#     count_type='subreddits',
#     reset_index=True,
#     add_col_prefix=False,
# )

In [21]:
print(f"{df_labels_target.shape} <- Shape before filtering")

l_manual_subs_to_remove = [
    'sexmeets1', 'fuck',
]
col_cluster_filter = 'k_3145_majority_primary_topic'
df_labels_target_clean = (
    df_labels_target[df_labels_target[col_cluster_filter] != 'Mature Themes and Adult Content']
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping NSFW clusters")

l_sensitive_topics = [
    'Military', 'Gender', 'Addiction Support',
    'Medical and Mental Health', 'Sexual Orientation',
    'Culture, Race, and Ethnicity',
]
df_labels_target_clean = (
    df_labels_target_clean[
        ~df_labels_target_clean[col_cluster_filter].isin(l_sensitive_topics)
    ]
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping Sensitive clusters")

df_labels_target_clean = (
    df_labels_target_clean[
        ~df_labels_target_clean['primary_topic'].isin(l_sensitive_topics)
    ]
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping SENSITIVE subreddits")


df_labels_target_clean = (
    df_labels_target_clean[
        ~df_labels_target_clean['subreddit_name'].isin(l_manual_subs_to_remove)
    ]
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping Manual list of subreddits")

print(f"  ** TODO: instead of excluding place subs, add logic to map hierarchy **")
# df_labels_target_clean = (
#     df_labels_target_clean[df_labels_target_clean['primary_topic'] != 'Place']
# )
# print(f"{df_labels_target_clean.shape} <- Shape after dropping Place subreddits")

(1433, 73) <- Shape before filtering
(1354, 73) <- Shape after dropping NSFW clusters
(1262, 73) <- Shape after dropping Sensitive clusters
(1246, 73) <- Shape after dropping SENSITIVE subreddits
(1246, 73) <- Shape after dropping Manual list of subreddits
  ** TODO: instead of excluding place subs, add logic to map hierarchy **


# Create new clustering logic to resize clusters

We want to balance two things:
- prevent orphan subreddits
- prevent clusters that are too large to be meaningful

In order to do this at a country level, we'll be better off starting with smallest cluster size and roll up until we have at least N subreddits in one cluster.


In [22]:
# even if cluster at k < 20 is generic, keep it to avoid orphan subs
col_new_cluster_val = 'cluster_label'
col_new_cluster_name = 'cluster_label_k'
col_new_cluster_prim_topic = 'cluster_majority_primary_topic'

l_cols_labels = (
    [c for c in df_labels_target.columns 
     if all([c != col_new_cluster_val, c.endswith('_label')])
    ]
    # [1:]  # use all the columns! helps prevent a bunch of orphan subs
)

l_iteration_results = list()
col_num_orph_subs = 'num_orphan_subreddits'
col_num_subs_mean = 'num_subreddits_per_cluster_mean'
col_num_subs_median = 'num_subreddits_per_cluster_median'

# TODO(djb): move this loop to a function
for n_ in tqdm([2, 3, 4, 5, 6, 7, 8, 9, 10, 11]): 
    d_run_clean = dict()
    d_run_raw = dict()
    d_run_clean['df'] = 'clean'
    d_run_raw['df'] = 'raw'
    d_run_clean['min_subreddits_in_cluster'] = n_
    d_run_raw['min_subreddits_in_cluster'] = n_
    
    _clean = create_dynamic_clusters(
        df_labels_target_clean,
        agg_strategy='aggregate_small_clusters',
        min_subreddits_in_cluster=n_,
        l_cols_labels_input=l_cols_labels,
        col_new_cluster_val=col_new_cluster_val,
        col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    )
    d_run_clean['cluster_count'] = _clean[col_new_cluster_val].nunique()
    df_vc_clean = _clean[col_new_cluster_val].value_counts()
    dv_vc_below_threshold = df_vc_clean[df_vc_clean <= 1]
    d_run_clean[col_num_orph_subs] = len(dv_vc_below_threshold)
    d_run_clean[col_num_subs_mean] = df_vc_clean.mean()
    d_run_clean[col_num_subs_median] = df_vc_clean.median()
    d_run_clean['cluster_ids_with_orphans'] = sorted(list(dv_vc_below_threshold.index))
    
    _raw  = create_dynamic_clusters(
        df_labels_target,
        agg_strategy='aggregate_small_clusters',
        min_subreddits_in_cluster=n_,
        l_cols_labels_input=l_cols_labels,
        col_new_cluster_val=col_new_cluster_val,
        col_new_cluster_name=col_new_cluster_name,
        col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    )
    d_run_raw['cluster_count'] = _raw[col_new_cluster_val].nunique()
    df_vc_raw = _raw[col_new_cluster_val].value_counts()
    dv_vc_below_thresh_raw = df_vc_raw[df_vc_raw <= 1]
    d_run_raw[col_num_orph_subs] = len(dv_vc_below_thresh_raw)
    d_run_raw[col_num_subs_mean] = df_vc_raw.mean()
    d_run_raw[col_num_subs_median] = df_vc_raw.median()
    d_run_raw['cluster_ids_with_orphans'] = sorted(list(dv_vc_below_thresh_raw.index))
    
    l_iteration_results.append(d_run_clean)
    l_iteration_results.append(d_run_raw)
    del _clean, _raw, d_run_raw, d_run_clean, df_vc_clean, df_vc_raw

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.84it/s]


Find optimal `min_subreddits_in_cluster` based on:
- `orphan count`, 
- `number of clusters`,
- & other info

number might be different for each:
<br>`df_labels_target_dynamic_clean` & `df_labels_target_dynamic_raw`

In [23]:
def highlight_below_threshold(val, threshold=1):
    if val <= threshold:
        return "color:purple; font-weight: bold; background-color:yellow;"
    else:
        return ''
    

(
    style_df_numeric(
        pd.DataFrame(l_iteration_results)
        # .set_index(['min_subreddits_in_cluster'])
        ,
        rename_cols_for_display=True,
        l_bar_simple=[col_num_orph_subs,
                      col_num_subs_median,
                      ]
    )
    .applymap(highlight_below_threshold, subset=[col_num_orph_subs.replace('_', ' ')])
)

Unnamed: 0,df,min subreddits in cluster,cluster count,num orphan subreddits,num subreddits per cluster mean,num subreddits per cluster median,cluster ids with orphans
0,clean,2,308,3,4.05,3.0,"['0005', '0008', '0009']"
1,raw,2,359,5,3.99,3.0,"['0001', '0004', '0009', '0010', '0012']"
2,clean,3,244,3,5.11,4.0,"['0005', '0007', '0009']"
3,raw,3,277,3,5.17,5.0,"['0001', '0007', '0009']"
4,clean,4,197,1,6.32,6.0,['0005']
5,raw,4,229,3,6.26,6.0,"['0007', '0009', '0010']"
6,clean,5,170,3,7.33,7.0,"['0005', '0010', '0011']"
7,raw,5,193,1,7.42,7.0,['0011']
8,clean,6,147,2,8.48,8.0,"['0005', '0011']"
9,raw,6,166,1,8.63,8.0,['0011']


In [24]:
df_labels_target_dynamic_clean = create_dynamic_clusters(
    df_labels_target_clean,
    agg_strategy='aggregate_small_clusters',
    min_subreddits_in_cluster=4,
    l_cols_labels_input=l_cols_labels,
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    verbose=False,
)

df_labels_target_dynamic_raw = create_dynamic_clusters(
    df_labels_target,
    agg_strategy='aggregate_small_clusters',
    min_subreddits_in_cluster=5,
    l_cols_labels_input=l_cols_labels,
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    col_new_cluster_prim_topic=col_new_cluster_prim_topic,
)

## Investigate orphans & manually reassign them

focus on raw output, won't worry about "clean" for now.

In [25]:
# (
#     df_labels_target_dynamic_clean
#     [df_labels_target_dynamic_clean['k_0013_label'] == 5]
#     .iloc[:20, :10]
# )

In [26]:
# (
#     df_labels_target_dynamic_raw
#     [df_labels_target_dynamic_raw['k_0013_label'] == 5]
#     .iloc[:20, :10]
# )

In [27]:
label_k_to_reassign_ = 'k_0320_label'
label_val_to_reassign_ = '0011-0018-0032-0043-0046-0058-0062-0087-0244'
subreddit_id_orphan_ = 't5_2tt7r'

mask_orphan_and_new_group = (
    (df_labels_target_dynamic_raw['subreddit_id'] == subreddit_id_orphan_) |
    (
        (df_labels_target_dynamic_raw[col_new_cluster_name] == label_k_to_reassign_) &
        (df_labels_target_dynamic_raw[col_new_cluster_val] == label_val_to_reassign_)
    )
)

label_k_new_ = 'k_0118_label'
label_val_new_col_ = f"{label_k_new_}_nested"
new_prim_topic_col_ = label_k_new_.replace('_label', '_majority_primary_topic')

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_name
] = label_k_new_

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_val
] = df_labels_target_dynamic_raw[mask_orphan_and_new_group][label_val_new_col_]

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_prim_topic
] = df_labels_target_dynamic_raw[mask_orphan_and_new_group][new_prim_topic_col_]

In [28]:
# style_df_numeric(
#     df_labels_target_dynamic_clean
#     [df_labels_target_dynamic_clean['k_0013_label'] == 11]
#     .iloc[:20, :80]
#     ,
#     l_bar_simple=[c for c in list(df_labels_target_dynamic_clean.columns)[5:30] if c.endswith('_label')]
# )

In [29]:
# style_df_numeric(
#     df_labels_target_dynamic_raw
#     [df_labels_target_dynamic_raw['k_0013_label'] == 11]
#     .iloc[:10, :30]
#     ,
#     l_bar_simple=[c for c in list(df_labels_target_dynamic_clean.columns)[5:30] if c.endswith('_label')]
# )

In [30]:
# value_counts_and_pcts(
#     df_labels_target_dynamic_clean,
#     ['cluster_label'],
#     top_n=10,
# )
value_counts_and_pcts(
    df_labels_target_dynamic_clean,
    ['cluster_label'],
    top_n=None,
    return_df=True
)['count'].describe()

count    197.000000
mean       6.324873
std        2.400264
min        1.000000
25%        5.000000
50%        6.000000
75%        7.000000
max       22.000000
Name: count, dtype: float64

In [31]:
print(df_labels_target_dynamic_raw['cluster_label'].nunique())
display(
    value_counts_and_pcts(
        df_labels_target_dynamic_raw,
        ['cluster_label'],
        top_n=10,
    )
)
value_counts_and_pcts(
    df_labels_target_dynamic_raw,
    ['cluster_label'],
    top_n=None,
    return_df=True
)['count'].describe()

192


Unnamed: 0_level_0,count,percent,cumulative_percent
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0007-0011-0019-0026-0027-0036-0037-0047-0136-0274-0404-0454-0659-0777-0934-0994-1201-1336-1439-1566-1639-1664,22,1.5%,1.5%
0007-0012-0020-0027-0028-0037-0038-0048-0138-0277-0409-0459-0669-0790-0950-1011-1219-1358-1462-1593-1668-1693,17,1.2%,2.7%
0010-0016-0029-0039-0041-0052-0055-0077-0217-0433-0625-0699-1010-1187-1430-1521-1815-2018-2181-2364-2464-2498,16,1.1%,3.8%
0006-0010-0018-0025-0025-0033-0034-0043,16,1.1%,5.0%
0008-0013-0023-0031-0033-0043-0045-0062,14,1.0%,5.9%
0010-0017-0030-0040-0043-0054-0057-0082-0227-0451-0651-0728-1054-1239-1489-1585-1895-2104-2272-2466-2572-2607,14,1.0%,6.9%
0006-0010-0017-0023-0023-0031-0032-0041,14,1.0%,7.9%
0013-0021-0037-0054-0057-0072-0076-0106-0287,13,0.9%,8.8%
0013-0021-0037-0054-0057-0072-0077-0107-0289-0587-0852-0948-1381-1634-1953-2081-2503-2771-3001,12,0.8%,9.6%
0011-0018-0033,12,0.8%,10.5%


count    192.000000
mean       7.463542
std        2.388613
min        2.000000
25%        6.000000
50%        7.000000
75%        8.000000
max       22.000000
Name: count, dtype: float64

### How deep are the clusters?

Looks like some peaks around 100, 300, 1k, and 4k clusters.

In [32]:
print(df_labels_target_dynamic_raw[col_new_cluster_name].nunique())
value_counts_and_pcts(
    df_labels_target_dynamic_raw,
    [col_new_cluster_name],
    top_n=None,
    sort_index=True,
)

22


Unnamed: 0_level_0,count,percent,cumulative_percent
cluster_label_k,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
k_0013_label,38,2.7%,2.7%
k_0023_label,29,2.0%,4.7%
k_0041_label,33,2.3%,7.0%
k_0059_label,16,1.1%,8.1%
k_0063_label,15,1.0%,9.1%
k_0079_label,12,0.8%,10.0%
k_0085_label,47,3.3%,13.3%
k_0118_label,258,18.0%,31.3%
k_0320_label,212,14.8%,46.1%
k_0657_label,80,5.6%,51.6%


# Export raw data: 1 row=1 subreddit

Make sure it's ordered by the col to sort subs similar to each other.

NOTE: I'll need to go back to colab because for some reason I can't get authenticated to create new sheets from my laptop **sigh**.

In [33]:
gspread.__version__

'5.0.0'

In [35]:
# # %%time
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

GSHEET_NAME = 'i18n Australia subreddits and clusters - model v0.4.1'
GSHEET_KEY = '1ujjoJ7uyTWz3P1aYgZTKyJGbETKAfk3vrom_ASaNUt0'
target_abbrev_ = 'au'

d_wsh_names = {
    'sub_raw': {
        'name': 'raw_data_per_subreddit',
    },
    'clusters_t2t_list_raw': {
        'name': f'raw_clusters_list_{target_abbrev_}_{target_abbrev_}',
    },
    'clusters_t2t_fpr_raw': {
        'name': f'raw_clusters_fpr_{target_abbrev_}_{target_abbrev_}',
    }
}
# SH_DE_2_DE_LISTING_BELOW = 'de_to_de_listing_below_raw_cluster_list_names_and_ids'

if GSHEET_KEY is not None:
    sh = gc.open_by_key(GSHEET_KEY)
    print(f"Opening google worksheet: {GSHEET_NAME} ...")
else:
    print(f"Creating google worksheet: {GSHEET_NAME} ...")
    sh = gc.create(GSHEET_NAME)

# create worksheets:
for _, d_ in d_wsh_names.items():
    sh_name = d_['name']
    try:
        d_['worksheet'] = sh.worksheet(sh_name)
        print(f"Opening tab/sheet: {sh_name} ...")
    except Exception as e:
        print(f"Creating tab/sheet: {sh_name} ...")
        d_['worksheet'] = sh.add_worksheet(sh_name, rows=5, cols=5)


APIError: {'code': 403, 'message': 'Request had insufficient authentication scopes.', 'status': 'PERMISSION_DENIED', 'details': [{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'ACCESS_TOKEN_SCOPE_INSUFFICIENT', 'domain': 'googleapis.com', 'metadata': {'service': 'sheets.googleapis.com', 'method': 'google.apps.sheets.v4.SpreadsheetsService.GetSpreadsheet'}}]}

## Save raw subreddit data - use for QA
Note that we have to use `fillna('')`

Otherwise, we'll get errors because the gspread library doesn't know how to handle `pd.NaN` or `np.Nan` (nulls).

In [None]:
# %%time

# wsh_raw_sub_output.update([df_labels_target.columns.values.tolist()] + 
#                           df_labels_target.fillna('').values.tolist())

### We can read the data back to confirm it's as expected

In [None]:
# Here's how to get the records as a dataframe
pd.DataFrame(wsh_raw_sub_output.get_all_records())

## Save target 2 target clusters - human readable
This one is mostly as a quick way to visually inspect the clusters. It doesn't get used by other tasks.

In [None]:
# %%time

# wsh_raw_de2de_lbelow.update(
#     [df_target_to_target_list.columns.values.tolist()] + 
#     df_target_to_target_list.fillna('').values.tolist()
# )

## Save FPR (raw) format. 1 row = 1 subreddit with counterpart/cluster subs

See utility function that does reshaping with one call.

In [None]:
col_sort_order

In [None]:
%%time

df_target_to_target_list = convert_distance_or_ab_to_list_for_fpr(
    df_labels_target_dynamic_clean,
    convert_to_ab=True,
    col_counterpart_count='counterpart_count',
    col_list_cluster_names='list_cluster_subreddit_names',
    col_list_cluster_ids='list_cluster_subreddit_ids',
    l_cols_for_seeds=None,
    l_cols_for_clusters=None,
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    verbose=False,
)
df_target_to_target_list.shape

In [None]:
df_target_to_target_list.head(10)

In [None]:
df_target_to_target_list[
    df_target_to_target_list['subreddit_name_seed'].isin(l_subs_to_check_orphan)
]