# Purpose

### 2022-02-14
In this notebook I'll select the clusters for the new FPR experiments for Canada, UK, Australia, & India.

Note that this is supposed to be an SFW experiment, so we'll need to filter out subreddits that are `over_18` or rated as `X`.

In one sheet include BOTH subreddit names & subreddit IDs.

---

TODO: Haven't included place logic (e.g., add direction to: city, state, country subreddits.)


### Updates
2022-02-16: 
I [David] will update the QA sheet so that we have standardized columns/format. Otherwise it'll be more work for us to wait for country managers to format things and then standardize them after the fact



# Imports & notebook setup

In [3]:
%load_ext autoreload
%autoreload 2

# Register bigquery magic
%load_ext google.cloud.bigquery

In [4]:
# colab auth for BigQuery
from google.colab import auth, files, drive
auth.authenticate_user()
print('Authenticated')

Authenticated


## Install custom library

### Append google drive path so we can install library from there

In [5]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
import sys


g_drive_root = '/content/drive'

try:
    drive.mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
    
except Exception as e:
    try:
        drive._mount(g_drive_root, force_remount=True)
        print('   Authenticated & mounted Google Drive')
    except Exception as e:
        print(e)
        raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install library

In [6]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

[K     |████████████████████████████████| 10.1 MB 2.6 MB/s 
[K     |████████████████████████████████| 14.2 MB 30.1 MB/s 
[K     |████████████████████████████████| 965 kB 56.8 MB/s 
[K     |████████████████████████████████| 144 kB 58.5 MB/s 
[K     |████████████████████████████████| 285 kB 52.9 MB/s 
[K     |████████████████████████████████| 13.2 MB 47.9 MB/s 
[K     |████████████████████████████████| 79.9 MB 97 kB/s 
[K     |████████████████████████████████| 133 kB 56.2 MB/s 
[K     |████████████████████████████████| 715 kB 62.1 MB/s 
[K     |████████████████████████████████| 74 kB 2.7 MB/s 
[K     |████████████████████████████████| 112 kB 33.7 MB/s 
[K     |████████████████████████████████| 58 kB 3.8 MB/s 
[K     |████████████████████████████████| 180 kB 56.7 MB/s 
[K     |████████████████████████████████| 146 kB 54.7 MB/s 
[K     |████████████████████████████████| 79 kB 5.8 MB/s 
[K     |████████████████████████████████| 1.1 MB 46.1 MB/s 
[K     |████████████████████

## Regular Imports

In [7]:
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted
from tqdm import tqdm


# auth for google sheets
import gspread
from oauth2client.client import GoogleCredentials


gc = gspread.authorize(GoogleCredentials.get_application_default())

# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

[autoreload of cloudpickle failed: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ImportError: cannot import name '_should_pickle_by_reference' from 'cloudpickle.cloudpickle' (/usr/local/lib/python3.7/dist-packages/cloudpickle/cloudpickle.py)
]


## Custom imports

In [8]:
# subclu imports
import subclu
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.models.clustering_utils import (
    create_dynamic_clusters,
    convert_distance_or_ab_to_list_for_fpr,
    reshape_df_to_get_1_cluster_per_row,
    get_primary_topic_mix_cols,
)

from subclu.models.reshape_clusters_v041 import (
    keep_only_target_labels,
    get_table_for_optimal_dynamic_cluster_params,
    get_dynamic_cluster_summary,
)


setup_logging()
notebook_display_config()
print_lib_versions([gspread, pd, np])

python		v 3.7.12
===
gspread		v: 5.1.1
pandas		v: 1.3.5
numpy		v: 1.21.5


# Load data from BigQuery

## Load subreddit geo-relevance & cultural relevance metadata

This data is already in bigQuery so read it straight from there. We'll use it to filter out geo-relevant (German) subs.

Also add the latest ratings so that we can filter based on those.

English-speaking countries don't have ambassador subs right now, so we should be able to create a standard template and replace the country name for these queries.

### SQL geo & cultural

In [9]:
%%time
%%bigquery df_geo_and_lang --project data-science-prod-218515 

-- Select geo+cultural subreddits for a target country
--  And add latest rating & over_18 flags to exclude X-rated & over_18
DECLARE TARGET_COUNTRY STRING DEFAULT 'Australia';


SELECT
    s.* EXCEPT(over_18, pt, verdict) 
    , nt.rating_name
    , nt.primary_topic
    , nt.rating_short
    , slo.over_18
    , CASE 
        WHEN(COALESCE(slo.over_18, 'f') = 't') THEN 'over_18_or_X_M_D_V'
        WHEN(COALESCE(nt.rating_short, '') IN ('X', 'M', 'D', 'V')) THEN 'over_18_or_X_M_D_V'
        ELSE 'unrated_or_E'
    END AS grouped_rating

FROM `reddit-employee-datasets.david_bermejo.subclu_v0041_subreddit_clusters_c_a` AS t
    -- Inner join b/c we only want to keep subs that are geo-relevant AND in topic model
    INNER JOIN (
        SELECT *
        FROM `reddit-employee-datasets.david_bermejo.subclu_subreddit_geo_score_standardized_20220212`
        WHERE country_name = TARGET_COUNTRY
    ) AS s
        ON t.subreddit_id = s.subreddit_id

    -- Add rating so we can get an estimate for how many we can actually use for recommendation
    LEFT JOIN (
        SELECT *
        FROM `data-prod-165221.ds_v2_postgres_tables.subreddit_lookup`
        -- Get latest partition
        WHERE dt = DATE(CURRENT_DATE() - 2)
    ) AS slo
    ON s.subreddit_id = slo.subreddit_id
    LEFT JOIN (
        SELECT * FROM `data-prod-165221.cnc.shredded_crowdsource_topic_and_rating`
        WHERE pt = DATE(CURRENT_DATE() - 2)
    ) AS nt
        ON s.subreddit_id = nt.subreddit_id

    -- Exclude popular US subreddits
    -- Can't query this table from local notebook because of errors getting google drive permissions. smh, excludefor now
    LEFT JOIN `reddit-employee-datasets.david_bermejo.subclu_subreddits_top_us_to_exclude_from_relevance` tus
        ON s.subreddit_name = LOWER(tus.subreddit_name)

WHERE 1=1
    AND s.subreddit_name != 'profile'
    AND COALESCE(s.type, '') = 'public'
    AND COALESCE(s.verdict, 'f') <> 'admin_removed'
    AND COALESCE(slo.over_18, 'f') = 'f'
    AND COALESCE(nt.rating_short, '') NOT IN ('X', 'D')

    AND(
        s.geo_relevance_default = TRUE
        OR s.relevance_percent_by_subreddit = TRUE
        OR s.relevance_percent_by_country_standardized = TRUE
    )
    AND country_name IN (
            TARGET_COUNTRY
        )

    AND (
         -- Exclude subs that are top in US but we want to exclude as culturally relevant
         --  For simplicity, let's go with the English exclusion (more relaxed) than the non-English one
         COALESCE(tus.english_exclude_from_relevance, '') <> 'exclude'
    )

ORDER BY e_users_percent_by_country_standardized DESC, users_l7 DESC, subreddit_name
;

CPU times: user 376 ms, sys: 23.1 ms, total: 399 ms
Wall time: 9.83 s


### Check df with geo + language information

In [10]:
print(df_geo_and_lang.shape)

(1419, 25)


In [11]:
df_geo_and_lang.iloc[:4, :9]

Unnamed: 0,subreddit_id,subreddit_name,country_name,geo_relevance_default,b_users_percent_by_subreddit,e_users_percent_by_country_standardized,c_users_percent_by_country,d_users_percent_by_country_rank,relevance_percent_by_subreddit
0,t5_2qkhb,melbourne,Australia,True,0.669184,10.780123,0.029058,3,True
1,t5_2qkob,sydney,Australia,True,0.821898,10.70648,0.019942,10,True
2,t5_2uo3q,ausfinance,Australia,True,0.831555,10.663609,0.018428,14,True
3,t5_2qh8e,australia,Australia,True,0.414442,10.460037,0.046857,2,True


## Load model labels (clusters)

The clusters now live in a Big Query table and have standardized names, so pull the data from there.

### SQL labels


In [12]:
%%time
%%bigquery df_labels --project data-science-prod-218515 

-- select subreddit clusters from bigQuery

SELECT
    sc.subreddit_id
    , sc.subreddit_name
    , nt.primary_topic

    , sc.* EXCEPT(subreddit_id, subreddit_name, primary_topic_1214)
FROM `reddit-employee-datasets.david_bermejo.subclu_v0041_subreddit_clusters_c_a` sc
    LEFT JOIN (
        -- New view should be visible to all, but still comes from cnc_taxonomy_cassandra_sync
        SELECT * FROM `data-prod-165221.cnc.shredded_crowdsource_topic_and_rating`
        WHERE DATE(pt) = (CURRENT_DATE() - 2)
    ) AS nt
        ON sc.subreddit_id = nt.subreddit_id
;

CPU times: user 8.43 s, sys: 608 ms, total: 9.04 s
Wall time: 24.9 s


### Check label outputs

In [13]:
print(df_labels.shape)
df_labels.iloc[:4, :9]

(49558, 51)


Unnamed: 0,subreddit_id,subreddit_name,primary_topic,model_sort_order,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label
0,t5_5a9iie,progonlydj,,40079,1000,12,19,34,49
1,t5_2x9c7,googleplaymusic,Music,40080,31,12,19,34,49
2,t5_3jzsk,ravedj,Music,40081,1000,12,19,34,49
3,t5_2rgie,happyhardcore,Music,40082,152,12,19,34,49


In [14]:
counts_describe(df_labels.iloc[:, :9])

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
subreddit_id,object,49558,49558,100.00%,0,0.00%
subreddit_name,object,49558,49558,100.00%,0,0.00%
primary_topic,object,40709,52,0.13%,8849,17.86%
model_sort_order,int64,49558,49558,100.00%,0,0.00%
posts_for_modeling_count,int64,49558,999,2.02%,0,0.00%
k_0013_label,int64,49558,13,0.03%,0,0.00%
k_0023_label,int64,49558,23,0.05%,0,0.00%
k_0041_label,int64,49558,41,0.08%,0,0.00%
k_0059_label,int64,49558,59,0.12%,0,0.00%


# Reshape data
Apply reshaping fxns so that we can export the data in a format that's good for QA.

## Keep only labels for Target subreddits


In [15]:
%%time
df_labels_target = keep_only_target_labels(
    df_labels=df_labels,
    df_geo=df_geo_and_lang,
    col_sort_order='model_sort_order',
    l_ix_subs=['subreddit_id', 'subreddit_name'],
    l_cols_to_front=None,
    geo_cols_to_drop=None,
)

0 <- subs to drop b/c they're not in model
(1419, 73) <- df_labels_target.shape
CPU times: user 183 ms, sys: 1.65 ms, total: 185 ms
Wall time: 186 ms


In [16]:
counts_describe(df_labels_target.iloc[:, :15])

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
model_sort_order,int64,1419,1419,100.00%,0,0.00%
subreddit_id,object,1419,1419,100.00%,0,0.00%
subreddit_name,object,1419,1419,100.00%,0,0.00%
primary_topic,object,1226,51,4.16%,193,13.60%
rating_short,object,1378,3,0.22%,41,2.89%
over_18,object,326,1,0.31%,1093,77.03%
rating_name,object,1378,3,0.22%,41,2.89%
posts_for_modeling_count,int64,1419,512,36.08%,0,0.00%
k_0013_label,int64,1419,12,0.85%,0,0.00%
k_0023_label,int64,1419,22,1.55%,0,0.00%


## Run loop to find "optimal" min_num of subreddits for dynamic clusters


We want to balance two things:
- prevent orphan subreddits
- prevent clusters that are too large to be meaningful

In order to do this at a country level, we'll be better off starting with smallest cluster size and roll up until we have at least N subreddits in one cluster.

Find optimal `min_subreddits_in_cluster` based on:
- `orphan count`, 
- `number of clusters`,
- & other info

number might be different for each country and even within a country it might differ by when we filter NSFW subs.

### Loop

In [17]:
%%time

col_new_cluster_val = 'cluster_label'
col_new_cluster_name = 'cluster_label_k'
col_new_cluster_prim_topic = 'cluster_majority_primary_topic'
col_new_cluster_topic_mix = 'cluster_topic_mix'

df_optimal_min_check = get_table_for_optimal_dynamic_cluster_params(
        df_labels_target=df_labels_target,
        col_new_cluster_val=col_new_cluster_val,
        col_new_cluster_name=col_new_cluster_name,
        col_new_cluster_prim_topic=col_new_cluster_prim_topic,
        col_new_cluster_topic_mix=col_new_cluster_topic_mix,
        min_subs_in_cluster_list=np.arange(3, 11),
        verbose=False,
)

  0%|          | 0/8 [00:00<?, ?it/s]
  0%|          | 0/16 [00:00<?, ?it/s][A
 12%|█▎        | 2/16 [00:00<00:00, 19.44it/s][A
 25%|██▌       | 4/16 [00:00<00:00, 18.70it/s][A
 38%|███▊      | 6/16 [00:00<00:00, 18.73it/s][A
 50%|█████     | 8/16 [00:00<00:00, 19.08it/s][A
 69%|██████▉   | 11/16 [00:00<00:00, 20.19it/s][A
100%|██████████| 16/16 [00:00<00:00, 20.03it/s]

  0%|          | 0/21 [00:00<?, ?it/s][A
100%|██████████| 21/21 [00:00<00:00, 112.92it/s]
 12%|█▎        | 1/8 [00:01<00:10,  1.44s/it]
  0%|          | 0/16 [00:00<?, ?it/s][A
 12%|█▎        | 2/16 [00:00<00:00, 18.93it/s][A
 25%|██▌       | 4/16 [00:00<00:00, 19.04it/s][A
 38%|███▊      | 6/16 [00:00<00:00, 19.04it/s][A
 50%|█████     | 8/16 [00:00<00:00, 18.77it/s][A
 62%|██████▎   | 10/16 [00:00<00:00, 17.86it/s][A
 75%|███████▌  | 12/16 [00:00<00:00, 17.60it/s][A
100%|██████████| 16/16 [00:00<00:00, 19.01it/s]

  0%|          | 0/21 [00:00<?, ?it/s][A
100%|██████████| 21/21 [00:00<00:00, 102.78it/s]

CPU times: user 11.5 s, sys: 253 ms, total: 11.8 s
Wall time: 12 s


### Display loop results

In [18]:
def highlight_below_threshold(val, threshold=1):
    if val <= threshold:
        return "color:purple; font-weight: bold; background-color:yellow;"
    else:
        return ''

col_num_orph_subs = 'num_orphan_subreddits'
# col_num_subs_mean = 'num_subreddits_per_cluster_mean'
col_num_subs_median = 'num_subreddits_per_cluster_median'

style_df_numeric(
    df_optimal_min_check,
    rename_cols_for_display=True,
    l_bar_simple=[col_num_orph_subs,
                  col_num_subs_median,]
).applymap(highlight_below_threshold, subset=[col_num_orph_subs.replace('_', ' ')])


Unnamed: 0,subs to cluster count,min subreddits in cluster,cluster count,num orphan subreddits,num subreddits per cluster mean,num subreddits per cluster median,num clusters with mature primary topic,cluster ids with orphans
0,1419,3,276,3,5.14,5.0,20,"0001, 0007, 0009"
1,1419,4,228,3,6.22,6.0,16,"0007, 0009, 0010"
2,1419,5,190,1,7.47,7.0,16,0011
3,1419,6,165,1,8.6,8.0,12,0011
4,1419,7,139,1,10.21,9.0,12,0009
5,1419,8,121,1,11.73,11.0,11,0010
6,1419,9,111,0,12.78,12.0,9,
7,1419,10,103,1,13.78,13.0,9,0007


In [19]:
del df_optimal_min_check

## Get dyanimc clusters (apply optimal num from above)

side bar: about 57% of subreddits in Australia only had a single primary topic as their `topic_mix`, so combining `primary topic` might not give us as much info as we hoped.

At the same time, for 43% of subs we might get additional detail by combining the primary topics.


In [20]:
n_min_subs_in_cluster_optimal = 5
n_mix_start = 4
l_ix = ['subreddit_id', 'subreddit_name']
col_new_cluster_topic_mix = 'cluster_topic_mix'
col_subreddit_topic_mix = 'subreddit_full_topic_mix'
col_full_depth_mix_count = 'subreddit_full_topic_mix_count'
suffix_new_topic_mix = '_topic_mix_nested'

# even if cluster at k < 20 is generic, keep it to avoid orphan subs
l_cols_labels = (
    [c for c in df_labels_target.columns
        if all([c != col_new_cluster_val, c.endswith('_label')])
        ]
)

df_labels_target_dynamic_raw = create_dynamic_clusters(
    df_labels_target,
    agg_strategy='aggregate_small_clusters',
    min_subreddits_in_cluster=5,
    l_cols_labels_input=l_cols_labels,
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    n_mix_start=n_mix_start,
    col_new_cluster_topic_mix=col_new_cluster_topic_mix,
    col_subreddit_topic_mix=col_subreddit_topic_mix,
    col_full_depth_mix_count=col_full_depth_mix_count,
    suffix_new_topic_mix=suffix_new_topic_mix,
    l_ix=l_ix,
    verbose=True,
)

12:26:57 | INFO | "Concat'ing nested cluster labels..."
12:26:57 | INFO | "Getting topic mix at different depths..."
12:26:57 | INFO | "  Assigning base topic mix cols"
12:26:57 | INFO | "  Creating deepest base topic mix col..."
12:26:57 | INFO | "  Iterating through additional subs with multiple topics..."
100%|██████████| 16/16 [00:00<00:00, 20.76it/s]
12:26:58 | INFO | "Initializing values for strategy: aggregate_small_clusters"
12:26:58 | INFO | "  Looping to roll-up clusters from smallest to largest..."
100%|██████████| 21/21 [00:00<00:00, 118.03it/s]
12:26:58 | INFO | "(1419, 123) <- output shape"


In [21]:
# [c for c in df_labels_target_dynamic_raw.columns if 'nested' in c]

In [22]:
style_df_numeric(
    get_dynamic_cluster_summary(
        df_labels_target_dynamic_raw,
        return_dict=False,
    ),
    rename_cols_for_display=True,
)

Unnamed: 0,cluster count,num orphan subreddits,num subreddits per cluster mean,num subreddits per cluster median,num clusters with mature primary topic,cluster ids with orphans
0,190,1,7.47,7.0,16,11


In [23]:
# # check column order
style_df_numeric(
    df_labels_target_dynamic_raw.iloc[104:109, :15],
    rename_cols_for_display=True,
    int_labels=['total_users_in', 'num_of_countries_', 'users_in_subreddit_from_country_l28',
                    'by_country_rank',
                    ],
    pct_cols=['b_users_percent_by_subreddit',
                  'c_users_percent_by_country',
                  'users_percent_by_country_avg',
                  ],
    pct_labels='',
)

Unnamed: 0,subreddit id,subreddit name,cluster topic mix,primary topic,rating short,subreddit full topic mix,rating name,over 18,geo relevance default,relevance percent by subreddit,relevance percent by country standardized,b users percent by subreddit,e users percent by country standardized,d users percent by country rank,model sort order
104,t5_38cev,melbournecycling,Cars and Motor Vehicles | Hobbies,-,E,Cars and Motor Vehicles | Hobbies,Everyone,-,False,True,True,84.24%,2.04,24356,18028
105,t5_2wy6u,ausbike,Cars and Motor Vehicles | Hobbies,Fitness and Nutrition,E,Cars and Motor Vehicles | Hobbies,Everyone,-,True,True,True,78.96%,5.01,5424,18029
106,t5_2tbmq,bikecommuting,Cars and Motor Vehicles | Hobbies,Travel,E,Cars and Motor Vehicles | Hobbies,Everyone,-,False,False,True,4.84%,2.33,4230,18032
107,t5_2qhyi,cycling,Cars and Motor Vehicles | Hobbies,Sports,E,Cars and Motor Vehicles | Hobbies,Everyone,f,False,False,True,5.94%,2.47,955,18033
108,t5_32hd6,electricskateboarding,Cars and Motor Vehicles,Hobbies,E,Cars and Motor Vehicles,Everyone,-,False,False,True,6.85%,2.28,5976,18037


### Minor QA checks

In [24]:
# # check column order
# style_df_numeric(
#     df_labels_target_dynamic_raw.iloc[70:74, -22:],
#     rename_cols_for_display=True,
#     int_labels=['total_users_in', 'num_of_countries_', 'users_in_subreddit_from_country_l28',
#                     'by_country_rank',
#                     ],
#     pct_cols=['b_users_percent_by_subreddit',
#                   'c_users_percent_by_country',
#                   'users_percent_by_country_avg',
#                   ],
#     pct_labels='',
# )

In [25]:
value_counts_and_pcts(
    df_labels_target_dynamic_raw[col_new_cluster_topic_mix],
    top_n=9,
)

Unnamed: 0,cluster_topic_mix-count,cluster_topic_mix-percent,cluster_topic_mix-pct_cumulative_sum
Mature Themes and Adult Content,92,6.5%,6.5%
Gaming,81,5.7%,12.2%
Television | Podcasts and Streamers,76,5.4%,17.5%
Animals and Pets,69,4.9%,22.4%
Medical and Mental Health,69,4.9%,27.3%
Music,59,4.2%,31.4%
Cars and Motor Vehicles,55,3.9%,35.3%
Television,55,3.9%,39.2%
Sports,49,3.5%,42.6%


In [26]:
# how many final clusters have multiple topics?
value_counts_and_pcts(
    df_labels_target_dynamic_raw[col_new_cluster_topic_mix].str.count('\|')
)

Unnamed: 0,cluster_topic_mix-count,cluster_topic_mix-percent,cluster_topic_mix-pct_cumulative_sum
0,873,61.5%,61.5%
1,385,27.1%,88.7%
2,135,9.5%,98.2%
3,26,1.8%,100.0%


In [27]:
# how many SUBREDDITS have multiple topics? (when we check the deepest clusters)
#  these two calls are equivalent

# value_counts_and_pcts(
#     df_labels_target_dynamic_raw[col_subreddit_topic_mix].str.count('\|')
# )

value_counts_and_pcts(
    df_labels_target_dynamic_raw[col_full_depth_mix_count]
)

Unnamed: 0,subreddit_full_topic_mix_count-count,subreddit_full_topic_mix_count-percent,subreddit_full_topic_mix_count-pct_cumulative_sum
1,652,45.9%,45.9%
2,433,30.5%,76.5%
3,228,16.1%,92.5%
4,84,5.9%,98.4%
5,21,1.5%,99.9%
6,1,0.1%,100.0%


In [28]:
style_df_numeric(
    df_labels_target_dynamic_raw
    [df_labels_target_dynamic_raw[col_full_depth_mix_count] >= 5]
    .iloc[-5:, :9]
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,subreddit id,subreddit name,cluster topic mix,primary topic,rating short,subreddit full topic mix,rating name,over 18,geo relevance default
884,t5_3ej5o,australiatravel,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity",-,-,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity | Internet Culture and Memes",-,-,True
885,t5_56fwff,australiacommercial,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity",-,-,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity | Internet Culture and Memes",-,-,True
886,t5_2x177,ameristralia,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity",Internet Culture and Memes,E,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity | Internet Culture and Memes",Everyone,-,True
887,t5_340dk,askanaustralian,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity",Learning and Education,E,"History | Place | Mature Themes and Adult Content | Culture, Race, and Ethnicity | Internet Culture and Memes",Everyone,-,True
1402,t5_2s5sk,lv426,Art,Movies,E,Art | Gaming | Television | Internet Culture and Memes | Movies,Everyone,-,False


## Re-assign orphan subreddits (optional)

If there are subreddits that are orphan (see summary above), check them out to see if we can re-assign them w/o too much work. if we can't skip and move to the next country.

In [29]:
# check subs around orphan sub
# n_plus_minus_ = 5
ix_orphan_ = (
    df_labels_target_dynamic_raw
    [df_labels_target_dynamic_raw[col_new_cluster_val] == '0011']
    .index
)
df_labels_target_dynamic_raw.iloc[ix_orphan_, :9]

Unnamed: 0,subreddit_id,subreddit_name,cluster_topic_mix,primary_topic,rating_short,subreddit_full_topic_mix,rating_name,over_18,geo_relevance_default
930,t5_2tt7r,realms,Gaming,Gaming,E,Gaming,Everyone,,False


In [30]:
# check other subs that are in the same cluster as orphan sub (at broadest level)
l_cols_orphan_check = (
    [
        'subreddit_id',
        col_new_cluster_topic_mix, 
        # col_new_cluster_val,  # this can be really long and makes comparing harder
        # col_subreddit_topic_mix,
        'subreddit_name', 
        col_new_cluster_name
    ] +
    l_cols_labels[:-5]
)

style_df_numeric(
    df_labels_target_dynamic_raw
    [df_labels_target_dynamic_raw['k_0013_label'] == 11]
    [l_cols_orphan_check]
    .iloc[3:14, :50]
    ,
    l_bar_simple=[c for c in l_cols_orphan_check[4:] if c.endswith('_label')],
    rename_cols_for_display=True,

)

Unnamed: 0,subreddit id,cluster topic mix,subreddit name,cluster label k,k 0013 label,k 0023 label,k 0041 label,k 0059 label,k 0063 label,k 0079 label,k 0085 label,k 0118 label,k 0320 label,k 0657 label,k 0958 label,k 1065 label,k 1560 label,k 1840 label,k 2207 label,k 2351 label,k 2830 label
926,t5_35qsx,Gaming,crowfall,k_0320_label,11,18,32,43,46,58,62,87,244,488,707,789,1147,1348,1616,1726,2066
927,t5_2ais35,Gaming,destiny2builds,k_0320_label,11,18,32,43,46,58,62,87,244,489,710,792,1151,1355,1624,1734,2074
928,t5_389nk,Gaming,sharditkeepit,k_0320_label,11,18,32,43,46,58,62,87,244,489,710,792,1151,1355,1624,1734,2074
929,t5_3os9l4,Gaming,crucibleguidebook,k_0320_label,11,18,32,43,46,58,62,87,244,489,710,792,1151,1355,1624,1734,2074
930,t5_2tt7r,Gaming,realms,k_0013_label,11,18,32,43,46,58,62,87,246,493,716,799,1159,1363,1633,1743,2086
931,t5_3nqdi,Gaming,swlegion,k_0085_label,11,18,32,43,46,59,63,88,247,494,718,801,1161,1365,1635,1745,2089
932,t5_293c3c,Gaming,printedwarhammer,k_0085_label,11,18,32,43,46,59,63,88,247,494,719,802,1162,1367,1638,1748,2092
933,t5_2y5lg,Gaming,tau40k,k_0085_label,11,18,32,43,46,59,63,88,247,494,719,802,1163,1368,1639,1749,2093
934,t5_2scss,Gaming,minipainting,k_0085_label,11,18,32,43,46,59,63,88,247,494,719,802,1163,1368,1639,1749,2093
935,t5_2qnwk,Gaming,battletech,k_0085_label,11,18,32,43,46,59,63,88,247,494,719,802,1163,1368,1639,1749,2094


In [31]:
label_k_to_reassign_ = 'k_0320_label'
label_val_to_reassign_ = '0011-0018-0032-0043-0046-0058-0062-0087-0244'
subreddit_id_orphan_ = 't5_2tt7r'

mask_orphan_and_new_group = (
    (df_labels_target_dynamic_raw['subreddit_id'] == subreddit_id_orphan_) |
    (
        (df_labels_target_dynamic_raw[col_new_cluster_name] == label_k_to_reassign_) &
        (df_labels_target_dynamic_raw[col_new_cluster_val] == label_val_to_reassign_)
    )
)

# assign is similar to what we do in the dynamic function
label_k_new_ = 'k_0118_label'
label_val_new_col_ = f"{label_k_new_}_nested"
new_prim_topic_col_ = label_k_new_.replace('_label', '_majority_primary_topic')
c_update_topic_mix_ = label_k_new_.replace('_label', suffix_new_topic_mix)

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_name
] = label_k_new_

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_val
] = df_labels_target_dynamic_raw[mask_orphan_and_new_group][label_val_new_col_]

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_prim_topic
] = df_labels_target_dynamic_raw[mask_orphan_and_new_group][new_prim_topic_col_]

df_labels_target_dynamic_raw.loc[
    mask_orphan_and_new_group,
    col_new_cluster_topic_mix
] = df_labels_target_dynamic_raw[mask_orphan_and_new_group][c_update_topic_mix_]

del mask_orphan_and_new_group, label_k_to_reassign_, label_val_to_reassign_
del label_k_new_, label_val_new_col_, new_prim_topic_col_

In [32]:
# check again, num of orphans should be lower than before
style_df_numeric(
    get_dynamic_cluster_summary(
        df_labels_target_dynamic_raw,
        return_dict=False,
    ),
    rename_cols_for_display=True,
)

Unnamed: 0,cluster count,num orphan subreddits,num subreddits per cluster mean,num subreddits per cluster median,num clusters with mature primary topic,cluster ids with orphans
0,189,0,7.51,7.0,16,


In [33]:
value_counts_and_pcts(
    df_labels_target_dynamic_raw,
    ['cluster_label'],
    top_n=None,
    return_df=True
)['count'].describe()

count    189.000000
mean       7.507937
std        2.402612
min        3.000000
25%        6.000000
50%        7.000000
75%        8.000000
max       22.000000
Name: count, dtype: float64

## Get cluster for humans (list of subs in a cluster in a cell)
Here we get 1 cluster per row. 
Use cases:
- It makes it easier to quickly check NSFW clusters that we'll filter out
- we'll append the list of subreddit names from here to the final table for QA (makes it easier to evaluate whether the cluster makes sense).


In [34]:
df_cluster_for_humans = reshape_df_to_get_1_cluster_per_row(
    df_labels_target_dynamic_raw,
    col_counterpart_count='subs_in_cluster_count',
    col_list_cluster_names='list_cluster_subreddit_names',
    col_list_cluster_ids='list_cluster_subreddit_ids',
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    col_new_cluster_topic=col_new_cluster_topic_mix,
    verbose=False,
    get_one_column_per_sub_id=False,
)

(189, 6)


In [35]:
df_cluster_for_humans.iloc[40:48, 1:]

Unnamed: 0,cluster_label_k,cluster_topic_mix,subs_in_cluster_count,list_cluster_subreddit_ids,list_cluster_subreddit_names
40,k_1065_label,Hobbies | Crafts and DIY,8,"t5_2rpor, t5_2rjcg, t5_2sczp, t5_hu6et, t5_315b8, t5_2srbo, t5_ie6pz, t5_2r57p","crossstitch, quilting, sewing, craftsnark, vintagesewing, sewhelp, punchneedle, knots"
41,k_3927_label,Hobbies | Crafts and DIY | Art,6,"t5_2re38, t5_2sk72, t5_2sc8d, t5_2yqob, t5_2um9k, t5_2s48h","pottery, ceramics, resin, resincasting, clay, polymerclay"
42,k_2351_label,Hobbies | Crafts and DIY | Fashion,7,"t5_2qkpi, t5_2s8y8, t5_2yhx0, t5_2rq1q, t5_2rv093, t5_2s2hj6, t5_35368","jewelry, jewelers, opals, diamonds, labdiamond, moissanitebst, moissanite"
43,k_0085_label,Hobbies | Beauty and Makeup,8,"t5_2qlac, t5_310nr, t5_2w8pb, t5_4yrxf5, t5_sb9wh, t5_2qqe7, t5_3iftm, t5_3a04j","beauty, ausskincare, indiemakeupandmore, fragranceaustralia, aussiemakeuptrade, mullets, eyelashextensions, equalattraction"
44,k_1560_label,Hobbies | Beauty and Makeup,7,"t5_2rww2, t5_vuqjc, t5_2qrwc, t5_3ca3n, t5_32g1x, t5_342em, t5_2ys2j","makeupaddiction, makeuplounge, makeup, olivemua, makeuprehab, muacjdiscussion, australianmakeup"
45,k_0013_label,Place,6,"t5_2w28u, t5_2spz6, t5_4ndtlb, t5_310i6, t5_2t1yp, t5_3ceuj","papuanewguinea, fijian, ausvisa, waggansw, newtown, tafe"
46,k_1065_label,Place,6,"t5_2rjoj, t5_2s7ae, t5_2rpf3, t5_2tixg, t5_512307, t5_2sksz","newcastle, ipswich, albury, northernbeaches, midnorthcoastnsw, footscray"
47,k_3927_label,Place,22,"t5_2sbmn, t5_2r1ca, t5_2r584, t5_2sdsu, t5_2s98u, t5_2roro, t5_2uim0, t5_2t24j, t5_2rjvn, t5_2qnjg, t5_2rhaq, t5_2sts2, t5_2s9kx, t5_2shzr, t5_2qutz, t5_2qkhb, t5_2qkob, t5_2su0b, t5_2qh8r, t5_2r2rv, t5_2r78m, t5_2qx4q","ballarat, adelaide, canberra, bendigo, geelong, wollongong, toowoomba, townsville, hobart, tasmania, cairns, sunshinecoast, centralcoastnsw, goldcoast, brisbane, melbourne, sydney, vic, darwin, southaustralia, perth, westernaustralia"


In [36]:
# df_cluster_for_humans.tail(9)

### Check subs that have mature topics

Make a list of sub IDs to exclude for clean df

In [37]:
mask_mature_clusters_ = (
    df_cluster_for_humans[col_new_cluster_topic_mix].str.lower()
    .str.contains('mature')
)
print(mask_mature_clusters_.sum())

22


In [47]:
(
    df_cluster_for_humans
    [mask_mature_clusters_]
    .iloc[:11, :]
)

Unnamed: 0,cluster_label,cluster_label_k,cluster_topic_mix,subs_in_cluster_count,list_cluster_subreddit_ids,list_cluster_subreddit_names,exclude_from_qa
0,0001,k_0013_label,Mature Themes and Adult Content,5,"t5_44faux, t5_3agvg, t5_24v66b, t5_5fbvny, t5_2zui0","neighboursbabez, stephclairesmith, isabelleclarke, sophadophaa, eugeniebouchard",exclude from QA
1,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,Mature Themes and Adult Content,6,"t5_iqt8v, t5_3g9c8, t5_2bfy1d, t5_39o6d, t5_28i70j, t5_3fruf","altladyboners, rtgirls, neighboursbabes, aussiebabes, downundercelebs, girlstennis",exclude from QA
2,0001-0001-0002-0002-0002-0003-0003-0003,k_0118_label,Mature Themes and Adult Content,9,"t5_4dgbcr, t5_5bz2ml, t5_4317um, t5_2v07n, t5_l7fuf, t5_547c6n, t5_3g9m96, t5_5clik0, t5_2ixk05","laetitiabrown, polyslartz, jadetunchy, realbikinis, indithew, nelphelanfan, sinoritaee, coffeysisters69, hannahorval",exclude from QA
3,0001-0001-0003-0003-0003-0004-0005-0005,k_0118_label,Mature Themes and Adult Content,7,"t5_25ejbo, t5_4wpiey, t5_4ote3s, t5_4xu6ij, t5_5dhfkc, t5_5bxvct, t5_514f58","hailiedeegan19, oliviadeeble_, forthaboys, saraboyd, australianbabes, alannapow, chanelphelan",exclude from QA
4,0002,k_0013_label,Mature Themes and Adult Content,4,"t5_4qcdkx, t5_48hmss, t5_4nnaei, t5_3zxen1","jacksonnmaddyof, goldcoastsugar, ellactrical, heidihudson",exclude from QA
5,0002-0002,k_0023_label,Mature Themes and Adult Content,6,"t5_4zl04q, t5_4zn255, t5_3xnod0, t5_4056vz, t5_2nl0s9, t5_4chanv","sydney_may2, polyleaked, siarramay, siamayl, ilsawatkins, polygirlsgonewild",exclude from QA
6,0002-0002-0005-0005-0005-0007-0008-0009,k_0118_label,Mature Themes and Adult Content,8,"t5_4cjiib, t5_3kwdmq, t5_4jrkfm, t5_39qqu, t5_3aewyu, t5_4z75gq, t5_2w83vq, t5_4jgdp9","vanessasierra_, jessicathoday, fijiporn, rosievan, luciejaid2, viralvideoleaks, deminovak, zimaanderson",exclude from QA
7,0004,k_0013_label,Mature Themes and Adult Content,4,"t5_55cv32, t5_24s9xa, t5_2wip6, t5_38ezr","bdsmaustralia, brisbanesocial, brisbanegaybros, amwfdating",exclude from QA
8,0005-0007-0012,k_0041_label,Mature Themes and Adult Content,8,"t5_296554, t5_333yu, t5_3jd26q, t5_53z0jy, t5_2858fg, t5_378b57, t5_2n2cey, t5_2qtjz","aussiefeet, trollingforababy, honeybirdette, hookupsperth, communalshowers, aussienaturism, blueycirclejerk, images",exclude from QA
59,0008-0013,k_0023_label,Mature Themes and Adult Content,6,"t5_2qlq6, t5_2qspe, t5_34cyw, t5_2qpe9, t5_2vfzu, t5_2qu5n","audible, etymology, datingoverthirty, onlinedating, thegirlsurvivalguide, polyamory",exclude from QA


In [48]:
(
    df_cluster_for_humans
    [mask_mature_clusters_]
    .iloc[-12:, :]
)

Unnamed: 0,cluster_label,cluster_label_k,cluster_topic_mix,subs_in_cluster_count,list_cluster_subreddit_ids,list_cluster_subreddit_names,exclude_from_qa
70,0008-0013-0023-0032-0034-0044-0046-0064,k_0118_label,Mature Themes and Adult Content | Gender,9,"t5_mwfyw, t5_2tnmd, t5_2r1c3, t5_2rfyw, t5_33rcf, t5_30c2m, t5_2rvxp, t5_xaiot, t5_2qkeh","latebloomerlesbians, lgbtaustralia, genderqueer, asianamerican, hapas, asianmasculinity, niceguys, femaledatingstrategy, answers",exclude from QA
71,0008-0013-0023-0032-0034-0044-0046-0064-0181-0356-0518-0580,k_1065_label,Mature Themes and Adult Content | Gender,6,"t5_2t187, t5_2ub9j, t5_2r4b9, t5_3ijj6, t5_31hoq, t5_32viw","mypartneristrans, mtf, asktransgender, transytalk, ask_transgender, transgenderau",keep
72,0008-0013-0023-0032-0034-0044-0046-0064-0184,k_0320_label,Mature Themes and Adult Content | Gender,7,"t5_3531l, t5_35hao, t5_2r4eo, t5_3hcwf, t5_2xinb, t5_296zi1, t5_39xf0","bumble, hingeapp, ama, aftertheloop, outoftheloop, smolbeansnark, blogsnark",keep
73,0008-0013-0023-0032-0034-0044-0046-0064-0184-0362-0525-0588,k_1065_label,Mature Themes and Adult Content | Gender | Family and Relationships,8,"t5_2yr9y, t5_nhaha, t5_2vplf, t5_2qhtr, t5_2rv3t, t5_2ug26, t5_11080t, t5_2ra25","bridezillas, weddingshaming, weddingsunder10k, wedding, weddingplanning, engaged, waiting_to_wed, engagementrings",keep
81,0008-0014-0025-0034-0036-0046-0048-0066,k_0118_label,Mature Themes and Adult Content,9,"t5_31mz6, t5_vt3kk, t5_3jth5, t5_2tkn9, t5_lx01k, t5_3c7rbf, t5_2wgs2u, t5_513pq7, t5_2r9bh","lexapro, mirtazapine_remeron, pristiq, modafinil, sarmssourcetalk, aulean, benzosaus, chemicalmagicau, acid",exclude from QA
82,0008-0014-0025-0034-0036-0046-0048-0066-0190-0378,k_0657_label,Mature Themes and Adult Content,6,"t5_2uggx, t5_2xfv5, t5_2qlxl, t5_2te5i, t5_2scuv, t5_2vu9j","druggardening, poppyseed, mescaline, kava, nitrous, nitrousoxide",exclude from QA
83,0008-0014-0025-0034-0036-0047-0049-0067,k_0118_label,Mature Themes and Adult Content,6,"t5_2t0if, t5_2t7u5, t5_2qpco, t5_3lftlc, t5_43or0d, t5_2ojf76","ausbeer, showerbeer, cocktails, auscann, medicalcannabisoz, medicalcannabisaus",exclude from QA
84,0008-0014-0025-0034-0036-0047-0049-0067-0193,k_0320_label,Mature Themes and Adult Content,8,"t5_2slm7, t5_2tdnb, t5_2tmn1, t5_2tt9m, t5_j3ovh, t5_2u7ap, t5_2xo3j, t5_3a4n7","stonerengineering, ausents, adelaidetrees, melbents, gamersupps, incense, aussievapers, craftymighty",exclude from QA
109,0010-0017-0030-0040-0042-0053-0056-0078-0219-0437-0629-0704-1016-1194-1437-1529-1824-2028-2191-2375-2476-2510,k_3927_label,Place | Mature Themes and Adult Content | Law,6,"t5_2qmsc, t5_2wvvc, t5_2s5e8, t5_38ve9, t5_2gmg9b, t5_4r6wtg","unsolvedmysteries, unresolvedmysteries, truecrime, truecrimediscussion, crimeplus, truecrimemystery",keep
110,0010-0017-0030-0040-0042-0053-0056-0078-0219-0437-0630-0705,k_1065_label,Place | Mature Themes and Adult Content | Law,7,"t5_3cfj2, t5_iioxt, t5_navml, t5_n9g9j, t5_2cfkv9, t5_1lruuu, t5_fl4tn","earons, mrcruel, chriswatts, shannanwatts, epsteinandfriends, epstein, duggarssnark",exclude from QA


- model did a good job identifying this drug cluster... but yeah def don't want to increase their reach   
    - `0008-0014-0025-0034-0036-0046-0048-0066`
    - `0008-0014-0025-0034-0036-0046-0048-0066-0190-0378`

- the drinking might be SFW... but prob not the 420
    - `0008-0014-0025-0034-0036-0047-0049-0067`
    - `0008-0014-0025-0034-0036-0047-0049-0067-0193`

In [46]:
l_mature_clusters_to_exclude_from_qa = [
    '0001',
    '0001-0001-0001-0001-0001-0002-0002-0002',
    '0001-0001-0002-0002-0002-0003-0003-0003',
    '0001-0001-0003-0003-0003-0004-0005-0005',
    '0002',
    '0002-0002',
    '0002-0002-0005-0005-0005-0007-0008-0009',
    '0004',
    '0005-0007-0012',
    '0008-0013',  # thegirlsurvivalguide could be good to show, but prob not for ppl looking at r/onlinedating...?
    '0008-0013-0023-0032-0034-0044-0046-0064',  # can of worms...
    '0008-0014-0025-0034-0036-0046-0048-0066',
    '0008-0014-0025-0034-0036-0046-0048-0066-0190-0378',
    '0008-0014-0025-0034-0036-0047-0049-0067',
    '0008-0014-0025-0034-0036-0047-0049-0067-0193',
    '0010-0017-0030-0040-0042-0053-0056-0078-0219-0437-0630-0705',
    '0010-0017-0030-0040-0042-0053-0056-0078-0219-0437-0630-0705-1017-1195-1439-1532-1828-2032-2196-2380-2482-2516',

]
val_exclude_from_qa_ = 'exclude from QA'
df_cluster_for_humans['exclude_from_qa'] = np.where(
    df_cluster_for_humans[col_new_cluster_val].isin(l_mature_clusters_to_exclude_from_qa),
    val_exclude_from_qa_,
    'keep',  
)

### Add the flag to exclude from QA & the list of sub names to df-raw

In [1]:
col_new 

NameError: ignored

## Create new df_clean 

- Add list of subreddits to target-CLEAN, b/c we'll need it for rating final
- Add new columns & update order


# Export data

## Define variables to create/access google sheet doc & worksheets

In [None]:
gspread.__version__

'5.1.1'

In [None]:
# # %%time
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

GSHEET_NAME = 'i18n Australia subreddits and clusters - model v0.4.1'
GSHEET_KEY = '1ujjoJ7uyTWz3P1aYgZTKyJGbETKAfk3vrom_ASaNUt0'
target_abbrev_ = 'au'

d_wsh_names = {
    'sub_raw': {
        'name': 'raw_data_per_subreddit',
    },
    'clusters_t2t_list_raw': {
        'name': f'raw_clusters_list_{target_abbrev_}_{target_abbrev_}',
    },
    'clusters_t2t_fpr_raw': {
        'name': f'raw_clusters_fpr_{target_abbrev_}_{target_abbrev_}',
    }
}
# SH_DE_2_DE_LISTING_BELOW = 'de_to_de_listing_below_raw_cluster_list_names_and_ids'

if GSHEET_KEY is not None:
    sh = gc.open_by_key(GSHEET_KEY)
    print(f"Opening google worksheet: {GSHEET_NAME} ...")
else:
    print(f"Creating google worksheet: {GSHEET_NAME} ...")
    sh = gc.create(GSHEET_NAME)

# create worksheets:
for _, d_ in d_wsh_names.items():
    sh_name = d_['name']
    try:
        d_['worksheet'] = sh.worksheet(sh_name)
        print(f"Opening tab/sheet: {sh_name} ...")
    except Exception as e:
        print(f"Creating tab/sheet: {sh_name} ...")
        d_['worksheet'] = sh.add_worksheet(sh_name, rows=5, cols=5)


Opening google worksheet: i18n Australia subreddits and clusters - model v0.4.1 ...
Opening tab/sheet: raw_data_per_subreddit ...
Opening tab/sheet: raw_clusters_list_au_au ...
Opening tab/sheet: raw_clusters_fpr_au_au ...


## Save: Clean sheet to rate

## Save: df cluster for humans

## Save: target raw dynamic


## Save: FPR target-2-target as list

Even though data isn't fully ready, want to have the output ready to make sure it's in the right format that we need.

# Export raw data: 1 row=1 subreddit

Make sure it's ordered by the col to sort subs similar to each other.

NOTE: I'll need to go back to colab because for some reason I can't get authenticated to create new sheets from my laptop **sigh**.

## Save raw subreddit data - use for QA
Note that we have to use `fillna('')`

Otherwise, we'll get errors because the gspread library doesn't know how to handle `pd.NaN` or `np.Nan` (nulls).

In [None]:
df_labels_target_dynamic_raw.iloc[:5, :22]

Unnamed: 0,subreddit_id,subreddit_name,cluster_majority_primary_topic,primary_topic,rating_short,rating_name,over_18,cluster_label,cluster_label_k,model_sort_order,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label,k_0063_label,k_0079_label,k_0085_label,k_0118_label,k_0320_label,k_0657_label,k_0958_label
424,t5_44faux,neighboursbabez,Mature Themes and Adult Content,Mature Themes and Adult Content,,,,0001,k_0013_label,334,36,1,1,1,1,1,1,1,1,3,5,6
1247,t5_iqt8v,altladyboners,Mature Themes and Adult Content,Celebrity,E,Everyone,f,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,489,108,1,1,1,1,1,2,2,2,4,8,11
1238,t5_3g9c8,rtgirls,Mature Themes and Adult Content,Podcasts and Streamers,E,Everyone,,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,509,36,1,1,1,1,1,2,2,2,4,8,11
437,t5_2bfy1d,neighboursbabes,Mature Themes and Adult Content,Mature Themes and Adult Content,E,Everyone,f,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,510,170,1,1,1,1,1,2,2,2,4,8,11
108,t5_39o6d,aussiebabes,Mature Themes and Adult Content,Mature Themes and Adult Content,E,Everyone,,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,661,21,1,1,1,1,1,2,2,2,5,9,12


In [None]:
l_cols_to_drop = (
    ['table_creation_date'] +
    [c for c in df_labels_target_dynamic_raw.columns if c.endswith('_nested')]
)
print(len(l_cols_to_drop))
# df_labels_target_dynamic_raw.columns.to_list()

23


In [None]:
%%time
(
    d_wsh_names['sub_raw']['worksheet']
    .update([df_labels_target_dynamic_raw.drop(l_cols_to_drop, axis=1).columns.values.tolist()] + 
            df_labels_target_dynamic_raw.drop(l_cols_to_drop, axis=1).fillna('').values.tolist())
)

CPU times: user 85.3 ms, sys: 3.28 ms, total: 88.6 ms
Wall time: 3.08 s


{'spreadsheetId': '1ujjoJ7uyTWz3P1aYgZTKyJGbETKAfk3vrom_ASaNUt0',
 'updatedCells': 107550,
 'updatedColumns': 75,
 'updatedRange': 'raw_data_per_subreddit!A1:BW1434',
 'updatedRows': 1434}

### We can read the data back to confirm it's as expected

In [None]:
# Here's how to get the records as a dataframe
pd.DataFrame(
    d_wsh_names['sub_raw']['worksheet'].get_all_records()
).iloc[:5, :15]

Unnamed: 0,subreddit_id,subreddit_name,cluster_majority_primary_topic,primary_topic,rating_short,rating_name,over_18,cluster_label,cluster_label_k,model_sort_order,posts_for_modeling_count,k_0013_label,k_0023_label,k_0041_label,k_0059_label
0,t5_44faux,neighboursbabez,Mature Themes and Adult Content,Mature Themes and Adult Content,,,,1,k_0013_label,334,36,1,1,1,1
1,t5_iqt8v,altladyboners,Mature Themes and Adult Content,Celebrity,E,Everyone,f,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,489,108,1,1,1,1
2,t5_3g9c8,rtgirls,Mature Themes and Adult Content,Podcasts and Streamers,E,Everyone,,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,509,36,1,1,1,1
3,t5_2bfy1d,neighboursbabes,Mature Themes and Adult Content,Mature Themes and Adult Content,E,Everyone,f,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,510,170,1,1,1,1
4,t5_39o6d,aussiebabes,Mature Themes and Adult Content,Mature Themes and Adult Content,E,Everyone,,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,661,21,1,1,1,1


## Save target 2 target clusters - human readable
This one is mostly as a quick way to visually inspect the clusters. It doesn't get used by other tasks.

In [None]:
d_wsh_names.keys()

dict_keys(['sub_raw', 'clusters_t2t_list_raw', 'clusters_t2t_fpr_raw'])

In [None]:
%%time

(
    d_wsh_names['clusters_t2t_list_raw']['worksheet']
    .update(
        [df_cluster_for_humans.columns.values.tolist()] + 
        df_cluster_for_humans.fillna('').values.tolist()
    )
)

CPU times: user 22 ms, sys: 2.21 ms, total: 24.2 ms
Wall time: 1.05 s


{'spreadsheetId': '1ujjoJ7uyTWz3P1aYgZTKyJGbETKAfk3vrom_ASaNUt0',
 'updatedCells': 1158,
 'updatedColumns': 6,
 'updatedRange': 'raw_clusters_list_au_au!A1:F193',
 'updatedRows': 193}

## Save FPR (raw) format. 1 row = 1 subreddit with counterpart/cluster subs

See utility function that does reshaping with one call.

In [None]:
col_sort_order

'model_sort_order'

In [None]:
%%time

df_target_to_target_list = convert_distance_or_ab_to_list_for_fpr(
    df_labels_target_dynamic_raw,
    convert_to_ab=True,
    col_counterpart_count='counterpart_count',
    col_list_cluster_names='list_cluster_subreddit_names',
    col_list_cluster_ids='list_cluster_subreddit_ids',
    l_cols_for_seeds=None,
    l_cols_for_clusters=None,
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    verbose=False,
)
df_target_to_target_list.shape

  (10352, 9) <- df_ab.shape after removing matches to self
  (1433, 7) <- df_a_to_b.shape
CPU times: user 136 ms, sys: 1.59 ms, total: 137 ms
Wall time: 144 ms


In [None]:
df_target_to_target_list.iloc[:5, :11]

Unnamed: 0,subreddit_name_seed,subreddit_id_seed,cluster_label,cluster_label_k,counterpart_count,list_cluster_subreddit_names,list_cluster_subreddit_ids
0,neighboursbabez,t5_44faux,0001,k_0013_label,4,"stephclairesmith, isabelleclarke, sophadophaa, eugeniebouchard","t5_3agvg, t5_24v66b, t5_5fbvny, t5_2zui0"
1,altladyboners,t5_iqt8v,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,5,"rtgirls, neighboursbabes, aussiebabes, downundercelebs, girlstennis","t5_3g9c8, t5_2bfy1d, t5_39o6d, t5_28i70j, t5_3fruf"
2,rtgirls,t5_3g9c8,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,5,"altladyboners, neighboursbabes, aussiebabes, downundercelebs, girlstennis","t5_iqt8v, t5_2bfy1d, t5_39o6d, t5_28i70j, t5_3fruf"
3,neighboursbabes,t5_2bfy1d,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,5,"altladyboners, rtgirls, aussiebabes, downundercelebs, girlstennis","t5_iqt8v, t5_3g9c8, t5_39o6d, t5_28i70j, t5_3fruf"
4,aussiebabes,t5_39o6d,0001-0001-0001-0001-0001-0002-0002-0002,k_0118_label,5,"altladyboners, rtgirls, neighboursbabes, downundercelebs, girlstennis","t5_iqt8v, t5_3g9c8, t5_2bfy1d, t5_28i70j, t5_3fruf"


In [None]:
%%time

(
    d_wsh_names['clusters_t2t_fpr_raw']['worksheet']
    .update(
        [df_target_to_target_list.columns.values.tolist()] + 
        df_target_to_target_list.fillna('').values.tolist()
    )
)

CPU times: user 29.1 ms, sys: 1.98 ms, total: 31.1 ms
Wall time: 1.06 s


{'spreadsheetId': '1ujjoJ7uyTWz3P1aYgZTKyJGbETKAfk3vrom_ASaNUt0',
 'updatedCells': 10038,
 'updatedColumns': 7,
 'updatedRange': 'raw_clusters_fpr_au_au!A1:G1434',
 'updatedRows': 1434}

# Appendix


## Additional checks on cluster depth

In [None]:
print(df_labels_target_dynamic_raw['cluster_label'].nunique())
display(
    value_counts_and_pcts(
        df_labels_target_dynamic_raw,
        ['cluster_label'],
        top_n=10,
    )
)
value_counts_and_pcts(
    df_labels_target_dynamic_raw,
    ['cluster_label'],
    top_n=None,
    return_df=True
)['count'].describe()

189


Unnamed: 0_level_0,count,percent,cumulative_percent
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0007-0011-0019-0026-0027-0036-0037-0047-0136-0274-0404-0454-0659-0777-0934-0994-1201-1336-1439-1566-1639-1664,22,1.5%,1.5%
0008-0013-0023-0031-0033-0043-0045-0062,18,1.3%,2.8%
0007-0012-0020-0027-0028-0037-0038-0048-0138-0277-0409-0459-0669-0790-0950-1011-1219-1358-1462-1593-1668-1693,17,1.2%,4.0%
0010-0016-0029-0039-0041-0052-0055-0077-0217-0433-0625-0699-1010-1187-1430-1521-1815-2018-2181-2364-2464-2498,16,1.1%,5.1%
0006-0010-0018-0025-0025-0033-0034-0043,16,1.1%,6.3%
0010-0017-0030-0040-0043-0054-0057-0082-0227-0451-0651-0728-1054-1239-1489-1585-1895-2104-2272-2466-2572-2607,14,1.0%,7.3%
0006-0010-0017-0023-0023-0031-0032-0041,14,1.0%,8.2%
0013-0021-0037-0054-0057-0072-0076-0106-0287,13,0.9%,9.2%
0013-0021-0037-0054-0057-0072-0077-0107-0289-0587-0852-0948-1381-1634-1953-2081-2503-2771-3001,12,0.8%,10.0%
0006-0008-0013-0017-0017-0021-0022-0026,12,0.8%,10.8%


count    189.000000
mean       7.513228
std        2.400373
min        3.000000
25%        6.000000
50%        7.000000
75%        8.000000
max       22.000000
Name: count, dtype: float64

### How deep are the clusters?

Looks like some peaks around 100, 300, 1k, and 4k clusters.

In [None]:
print(df_labels_target_dynamic_raw[col_new_cluster_name].nunique())
value_counts_and_pcts(
    df_labels_target_dynamic_raw,
    [col_new_cluster_name],
    top_n=None,
    sort_index=True,
)

22


Unnamed: 0_level_0,count,percent,cumulative_percent
cluster_label_k,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
k_0013_label,37,2.6%,2.6%
k_0023_label,35,2.5%,5.1%
k_0041_label,30,2.1%,7.2%
k_0059_label,16,1.1%,8.3%
k_0063_label,15,1.1%,9.4%
k_0079_label,12,0.8%,10.2%
k_0085_label,38,2.7%,12.9%
k_0118_label,276,19.4%,32.3%
k_0320_label,194,13.7%,46.0%
k_0657_label,88,6.2%,52.2%


In [None]:
# style_df_numeric(
#     df_labels_target.tail(10),
#     # rename_cols_for_display=True,
#     l_bar_simple=[c for c in df_labels_target.columns if '_label' in c]
# )

# Filter out subs [extra]

UPDATE: For now let's include all the subreddits for QA because this list could potentially help us rate/flag subreddits that aren't rated or mis-rated and have a lot of traffic.

Now that we have even more clusters (over 3,000), it's harder to figure out where to set the threshold for clusters to exclude.

--- 

The main use case for now are SFW subs, we could save some QA time by excluding these subs:
- Exclude NSFW clusters
- Exclude place subs


~We'll use the cluster labels to discard subreddits because~
- many of the DE subreddits don't have a `primary_topic`
- if the majority of subs for a subreddits are NSFW, then we wouldn't want to recommend those anyway

In [None]:
# # we can see that the subreddit count changes as we go 
# #  from shallow to deeper cluster counts
# value_counts_and_pcts(
#     df_labels_target['k_0118_majority_primary_topic'],
#     top_n=15,
#     reset_index=True,
#     add_col_prefix=False,
#     count_type='subreddits',
#     return_df=False,
# )

In [None]:
# value_counts_and_pcts(
#     df_labels_target['k_3145_majority_primary_topic'],
#     top_n=15,
#     reset_index=True,
#     add_col_prefix=False,
#     count_type='subreddits',
#     return_df=False,
# )

In [None]:
# value_counts_and_pcts(
#     df_labels_target['k_3927_majority_primary_topic'],
#     top_n=15,
#     reset_index=True,
#     add_col_prefix=False,
#     count_type='subreddits',
#     return_df=False,
# )

In [None]:
# # And the count is slightly different from the known primary topics
# #  We still have a large number of subs w/o a primary topic
# value_counts_and_pcts(
#     df_labels_target['primary_topic'],
#     count_type='subreddits',
#     reset_index=True,
#     add_col_prefix=False,
# )

In [None]:
print(f"{df_labels_target.shape} <- Shape before filtering")

l_manual_subs_to_remove = [
    'sexmeets1', 'fuck',
]
col_cluster_filter = 'k_3145_majority_primary_topic'
df_labels_target_clean = (
    df_labels_target[df_labels_target[col_cluster_filter] != 'Mature Themes and Adult Content']
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping NSFW clusters")

l_sensitive_topics = [
    'Military', 'Gender', 'Addiction Support',
    'Medical and Mental Health', 'Sexual Orientation',
    'Culture, Race, and Ethnicity',
]
df_labels_target_clean = (
    df_labels_target_clean[
        ~df_labels_target_clean[col_cluster_filter].isin(l_sensitive_topics)
    ]
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping Sensitive clusters")

df_labels_target_clean = (
    df_labels_target_clean[
        ~df_labels_target_clean['primary_topic'].isin(l_sensitive_topics)
    ]
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping SENSITIVE subreddits")


df_labels_target_clean = (
    df_labels_target_clean[
        ~df_labels_target_clean['subreddit_name'].isin(l_manual_subs_to_remove)
    ]
)
print(f"{df_labels_target_clean.shape} <- Shape after dropping Manual list of subreddits")

print(f"  ** TODO: instead of excluding place subs, add logic to map hierarchy **")
# df_labels_target_clean = (
#     df_labels_target_clean[df_labels_target_clean['primary_topic'] != 'Place']
# )
# print(f"{df_labels_target_clean.shape} <- Shape after dropping Place subreddits")

(1420, 73) <- Shape before filtering
(1344, 73) <- Shape after dropping NSFW clusters
(1253, 73) <- Shape after dropping Sensitive clusters
(1238, 73) <- Shape after dropping SENSITIVE subreddits
(1238, 73) <- Shape after dropping Manual list of subreddits
  ** TODO: instead of excluding place subs, add logic to map hierarchy **
