# Purpose

### 2022-07-18
In this notebook we'll run a batch of countries through a new FPR output process. 
Instead of saving data to google sheets, we'll save:
- FPR outputs to a GCS bucket (JSON)
- FPR summary to BigQuery
- FPR details to BigQuery

We can then use the summary & details in a Mode dashboard for inspection (if needed).

See this dashboard for more information about the model coverage & filters.
https://app.mode.com/reddit/reports/b99c94984018


# Imports & notebook setup

In [None]:
%load_ext autoreload
%autoreload 2

# Register bigquery magic (only needed for laptop/local, not colab)
# %load_ext google.cloud.bigquery

In [None]:
# colab auth for BigQuery, google drive, & google sheets (gspread)
from google.colab import auth, files, drive
from google.auth import default
import sys  # need sys for mounting gdrive path

auth.authenticate_user()
print('Authenticated')

Authenticated


## Install custom library

### Append google drive path so we can install library from there

In [None]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
import sys


g_drive_root = '/content/drive'

try:
    drive.mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
    
except Exception as e:
    try:
        drive._mount(g_drive_root, force_remount=True)
        print('   Authenticated & mounted Google Drive')
    except Exception as e:
        print(e)
        raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install library

In [None]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

[K     |████████████████████████████████| 10.1 MB 13.7 MB/s 
[K     |████████████████████████████████| 14.2 MB 48.0 MB/s 
[K     |████████████████████████████████| 965 kB 75.2 MB/s 
[K     |████████████████████████████████| 144 kB 62.2 MB/s 
[K     |████████████████████████████████| 76 kB 5.6 MB/s 
[K     |████████████████████████████████| 285 kB 79.0 MB/s 
[K     |████████████████████████████████| 13.2 MB 50.5 MB/s 
[K     |████████████████████████████████| 79.9 MB 107 kB/s 
[K     |████████████████████████████████| 140 kB 53.3 MB/s 
[K     |████████████████████████████████| 715 kB 61.5 MB/s 
[K     |████████████████████████████████| 74 kB 3.9 MB/s 
[K     |████████████████████████████████| 112 kB 61.9 MB/s 
[K     |████████████████████████████████| 181 kB 53.0 MB/s 
[K     |████████████████████████████████| 79 kB 8.4 MB/s 
[K     |████████████████████████████████| 1.1 MB 54.9 MB/s 
[K     |████████████████████████████████| 146 kB 57.6 MB/s 
[K     |██████████████████

## Regular Imports

In [None]:
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted
from tqdm import tqdm


# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

## Custom imports

In [None]:
# subclu imports
import subclu
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.models.clustering_utils import (
    create_dynamic_clusters,
    convert_distance_or_ab_to_list_for_fpr,
    get_primary_topic_mix_cols,
)
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.models.reshape_clusters_v050 import (
    get_table_for_optimal_dynamic_cluster_params,
    CreateFPRs,
    get_dynamic_cluster_summary,
    get_geo_relevant_subreddits_and_cluster_labels,
    get_fpr_cluster_per_row_summary,
    reshape_df_1_cluster_per_row,
)

setup_logging()
notebook_display_config()
print_lib_versions([pd, np])

python		v 3.7.13
===
pandas		v: 1.3.5
numpy		v: 1.21.6


# Checklist to re-run FPRs

- Update list of countries to run
- Update path to save outputs (in GCS)


With this new process we should only need a list of country names to get an FPR output. Everything else should be automated as long as we load from the default config.

## Load test configuration

In [None]:
# load test config
l_target_countries = [
    'ES',
    'MX',
    'IT',
    'BR',
    'TR',
    'PH',
    'DE',
    'NL',
    'RO',
]

cfg_fpr_test = LoadHydraConfig(
    config_name='fpr_v050_test.yaml',
    config_path="../config",
    overrides=[
        f"target_countries={l_target_countries}",
        f"partition_dt=2022-07-24",
    ],
)

print([k for k in cfg_fpr_test.config_dict.keys()])
cfg_fpr_test.config_dict

['description', 'target_countries', 'output_bucket', 'gcs_output_path', 'cluster_labels_table', 'partition_dt', 'geo_relevance_table', 'geo_min_users_percent_by_subreddit_l28', 'geo_min_country_standardized_relevance', 'qa_table', 'qa_pt']


{'cluster_labels_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_full',
 'description': 'Base config to test FPR creation',
 'gcs_output_path': 'i18n_topic_model_batch/fpr/runs_test',
 'geo_min_country_standardized_relevance': 2.4,
 'geo_min_users_percent_by_subreddit_l28': 0.14,
 'geo_relevance_table': 'reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725',
 'output_bucket': 'i18n-subreddit-clustering',
 'partition_dt': '2022-07-24',
 'qa_pt': '2022-07-24',
 'qa_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags',
 'target_countries': ['ES', 'MX', 'IT', 'BR', 'TR', 'PH', 'DE', 'NL', 'RO']}

## Run `create_fprs()` method

This method should do everything needed to create FPRs.

In [None]:
%%time
fprs_test = CreateFPRs(
    **cfg_fpr_test.config_dict
)

# run all countries
fprs_test.create_fprs()

  0%|          | 0/9 [00:00<?, ?it/s]07:48:54 | INFO | "== Country: ES =="
07:48:54 | INFO | "Getting geo-relevant subreddits in model for ES..."
07:48:57 | INFO | " (213, 160)  <- df_shape"
07:48:57 | INFO | " (213, 161) <- Shape AFTER dropping subreddits with covid in title"
07:48:57 | INFO | "Finding optimal N (target # of subs per cluster)..."

  0%|          | 0/6 [00:00<?, ?it/s][A
 17%|█▋        | 1/6 [00:02<00:11,  2.30s/it][A
 33%|███▎      | 2/6 [00:04<00:09,  2.30s/it][A
 50%|█████     | 3/6 [00:06<00:06,  2.29s/it][A
 67%|██████▋   | 4/6 [00:09<00:04,  2.31s/it][A
 83%|████████▎ | 5/6 [00:11<00:02,  2.31s/it][A
100%|██████████| 6/6 [00:13<00:00,  2.30s/it]
07:49:11 | INFO | "  5 <-- Optimal N"
07:49:11 | INFO | "Assigning clusters based on optimal N..."
07:49:13 | INFO | "Getting QA and summary at cluster_level..."
07:49:14 | INFO | "(44, 23)  <- df.shape full summary"
07:49:14 | INFO | "Adding metadata to df_top_level_summary..."
07:49:14 | INFO | "Creating FPR outp

CPU times: user 2min 55s, sys: 2.1 s, total: 2min 57s
Wall time: 3min 59s





In [None]:
f"test2"

'test2'

In [None]:
f"test"

'test'

# Tests on a single country

In [None]:
# load test config
l_target_countries = [
    'ES',
    'MX',
    'BR',
    'TR',
    'PH',
]

cfg_fpr_test = LoadHydraConfig(
    config_name='fpr_v050_test.yaml',
    config_path="../config",
    overrides=[
        f"target_countries={l_target_countries}",
        f"partition_dt=2022-07-24",
    ],
)

print([k for k in cfg_fpr_test.config_dict.keys()])
cfg_fpr_test.config_dict

['description', 'target_countries', 'output_bucket', 'gcs_output_path', 'cluster_labels_table', 'partition_dt', 'geo_relevance_table', 'geo_min_users_percent_by_subreddit_l28', 'geo_min_country_standardized_relevance', 'qa_table', 'qa_pt']


{'cluster_labels_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_full',
 'description': 'Base config to test FPR creation',
 'gcs_output_path': 'i18n_topic_model_batch/fpr/runs_test',
 'geo_min_country_standardized_relevance': 2.4,
 'geo_min_users_percent_by_subreddit_l28': 0.14,
 'geo_relevance_table': 'reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725',
 'output_bucket': 'i18n-subreddit-clustering',
 'partition_dt': '2022-07-24',
 'qa_pt': '2022-07-24',
 'qa_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags',
 'target_countries': ['ES', 'MX', 'BR', 'TR', 'PH']}

In [None]:
%%time
fprs_test = CreateFPRs(
    **cfg_fpr_test.config_dict
)

# run a single country
d_mx = fprs_test.create_fpr_('ES', optimal_k_search=np.arange(5, 7), verbose=False, verbose_summary=True, fpr_verbose=False)

03:50:59 | INFO | "Getting geo-relevant subreddits in model for ES..."
03:51:02 | INFO | " (213, 160)  <- df_shape"
03:51:02 | INFO | " (213, 161) <- Shape AFTER dropping subreddits with covid in title"
03:51:02 | INFO | "Finding optimal N (target # of subs per cluster)..."
100%|██████████| 2/2 [00:04<00:00,  2.32s/it]
03:51:07 | INFO | "  5 <-- Optimal N"
03:51:07 | INFO | "Assigning clusters based on optimal N..."
03:51:09 | INFO | "Getting QA and summary at cluster_level..."
03:51:09 | INFO | "   213 <- SEED subreddits"
03:51:09 | INFO | "   182 <- RECOMMEND subs (includes orphans)"
03:51:09 | INFO | "    23 <- missingTopic subreddits"
03:51:09 | INFO | "     6 <- discover=f subs"
03:51:09 | INFO | "     0 <- private subs"
03:51:09 | INFO | "   207 <- seed_subreddit_ids_count"
03:51:09 | INFO | "   178 <- recommend_subreddit_ids_count"
03:51:09 | INFO | "     6 <- orphan_or_exclude_seed_subreddit_ids_count"
03:51:09 | INFO | "     5 <- orphan_seed_subreddit_ids_count"
03:51:09 | INF

CPU times: user 7.63 s, sys: 96.1 ms, total: 7.73 s
Wall time: 14.3 s


## Check QA `combined` filter

There might be some "removes" that are used for seeds, but their reason should be `allow_discovery=f`

In [None]:
value_counts_and_pcts(
    d_mx['df_labels_target_dynamic'],
    ['combined_filter', 'combined_filter_reason'],
    sort_index=True,
    cumsum=False
)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,percent
combined_filter,combined_filter_reason,Unnamed: 2_level_1,Unnamed: 3_level_1
recommend,missing_topic,15,6.0%
recommend,predictions_clean,78,31.0%
recommend,predictions_missing,97,38.5%
remove,allow_discovery_f,3,1.2%
review,missing_topic,51,20.2%
review,review_topic,8,3.2%


In [None]:
(
    d_mx['df_labels_target']
    [d_mx['df_labels_target']['combined_filter'] == 'remove']
    .iloc[:, :29]
)

In [None]:
(
    d_mx['df_labels_target']
    [d_mx['df_labels_target']['combined_filter_detail'] == 'review-review_topic']
    .iloc[:, :29]
)

## Preview output dfs

In [None]:
%%time
for name_, df_ in d_mx.items():
    print(f"\n{name_}")
    try:
        print(df_.shape)
        l_cols_labels_drop = [c for c in df_.columns if all([c.startswith('k_'), c.endswith('_label')])]
        l_cols_topic_drop =  [c for c in df_.columns if all([c.startswith('k_'), c.endswith('_primary_topic')])]
        l_cols_nested_drop =  [c for c in df_.columns if all([c.startswith('k_'), c.endswith('_nested')])]
        n_label_cols = len(l_cols_labels_drop)
        n_topic_cols = len(l_cols_topic_drop)
        n_nested_cols = len(l_cols_nested_drop)
        print(f"Label cols to drop: {n_label_cols}")
        print(f"Topic cols to drop: {n_topic_cols}")
        print(f"Nested cols to drop: {n_nested_cols}")
        if n_label_cols > 0:
            print(f"  Label cols sample: {l_cols_labels_drop[:5]}")
        if n_topic_cols > 0:
            print(f"  Topic cols sample: {l_cols_topic_drop[:5]}")
        if n_nested_cols > 0:
            print(f"  Nested cols sample: {l_cols_nested_drop[:5]}")
        l_all_cols_to_drop = l_cols_labels_drop + l_cols_topic_drop + l_cols_nested_drop
        if len(l_cols_topic_drop) > 0:
            print(f"{df_.drop(l_all_cols_to_drop, axis=1).shape} <-- Shape after droping cols")
            display(df_.drop(l_all_cols_to_drop, axis=1).iloc[:5, :29])
        else:
            display(df_.iloc[:5, :59])

    except Exception as e:
        print(f"  {df_.keys()}")


df_labels_target
(252, 161)
Label cols to drop: 66
Topic cols to drop: 66
Nested cols to drop: 0
  Label cols sample: ['k_0050_label', 'k_0052_label', 'k_0060_label', 'k_0066_label', 'k_0070_label']
  Topic cols sample: ['k_0050_majority_primary_topic', 'k_0052_majority_primary_topic', 'k_0060_majority_primary_topic', 'k_0066_majority_primary_topic', 'k_0070_majority_primary_topic']
(252, 29) <-- Shape after droping cols


Unnamed: 0,pt,qa_pt,qa_table,geo_relevance_table,subreddit_id,users_l7,geo_country_code,country_name,subreddit_name,geo_relevance_default,relevance_combined_score,users_percent_by_subreddit_l28,users_percent_by_country_standardized,primary_topic,rating_short,predicted_rating,predicted_topic,allow_discovery,over_18,type,combined_filter_detail,combined_filter,combined_filter_reason,taxonomy_action,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,model_sort_order,posts_for_modeling_count,run_id
0,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,t5_28x930,7293,MX,Mexico,mexicowave,True,0.931074,0.913843,8.772322,Art,E,E,,,,public,recommend-predictions_missing,recommend,predictions_missing,,True,True,57944,118.0,2022-07-26_030217
1,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,t5_3oy6i,110,MX,Mexico,tepic,True,0.755967,0.732203,3.594327,"Culture, Race, and Ethnicity",E,E,,t,f,public,review-review_topic,review,review_topic,review_topic,True,True,57946,8.0,2022-07-26_030217
2,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,t5_43jrmn,58,MX,Mexico,monterreywave,True,0.839095,0.868132,2.652269,Internet Culture and Memes,E,E,,,f,public,recommend-predictions_missing,recommend,predictions_missing,,True,True,57947,5.0,2022-07-26_030217
3,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,t5_2721oo,86,MX,Mexico,mexicancomicbooks,True,0.49674,0.384858,4.429773,,E,E,,,,public,review-missing_topic,review,missing_topic,missing_topic,True,True,39893,31.0,2022-07-26_030217
4,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,t5_2vrad,75,MX,Mexico,webcomicsenespanol,True,0.37155,0.3,2.249591,,E,,,,,restricted,review-missing_topic,review,missing_topic,missing_topic,True,False,39894,264.0,2022-07-26_030217



df_labels_target_dynamic
(252, 300)
Label cols to drop: 66
Topic cols to drop: 66
Nested cols to drop: 132
  Label cols sample: ['k_0050_label', 'k_0052_label', 'k_0060_label', 'k_0066_label', 'k_0070_label']
  Topic cols sample: ['k_0050_majority_primary_topic', 'k_0052_majority_primary_topic', 'k_0060_majority_primary_topic', 'k_0066_majority_primary_topic', 'k_0070_majority_primary_topic']
  Nested cols sample: ['k_0050_label_nested', 'k_0052_label_nested', 'k_0060_label_nested', 'k_0066_label_nested', 'k_0070_label_nested']
(252, 36) <-- Shape after droping cols


Unnamed: 0,subreddit_id,subreddit_name,cluster_label_int,cluster_topic_mix,primary_topic,rating_short,subreddit_full_topic_mix,over_18,geo_relevance_default,relevance_percent_by_subreddit,relevance_percent_by_country_standardized,model_sort_order,posts_for_modeling_count,cluster_label,cluster_label_k,cluster_majority_primary_topic,pt,qa_pt,qa_table,geo_relevance_table,users_l7,geo_country_code,country_name,relevance_combined_score,users_percent_by_subreddit_l28,users_percent_by_country_standardized,predicted_rating,predicted_topic,allow_discovery
0,t5_28x930,mexicowave,410,Place,Art,E,Place | Art,,True,True,True,57944,118.0,0031-0032-0038-0041-0044-0047-0049-0055-0057-0060-0071-0076-0089-0100-0113-0154-0177-0238-0291-0350-0352-0410,k_0700_label,Place,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,7293,MX,Mexico,0.931074,0.913843,8.772322,E,,
1,t5_3oy6i,tepic,410,Place,"Culture, Race, and Ethnicity",E,Place | Art,f,True,True,True,57946,8.0,0031-0032-0038-0041-0044-0047-0049-0055-0057-0060-0071-0076-0089-0100-0113-0154-0177-0238-0291-0350-0352-0410,k_0700_label,Place,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,110,MX,Mexico,0.755967,0.732203,3.594327,E,,t
2,t5_43jrmn,monterreywave,410,Place,Internet Culture and Memes,E,Place | Art,f,True,True,True,57947,5.0,0031-0032-0038-0041-0044-0047-0049-0055-0057-0060-0071-0076-0089-0100-0113-0154-0177-0238-0291-0350-0352-0410,k_0700_label,Place,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,58,MX,Mexico,0.839095,0.868132,2.652269,E,,
3,t5_2721oo,mexicancomicbooks,32,Gaming,,E,Gaming | Technology | Art,,True,True,True,39893,31.0,0019-0019-0023-0024-0025-0027-0028-0031-0032,k_0094_label,Gaming,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,86,MX,Mexico,0.49674,0.384858,4.429773,E,,
4,t5_2vrad,webcomicsenespanol,32,Gaming,,E,Gaming | Technology | Art,,True,True,False,39894,264.0,0019-0019-0023-0024-0025-0027-0028-0031-0032,k_0094_label,Gaming,2022-07-24,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,75,MX,Mexico,0.37155,0.3,2.249591,,,



df_summary_cluster
(48, 23)
Label cols to drop: 0
Topic cols to drop: 0
Nested cols to drop: 0


Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds,missingTopic_subreddit_count,missingTopic_subreddit_names_list,discoveryF_subreddit_count,discoveryF_subreddit_names_list,private_subreddit_count,private_subreddit_names_list
0,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,0013,k_0050_label,13,Music,1,anamanaguchi,t5_2sedm,1,anamanaguchi,t5_2sedm,True,False,0,,0,,0,
1,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,0014-0014-0017,k_0060_label,17,Music,7,"indiemx, musicaenespanol, corridos, cumbia, latinpopheads, cuartetodenos, hypetracks","t5_24z6p6, t5_2v8rv, t5_2vneg, t5_2x1po, t5_wfesz, t5_2csckx, t5_2jh5ny",6,"indiemx, musicaenespanol, cumbia, latinpopheads, cuartetodenos, hypetracks","t5_24z6p6, t5_2v8rv, t5_2x1po, t5_wfesz, t5_2csckx, t5_2jh5ny",False,False,0,,0,,0,
2,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,0014-0014-0017-0018-0019-0021-0022-0024-0025-0026-0033-0035-0040-0045-0049-0068,k_0266_label,68,Music,6,"withintemptation, dannyelfman, homeshake, marsargo, thewarning, themarsvolta","t5_2wzac, t5_31291, t5_33h9w, t5_3brig, t5_2vbdi, t5_2sdzc",6,"withintemptation, dannyelfman, homeshake, marsargo, thewarning, themarsvolta","t5_2wzac, t5_31291, t5_33h9w, t5_3brig, t5_2vbdi, t5_2sdzc",False,False,0,,0,,0,
3,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,0015,k_0050_label,15,Animals and Pets,2,"packadaykitties, torties","t5_255vuq, t5_2tt9j",2,"packadaykitties, torties","t5_255vuq, t5_2tt9j",False,False,0,,0,,0,
4,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,0016,k_0050_label,16,Internet Culture and Memes,3,"wishwtf, watchthingsfly, fursedimages","t5_bwpub, t5_2bzypc, t5_nz4bq",3,"wishwtf, watchthingsfly, fursedimages","t5_bwpub, t5_2bzypc, t5_nz4bq",False,False,0,,0,,0,



d_fpr_qa
  dict_keys(['seed_subreddit_ids', 'seed_subreddit_ids_count', 'recommend_subreddit_ids', 'recommend_subreddit_ids_count', 'orphan_or_exclude_seed_subreddit_ids', 'orphan_or_exclude_seed_subreddit_ids_count', 'orphan_seed_subreddit_ids', 'orphan_seed_subreddit_ids_count', 'orphan_recommend_subreddit_ids', 'orphan_recommend_subreddit_ids_count'])

df_top_level_summary
(1, 18)
Label cols to drop: 0
Topic cols to drop: 0
Nested cols to drop: 0


Unnamed: 0,pt,geo_relevance_table,qa_pt,qa_table,run_id,geo_country_code,country_name,relevant_subreddit_id_count,seed_subreddit_ids,seed_subreddit_ids_count,recommend_subreddit_ids,recommend_subreddit_ids_count,orphan_or_exclude_seed_subreddit_ids,orphan_or_exclude_seed_subreddit_ids_count,orphan_seed_subreddit_ids,orphan_seed_subreddit_ids_count,orphan_recommend_subreddit_ids,orphan_recommend_subreddit_ids_count
0,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,2022-07-24,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,2022-07-26_030217,MX,Mexico,252,"[t5_408ts1, t5_2vbdi, t5_2lxxle, t5_6kmxa7, t5_2roc3, t5_2qk9t, t5_3gx1d, t5_vbgz2, t5_5xioyj, t5_2nv19k, t5_2h0cuw, t5_3913r, t5_2glnx3, t5_6fsmt3, t5_2toht, t5_3f0mg, t5_nuenm, t5_g38jm, t5_26div5, t5_4z9yto, t5_2n0rjr, t5_2r6hv, t5_5...",244,"[t5_24z6p6, t5_2v8rv, t5_2x1po, t5_wfesz, t5_2csckx, t5_2jh5ny, t5_2wzac, t5_31291, t5_33h9w, t5_3brig, t5_2vbdi, t5_2sdzc, t5_255vuq, t5_2tt9j, t5_bwpub, t5_2bzypc, t5_nz4bq, t5_2re0i3, t5_11oz5w, t5_36llrb, t5_2ujoy, t5_3la4d, t5_2kyy...",185,"[t5_4b8yrf, t5_x33ns, t5_3pf0f, t5_4qyhtu, t5_2wm8o, t5_1157ax, t5_2sedm, t5_35inf]",8,"[t5_2sedm, t5_4qyhtu, t5_2wm8o, t5_1157ax, t5_x33ns]",5,"[t5_2sedm, t5_4qyhtu, t5_2wm8o, t5_1157ax, t5_x33ns]",5



df_fpr
(244, 15)
Label cols to drop: 0
Topic cols to drop: 0
Nested cols to drop: 0


Unnamed: 0,subreddit_id_seed,subreddit_name_seed,cluster_label,cluster_label_k,pt,qa_pt,run_id,geo_country_code,country_name,qa_table,geo_relevance_table,cluster_label_int,subs_to_rec_in_cluster_count,list_cluster_subreddit_names,list_cluster_subreddit_ids
121,t5_2x1po,cumbia,0014-0014-0017,k_0060_label,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,17,5,"indiemx, musicaenespanol, latinpopheads, cuartetodenos, hypetracks","t5_24z6p6, t5_2v8rv, t5_wfesz, t5_2csckx, t5_2jh5ny"
32,t5_2jh5ny,hypetracks,0014-0014-0017,k_0060_label,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,17,5,"indiemx, musicaenespanol, cumbia, latinpopheads, cuartetodenos","t5_24z6p6, t5_2v8rv, t5_2x1po, t5_wfesz, t5_2csckx"
22,t5_2csckx,cuartetodenos,0014-0014-0017,k_0060_label,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,17,5,"indiemx, musicaenespanol, cumbia, latinpopheads, hypetracks","t5_24z6p6, t5_2v8rv, t5_2x1po, t5_wfesz, t5_2jh5ny"
240,t5_wfesz,latinpopheads,0014-0014-0017,k_0060_label,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,17,5,"indiemx, musicaenespanol, cumbia, cuartetodenos, hypetracks","t5_24z6p6, t5_2v8rv, t5_2x1po, t5_2csckx, t5_2jh5ny"
110,t5_2v8rv,musicaenespanol,0014-0014-0017,k_0060_label,2022-07-24,2022-07-24,2022-07-26_030217,MX,Mexico,reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags,reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725,17,5,"indiemx, cumbia, latinpopheads, cuartetodenos, hypetracks","t5_24z6p6, t5_2x1po, t5_wfesz, t5_2csckx, t5_2jh5ny"



dict_fpr
  dict_keys(['MX'])
CPU times: user 133 ms, sys: 6.3 ms, total: 140 ms
Wall time: 150 ms


## Get columns in df_dynamic that are not in df_target

Which of these are worth saving?

Looks like we have 3 types:
- k labels
- primary topic
- nested: labels & primary topic

In [None]:
l_cols_labels_drop = [c for c in d_mx['df_labels_target_dynamic'].columns if all([c.startswith('k_'), c.endswith('_label')])]
l_cols_topic_drop =  [c for c in d_mx['df_labels_target_dynamic'].columns if all([c.startswith('k_'), c.endswith('_primary_topic')])]
n_label_cols = len(l_cols_labels_drop)
n_topic_cols = len(l_cols_topic_drop)
print(f"Label cols to drop: {n_label_cols}")
print(f"Topic cols to drop: {n_topic_cols}")
if n_label_cols > 0:
    print(f"  Label cols sample: {l_cols_labels_drop[:5]}")
if n_topic_cols > 0:
    print(f"  Topic cols sample: {l_cols_topic_drop[:5]}")
l_all_cols_to_drop = l_cols_labels_drop + l_cols_topic_drop
print(len(l_all_cols_to_drop))

set_cols_dynamic = set(d_mx['df_labels_target_dynamic'].drop(l_all_cols_to_drop, axis=1))
set_cols_label = set(d_mx['df_labels_target'])

print(len(set_cols_dynamic - set_cols_label))
list(set_cols_dynamic - set_cols_label)[:10]

Label cols to drop: 66
Topic cols to drop: 66
  Label cols sample: ['k_0050_label', 'k_0052_label', 'k_0060_label', 'k_0066_label', 'k_0070_label']
  Topic cols sample: ['k_0050_majority_primary_topic', 'k_0052_majority_primary_topic', 'k_0060_majority_primary_topic', 'k_0066_majority_primary_topic', 'k_0070_majority_primary_topic']
132
139


['k_4000_topic_mix_nested',
 'k_4070_topic_mix_nested',
 'k_1500_label_nested',
 'k_4500_label_nested',
 'k_0900_topic_mix_nested',
 'k_0500_label_nested',
 'k_0200_label_nested',
 'k_0090_label_nested',
 'k_3760_topic_mix_nested',
 'k_4070_label_nested']

## Check clusters for some big subreddits

Just curious to see how different they are from the previous model

In [None]:
l_sub_names_to_check_ = [
    # 'mexico', 
    'askmexico', 'mejico', 'memexico'
]
for s_ in l_sub_names_to_check_:
    print(s_)
    display(
        d_mx['df_summary_cluster']
        [d_mx['df_summary_cluster']['seed_subreddit_names_list'].str.contains(f"{s_}")]
        .iloc[:, :17]
    )

askmexico


Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds
35,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,0038-0039-0046-0050-0053-0057-0059-0068-0070-0074-0093-0098-0114-0130-0150-0200-0228-0305-0375-0451-0453-0529-0598-0625-0664-0744-0748-0937-1083-1125-1315-1447-1514-1530-1702-1819-1906-2098-2153-2296-2381-2499-2627-2707-2907-2915-3016-3...,k_8000_label,6270,Place,19,"mexico, tijuana, monterrey, guadalajara, oaxaca, juarez, veracruz, ensenada, bajacalifornia, mexicali, puebla, sonora, guanajuato, mexicocity, yucatan, michoacan, puertoescondido, cdmx, askmexico","t5_2qhv7, t5_2qk9t, t5_2qm06, t5_2qp8n, t5_2qz0w, t5_2qzle, t5_2rja3, t5_2s4jm, t5_2s521, t5_2sbh1, t5_2sexv, t5_2sn91, t5_2tqn4, t5_2tw1p, t5_2tw8f, t5_2tw8p, t5_39wjm, t5_4sbz8m, t5_4ywzju",13,"tijuana, guadalajara, oaxaca, juarez, ensenada, bajacalifornia, mexicali, guanajuato, mexicocity, yucatan, michoacan, cdmx, askmexico","t5_2qk9t, t5_2qp8n, t5_2qz0w, t5_2qzle, t5_2s4jm, t5_2s521, t5_2sbh1, t5_2tqn4, t5_2tw1p, t5_2tw8f, t5_2tw8p, t5_4sbz8m, t5_4ywzju",False,False


mejico


Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds
5,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,0017-0017-0021-0022-0023-0025-0026-0029-0030-0031-0038-0040-0047-0053-0058-0080-0090-0122-0149-0181-0182-0210-0233-0247-0256-0288-0289,k_1004_label,289,Internet Culture and Memes | Funny/Humor,10,"dedreviil, leyendaslegendarias, caliebre, shitpostesp, moaigreddit, serpias, lospotiers, mejico, mujicomajiconius, mexicanmemes","t5_1nq4ah, t5_254vgb, t5_2crwdl, t5_2e54fb, t5_2rks8q, t5_4nptcc, t5_6fsmt3, t5_2r8eh, t5_691jaj, t5_w864s",7,"dedreviil, leyendaslegendarias, caliebre, shitpostesp, moaigreddit, serpias, mexicanmemes","t5_1nq4ah, t5_254vgb, t5_2crwdl, t5_2e54fb, t5_2rks8q, t5_4nptcc, t5_w864s",False,False


memexico


Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds
2,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,0016-0016-0020-0021-0022-0024-0025-0028-0029-0030-0037-0039-0045,k_0150_label,45,Internet Culture and Memes,8,"unexpectedsimpsons, wishwtf, kalaredditftpch, maau, cafe_infinito, memexico, mujico, copypasta_es","t5_3i0cj, t5_bwpub, t5_2re0i3, t5_11oz5w, t5_36llrb, t5_2ujoy, t5_3la4d, t5_3o47b",6,"wishwtf, kalaredditftpch, maau, cafe_infinito, memexico, mujico","t5_bwpub, t5_2re0i3, t5_11oz5w, t5_36llrb, t5_2ujoy, t5_3la4d",False,False


## Check that outputs have cols to ID fpr provenance

These cols will make it easier to check FPR provenance.


In [None]:
l_cols_id_to_check_ = [
    'pt', 'qa_pt', 'run_id'
]
for name_, df_ in d_mx.items():
    print(f"\n{name_}")
    try:
        print(df_.shape)
        display(df_[l_cols_id_to_check_].iloc[:5, :17])
    except Exception as e:
        display(df_.keys())


df_labels_target
(261, 159)


Unnamed: 0,pt,qa_pt,run_id
0,2022-07-24,2022-07-24,2022-07-25_155327
1,2022-07-24,2022-07-24,2022-07-25_155327
2,2022-07-24,2022-07-24,2022-07-25_155327
3,2022-07-24,2022-07-24,2022-07-25_155327
4,2022-07-24,2022-07-24,2022-07-25_155327



df_labels_target_dynamic
(261, 298)


Unnamed: 0,pt,qa_pt,run_id
0,2022-07-24,2022-07-24,2022-07-25_155327
1,2022-07-24,2022-07-24,2022-07-25_155327
2,2022-07-24,2022-07-24,2022-07-25_155327
3,2022-07-24,2022-07-24,2022-07-25_155327
4,2022-07-24,2022-07-24,2022-07-25_155327



df_summary_cluster
(46, 23)


Unnamed: 0,pt,qa_pt,run_id
0,2022-07-24,2022-07-24,2022-07-25_155327
1,2022-07-24,2022-07-24,2022-07-25_155327
2,2022-07-24,2022-07-24,2022-07-25_155327
3,2022-07-24,2022-07-24,2022-07-25_155327
4,2022-07-24,2022-07-24,2022-07-25_155327



d_fpr_qa


dict_keys(['seed_subreddit_ids', 'recommend_subreddit_ids'])


df_fpr
(251, 13)


Unnamed: 0,pt,qa_pt,run_id
24,2022-07-24,2022-07-24,2022-07-25_155327
162,2022-07-24,2022-07-24,2022-07-25_155327
71,2022-07-24,2022-07-24,2022-07-25_155327
114,2022-07-24,2022-07-24,2022-07-25_155327
115,2022-07-24,2022-07-24,2022-07-25_155327



dict_fpr


dict_keys(['MX'])

## Check subs & clusters that should be reviewed for missing topic

Note: some subs are flagged as recommend + missing topic because the predicted topic is not sensitive.

In [None]:
mask_subs_review_ = (
    d_mx['df_labels_target_dynamic']['combined_filter'] == 'review'
)
mask_subs_review_missing_topic_ = (
    d_mx['df_labels_target_dynamic']['combined_filter_reason'] == 'missing_topic'
)
print(f"{mask_subs_review_.sum()} <- review")
print(f"{mask_subs_review_missing_topic_.sum()} <- missing topic")

# Check subs missing topic, but not for review
df_subs_review_seeds = (
    d_mx['df_labels_target_dynamic']
    [mask_subs_review_missing_topic_ & mask_subs_review_]
    [['cluster_label', 'cluster_label_k', 'cluster_label_int', 'subreddit_name', 'users_l7', 'combined_filter_detail',]]
    .sort_values(by=['cluster_label', 'users_l7'], ascending=[True, False])
)
df_subs_review_seeds.iloc[:7, :]

59 <- review
69 <- missing topic


Unnamed: 0,cluster_label,cluster_label_k,cluster_label_int,subreddit_name,users_l7,combined_filter_detail
222,0016-0016-0020-0021-0022-0024-0025-0028-0029-0030-0037-0039-0045,k_0150_label,45,copypasta_es,42872,review-missing_topic
143,0016-0016-0020-0021-0022-0024-0025-0028-0029-0030-0037-0039-0045,k_0150_label,45,unexpectedsimpsons,970,review-missing_topic
240,0017,k_0050_label,17,elcalifato,336,review-missing_topic
236,0017-0017-0021-0022-0023-0025-0026-0029-0030-0031-0038-0040-0047-0053-0058-0080-0090-0122-0149-0181-0182-0209-0232-0246-0255-0287-0288-0355-0417-0433-0500-0553-0579-0584-0645-0687-0714-0798-0819-0871-0899-0945-1000-1042-1121-1122-1160-1...,k_8000_label,2446,memesespanol,365,review-missing_topic
238,0017-0017-0021-0022-0023-0025-0026-0029-0030-0031-0038-0040-0047-0053-0058-0080-0090-0122-0149-0181-0182-0210-0233-0247-0256-0288-0289,k_1004_label,289,mujicomajiconius,2853,review-missing_topic
168,0017-0017-0021-0022-0023-0025-0026-0029-0030-0031-0038-0040-0047-0053-0058-0080-0090-0122-0149-0181-0182-0210-0233-0247-0256-0288-0289,k_1004_label,289,lospotiers,155,review-missing_topic
169,0017-0017-0021-0022-0023-0025-0026-0029-0030-0031-0038-0040-0047-0053-0058-0080-0090-0122-0149-0181-0182-0210-0233-0247-0256-0288-0289-0358-0420-0436-0503-0556-0582-0587-0648-0692-0719-0803-0824-0876-0904-0950-1006-1050-1129-1130-1169-1...,k_8000_label,2464,rubius,5077,review-missing_topic


In [None]:
# subs with most users
# df_subs_review_seeds.sort_values(by='users_l7', ascending=False).head(10)

In [None]:
# df_subs_review_seeds.iloc[10:22, :]

In [None]:
# df_subs_review_seeds.iloc[-10:, :]

## Check clusters with more seeds and recommendations

These should include subs with `allow_discovery=f` & `type=private`

In [None]:
mask_more_seeds_than_recs_ = d_mx['df_summary_cluster']['seed_subreddit_count'] > d_mx['df_summary_cluster']['recommend_subreddit_count']
d_mx['df_summary_cluster'][mask_more_seeds_than_recs_].drop(['cluster_label'], axis=1).iloc[:9, :]

Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds,missingTopic_subreddit_count,missingTopic_subreddit_names_list,discoveryF_subreddit_count,discoveryF_subreddit_names_list,private_subreddit_count,private_subreddit_names_list
0,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_0060_label,17,Music,8,"indiemx, musicaenespanol, corridos, cuartetodenos, kenyanart, withintemptation, thewarning, interpol","t5_24z6p6, t5_2v8rv, t5_2vneg, t5_2csckx, t5_3gzgzl, t5_2wzac, t5_2vbdi, t5_2rwkm",7,"indiemx, musicaenespanol, cuartetodenos, kenyanart, withintemptation, thewarning, interpol","t5_24z6p6, t5_2v8rv, t5_2csckx, t5_3gzgzl, t5_2wzac, t5_2vbdi, t5_2rwkm",False,False,0,,0,,0,
2,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_0150_label,45,Internet Culture and Memes,8,"unexpectedsimpsons, wishwtf, kalaredditftpch, maau, cafe_infinito, memexico, mujico, copypasta_es","t5_3i0cj, t5_bwpub, t5_2re0i3, t5_11oz5w, t5_36llrb, t5_2ujoy, t5_3la4d, t5_3o47b",6,"wishwtf, kalaredditftpch, maau, cafe_infinito, memexico, mujico","t5_bwpub, t5_2re0i3, t5_11oz5w, t5_36llrb, t5_2ujoy, t5_3la4d",False,False,2,"unexpectedsimpsons, copypasta_es",0,,0,
3,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_0050_label,17,Internet Culture and Memes,6,"ibai, spreen, tutisvalentine, nimuvt, redditmint, elcalifato","t5_2kyy11, t5_44gund, t5_4oqex2, t5_4ovvzs, t5_4vjzqe, t5_5ieqvq",4,"ibai, spreen, nimuvt, redditmint","t5_2kyy11, t5_44gund, t5_4ovvzs, t5_4vjzqe",False,False,1,elcalifato,1,tutisvalentine,0,
4,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_8000_label,2446,Internet Culture and Memes | Funny/Humor,9,"memesesp, ilutv, dylanteroyt, riczer, guibelreviews, radiopirata, clownvt, kendomurft, memesespanol","t5_10wycq, t5_2hbnwo, t5_2q2ypv, t5_2rkelh, t5_3s1qy6, t5_4e3ldm, t5_5umaai, t5_6chcvt, t5_hc3xv",8,"memesesp, ilutv, dylanteroyt, riczer, guibelreviews, radiopirata, clownvt, kendomurft","t5_10wycq, t5_2hbnwo, t5_2q2ypv, t5_2rkelh, t5_3s1qy6, t5_4e3ldm, t5_5umaai, t5_6chcvt",False,False,1,memesespanol,0,,0,
5,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_1004_label,289,Internet Culture and Memes | Funny/Humor,10,"dedreviil, leyendaslegendarias, caliebre, shitpostesp, moaigreddit, serpias, lospotiers, mejico, mujicomajiconius, mexicanmemes","t5_1nq4ah, t5_254vgb, t5_2crwdl, t5_2e54fb, t5_2rks8q, t5_4nptcc, t5_6fsmt3, t5_2r8eh, t5_691jaj, t5_w864s",7,"dedreviil, leyendaslegendarias, caliebre, shitpostesp, moaigreddit, serpias, mexicanmemes","t5_1nq4ah, t5_254vgb, t5_2crwdl, t5_2e54fb, t5_2rks8q, t5_4nptcc, t5_w864s",False,False,2,"lospotiers, mujicomajiconius",0,,0,
6,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_8000_label,2464,Internet Culture and Memes | Funny/Humor,13,"memesenespanol, ubius, doomergang, retheys, skyshocksub, elvirgocascarudo, aradiroff, darkraiposting, iamcristinini, emikuki, locochon, rubius, auronplay","t5_1009a3, t5_24tlbv, t5_2gwio9, t5_384jxm, t5_3a4xqa, t5_4dd0a3, t5_4t6fim, t5_5cy8ez, t5_5loj4v, t5_5spn0e, t5_5sxuai, t5_jgwhn, t5_xk6dj",10,"memesenespanol, ubius, doomergang, retheys, skyshocksub, elvirgocascarudo, aradiroff, iamcristinini, emikuki, auronplay","t5_1009a3, t5_24tlbv, t5_2gwio9, t5_384jxm, t5_3a4xqa, t5_4dd0a3, t5_4t6fim, t5_5loj4v, t5_5spn0e, t5_xk6dj",False,False,3,"darkraiposting, locochon, rubius",0,,0,
7,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_7500_label,2306,Internet Culture and Memes | Funny/Humor,7,"vicio, renezz, webadas, beelcitosmemes, latesitoo, mujicocity, anzutops777oficial","t5_2nv19k, t5_2oxnbx, t5_366df1, t5_3qq2qy, t5_3wam26, t5_5ie0sc, t5_69coi0",6,"vicio, renezz, webadas, beelcitosmemes, latesitoo, mujicocity","t5_2nv19k, t5_2oxnbx, t5_366df1, t5_3qq2qy, t5_3wam26, t5_5ie0sc",False,False,1,anzutops777oficial,0,,0,
8,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_0300_label,91,Internet Culture and Memes,7,"dreamcoreimages, rangugamer, folagoro, stickersarehard, okbuddywawanakwa, memesbuenaonda, badaboom","t5_3pifpp, t5_26div5, t5_2uiajk, t5_4iasdt, t5_5v2kpo, t5_4z9yto, t5_58nf2x",6,"dreamcoreimages, rangugamer, stickersarehard, okbuddywawanakwa, memesbuenaonda, badaboom","t5_3pifpp, t5_26div5, t5_4iasdt, t5_5v2kpo, t5_4z9yto, t5_58nf2x",False,False,0,,1,folagoro,0,
9,2022-07-24,2022-07-24,2022-07-25_155327,MX,Mexico,k_0050_label,18,Podcasts and Streamers,1,planetaludico,t5_3f0mg,0,,,True,False,1,planetaludico,0,,0,


## Check format output needed for FPR format

should we use to_json or to_dict?

In [None]:
pd_to_json_fmts = [
    'columns',
    'split',
    'records',
    'index',
    'values',
    'table',
]
for fmt_ in pd_to_json_fmts:
    print(f"\n{fmt_}")
    print(
        d_mx['df_summary_cluster'][['cluster_topic_mix', 'seed_subreddit_names_list']]
        .head(3)
        .to_json(orient=fmt_)
    )


columns

split

records

index

values

table


In [None]:
pd_to_json_fmts = [
    'split',
    'records',
    'index',
    'table',
]
for fmt_ in pd_to_json_fmts:
    print(f"\n{fmt_}")
    print(
        d_mx['df_summary_cluster'][['cluster_topic_mix', 'seed_subreddit_names_list']]
        .set_index('cluster_topic_mix')
        .head(3)
        ['seed_subreddit_names_list']
        .to_json(orient=fmt_)
    )


split

records

index

table


In [None]:
pd_to_dict_fmts = [
    'dict',
    'list',
    'series',
    'split',
    'records',
    'index',
]
for fmt_ in pd_to_dict_fmts:
    print(f"\n{fmt_}")
    try:
        print(
            d_mx['df_summary_cluster'][['cluster_topic_mix', 'seed_subreddit_names_list']]
            .set_index('cluster_topic_mix')
            .head(3)
            .to_dict(orient=fmt_)
            ['seed_subreddit_names_list']
        )
    except Exception as e:
        pass


dict

list

series
cluster_topic_mix
Animals and Pets                                                                                                packadaykitties
Internet Culture and Memes    unexpectedsimpsons, wishwtf, kalaredditftpch, maau, cafe_infinito, memexico, mujico, copypasta_es
Name: seed_subreddit_names_list, dtype: object

split

records

index


## Check orphan clusters

In [None]:
mask_agg_clusters_orphan = d_mx['df_summary_cluster']['orphan_clusters']
d_mx['df_summary_cluster'][mask_agg_clusters_orphan]

Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds,missingTopic_subreddit_count,missingTopic_subreddit_names_list,discoveryF_subreddit_count,discoveryF_subreddit_names_list,private_subreddit_count,private_subreddit_names_list
0,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,13,k_0050_label,13,Music,1,anamanaguchi,t5_2sedm,1,anamanaguchi,t5_2sedm,True,False,0,,0,,0,
21,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,25,k_0050_label,25,Gaming,1,avakinlifela,t5_4qyhtu,1,avakinlifela,t5_4qyhtu,True,False,0,,0,,0,
22,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,26,k_0050_label,26,Gaming,1,minecraftespanol,t5_2wm8o,1,minecraftespanol,t5_2wm8o,True,False,0,,0,,0,
25,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,29,k_0050_label,29,Sports,1,sjsuspartans,t5_1157ax,1,sjsuspartans,t5_1157ax,True,False,0,,0,,0,
46,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,49,k_0050_label,49,Hobbies,1,theprepared,t5_x33ns,1,theprepared,t5_x33ns,True,False,0,,0,,0,


In [None]:
mask_zero_rec = d_mx['df_summary_cluster']['recommend_subreddit_count'].fillna(0) < 1
d_mx['df_summary_cluster'][mask_zero_rec]

Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds,missingTopic_subreddit_count,missingTopic_subreddit_names_list,discoveryF_subreddit_count,discoveryF_subreddit_names_list,private_subreddit_count,private_subreddit_names_list


In [None]:

d_mx['df_summary_cluster'][d_mx['df_summary_cluster']['exclude_recs_from_seeds']]

Unnamed: 0,pt,qa_pt,run_id,geo_country_code,country_name,cluster_label,cluster_label_k,cluster_label_int,cluster_topic_mix,seed_subreddit_count,seed_subreddit_names_list,seed_subreddit_ids_list,recommend_subreddit_count,recommend_subreddit_names_list,recommend_subreddit_ids_list,orphan_clusters,exclude_recs_from_seeds,missingTopic_subreddit_count,missingTopic_subreddit_names_list,discoveryF_subreddit_count,discoveryF_subreddit_names_list,private_subreddit_count,private_subreddit_names_list
39,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,39,k_0050_label,39,Learning and Education,3,"unam, tecdemonterrey, prepa","t5_2uiut, t5_2uu7p, t5_35inf",1,prepa,t5_35inf,False,True,2,"unam, tecdemonterrey",0,,0,
41,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,41,k_0050_label,41,Technology,2,"soportepc, onlyfans","t5_47ph2n, t5_3pf0f",1,onlyfans,t5_3pf0f,False,True,1,soportepc,0,,0,
44,2022-07-24,2022-07-24,2022-07-25_225603,MX,Mexico,47,k_0050_label,47,Marketplace and Deals,2,"cigarpromos, monterreyventas","t5_28xhyi, t5_4b8yrf",1,monterreyventas,t5_4b8yrf,False,True,1,cigarpromos,0,,0,


## Create metadata df for top-level df_fpr_summary

Goal: want to just count the subreddits & outputs w/o having to UNNest the FPR output

In [None]:
l_cols_top_level_summary = [
    'pt',
    'geo_relevance_table',
    'qa_pt',
    'qa_table',
    'run_id',

    'geo_country_code',
    'country_name',
]
df_top_summary_ = (
    d_mx['df_labels_target']
    .groupby(l_cols_top_level_summary, as_index=False)
    .agg(
        **{'relevant_subreddit_id_count': ('subreddit_id', 'count')}    
    )
)
seed_ids_ = ['a', 'b', 'c']

try:
    df_top_summary_.at[0, 'subreddit_seeds'] = seed_ids_
except ValueError:
    df_top_summary_['subreddit_seeds'] = np.nan
    df_top_summary_['subreddit_seeds'] = df_top_summary_['subreddit_seeds'].astype('object')
    df_top_summary_.at[0, 'subreddit_seeds'] = seed_ids_

## Check seed subs that are missing from df-FPR

Reason: 
<br>In some clusters there are multiple seeds (e.g., allow_discover=f), but only 1 sub to recommend. 
<br>In these cases, we need to remove the single "sub to recommend" from the list of seeds because it can't recommend itself.

I already added much better logic in the QA function for this. No need to replicate it here.

In [None]:
l_seed_subs_fpr = d_mx['dict_fpr']['MX'].keys()
len(l_seed_subs_fpr)

251

In [None]:
# d_mx['d_fpr_qa']['seed_subreddit_ids']

In [None]:
l_seed_subs_wo_orphans_summary = list()
for l_ in (
    d_mx['df_summary_cluster'][~d_mx['df_summary_cluster']['orphan_clusters']]
    ['seed_subreddit_ids_list']
    .to_list()
):
    # kinda hate that we have to split a string to recreate a list,
    #  but it's the best worst thing here
    if isinstance(l_, str):
        inner_list = l_.split(', ')
        for i_ in inner_list:
            l_seed_subs_wo_orphans_summary.append(i_)
    else:
        for i_ in l_:
            l_seed_subs_wo_orphans_summary.append(i_)
len(l_seed_subs_wo_orphans_summary)

254

In [None]:
# The logic here is flawed -- some of these subs shouldn't be seeds because all the
#  other subs in the cluster are not meant to be recommended
set_missing_subs_ = set(l_seed_subs_wo_orphans_summary) - set(l_seed_subs_fpr)
print(set_missing_subs_)

# for id_ in set_missing_subs_:
#     try:
#         display(d_mx['df_summary_cluster'][d_mx['df_summary_cluster']['seed_subreddit_ids_list'].str.contains(id_)])
#     except KeyError:
#         display(d_mx['df_summary_cluster'][d_mx['df_summary_cluster']['seed_subreddit_ids_list'].astype(str).str.contains(id_)])

{'t5_35inf', 't5_3pf0f', 't5_4b8yrf'}


In [None]:
set2_ = set(l_seed_subs_fpr) - set(l_seed_subs_wo_orphans_summary)
for id_ in set2_:
    display(d_mx['df_summary_cluster'][d_mx['df_summary_cluster']['seed_subreddit_ids_list'].str.contains(id_)])
print(set2_)
# d_mx['dict_fpr']['MX']['t5_3f0mg']

set()


In [None]:
# (
#     d_mx['df_labels_target_dynamic']
#     [d_mx['df_labels_target_dynamic']['subreddit_id'].isin(set(l_seed_subs_fpr) - set(l_seed_subs_wo_orphans_summary))]
#     [l_cols_sub_to_check_]
# )