# Purpose

### 2022-07-18
In this notebook we'll run a batch of countries through a new FPR output process. 
Instead of saving data to google sheets, we'll save:
- FPR outputs to a GCS bucket (JSON)
- FPR summary to BigQuery
- FPR details to BigQuery

We can then use the summary & details in a Mode dashboard for inspection (if needed).

See this dashboard for more information about the model coverage & filters.
https://app.mode.com/reddit/reports/b99c94984018


# Imports & notebook setup

In [3]:
%load_ext autoreload
%autoreload 2

# Register bigquery magic (only needed for laptop/local, not colab)
# %load_ext google.cloud.bigquery

In [4]:
# colab auth for BigQuery, google drive, & google sheets (gspread)
from google.colab import auth, files, drive
from google.auth import default
import sys  # need sys for mounting gdrive path

auth.authenticate_user()
print('Authenticated')

Authenticated


## Install custom library

### Append google drive path so we can install library from there

In [5]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
import sys


g_drive_root = '/content/drive'

try:
    drive.mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
    
except Exception as e:
    try:
        drive._mount(g_drive_root, force_remount=True)
        print('   Authenticated & mounted Google Drive')
    except Exception as e:
        print(e)
        raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install library

In [6]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

[K     |████████████████████████████████| 10.1 MB 5.2 MB/s 
[K     |████████████████████████████████| 14.2 MB 12.6 MB/s 
[K     |████████████████████████████████| 965 kB 46.3 MB/s 
[K     |████████████████████████████████| 144 kB 55.6 MB/s 
[K     |████████████████████████████████| 76 kB 4.2 MB/s 
[K     |████████████████████████████████| 285 kB 49.1 MB/s 
[K     |████████████████████████████████| 13.2 MB 18.4 MB/s 
[K     |████████████████████████████████| 79.9 MB 98 kB/s 
[K     |████████████████████████████████| 140 kB 46.0 MB/s 
[K     |████████████████████████████████| 715 kB 48.8 MB/s 
[K     |████████████████████████████████| 74 kB 3.1 MB/s 
[K     |████████████████████████████████| 112 kB 47.9 MB/s 
[K     |████████████████████████████████| 181 kB 68.4 MB/s 
[K     |████████████████████████████████| 79 kB 7.7 MB/s 
[K     |████████████████████████████████| 1.1 MB 49.3 MB/s 
[K     |████████████████████████████████| 81 kB 1.2 MB/s 
[K     |██████████████████████

## Regular Imports

In [7]:
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted
from tqdm import tqdm


# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

## Custom imports

In [8]:
# subclu imports
import subclu
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.models.clustering_utils import (
    create_dynamic_clusters,
    convert_distance_or_ab_to_list_for_fpr,
    get_primary_topic_mix_cols,
)
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.models.reshape_clusters_v050 import (
    get_table_for_optimal_dynamic_cluster_params,
    CreateFPRs,
    get_dynamic_cluster_summary,
    get_geo_relevant_subreddits_and_cluster_labels,
    get_fpr_cluster_per_row_summary,
    reshape_df_1_cluster_per_row,
)

setup_logging()
notebook_display_config()
print_lib_versions([pd, np])

python		v 3.7.13
===
pandas		v: 1.3.5
numpy		v: 1.21.6


# Checklist to re-run FPRs

- Update list of countries to run
- Update path to save outputs (in GCS)


With this new process we should only need a list of country names to get an FPR output. Everything else should be automated as long as we load from the default config.

## Load English-countries configuration

In [11]:
cfg_fpr_not_en = LoadHydraConfig(
    config_name='fpr_v050-2022-07-26-non_english.yaml',
    config_path="../config",
    # overrides=[
    #     f"partition_dt=2022-07-24",
    # ],
)

print([k for k in cfg_fpr_not_en.config_dict.keys()])
cfg_fpr_not_en.config_dict

['description', 'output_bucket', 'gcs_output_path', 'cluster_labels_table', 'partition_dt', 'geo_relevance_table', 'geo_min_users_percent_by_subreddit_l28', 'geo_min_country_standardized_relevance', 'qa_table', 'qa_pt', 'target_countries']


{'cluster_labels_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_full',
 'description': "Base config to test FPR creation. IE (Ireland) is here because it's a small country",
 'gcs_output_path': 'i18n_topic_model_batch/fpr/runs',
 'geo_min_country_standardized_relevance': 2.4,
 'geo_min_users_percent_by_subreddit_l28': 0.14,
 'geo_relevance_table': 'reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725',
 'output_bucket': 'i18n-subreddit-clustering',
 'partition_dt': '2022-07-25',
 'qa_pt': '2022-07-25',
 'qa_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags',
 'target_countries': ['IN',
  'IE',
  'DE',
  'AT',
  'CH',
  'PT',
  'BR',
  'FR',
  'IT',
  'MX',
  'ES',
  'AR',
  'CO',
  'CR',
  'PA',
  'SE',
  'RO',
  'NL',
  'GR',
  'BE',
  'PL',
  'TR',
  'SA',
  'PH',
  'GT',
  'CL',
  'FI']}

In [12]:
cfg_fpr_en = LoadHydraConfig(
    config_name='fpr_v050-2022-07-26-english_countries.yaml',
    config_path="../config",
    # overrides=[
    #     f"target_countries={l_target_countries}",
    #     f"partition_dt=2022-07-24",
    # ],
)

print([k for k in cfg_fpr_en.config_dict.keys()])
cfg_fpr_en.config_dict

['description', 'target_countries', 'output_bucket', 'gcs_output_path', 'cluster_labels_table', 'partition_dt', 'geo_relevance_table', 'geo_min_users_percent_by_subreddit_l28', 'geo_min_country_standardized_relevance', 'qa_table', 'qa_pt']


{'cluster_labels_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_full',
 'description': 'Base config to test FPR creation',
 'gcs_output_path': 'i18n_topic_model_batch/fpr/runs',
 'geo_min_country_standardized_relevance': 3.0,
 'geo_min_users_percent_by_subreddit_l28': 0.15,
 'geo_relevance_table': 'reddit-employee-datasets.david_bermejo.subclu_subreddit_relevance_beta_20220725',
 'output_bucket': 'i18n-subreddit-clustering',
 'partition_dt': '2022-07-25',
 'qa_pt': '2022-07-25',
 'qa_table': 'reddit-employee-datasets.david_bermejo.subclu_v0050_subreddit_clusters_c_qa_flags',
 'target_countries': ['AU', 'GB', 'CA']}

# Run `create_fprs()` method

This method should do everything needed to create FPRs.

### Mainly English-speaking countries

In [None]:
%%time
fprs_en = CreateFPRs(
    **cfg_fpr_en.config_dict
)

# run all countries
fprs_en.create_fprs()

  0%|          | 0/3 [00:00<?, ?it/s]17:30:22 | INFO | "== Country: AU =="
17:30:22 | INFO | "Getting geo-relevant subreddits in model for AU..."
17:30:31 | INFO | " (607, 160)  <- df_shape"
17:30:31 | INFO | " (606, 161) <- Shape AFTER dropping subreddits with covid in title"
17:30:31 | INFO | "Finding optimal N (target # of subs per cluster)..."

  0%|          | 0/5 [00:00<?, ?it/s][A
 20%|██        | 1/5 [00:03<00:14,  3.61s/it][A
 40%|████      | 2/5 [00:07<00:10,  3.65s/it][A
 60%|██████    | 3/5 [00:10<00:07,  3.66s/it][A
 80%|████████  | 4/5 [00:14<00:03,  3.67s/it][A
100%|██████████| 5/5 [00:18<00:00,  3.69s/it]
17:30:50 | INFO | "  6 <-- Optimal N"
17:30:50 | INFO | "Assigning clusters based on optimal N..."
17:30:54 | INFO | "Getting QA and summary at cluster_level..."
17:30:54 | INFO | "(83, 23)  <- df.shape full summary"
17:30:54 | INFO | "Adding metadata to df_top_level_summary..."
17:30:54 | INFO | "Creating FPR output..."
17:30:54 | INFO | "  (559, 15) <- df_fpr.s

CPU times: user 2min 50s, sys: 2.37 s, total: 2min 53s
Wall time: 3min 36s





### Mainly non-English speaking countries

In [13]:
%%time
fprs_non_en = CreateFPRs(
    **cfg_fpr_not_en.config_dict
)

# run all countries
fprs_non_en.create_fprs()

  0%|          | 0/27 [00:00<?, ?it/s]23:34:58 | INFO | "== Country: IN =="
23:34:58 | INFO | "Getting geo-relevant subreddits in model for IN..."
23:35:02 | INFO | " (798, 160)  <- df_shape"
23:35:02 | INFO | " (798, 161) <- Shape AFTER dropping subreddits with covid in title"
23:35:02 | INFO | "Finding optimal N (target # of subs per cluster)..."

  0%|          | 0/5 [00:00<?, ?it/s][A
 20%|██        | 1/5 [00:04<00:18,  4.60s/it][A
 40%|████      | 2/5 [00:09<00:14,  4.99s/it][A
 60%|██████    | 3/5 [00:15<00:10,  5.31s/it][A
 80%|████████  | 4/5 [00:20<00:05,  5.23s/it][A
100%|██████████| 5/5 [00:25<00:00,  5.16s/it]
23:35:28 | INFO | "  6 <-- Optimal N"
23:35:28 | INFO | "Assigning clusters based on optimal N..."
23:35:34 | INFO | "Getting QA and summary at cluster_level..."
23:35:34 | INFO | "(108, 23)  <- df.shape full summary"
23:35:34 | INFO | "Adding metadata to df_top_level_summary..."
23:35:34 | INFO | "Creating FPR output..."
23:35:34 | INFO | "  (690, 15) <- df_fpr

CPU times: user 7min 46s, sys: 5.28 s, total: 7min 51s
Wall time: 10min 20s





In [1]:
f"test2"

'test2'

In [None]:
f"test"

'test'