# Purpose

### 2022-11-22
Need a hot fix because the format for the `ndjson` file was wrong and it ended up copying the same subreddit name to the similar subreddits :upside_down_smile:

On the bright side, we can re-load the existing df with the nearest neighbors and focus only on reshaping the data (instead of rebuilding the index from scratch).


### 2022-11-08
Optimize the function to get ANN faster. Now that we'll run ANN for 250+ subreddits, running in a single thread could take a loooong time.

New ETA for ~250k subreddits: ~50 minutes.


### 2022-08-01
Calculating precise nearest neighbors has become too expensive as we go over 40k subreddits. So instead let's calculate approx nearest neighbors (ANN). 

In this notebook we use [ANNOY](https://github.com/spotify/annoy).  Main reason for using annoy over FAISS is that annoy has official wheels in pypi, but FAISS only officially supports installation from conda. For now we don't want to depend on third-party wheels for FAISS b/c that can be messy to install & replicate in a VM. Maybe when we switch to kubeflow we can try FAISS.


# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
import os
import json
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm import tqdm

import mlflow
import hydra
import annoy


import subclu
from subclu.models.nn_annoy import AnnoyIndex
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.data.data_loaders import LoadSubreddits
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks

from subclu.utils.big_query_utils import load_data_to_bq_table
from subclu.models.bq_embedding_schemas import embeddings_schema, similar_sub_schema


# General utils to display & set working directories
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)


print_lib_versions([annoy, dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
annoy		v: 1.17.0
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.6.1


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas').tail(9)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
35,35,v0.6.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/35,active
36,36,v0.6.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/36,active
37,37,v0.6.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/37,active
38,38,v0.6.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/38,active
39,39,v0.6.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/39,active
40,40,v0.6.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/40,active
41,41,v0.6.1_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/41,active
42,42,v0.6.1_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/42,active
43,43,v0.6.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/43,active


## Get runs from embeddings aggregation jobs

Want to make sure we can load these artifacts for other jobs

In [6]:
%%time

df_mlf_runs =  mlf.search_all_runs(experiment_ids=[40])
df_mlf_runs.shape

CPU times: user 64 ms, sys: 0 ns, total: 64 ms
Wall time: 63.3 ms


(4, 43)

In [7]:
df_mlf_runs[df_mlf_runs['status'] == 'FINISHED'].iloc[:5, :10]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.df_v_post_comments-rows,metrics.memory_total,metrics.df_v_subs-rows,metrics.memory_used_percent
2,91ac7ca171024c779c0992f59470c81b,40,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts,2022-11-07 21:38:57.662000+00:00,2022-11-22 23:21:25.710000+00:00,53597817.0,1444961.0,781653.0,0.552656


### Check run artifacts for selected run

In [8]:
run_uuid_ = '91ac7ca171024c779c0992f59470c81b'
l_artifacts_top_level = mlf.list_run_artifacts(
    run_id=run_uuid_,
    only_top_level=True,
    verbose=True,
)
l_artifacts_all = mlf.list_run_artifacts(
    run_id=run_uuid_,
    only_top_level=False,
    verbose=False,
)

00:04:39 | INFO | "   341 <- Artifacts to check count"
00:04:39 | INFO | "   341 <- Artifacts clean count"
00:04:39 | INFO | "     8 <- Artifacts & folders at TOP LEVEL clean count"
00:04:45 | INFO | "   341 <- Artifacts clean count"
00:04:45 | INFO | "     8 <- Artifacts & folders at TOP LEVEL clean count"


In [9]:
for t_ in l_artifacts_top_level:
    l_ = [i for i in l_artifacts_all if t_ in i]
    print(f"=== Items in folder: {len(l_):,.0f} | {t_}  ===")
    for _ in l_[:3]:
        print(' ', '/'.join(_.split('/')[5:]))
    print('')

=== Items in folder: 63 | ann_df-2022-11-22_185903  ===
  ann_df-2022-11-22_185903/_common_metadata
  ann_df-2022-11-22_185903/_metadata
  ann_df-2022-11-22_185903/part.0.parquet

=== Items in folder: 52 | ann_df-239774-2022-11-22_230904  ===
  ann_df-239774-2022-11-22_230904/_common_metadata
  ann_df-239774-2022-11-22_230904/_metadata
  ann_df-239774-2022-11-22_230904/part.0.parquet

=== Items in folder: 1 | ann_ndjson-239774-2022-11-22_230904  ===
  ann_ndjson-239774-2022-11-22_230904/ann_ndjson-239774_subreddits.json

=== Items in folder: 211 | df_posts_agg_c1  ===
  df_posts_agg_c1/_common_metadata
  df_posts_agg_c1/_metadata
  df_posts_agg_c1/part.0.parquet

=== Items in folder: 14 | df_subs_agg_c1  ===
  df_subs_agg_c1/_common_metadata
  df_subs_agg_c1/_metadata
  df_subs_agg_c1/part.0.parquet

=== Items in folder: 1 | df_subs_agg_c1_ndjson  ===
  df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json

=== Items in folder: 7 | df_subs_agg_c1_unweighted  ===
  df_subs_a

# Set run parameters to log for mlflow

This dictionary is equivalent to a config file for now. Use it as a bases for kubeflow re-write.

How to get active run:
```python
mlflow.active_run().info.run_id
```

In [10]:
d_mlf_params = {
    'run_name': f"ann_subreddit_test-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    'mlflow_experiment_name': 'v0.6.1_nearest_neighbors',
    'embeddings_run_uuid': '91ac7ca171024c779c0992f59470c81b',
    'subreddit_embeddings_folder': 'df_subs_agg_c1',
    'post_embeddings_folder': 'df_posts_agg_c1',
    'n_min_post_per_sub': 6,

    # index columns for ANN df, JSON, & BQ table
    'index_cols': ['subreddit_id', 'subreddit_name'],
    'model_version': 'v0.6.1',
    'model_name': 'cau-text-mUSE',
    
    # sample number of subreddits to sample.
    #  Set to None to run on full data
    'n_sample_embedding_rows': None,
    
    # flag & params to upload to bigquery
    'upload_to_bq': False,
    'bq_project': 'reddit-employee-datasets',
    'bq_dataset': 'david_bermejo',
    'bq_table_name': 'cau_similar_subreddits_by_text',
}
d_ann_params = {
    'n_trees': 200,
    'metric': 'angular',
}
run_uuid = d_mlf_params['embeddings_run_uuid']

# Load df with precomputed distances


**HOTFIX specific**: In this case we want to load an existing df with the desired outputs so we don't need to filter & create an index
- We will want to write to a new folder with a `fix` suffix so we know it references the same data but has fixed the sub name

In [12]:
%%time

df_nn_top = mlf.read_run_artifact(
    run_id=run_uuid,
    artifact_folder='ann_df-239774-2022-11-22_230904',
    read_function='pd_parquet',
    verbose=False,
)


info(df_nn_top.shape)

00:12:18 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/ann_df-239774-2022-11-22_230904"
100%|###########################################| 52/52 [00:24<00:00,  2.16it/s]
00:12:42 | INFO | "  Parquet files found:    50"
00:12:42 | INFO | "  Parquet files to use:    50"
00:12:55 | INFO | "(47954800, 13)"


CPU times: user 52.8 s, sys: 10.1 s, total: 1min 2s
Wall time: 43.7 s


In [13]:
df_nn_top.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47954800 entries, 0 to 47954799
Data columns (total 13 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   subreddit_id            object 
 1   subreddit_name          object 
 2   seed_ix                 int64  
 3   nn_ix                   int64  
 4   distance                float64
 5   distance_rank           int64  
 6   similar_subreddit_id    object 
 7   similar_subreddit_name  object 
 8   cosine_similarity       float64
 9   pt                      object 
 10  mlflow_run_id           object 
 11  model_name              object 
 12  model_version           object 
dtypes: float64(2), int64(3), object(8)
memory usage: 4.6+ GB


## Quick checks

In [14]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'france']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity,pt,mlflow_run_id,model_name,model_version
22496720,t5_2qhjz,france,16574,6584,0.489396,1,t5_29145x,francedigeste,0.880246,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496721,t5_2qhjz,france,16574,92786,0.500917,2,t5_4c3l03,france6,0.874541,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496722,t5_2qhjz,france,16574,21992,0.513589,3,t5_2rj8v,francais,0.868113,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496723,t5_2qhjz,france,16574,50277,0.536449,4,t5_2zkfk,askfrance,0.856111,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496724,t5_2qhjz,france,16574,90336,0.541022,5,t5_47quxa,yahooqr,0.853648,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496725,t5_2qhjz,france,16574,45568,0.558391,6,t5_2xe8t,paslegorafi,0.8441,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496726,t5_2qhjz,france,16574,130621,0.567863,7,t5_5yjd6o,france_actu_debats,0.838766,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496727,t5_2qhjz,france,16574,3015,0.608544,8,t5_22i0,de,0.814837,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496728,t5_2qhjz,france,16574,16544,0.608873,9,t5_2qhh9,quebec,0.814637,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22496729,t5_2qhjz,france,16574,118628,0.612728,10,t5_5i39cu,lbaqr,0.812282,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1


In [15]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'finanzen']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity,pt,mlflow_run_id,model_name,model_version
4192032,t5_35m5e,finanzen,59324,126925,0.392432,1,t5_5txdoj,finanzenat,0.922998,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192033,t5_35m5e,finanzen,59324,76249,0.470963,2,t5_3isqn,italiapersonalfinance,0.889097,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192034,t5_35m5e,finanzen,59324,61226,0.481969,3,t5_37aoh,vosfinances,0.883853,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192035,t5_35m5e,finanzen,59324,669,0.484936,4,t5_11cinh,befire,0.882418,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192036,t5_35m5e,finanzen,59324,42405,0.522513,5,t5_2w5jv,eupersonalfinance,0.86349,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192037,t5_35m5e,finanzen,59324,8816,0.530557,6,t5_2clhc5,literaciafinanceira,0.859254,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192038,t5_35m5e,finanzen,59324,233763,0.536577,7,t5_oe819,personalfinanceza,0.856042,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192039,t5_35m5e,finanzen,59324,37864,0.540692,8,t5_2uo3q,ausfinance,0.853826,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192040,t5_35m5e,finanzen,59324,63855,0.54347,9,t5_38zrx,personalfinancenz,0.85232,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
4192041,t5_35m5e,finanzen,59324,65493,0.543982,10,t5_39zkf,fiaustralia,0.852042,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1


In [16]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'de']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity,pt,mlflow_run_id,model_name,model_version
603000,t5_22i0,de,3015,93929,0.389254,1,t5_4egnbw,dezwo,0.924241,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603001,t5_22i0,de,3015,68278,0.477014,2,t5_3caax,600euro,0.886229,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603002,t5_22i0,de,3015,77935,0.481225,3,t5_3jxvk,tja,0.884211,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603003,t5_22i0,de,3015,231106,0.493754,4,t5_irnzx,dachschaden,0.878103,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603004,t5_22i0,de,3015,18131,0.509679,5,t5_2qo9i,austria,0.870114,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603005,t5_22i0,de,3015,64009,0.517639,6,t5_392ha,asozialesnetzwerk,0.866025,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603006,t5_22i0,de,3015,40631,0.571133,7,t5_2vk0m,nachrichten,0.836903,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603007,t5_22i0,de,3015,96508,0.598626,8,t5_4juf8o,politpro,0.820823,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603008,t5_22i0,de,3015,233371,0.607671,9,t5_nls07,belgium2,0.815368,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
603009,t5_22i0,de,3015,231620,0.608017,10,t5_jsyzh,poldersocialisme,0.815158,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1


In [17]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'mexico']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity,pt,mlflow_run_id,model_name,model_version
22534320,t5_2qhv7,mexico,16762,106045,0.43765,1,t5_4ywzju,askmexico,0.904231,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534321,t5_2qhv7,mexico,16762,17695,0.520065,2,t5_2qm06,monterrey,0.864766,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534322,t5_2qhv7,mexico,16762,26092,0.541566,3,t5_2sbh1,mexicali,0.853353,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534323,t5_2qhv7,mexico,16762,37938,0.546211,4,t5_2up3k,ticos,0.850827,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534324,t5_2qhv7,mexico,16762,79115,0.55256,5,t5_3la4d,mujico,0.847338,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534325,t5_2qhv7,mexico,16762,34838,0.555084,6,t5_2tw1p,mexicocity,0.845941,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534326,t5_2qhv7,mexico,16762,37443,0.576658,7,t5_2ujoy,memexico,0.833733,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534327,t5_2qhv7,mexico,16762,13792,0.579503,8,t5_2lxxle,mexicow,0.832088,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534328,t5_2qhv7,mexico,16762,25891,0.588112,9,t5_2samk,guatemala,0.827062,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22534329,t5_2qhv7,mexico,16762,101331,0.609711,10,t5_4sbz8m,cdmx,0.814126,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1


In [18]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'formula1']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity,pt,mlflow_run_id,model_name,model_version
22583320,t5_2qimj,formula1,17007,7135,0.334023,1,t5_29o8ec,grandprixracing,0.944214,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583321,t5_2qimj,formula1,17007,54108,0.39302,2,t5_31vs7,scuderiaferrari,0.922768,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583322,t5_2qimj,formula1,17007,52816,0.409588,3,t5_316st,f1feederseries,0.916119,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583323,t5_2qimj,formula1,17007,1721,0.432823,4,t5_13t1oy,mclarenformula1,0.906332,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583324,t5_2qimj,formula1,17007,26493,0.459663,5,t5_2sdeq,indycar,0.894355,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583325,t5_2qimj,formula1,17007,40840,0.467478,6,t5_2vmby,lewishamilton,0.890732,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583326,t5_2qimj,formula1,17007,56939,0.467626,7,t5_33n2v1,astonmartinformula1,0.890663,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583327,t5_2qimj,formula1,17007,61317,0.471614,8,t5_37co3,haasf1team,0.88879,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583328,t5_2qimj,formula1,17007,41108,0.474223,9,t5_2vpfj,formulae,0.887556,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1
22583329,t5_2qimj,formula1,17007,80978,0.491059,10,t5_3ndbi,formuladank,0.879431,2022-11-22,91ac7ca171024c779c0992f59470c81b,cau-text-mUSE,v0.6.1


# Save metadata to a dictionary
This meta is the same for all subs, so it'll be faster if we create it once rather than re-creating for each subreddit

In [26]:
%%time

d_topk_meta = {
    'pt': None,
    'mlflow_run_id': None, 
    'model_name': None,
    'model_version': None,
}
info(f"Checking keys for ndjson...")
for k in tqdm(d_topk_meta.keys()):
    assert 1 == df_nn_top[k].nunique()
    d_topk_meta[k] = df_nn_top[k].values[0]
    print(f"  {k}: {d_topk_meta[k]}")

00:26:29 | INFO | "Checking keys for ndjson..."
 25%|██▌       | 1/4 [00:04<00:13,  4.56s/it]

  pt: 2022-11-22


 50%|█████     | 2/4 [00:10<00:10,  5.26s/it]

  mlflow_run_id: 91ac7ca171024c779c0992f59470c81b


 75%|███████▌  | 3/4 [00:14<00:04,  4.98s/it]

  model_name: cau-text-mUSE


100%|██████████| 4/4 [00:19<00:00,  4.84s/it]

  model_version: v0.6.1
CPU times: user 18.4 s, sys: 946 ms, total: 19.4 s
Wall time: 19.4 s





# Reshape to JSON so we can load it into BigQuery

Saving in a nested format can speed up queries by over 2x and matches the format that the ML team has in their embeddings.

Fixed (2022-11-22) to correct format:
- WANTED: the `similar_subreddits` field should be: 
    - a list of dictionaries
        - each dict is a subreddit

See example of format we want here:
https://github.snooguts.net/reddit/gazette-models/blob/cf324c18d974d0b01bb40c71c7f6425d7ff16576/similar_subreddit/embeddings/local_write.py#L32

```python
def write_similar_subreddit_file(
    date_today: str,
    model_name: str,
    model_version: str,
    filename_path_top_k: Path,
    topk_dict: Dict,
    subreddit2id: Dict,
) -> List:
    with open(filename_path_top_k, "w") as f:
        for sr, sim_sr_pairs in topk_dict.items():
            line_dict: Dict[str, Any] = dict()
            if sim_sr_pairs:  # make sure subreddit list is not empty
                line_dict["pt"] = date_today
                line_dict["model_name"] = model_name
                line_dict["model_version"] = model_version
                line_dict["subreddit_name"] = sr
                line_dict["subreddit_id"] = subreddit2id[sr]

                if sr != sim_sr_pairs[0][0]:
                    raise ValueError(
                        f"Inconsistent subreddit name {sim_sr_pairs[0][0]} with searched name {sr}"
                    )

                # ** THIS is the nested list of dicts that was wrong in previous code ** 
                sim_srs = []
                for sim_sr, sim_score in sim_sr_pairs[1:]:
                    sim_sr_dict = {
                        "subreddit_name": sim_sr,
                        "subreddit_id": subreddit2id[sim_sr],
                        "score": sim_score.astype(float),
                    }
                    sim_srs.append(sim_sr_dict)

                line_dict["similar_subreddit"] = sim_srs

                line = json.dumps(line_dict)
                f.write(line + "\n")
```

In [29]:
manual_model_timestamp = '2022-11-22_230904'  # datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
path_this_model = get_project_subfolder(
    f"data/models/ann/manual_v061_{manual_model_timestamp}"
)
Path.mkdir(path_this_model, parents=True, exist_ok=True)
path_this_model

PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/ann/manual_v061_2022-11-22_230904')

In [35]:
%%time

# NOTE the "_fix" suffix for this path so we don't over-write the existing data
p_local_json = path_this_model / f"ann_ndjson-{df_nn_top['subreddit_id'].nunique()}-{manual_model_timestamp}_fix"
Path.mkdir(p_local_json, exist_ok=True, parents=True)
subfolder_json = p_local_json.name

f_local_json_name = f"ann_ndjson-{df_nn_top['subreddit_id'].nunique()}_subreddits.json"
f_local_json_full = p_local_json / f_local_json_name

# If we run this multiple times, make sure we don't append duplicated lines
try:
    info(f"Deleting existing file...")
    f_local_json_full.unlink()
except FileNotFoundError as e:
    info(f"NVM, file does not exist yet...\n {e}")

prefix_similar_sub = 'similar'

# These are the cols to nest for similar subreddits
# NOTE: we need to rename sub-ID & sub-name for the nested format!
cols_for_similar_sub_ = [
    f"{prefix_similar_sub}_subreddit_id",
    f"{prefix_similar_sub}_subreddit_name",
    'cosine_similarity',
    'distance_rank',
]
d_rename_for_similar = {
    f"{prefix_similar_sub}_subreddit_id": "subreddit_id",
    f"{prefix_similar_sub}_subreddit_name": "subreddit_name",
}


info(f"Start saving df as ndJSON...")
with open(f_local_json_full, 'w') as f:
    for seed_sub_id_, df_seed_ in tqdm(df_nn_top.groupby(['subreddit_id']), mininterval=2):
        d_seed = {
            **d_topk_meta,
            **{
                'subreddit_id': seed_sub_id_,
                'subreddit_name': str(df_seed_['subreddit_name'].values[0]),
                
                # 2022-11-22: fixed the logic for similar_subreddit 
                #   each subreddit should be its own dict
                #   We need to RENAME the "similar" columns! (remove the similar prefix)
                'similar_subreddit': (
                    df_seed_[cols_for_similar_sub_]
                    .rename(columns=d_rename_for_similar)
                    .to_dict(orient='records')
                )
            }
        }
        f.write(json.dumps(d_seed) + "\n")

info(f"Done saving as ndJSON")
print(f"Example subreddit:")
for k, v in d_seed.items():
    if isinstance(v, list):
        print(f"{k}:")
        for _ in v[:5]:
            print(f"    {_}")
    else:
        print(f"{k}:  {v}")

01:05:02 | INFO | "Deleting existing file..."
01:05:02 | INFO | "Start saving df as ndJSON..."
100%|██████████| 239774/239774 [10:18<00:00, 387.73it/s]
01:15:31 | INFO | "Done saving as ndJSON"


Example subreddit:
pt:  2022-11-22
mlflow_run_id:  91ac7ca171024c779c0992f59470c81b
model_name:  cau-text-mUSE
model_version:  v0.6.1
subreddit_id:  t5_zzw6f
subreddit_name:  missourisingles
similar_subreddit:
    {'subreddit_id': 't5_37j5o', 'subreddit_name': 'upstatenyr4r', 'cosine_similarity': 0.8861671273011491, 'distance_rank': 1}
    {'subreddit_id': 't5_2vcv5', 'subreddit_name': 'r4rportland', 'cosine_similarity': 0.8686086098671737, 'distance_rank': 2}
    {'subreddit_id': 't5_2ucfi', 'subreddit_name': 'houstonr4r', 'cosine_similarity': 0.8656136562130552, 'distance_rank': 3}
    {'subreddit_id': 't5_2qhlb', 'subreddit_name': 'singles', 'cosine_similarity': 0.864648776897802, 'distance_rank': 4}
    {'subreddit_id': 't5_3jkzq5', 'subreddit_name': 'zahookups', 'cosine_similarity': 0.8631454247168744, 'distance_rank': 5}
CPU times: user 10min 17s, sys: 15.1 s, total: 10min 32s
Wall time: 10min 38s


In [39]:
%%time
# log to mlflow
try:
    print(d_mlflow_paths.keys())
except NameError:
    info(f"Initialize d_mlflow_paths (dict)")
    d_mlflow_paths = dict()

info(f"Start logging JSON file...")
with mlflow.start_run(run_id=run_uuid) as run:
    mlflow.log_artifacts(str(p_local_json), subfolder_json)
    # get path to JSON file so that we can create a table from it
    d_mlflow_paths['mlflow_artifact_json'] = mlflow.get_artifact_uri(
        artifact_path=f"{subfolder_json}/{f_local_json_name}"
    )
info(f"Logging artifact complete!")
info(f"JSON artifact location:\n{d_mlflow_paths['mlflow_artifact_json']}")

02:05:52 | INFO | "Initialize d_mlflow_paths"
02:05:52 | INFO | "Start logging JSON file..."
02:06:49 | INFO | "Logging artifact complete!"


CPU times: user 3.47 s, sys: 3.64 s, total: 7.11 s
Wall time: 56.8 s


# Upload JSON to BQ

Example `schema` here:
- https://github.snooguts.net/reddit/gazette-models/blob/cf324c18d974d0b01bb40c71c7f6425d7ff16576/similar_subreddit/embeddings/bq_write.py

using `bq load` won't work with a JSON schema in BQ.

Instead, let's try using the python client. NOTE: we'll need to get the right authentication in the VM that has the correct read & write access, e.g.,:
```bash
# login
gcloud auth application-default login

# logout
gcloud auth application-default revoke
```

---
example format for path:
```
d_mlflow_paths['mlflow_artifact_json'] = (
    'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_ndjson-2022-09-10_003611/ann_ndjson-250573_subreddits.json'
)
```

In [40]:
d_mlflow_paths['mlflow_artifact_json']

'gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/ann_ndjson-239774-2022-11-22_230904_fix/ann_ndjson-239774_subreddits.json'

In [None]:
BREAK

In [41]:
%%time

info(f"Updating table from file:\n{d_mlflow_paths['mlflow_artifact_json']}")

load_data_to_bq_table(
    uri=d_mlflow_paths['mlflow_artifact_json'],
    bq_project='reddit-employee-datasets',
    bq_dataset='david_bermejo',
    bq_table_name='cau_similar_subreddits_by_text',
    schema=similar_sub_schema(),
    partition_column='pt',
    table_description=(
        "Table with most similar subreddits by the text (posts & comments) in each sub."
        "  It works across 16 languages. So finance (English), Finanzen(German), & financia(Spanish) will be clustered together."
        "  See wiki: https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/CA+Embeddings+Topic+Model"
    ),
    update_table_description=True,
)

02:15:20 | INFO | "Updating table from file:
gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/ann_ndjson-239774-2022-11-22_230904_fix/ann_ndjson-239774_subreddits.json"
02:15:22 | INFO | "Loading data to table:
  reddit-employee-datasets.david_bermejo.cau_similar_subreddits_by_text"
02:15:22 | INFO | "Table reddit-employee-datasets.david_bermejo.cau_similar_subreddits_by_text already exist"
02:15:22 | INFO | "  0 rows in table BEFORE adding data"
02:16:40 | INFO | "Updating subreddit description from:
  Table with most similar subreddits by the text (posts & comments) in each sub.  It works across 16 languages. So finance (English), Finanzen(German), & financia(Spanish) will be clustered together.  See wiki: https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/CA+Embeddings+Topic+Model
to:
  Table with most similar subreddits by the text (posts & comments) in each sub.  It works across 16 languages. So finance (English), Finanz

CPU times: user 67.8 ms, sys: 104 ms, total: 172 ms
Wall time: 1min 19s
