# Purpose

### 2022-11-08
2 big changes:
- Use a new method to pull ANN and conver to dataframes faster
- Use a new class to run the job and make it easier to replicate and port to kubeflow in the future

New ETA for ~250k subreddits: ~50 minutes.


### 2022-08-01
Calculating precise nearest neighbors has become too expensive as we go over 40k subreddits. So instead let's calculate approx nearest neighbors (ANN). 

In this notebook we use [ANNOY](https://github.com/spotify/annoy).  Main reason for using annoy over FAISS is that annoy has official wheels in pypi, but FAISS only officially supports installation from conda. For now we don't want to depend on third-party wheels for FAISS b/c that can be messy to install & replicate. Maybe when we switch to kubeflow we can try FAISS?


# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [11]:
from datetime import datetime
import gc
import os
import json
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd

from tqdm import tqdm

import mlflow
import hydra
import annoy


import subclu
from subclu.models.nn_annoy import AnnoyIndex

from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks

from subclu.models.get_ann import GetANN


print_lib_versions([annoy, hydra, mlflow, np, pd, subclu])

python		v 3.7.10
===
annoy		v: 1.17.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
subclu		v: 0.6.1


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [5]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [6]:
mlf.list_experiment_meta(output_format='pandas').tail(9)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
35,35,v0.6.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/35,active
36,36,v0.6.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/36,active
37,37,v0.6.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/37,active
38,38,v0.6.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/38,active
39,39,v0.6.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/39,active
40,40,v0.6.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/40,active
41,41,v0.6.1_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/41,active
42,42,v0.6.1_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/42,active
43,43,v0.6.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/43,active


## Get runs from embeddings aggregation jobs

Want to make sure we can load these artifacts for other jobs

In [7]:
%%time

df_mlf_runs =  mlf.search_all_runs(experiment_ids=[40])
df_mlf_runs.shape

CPU times: user 60.4 ms, sys: 4.34 ms, total: 64.7 ms
Wall time: 64.1 ms


(4, 43)

In [8]:
df_mlf_runs[df_mlf_runs['status'] == 'FINISHED'].iloc[:5, :10]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.memory_total,metrics.cpu_count,metrics.df_v_subs-cols,metrics.df_v_post_comments-rows
2,91ac7ca171024c779c0992f59470c81b,40,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts,2022-11-07 21:38:57.662000+00:00,2022-11-08 08:52:44.944000+00:00,1444961.0,96.0,514.0,53597817.0


### Check run artifacts for selected run

In [9]:
run_uuid_ = '91ac7ca171024c779c0992f59470c81b'
l_artifacts_top_level = mlf.list_run_artifacts(
    run_id=run_uuid_,
    only_top_level=True,
    verbose=True,
)
l_artifacts_all = mlf.list_run_artifacts(
    run_id=run_uuid_,
    only_top_level=False,
    verbose=False,
)

06:58:33 | INFO | "   223 <- Artifacts to check count"
06:58:33 | INFO | "   223 <- Artifacts clean count"
06:58:33 | INFO | "     3 <- Artifacts & folders at TOP LEVEL clean count"
06:58:39 | INFO | "   223 <- Artifacts clean count"
06:58:39 | INFO | "     3 <- Artifacts & folders at TOP LEVEL clean count"


In [10]:
for t_ in l_artifacts_top_level:
    l_ = [i for i in l_artifacts_all if t_ in i]
    print(f"=== Items in folder: {len(l_):,.0f} | {t_}  ===")
    for _ in l_[:3]:
        print(' ', '/'.join(_.split('/')[5:]))
    print('')

=== Items in folder: 211 | df_posts_agg_c1  ===
  df_posts_agg_c1/_common_metadata
  df_posts_agg_c1/_metadata
  df_posts_agg_c1/part.0.parquet

=== Items in folder: 12 | df_subs_agg_c1  ===
  df_subs_agg_c1/_common_metadata
  df_subs_agg_c1/_metadata
  df_subs_agg_c1/part.0.parquet

=== Items in folder: 6 | df_subs_agg_c1_unweighted  ===
  df_subs_agg_c1_unweighted/_common_metadata
  df_subs_agg_c1_unweighted/_metadata
  df_subs_agg_c1_unweighted/part.0.parquet



# Set run parameters to log for mlflow

This dictionary is equivalent to a config file for now. Use it as a bases for kubeflow re-write.

How to get active run:
```python
mlflow.active_run().info.run_id
```

In [35]:
d_mlf_params = {
    'mlflow_run_name': f"ann_subreddit_test-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    'mlflow_experiment_name': 'v0.6.1_nearest_neighbors',
    'embeddings_run_uuid': '91ac7ca171024c779c0992f59470c81b',
    'subreddit_embeddings_folder': 'df_subs_agg_c1',
    'post_embeddings_folder': 'df_posts_agg_c1',
    'n_min_post_per_sub': 4,

    # index columns for ANN df, JSON, & BQ table
    'index_cols': ['subreddit_id', 'subreddit_name'],
    'model_version': 'v0.6.1',
    'model_name': 'cau-text-mUSE',
    
    # sample number of subreddits to sample.
    #  Set to None to run on full data
    'n_sample_embedding_rows': 25000,
    
    # flag & params to upload to bigquery
    'upload_to_bq': False,
    'bq_project': 'reddit-employee-datasets',
    'bq_dataset': 'david_bermejo',
    'bq_table_name': 'cau_similar_subreddits_by_text',
}
d_ann_params = {
    'n_trees': 200,
    'metric': 'angular',
}

# Run job to get ANN 
Use the new `class` to make it easy to replicate & test multiple ANN runs.

In [36]:
ann = GetANN(
    **d_mlf_params,
)

In [37]:
# mlflow.end_run("FAILED")

In [38]:
# mlflow.active_run().info

In [39]:
_ = ann._load_sub_embeddings()

08:09:10 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1"
100%|########################################| 12/12 [00:00<00:00, 39260.26it/s]
08:09:10 | INFO | "  Parquet files found:     4"
08:09:10 | INFO | "  Parquet files to use:     4"
08:09:12 | INFO | "  781,653 |   515 <- RAW df_embeddings SHAPE"
08:09:12 | INFO | "  Keeping only subs with 4 >= posts"
08:09:12 | INFO | "  289,856 |   515 <- df_embeddings SHAPE, after min post filter"
08:09:12 | INFO | "  SAMPLING n_rows: 25,000"
08:09:13 | INFO | "   25,000 |   515 <- df_embeddings SHAPE"
08:09:13 | INFO | "  0:00:08.041057 <- Load embeddings time time elapsed"


In [None]:
LEGACY

# Load aggregated embeddings

For subreddit-level embeddings, my python code (serial) is fine. 

Try `gsutil` to download **posts-level embeddings** b/c that can take a LONG time to download sequentially. `gsutil` makes parallel downloaidng much faster and reports download speeds above 500MB / s:

```bash
ents_sub_desc/part.67.parquet...
/ [2/197 files][ 61.7 GiB/ 75.4 GiB]  81% Done 632.0 MiB/s ETA 00:00:22
```

In [38]:
%%time

mlf.set_experiment(self.mlflow_experiment_name)
t_start_job = datetime.utcnow()
info(f"== Start ANN job ==")

t_start_read_embeddings_ = datetime.utcnow()
df_agg_sub_c_raw = mlf.read_run_artifact(
    run_id=run_uuid,
    artifact_folder='df_subs_agg_c1',
    read_function='pd_parquet',
    verbose=False,
)


info(df_agg_sub_c_raw.shape)

00:45:22 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1"
100%|###########################################| 12/12 [00:15<00:00,  1.33s/it]
00:45:38 | INFO | "  Parquet files found:     4"
00:45:38 | INFO | "  Parquet files to use:     4"


(781653, 515)
CPU times: user 20.8 s, sys: 6.82 s, total: 27.6 s
Wall time: 23.8 s


In [89]:
df_agg_sub_c_raw.iloc[:5, :7]

Unnamed: 0,subreddit_id,subreddit_name,posts_for_embeddings_count,embeddings_0,embeddings_1,embeddings_2,embeddings_3
0,t5_1001tl,jewel_xo,1,-0.028712,-0.027187,0.024826,0.046359
1,t5_1004au,tisbutafleshwound,3,0.010298,-0.000277,-0.004013,0.01762
2,t5_1006a0,sethigh,1,0.027356,0.032256,-0.022585,-0.004125
3,t5_1008xr,asiandiasporamusic,2,-0.011276,0.00072,-0.010621,0.021452
4,t5_1009a3,memesenespanol,299,-0.005113,-0.005898,-0.012267,0.006103


In [90]:
df_agg_sub_c_raw.iloc[-5:, :7]

Unnamed: 0,subreddit_id,subreddit_name,posts_for_embeddings_count,embeddings_0,embeddings_1,embeddings_2,embeddings_3
781648,t5_71dwdl,leagoldmining,0,0.027257,0.03655,-0.086116,-0.007389
781649,t5_6u3a0g,onlyfans_subscribers7,0,-0.05452,0.011804,-0.01482,0.02281
781650,t5_7a1p9b,xecauquan1,0,-0.013504,-0.092274,0.013367,0.032947
781651,t5_6xryrp,steroidsarmspeptide,0,-0.03319,0.058719,0.071802,0.065691
781652,t5_7a4axh,autonation_,0,-0.016285,-0.034998,-0.000827,0.058873


# Filter subreddits to use in ANN index

In a previous version we only kept subs that had embeddings AND clustering data. 
<br>Now that we cover 700k subreddits for v0.6.x, we need to be more thoughtful about how we'll select which subs to keep for ANN.

For v0.6.1 we'll keep only subs that have **4+ posts in L90 days**. From this mode dashboard we expect that number to be around 289k subreddits.

Mode Dashboards: 
- v0.6.0: https://app.mode.com/reddit/reports/e6cde33162c4 
- v0.6.1: https://app.mode.com/reddit/reports/87ce3abc9e37


## Apply filters

In v0.6.1, we already have the number of posts for embedding in the embedding file, so we don't need to load additional data (from mlflow or BQ) to apply post-count filters.

In [56]:
# use df_pc_counts because it has the counts for post+comment after filtering for length
value_counts_and_pcts(
  pd.cut(
      df_agg_sub_c_raw['posts_for_embeddings_count'],
      bins=[-1, 0, 1, 2, 3, 4, 5, 49, np.inf],
      labels=[
        "00 posts", "01 post", '02 posts', '03 posts',
        '04 posts', '05 posts'
        , '06-49 posts', '50+ posts'
      ]
  ).rename('posts_with_len_3+'),
  sort_index=True,
  add_col_prefix=False,
  count_type='subreddits',
  sort_index_ascending=False,
  cumsum_count=True,
  reset_index=True,
).hide_index().set_caption(f"<h4 align='left'>Post distribution for subreddits with 1 view & 1 attempted post in L90-days</h4>")

posts_with_len_3+,subreddits_count,percent_of_subreddits,cumulative_sum_of_subreddits,cumulative_percent_of_subreddits
50+ posts,74004,9.5%,74004,9.5%
06-49 posts,158397,20.3%,232401,29.7%
05 posts,23732,3.0%,256133,32.8%
04 posts,33723,4.3%,289856,37.1%
03 posts,57946,7.4%,347802,44.5%
02 posts,128068,16.4%,475870,60.9%
01 post,235794,30.2%,711664,91.0%
00 posts,69989,9.0%,781653,100.0%


In [57]:
%%time

df_agg_sub_c = df_agg_sub_c_raw[df_agg_sub_c_raw['posts_for_embeddings_count'] >= 4]
df_agg_sub_c.shape

CPU times: user 268 ms, sys: 101 ms, total: 370 ms
Wall time: 367 ms


(289856, 515)

# Build annoy index

I created a custom `AnnoyIndex` class with some extra methods to create outputs & (and calculate cosine distance) for BigQuery.

In [28]:
%%time

index_cols = ['subreddit_id', 'subreddit_name']
l_embedding_cols = [c for c in df_agg_sub_c.columns if c.startswith('embeddings_')]
n_trees = 200

nn_index = AnnoyIndex(
    df_agg_sub_c[l_embedding_cols + index_cols],
    index_cols=index_cols,
    metric='angular',
    n_trees=n_trees,
)

nn_index.build()

CPU times: user 55min 11s, sys: 1min 34s, total: 56min 46s
Wall time: 1min 1s


## Get df with all items

For 80k subreddits it took 1 hr & 17 minutes.

I had to create a new method b/c that method would've taken over 18 hours to get ANN for 250k subreddits.

New method should take ~40 minutes to get 250 ANN for 250k subreddits !!!.


```bash
# old method:
100%|██████████| 81973/81973 [1:17:02<00:00, 17.73it/s]
17:07:23 | INFO | "(8115327, 7) <- df_top_items shape"


# new method:
  7%|6         | 17098/250573 [02:40<36:22, 106.97it/s]
```

In [32]:
%%time

df_nn_top = nn_index.get_top_n_by_item_all_fast(
    k=250,
    search_k=-1,
    include_distances=True,
    append_i=True,
    cosine_similarity=True,
    # n_sample=None,
)

100%|##########| 250573/250573 [39:35<00:00, 105.50it/s]
23:17:53 | INFO | "Start combining all ANNs into a df..."
23:18:50 | INFO | "(62643250, 4) <- df_nn_top shape"
23:18:50 | INFO | "Adding index labels (subreddit ID & Name)"
23:19:05 | INFO | "Done adding index names"
23:19:05 | INFO | "(62643250, 8) <- df_nn_top shape"
23:19:05 | INFO | "Calculating cosine similarity..."


CPU times: user 40min 29s, sys: 22.1 s, total: 40min 51s
Wall time: 40min 50s


### Quick Checks

In [42]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'france']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
4952750,t5_2qm9d,cfb,19811,40995,0.350282,1,t5_2urol,cfbmemes,0.938651
4952751,t5_2qm9d,cfb,19811,50234,0.357853,2,t5_2xys7,fsusports,0.935971
4952752,t5_2qm9d,cfb,19811,40002,0.407226,3,t5_2uhr8,notredamefootball,0.917084
4952753,t5_2qm9d,cfb,19811,22423,0.413432,4,t5_2r5u7,ohiostatefootball,0.914537
4952754,t5_2qm9d,cfb,19811,24202,0.42094,5,t5_2rj3j,collegebasketball,0.911405
4952755,t5_2qm9d,cfb,19811,27143,0.448822,6,t5_2s5kg,lsufootball,0.899279
4952756,t5_2qm9d,cfb,19811,27684,0.451051,7,t5_2s81c,wde,0.898277
4952757,t5_2qm9d,cfb,19811,43840,0.462864,8,t5_2vmjf,theb1g,0.892879
4952758,t5_2qm9d,cfb,19811,22673,0.467848,9,t5_2r7qs,huskers,0.890559
4952759,t5_2qm9d,cfb,19811,41916,0.474273,10,t5_2v1s4,utahfootball,0.887533


In [43]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'finanzen']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
15892000,t5_35m5e,finanzen,63568,143647,0.402592,1,t5_5txdoj,finanzenat,0.91896
15892001,t5_35m5e,finanzen,63568,65652,0.486613,2,t5_37aoh,vosfinances,0.881604
15892002,t5_35m5e,finanzen,63568,765,0.487687,3,t5_11cinh,befire,0.881081
15892003,t5_35m5e,finanzen,63568,81921,0.49109,4,t5_3isqn,italiapersonalfinance,0.879415
15892004,t5_35m5e,finanzen,63568,34682,0.50908,5,t5_2tasy,personalfinancecanada,0.870419
15892005,t5_35m5e,finanzen,63568,68503,0.523805,6,t5_38zrx,personalfinancenz,0.862814
15892006,t5_35m5e,finanzen,63568,244203,0.527583,7,t5_oe819,personalfinanceza,0.860828
15892007,t5_35m5e,finanzen,63568,45481,0.527723,8,t5_2w5jv,eupersonalfinance,0.860754
15892008,t5_35m5e,finanzen,63568,85195,0.532999,9,t5_3ljid,europefire,0.857956
15892009,t5_35m5e,finanzen,63568,40661,0.53488,10,t5_2uo3q,ausfinance,0.856952


# Add dt/pt column & metadata columns

In [44]:
d_topk_meta = {
    'pt': datetime.utcnow().strftime("%Y-%m-%d"),
    'mlflow_run_id': run_uuid, 
    'model_name': 'cau-text-mUSE',
    'model_version': 'v0.6.0',
}
for k, v in d_topk_meta.items():
    df_nn_top[k] = v

In [46]:
df_nn_top.tail()

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity,pt,mlflow_run_id,model_name,model_version
62643245,t5_zzzyw,rachelnicki,250572,169413,0.602402,246,t5_6d3o35,gladys_,0.818556,2022-09-09,badc44b0e5ac467da14f710da0b410c6,cau-text-mUSE,v0.6.0
62643246,t5_zzzyw,rachelnicki,250572,123722,0.602437,247,t5_56ynmh,incremental_game,0.818535,2022-09-09,badc44b0e5ac467da14f710da0b410c6,cau-text-mUSE,v0.6.0
62643247,t5_zzzyw,rachelnicki,250572,166877,0.602614,248,t5_6cimpb,allrandomxd,0.818428,2022-09-09,badc44b0e5ac467da14f710da0b410c6,cau-text-mUSE,v0.6.0
62643248,t5_zzzyw,rachelnicki,250572,145233,0.602756,249,t5_5vdb71,ups3040,0.818342,2022-09-09,badc44b0e5ac467da14f710da0b410c6,cau-text-mUSE,v0.6.0
62643249,t5_zzzyw,rachelnicki,250572,96088,0.603069,250,t5_44nt53,bcgyvxz421675,0.818154,2022-09-09,badc44b0e5ac467da14f710da0b410c6,cau-text-mUSE,v0.6.0


In [47]:
df_nn_top.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62643250 entries, 0 to 62643249
Data columns (total 13 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   subreddit_id            object 
 1   subreddit_name          object 
 2   seed_ix                 int64  
 3   nn_ix                   int64  
 4   distance                float64
 5   distance_rank           int64  
 6   similar_subreddit_id    object 
 7   similar_subreddit_name  object 
 8   cosine_similarity       float64
 9   pt                      object 
 10  mlflow_run_id           object 
 11  model_name              object 
 12  model_version           object 
dtypes: float64(2), int64(3), object(8)
memory usage: 6.1+ GB


# Save DF to local & log to Mlflow

Instead of saving it to random location in GCS, save artifact locally & then log it to mlflow job as a new artifact.

Make sure to append a timestamp in case we try different ANN approaches


In [48]:
manual_model_timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
path_this_model = get_project_subfolder(
    f"data/models/ann/manual_v060_{manual_model_timestamp}"
)
Path.mkdir(path_this_model, parents=True, exist_ok=True)
path_this_model

PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/ann/manual_v060_2022-09-10_003611')

In [49]:
%%time

p_df_subfolder = path_this_model / f'ann_df-{manual_model_timestamp}'
subfolder_df = p_df_subfolder.name

save_pd_df_to_parquet_in_chunks(
    df_nn_top,
    p_df_subfolder,
    write_index=False
)

00:36:12 | INFO | "Converting pandas to dask..."
00:36:56 | INFO | "  35,795.5 MB <- Memory usage"
00:36:56 | INFO | "      66	<- target Dask partitions	  550.0 <- target MB partition size"


CPU times: user 2min 25s, sys: 17.6 s, total: 2min 42s
Wall time: 1min 56s


### Log to mlflow

In [50]:
%%time

d_mlflow_paths = dict()
info(f"Start logging parquet to mlflow...")
with mlflow.start_run(run_id=run_uuid) as run:
    mlflow.log_artifacts(str(p_df_subfolder), subfolder_df)
    # get path to JSON file so that we can create a table from it
    d_mlflow_paths['mlflow_artifact_df'] = mlflow.get_artifact_uri(
        artifact_path=f"{subfolder_df}"
    )
info(f"Logging artifact complete!")

00:39:01 | INFO | "Logging artifact complete!"


CPU times: user 2.3 s, sys: 2.66 s, total: 4.96 s
Wall time: 53.2 s


In [51]:
d_mlflow_paths

{'mlflow_artifact_df': 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_df-2022-09-10_003611'}

# Save to JSON for BigQuery

My code below is INCORECT as of 2022-11-09 (sigh).
Need to fix:
- WANTED: the nested similars subreddits format to a list of dictionaries
- Current/wrong: each field is a list

```python
sim_srs = []
for sim_sr, sim_score in sim_sr_pairs[1:]:
    sim_sr_dict = {
        "subreddit_name": sim_sr,
        "subreddit_id": subreddit2id[sim_sr],
        "score": sim_score.astype(float),
    }
    sim_srs.append(sim_sr_dict)

line_dict["similar_subreddit"] = sim_srs

```


See example of format we want here:
https://github.snooguts.net/reddit/gazette-models/blob/cf324c18d974d0b01bb40c71c7f6425d7ff16576/similar_subreddit/embeddings/local_write.py#L32

```python
def write_similar_subreddit_file(
    date_today: str,
    model_name: str,
    model_version: str,
    filename_path_top_k: Path,
    topk_dict: Dict,
    subreddit2id: Dict,
) -> List:
    with open(filename_path_top_k, "w") as f:
        for sr, sim_sr_pairs in topk_dict.items():
            line_dict: Dict[str, Any] = dict()
            if sim_sr_pairs:  # make sure subreddit list is not empty
                line_dict["pt"] = date_today
                line_dict["model_name"] = model_name
                line_dict["model_version"] = model_version
                line_dict["subreddit_name"] = sr
                line_dict["subreddit_id"] = subreddit2id[sr]

                if sr != sim_sr_pairs[0][0]:
                    raise ValueError(
                        f"Inconsistent subreddit name {sim_sr_pairs[0][0]} with searched name {sr}"
                    )

                sim_srs = []
                for sim_sr, sim_score in sim_sr_pairs[1:]:
                    sim_sr_dict = {
                        "subreddit_name": sim_sr,
                        "subreddit_id": subreddit2id[sim_sr],
                        "score": sim_score.astype(float),
                    }
                    sim_srs.append(sim_sr_dict)

                line_dict["similar_subreddit"] = sim_srs

                line = json.dumps(line_dict)
                f.write(line + "\n")
```

In [62]:
%%time

p_local_json = path_this_model / f'ann_ndjson-{manual_model_timestamp}'
Path.mkdir(p_local_json, exist_ok=True, parents=True)
subfolder_json = p_local_json.name

f_local_json_name = f"ann_ndjson-{df_nn_top['subreddit_id'].nunique()}_subreddits.json"
f_local_json_full = p_local_json / f_local_json_name
# If we run this multiple times, make sure we don't append duplicated lines
try:
    info(f"Deleting existing file...")
    f_local_json_full.unlink()
except FileNotFoundError as e:
    info(f"NVM, file does not exist yet...\n {e}")

prefix_similar_sub = 'similar'


info(f"Start saving df as ndJSON...")
with open(f_local_json_full, 'w') as f:
    for seed_sub_id_, df_seed_ in tqdm(df_nn_top.groupby(['subreddit_id']), mininterval=2):

        d_seed = {
            **d_topk_meta,
            **{
                'subreddit_id': seed_sub_id_,
                'subreddit_name': str(df_seed_['subreddit_name'].values[0]),
                'similar_subreddit': {
                    'subreddit_id': list(df_seed_[f'{prefix_similar_sub}_subreddit_id']),
                    'subreddit_name': list(df_seed_[f'{prefix_similar_sub}_subreddit_name']),
                    'cosine_similarity': list(df_seed_['cosine_similarity']),
                    'distance_rank': list(df_seed_['distance_rank']),
                }
            }
        }
        f.write(json.dumps(d_seed) + "\n")
info(f"Done saving as ndJSON")

00:50:01 | INFO | "Deleting existing file..."
00:50:01 | INFO | "NVM, file does not exist yet...
 [Errno 2] No such file or directory: '/home/jupyter/subreddit_clustering_i18n/data/models/ann/manual_v060_2022-09-10_003611/ann_ndjson-2022-09-10_003611/ann_ndjson-250573_subreddits.json'"
00:50:01 | INFO | "Start saving df as ndJSON..."
100%|██████████| 250573/250573 [03:24<00:00, 1223.17it/s]
00:53:33 | INFO | "Done saving as ndJSON"


CPU times: user 3min 25s, sys: 9.43 s, total: 3min 34s
Wall time: 3min 38s


In [63]:
%%time
# log to mlflow

with mlflow.start_run(run_id=run_uuid) as run:
    mlflow.log_artifacts(str(p_local_json), subfolder_json)
    # get path to JSON file so that we can create a table from it
    d_mlflow_paths['mlflow_artifact_json'] = mlflow.get_artifact_uri(
        artifact_path=f"{subfolder_json}/{f_local_json_name}"
    )
info(f"Logging artifact complete!")

00:54:17 | INFO | "Logging artifact complete!"


CPU times: user 2.04 s, sys: 2.52 s, total: 4.57 s
Wall time: 43.9 s


In [64]:
d_mlflow_paths['mlflow_artifact_json']

'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_ndjson-2022-09-10_003611/ann_ndjson-250573_subreddits.json'

# Upload JSON to BQ

using `bq load` won't work with a JSON schema in BQ.

Instead, let's try using the python client. NOTE: we'll need to get the right authentication in the VM that has the correct read & write access, e.g.,:
```bash
# login
gcloud auth application-default login

# logout
gcloud auth application-default revoke
```

In [65]:
info(f"Creating table from file:\n{d_mlflow_paths['mlflow_artifact_json']}")

load_data_to_bq_table(
    uri=d_mlflow_paths['mlflow_artifact_json'],
    bq_project='reddit-employee-datasets',
    bq_dataset='david_bermejo',
    bq_table_name='cau_similar_subreddits_by_text',
    schema=similar_sub_schema(),
    partition_column='pt',
    table_description=(
        "Table with most similar subreddits by the text (posts & comments) in each sub."
        "  It works across 16 languages. So finance (English), Finanzen(German), & financia(Spanish) will be clustered together."
        "  See wiki: https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/CA+Embeddings+Topic+Model"
    ),
    update_table_description=True,
)

00:54:17 | INFO | "Creating table from file:
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_ndjson-2022-09-10_003611/ann_ndjson-250573_subreddits.json"
00:54:18 | INFO | "Loading data to table:
  reddit-employee-datasets.david_bermejo.cau_similar_subreddits_by_text"
00:54:19 | INFO | "Table reddit-employee-datasets.david_bermejo.cau_similar_subreddits_by_text already exist"
00:54:19 | INFO | "  0 rows in table BEFORE adding data"
00:55:28 | INFO | "Updating subreddit description from:
  Table with most similar subreddits by the text (posts & comments) in each sub.  It works across 16 languages. So finance (English), Finanzen(German), & financia(Spanish) will be clustered together.  See wiki: https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/CA+Embeddings+Topic+Model
to:
  Table with most similar subreddits by the text (posts & comments) in each sub.  It works across 16 languages. So finance (English), Finanzen(German),

In [None]:
total_fxn_time = elapsed_time(start_time=t_start_job, log_label='Total fxn time', verbose=True)
mlflow.log_metric(
    'time_fxn-full_ann_fxn_minutes',
    total_fxn_time / timedelta(minutes=1)
)

mlflow.end_run()

# Check some example outputs

In [35]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'ich_iel']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
16526500,t5_37k29,ich_iel,66106,19931,0.526296,1,t5_2qmr6,aeiou,0.861506
16526501,t5_37k29,ich_iel,66106,43743,0.536011,2,t5_2vlhq,kreiswichs,0.856346
16526502,t5_37k29,ich_iel,66106,70082,0.565085,3,t5_39uv3,kopiernudeln,0.840339
16526503,t5_37k29,ich_iel,66106,248077,0.583634,4,t5_w2zxy,okoidawappler,0.829686
16526504,t5_37k29,ich_iel,66106,69147,0.589393,5,t5_39bxv,ik_ihe,0.826308
16526505,t5_37k29,ich_iel,66106,56687,0.614496,6,t5_318w4,cirkeltrek,0.811198
16526506,t5_37k29,ich_iel,66106,2179,0.614739,7,t5_17d5ey,ichbin40undlustig,0.811048
16526507,t5_37k29,ich_iel,66106,71817,0.617536,8,t5_3b2y1,einfach_posten,0.809324
16526508,t5_37k29,ich_iel,66106,244226,0.618018,9,t5_ofkj1,okbrudimongo,0.809027
16526509,t5_37k29,ich_iel,66106,80224,0.632797,10,t5_3hn0l,deutschememes,0.799784


In [85]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'ireland']
    .head(15)
)

NameError: name 'df_nn_top' is not defined

In [36]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'vegetarischde']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
25173750,t5_4c06em,vegetarischde,100695,66507,0.343746,1,t5_37ruc,vegande,0.940919
25173751,t5_4c06em,vegetarischde,100695,5347,0.471801,2,t5_25v3wn,kreisvegs,0.888702
25173752,t5_4c06em,vegetarischde,100695,18865,0.50797,3,t5_2qhzr,vegetarianism,0.870983
25173753,t5_4c06em,vegetarischde,100695,38058,0.52908,4,t5_2u0f5t,vegfr,0.860037
25173754,t5_4c06em,vegetarischde,100695,18699,0.544144,5,t5_2qhpm,vegan,0.851953
25173755,t5_4c06em,vegetarischde,100695,40908,0.55735,6,t5_2uquu,askvegans,0.84468
25173756,t5_4c06em,vegetarischde,100695,83645,0.558577,7,t5_3jwb3,veganita,0.843996
25173757,t5_4c06em,vegetarischde,100695,43112,0.565111,8,t5_2ven0,antivegan,0.840325
25173758,t5_4c06em,vegetarischde,100695,120,0.573935,9,t5_109235,exvegans,0.835299
25173759,t5_4c06em,vegetarischde,100695,61425,0.574836,10,t5_33xgk,veganuk,0.834782


In [37]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'antivegan']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
10778000,t5_2ven0,antivegan,43112,29526,0.289227,1,t5_2sgfh,vegancirclejerk,0.958174
10778001,t5_2ven0,antivegan,43112,157083,0.330142,2,t5_675dds,vegancirclejerkchat,0.945503
10778002,t5_2ven0,antivegan,43112,55981,0.33087,3,t5_30wk6,veganmemes,0.945262
10778003,t5_2ven0,antivegan,43112,18699,0.33245,4,t5_2qhpm,vegan,0.944738
10778004,t5_2ven0,antivegan,43112,242474,0.339799,5,t5_kycqf,veganforcirclejerkers,0.942268
10778005,t5_2ven0,antivegan,43112,40908,0.366645,6,t5_2uquu,askvegans,0.932786
10778006,t5_2ven0,antivegan,43112,120,0.371951,7,t5_109235,exvegans,0.930826
10778007,t5_2ven0,antivegan,43112,18865,0.372444,8,t5_2qhzr,vegetarianism,0.930643
10778008,t5_2ven0,antivegan,43112,24096,0.42114,9,t5_2ribr,veganism,0.91132
10778009,t5_2ven0,antivegan,43112,28190,0.429534,10,t5_2sa7z,debateavegan,0.90775


In [38]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'mexico']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
4697500,t5_2qhv7,mexico,18790,117015,0.42711,1,t5_4ywzju,askmexico,0.908789
4697501,t5_2qhv7,mexico,18790,19745,0.500914,2,t5_2qm06,monterrey,0.874542
4697502,t5_2qhv7,mexico,18790,28453,0.548083,3,t5_2sbh1,mexicali,0.849802
4697503,t5_2qhv7,mexico,18790,40741,0.57182,4,t5_2up3k,ticos,0.836511
4697504,t5_2qhv7,mexico,18790,84974,0.575123,5,t5_3la4d,mujico,0.834617
4697505,t5_2qhv7,mexico,18790,36542,0.582951,6,t5_2tocwj,tfwyouliveinmexico,0.830084
4697506,t5_2qhv7,mexico,18790,28259,0.594426,7,t5_2samk,guatemala,0.823329
4697507,t5_2qhv7,mexico,18790,15439,0.611576,8,t5_2lxxle,mexicow,0.812987
4697508,t5_2qhv7,mexico,18790,27163,0.617722,9,t5_2s5noh,mexico4t,0.80921
4697509,t5_2qhv7,mexico,18790,37505,0.626232,10,t5_2tw1p,mexicocity,0.803917


In [39]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'memesenespanol']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
0,t5_1009a3,memesenespanol,0,240461,0.531268,1,t5_hc3xv,memesespanol,0.858877
1,t5_1009a3,memesenespanol,0,483,0.561,2,t5_10wycq,memesesp,0.842639
2,t5_1009a3,memesenespanol,0,90209,0.572214,3,t5_3qq2qy,beelcitosmemes,0.836285
3,t5_1009a3,memesenespanol,0,91771,0.588615,4,t5_3wam26,latesitoo,0.826766
4,t5_1009a3,memesenespanol,0,160575,0.650379,5,t5_69coi0,anzutops777oficial,0.788504
5,t5_1009a3,memesenespanol,0,696,0.650841,6,t5_1178j8,orslokx,0.788203
6,t5_1009a3,memesenespanol,0,249358,0.658051,7,t5_xz9ed,yointerneto,0.783485
7,t5_1009a3,memesenespanol,0,10581,0.659339,8,t5_2e54fb,shitpostesp,0.782636
8,t5_1009a3,memesenespanol,0,111782,0.660614,9,t5_4t6fim,aradiroff,0.781795
9,t5_1009a3,memesenespanol,0,250496,0.662297,10,t5_zvcd0,shitpostbr,0.780681


In [40]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'de']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
845250,t5_22i0,de,3381,102150,0.449787,1,t5_4egnbw,dezwo,0.898846
845251,t5_22i0,de,3381,73325,0.506051,2,t5_3caax,600euro,0.871956
845252,t5_22i0,de,3381,68658,0.513088,3,t5_392ha,asozialesnetzwerk,0.86837
845253,t5_22i0,de,3381,241308,0.527212,4,t5_irnzx,dachschaden,0.861024
845254,t5_22i0,de,3381,83717,0.544782,5,t5_3jxvk,tja,0.851606
845255,t5_22i0,de,3381,20188,0.558626,6,t5_2qo9i,austria,0.843968
845256,t5_22i0,de,3381,43596,0.626461,7,t5_2vk0m,nachrichten,0.803773
845257,t5_22i0,de,3381,18599,0.634985,8,t5_2qhjz,france,0.798397
845258,t5_22i0,de,3381,74106,0.643901,9,t5_3czn3,tokkiefeesboek,0.792696
845259,t5_22i0,de,3381,216418,0.650331,10,t5_6ok6xa,tratschen,0.788535


In [41]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'askfrance']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
16526500,t5_37k29,ich_iel,66106,19931,0.526296,1,t5_2qmr6,aeiou,0.861506
16526501,t5_37k29,ich_iel,66106,43743,0.536011,2,t5_2vlhq,kreiswichs,0.856346
16526502,t5_37k29,ich_iel,66106,70082,0.565085,3,t5_39uv3,kopiernudeln,0.840339
16526503,t5_37k29,ich_iel,66106,248077,0.583634,4,t5_w2zxy,okoidawappler,0.829686
16526504,t5_37k29,ich_iel,66106,69147,0.589393,5,t5_39bxv,ik_ihe,0.826308
16526505,t5_37k29,ich_iel,66106,56687,0.614496,6,t5_318w4,cirkeltrek,0.811198
16526506,t5_37k29,ich_iel,66106,2179,0.614739,7,t5_17d5ey,ichbin40undlustig,0.811048
16526507,t5_37k29,ich_iel,66106,71817,0.617536,8,t5_3b2y1,einfach_posten,0.809324
16526508,t5_37k29,ich_iel,66106,244226,0.618018,9,t5_ofkj1,okbrudimongo,0.809027
16526509,t5_37k29,ich_iel,66106,80224,0.632797,10,t5_3hn0l,deutschememes,0.799784


In [42]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'cfb']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
4952750,t5_2qm9d,cfb,19811,40995,0.350282,1,t5_2urol,cfbmemes,0.938651
4952751,t5_2qm9d,cfb,19811,50234,0.357853,2,t5_2xys7,fsusports,0.935971
4952752,t5_2qm9d,cfb,19811,40002,0.407226,3,t5_2uhr8,notredamefootball,0.917084
4952753,t5_2qm9d,cfb,19811,22423,0.413432,4,t5_2r5u7,ohiostatefootball,0.914537
4952754,t5_2qm9d,cfb,19811,24202,0.42094,5,t5_2rj3j,collegebasketball,0.911405
4952755,t5_2qm9d,cfb,19811,27143,0.448822,6,t5_2s5kg,lsufootball,0.899279
4952756,t5_2qm9d,cfb,19811,27684,0.451051,7,t5_2s81c,wde,0.898277
4952757,t5_2qm9d,cfb,19811,43840,0.462864,8,t5_2vmjf,theb1g,0.892879
4952758,t5_2qm9d,cfb,19811,22673,0.467848,9,t5_2r7qs,huskers,0.890559
4952759,t5_2qm9d,cfb,19811,41916,0.474273,10,t5_2v1s4,utahfootball,0.887533


In [43]:
(
    df_nn_top[df_nn_top['subreddit_name'] == 'finanzen']
    .head(15)
)

Unnamed: 0,subreddit_id,subreddit_name,seed_ix,nn_ix,distance,distance_rank,similar_subreddit_id,similar_subreddit_name,cosine_similarity
15892000,t5_35m5e,finanzen,63568,143647,0.402592,1,t5_5txdoj,finanzenat,0.91896
15892001,t5_35m5e,finanzen,63568,65652,0.486613,2,t5_37aoh,vosfinances,0.881604
15892002,t5_35m5e,finanzen,63568,765,0.487687,3,t5_11cinh,befire,0.881081
15892003,t5_35m5e,finanzen,63568,81921,0.49109,4,t5_3isqn,italiapersonalfinance,0.879415
15892004,t5_35m5e,finanzen,63568,34682,0.50908,5,t5_2tasy,personalfinancecanada,0.870419
15892005,t5_35m5e,finanzen,63568,68503,0.523805,6,t5_38zrx,personalfinancenz,0.862814
15892006,t5_35m5e,finanzen,63568,244203,0.527583,7,t5_oe819,personalfinanceza,0.860828
15892007,t5_35m5e,finanzen,63568,45481,0.527723,8,t5_2w5jv,eupersonalfinance,0.860754
15892008,t5_35m5e,finanzen,63568,85195,0.532999,9,t5_3ljid,europefire,0.857956
15892009,t5_35m5e,finanzen,63568,40661,0.53488,10,t5_2uo3q,ausfinance,0.856952


# Test `search_k`
`search_k=-1` will search all trees and get the most accurate results but it will take longer to compute.

Recommendation: 
<br>use k=-1  or 


Even with small changes we can see in the examples below that there is a time difference and sometimes even in the top10 results we will miss a neighbor when we set k<=3 -- i.e., k=3 -> only search 3 trees).

In [29]:
%%time

n_test_i = 212
nn_index.get_top_n_by_item(n_test_i, k=9, search_k=-1, include_distances=True)

CPU times: user 277 ms, sys: 32.1 ms, total: 310 ms
Wall time: 308 ms


Unnamed: 0,subreddit_id_a,subreddit_name_a,distance_rank,subreddit_id_b,subreddit_name_b,distance
0,t5_10dzqu,godawfulmovies,0,t5_10dzqu,godawfulmovies,0.0
1,t5_10dzqu,godawfulmovies,1,t5_me7ba,podcastsharing,0.69861
2,t5_10dzqu,godawfulmovies,2,t5_2u29p,filmjunk,0.705137
3,t5_10dzqu,godawfulmovies,3,t5_2c7q0h,podcastpromoting,0.716215
4,t5_10dzqu,godawfulmovies,4,t5_n99oj,findthepathpodcast,0.716643
5,t5_10dzqu,godawfulmovies,5,t5_t6jv7,sinisterhood,0.716963
6,t5_10dzqu,godawfulmovies,6,t5_2zzeu,highersidechats,0.720665
7,t5_10dzqu,godawfulmovies,7,t5_np3is,letsgo2courtpodcast,0.721888
8,t5_10dzqu,godawfulmovies,8,t5_2t8p3,wehatemovies,0.723386


In [30]:
%%time
nn_index.get_top_n_by_item(n_test_i, k=9, search_k=1, include_distances=True)

CPU times: user 244 ms, sys: 0 ns, total: 244 ms
Wall time: 243 ms


Unnamed: 0,subreddit_id_a,subreddit_name_a,distance_rank,subreddit_id_b,subreddit_name_b,distance
0,t5_10dzqu,godawfulmovies,0,t5_10dzqu,godawfulmovies,0.0
1,t5_10dzqu,godawfulmovies,1,t5_2u29p,filmjunk,0.705137
2,t5_10dzqu,godawfulmovies,2,t5_2c7q0h,podcastpromoting,0.716215
3,t5_10dzqu,godawfulmovies,3,t5_n99oj,findthepathpodcast,0.716643
4,t5_10dzqu,godawfulmovies,4,t5_2zzeu,highersidechats,0.720665
5,t5_10dzqu,godawfulmovies,5,t5_np3is,letsgo2courtpodcast,0.721888
6,t5_10dzqu,godawfulmovies,6,t5_35xxi9,headgumpodcast,0.738195
7,t5_10dzqu,godawfulmovies,7,t5_2vo38,harmontown,0.741415
8,t5_10dzqu,godawfulmovies,8,t5_26gz8w,theteamhouse,0.746122


In [31]:
top_k_test_ = 20
cols_drop_ = ['subreddit_id_a', 'subreddit_id_b', 'distance']
cols_append_ = ['subreddit_name_b',]
df_compare_sk = nn_index.get_top_n_by_item(
    n_test_i, k=top_k_test_, search_k=-1, include_distances=True
).drop(cols_drop_, axis=1)

for k_ in [int(0.998 * n_trees), int(0.85 * n_trees), 
           int(0.5 * n_trees), min([200, int(0.1 * n_trees)]),
           1]:
    df_compare_sk = pd.concat(
        [
            df_compare_sk,
            nn_index.get_top_n_by_item(
                n_test_i, k=top_k_test_, search_k=k_, include_distances=True
            )[cols_append_].rename(columns={c: f"{c}_{k_}" for c in df_compare_sk.columns})
        ],
        axis=1,
    )
df_compare_sk

Unnamed: 0,subreddit_name_a,distance_rank,subreddit_name_b,subreddit_name_b_199,subreddit_name_b_170,subreddit_name_b_100,subreddit_name_b_20,subreddit_name_b_1
0,godawfulmovies,0,godawfulmovies,godawfulmovies,godawfulmovies,godawfulmovies,godawfulmovies,godawfulmovies
1,godawfulmovies,1,podcastsharing,filmjunk,filmjunk,filmjunk,filmjunk,filmjunk
2,godawfulmovies,2,filmjunk,podcastpromoting,podcastpromoting,podcastpromoting,podcastpromoting,podcastpromoting
3,godawfulmovies,3,podcastpromoting,findthepathpodcast,findthepathpodcast,findthepathpodcast,findthepathpodcast,findthepathpodcast
4,godawfulmovies,4,findthepathpodcast,highersidechats,highersidechats,highersidechats,highersidechats,highersidechats
5,godawfulmovies,5,sinisterhood,letsgo2courtpodcast,letsgo2courtpodcast,letsgo2courtpodcast,letsgo2courtpodcast,letsgo2courtpodcast
6,godawfulmovies,6,highersidechats,headgumpodcast,headgumpodcast,headgumpodcast,headgumpodcast,headgumpodcast
7,godawfulmovies,7,letsgo2courtpodcast,harmontown,harmontown,harmontown,harmontown,harmontown
8,godawfulmovies,8,wehatemovies,theteamhouse,theteamhouse,theteamhouse,theteamhouse,theteamhouse
9,godawfulmovies,9,weeklyplanetpodcast,headgum,headgum,headgum,headgum,headgum


In [None]:
LEGACY

## Load metdata to apply other filters [optional]

If we want to filter subreddits based on other data, we'll need to pull data from mlflow or BigQuery.


In [51]:
# # gcs_sub_embeddings = cfg_agg_embeddings.config_dict['data_embeddings_to_aggregate']['subreddit_desc_folder_embeddings']
# # print(gcs_sub_embeddings)
# gcs_post_comment_embeddings = "mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_posts_agg_c1"


# # gsutil is usually faster than the python library.
# remote_bucket_and_key = f"i18n-subreddit-clustering/{gcs_post_comment_embeddings}"
# remote_gs_path = f'gs://{remote_bucket_and_key}'

# # Need to remove the last part of the local path otherwise we'll get duplicate subfolders:
# #. top/2021-12-14/2021-12-14 instead of top/2021-12-14
# local_f = f"/home/jupyter/subreddit_clustering_i18n/data/local_cache/{'/'.join(remote_bucket_and_key.split('/')[:-1])}"
# Path(local_f).mkdir(parents=True, exist_ok=True)

# print(f"Remote path:\n  {remote_gs_path}")
# print(f"Local path:\n  {local_f}\n")

# # print(f"gsutil -o GSUtil:parallel_thread_count=15 -o GSUtil:parallel_process_count=15 -m cp -r -n {remote_gs_path} {local_f} \n")

# # !gsutil -o GSUtil:parallel_thread_count=11 -o GSUtil:parallel_process_count=11 -m cp -r -n $remote_gs_path $local_f

In [54]:
# %%time

# df_post_embeddings = mlf.read_run_artifact(
#     run_id=run_uuid,
#     artifact_folder='df_posts_agg_c1',
#     read_function='pd_parquet',
#     verbose=False,
#     columns=['subreddit_id', 'post_id']
# )
# print(df_post_embeddings.shape)

In [53]:
# df_post_embeddings.iloc[:5, :15]