# Purpose

**2022-11-18: v0.6.0** update
<br>For some reason the previous table with v0.6.0 embeddings was deleted. Use this notebook to re-upload the embeddings for v0.6.0 so that we can compare or backfill if needed.

---

**2022-08-30: v0.6.0**
The default parquet embedding format for my embeddings (1 row per column) is not favored for bigquery & other steam standards.

The preferred format is: 1 column that has repeated records. For example:
- `data-prod-165221.ml_content.subreddit_embeddings_ft2` 
    - [console link](https://console.cloud.google.com/bigquery?project=data-science-prod-218515&ws=!1m10!1m4!4m3!1sdata-prod-165221!2sml_content!3ssimilar_subreddit_ft2!1m4!4m3!1sdata-prod-165221!2sml_content!3ssubreddit_embeddings_ft2)
    - github link
        - https://github.snooguts.net/reddit/gazette-models/blob/master/similar_subreddit/embeddings/__main__.py#L105-L112

In this notebook we convert a dataframe into a new-line delimited JSON file. 
<br>With pandas we can vectorize this function instead of having to loop through each row (subreddit) individually.

In this notebook we're loading 2 separate embedding flavors based on the `embeddings_artifact_path` and loading to the BQ in two separate partitions.

---

```python
sr_embedding_dict = {
    "pt": date_today,
    "model_name": MODEL_NAME,
    "model_version": MODEL_VERSION,
    "subreddit_id": sr_dict["subreddit_id"],
    "subreddit_name": subreddit_name_lowercase,
    "embedding": sr_embedding.tolist(),
}
```

In BQ:
```
Field name		Type	Mode		Description
model_name  	STRING	NULLABLE		Model name	
model_version	STRING	NULLABLE		Model version	
subreddit_id	STRING	NULLABLE		Subreddit id	
subreddit_name	STRING	NULLABLE		Lower case subreddit name	
**embedding		FLOAT	REPEATED		Subreddit embeddings
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime, timedelta
import os
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
    elapsed_time,
)

from subclu.utils.mlflow_logger import MlflowLogger
from subclu.models.bq_embedding_schemas import embeddings_schema
from subclu.models.reshape_embeddings_for_bq import reshape_embeddings_to_ndjson, reshape_embeddings_and_upload_to_bq
from subclu.utils.big_query_utils import load_data_to_bq_table

print_lib_versions([hydra, mlflow, np, pd, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
subclu		v: 0.6.1


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set Local model path (for saving)

NOT NEEDED in this case because we're reading embeddings that have already been reshaped.

In [4]:
# manual_model_timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
# path_this_model = get_project_subfolder(
#     f"data/models/aggregate_embeddings/manual_v060_{manual_model_timestamp}"
# )
# Path.mkdir(path_this_model, parents=True, exist_ok=True)
# path_this_model

# Load 1st config for embeddings to reshape and where to save them

The embedding aggregation should've been logged to `mlflow` so we should be able to
- make calls to mlflow to get the embeddings
- add the new embeddings format to the original job





In [5]:
cfg_reshape_embeddings = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0.yaml',
    config_path="../config",
)

In [6]:
for k_, v_ in cfg_reshape_embeddings.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
    dataset_name: v0.6.0 inputs. ~110k seed subreddits, ~340k with 3+ posts, ~700k total subreddits
    bucket_name: i18n-subreddit-clustering
    folder_subreddits_text_and_meta: i18n_topic_model_batch/runs/20220811/subreddits/text
    folder_posts_text_and_meta: i18n_topic_model_batch/runs/20220811/posts
    folder_comments_text_and_meta: i18n_topic_model_batch/runs/20220811/comments
    folder_post_and_comment_text_and_meta: i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all
data_embeddings_to_aggregate:
    bucket_embeddings: i18n-subreddit-clustering
    post_and_comments_folder_embeddings: i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all/embedding/2022-08-11_084218
    subreddit_desc_folder_embeddings: i18n_topic_model_batch/runs/20220811/subreddits/text/embedding/2022-08-11_082859
    col_subreddit_id: subreddit_id
aggregate_params:
    min_post_and_comment_text_len: 3
    agg_post_post_and_comment_wei

# Start MLflow
We need it to get the paths for artifacts and load subreddit embeddings (based on the mlflow run ID).

In [8]:
mlf = MlflowLogger(tracking_uri=cfg_reshape_embeddings.config_dict['mlflow_tracking_uri'])

In [30]:
# mlf.list_experiment_meta(output_format='pandas').tail(10)

In [21]:
%%time

df_mlf = mlf.search_all_runs(35)
print(df_mlf.shape)

(4, 43)
CPU times: user 53.3 ms, sys: 8.6 ms, total: 61.9 ms
Wall time: 61 ms


In [22]:
# only show finished runs
df_mlf[df_mlf['status'] == "FINISHED"].iloc[:5, :13]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.df_v_post_comments-cols,metrics.df_v_post_comments-rows,metrics.time_fxn-data_loading_time,metrics.time_fxn-full_aggregation_fxn_minutes,metrics.time_fxn-df_subs_agg_c1_uw,metrics.df_v_subs-cols,metrics.df_subs_agg_c1-cols
0,badc44b0e5ac467da14f710da0b410c6,35,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts,2022-08-16 08:41:53.006000+00:00,2022-09-10 00:54:17.303000+00:00,515.0,51906348.0,3.698822,820.674805,15.926672,514.0,515.0
1,ca79765b72c5428395b02926612d85fd,35,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/35/ca79765b72c5428395b02926612d85fd/artifacts,2022-08-16 08:41:31.162000+00:00,2022-08-31 03:13:27.187000+00:00,515.0,51906348.0,3.719202,,,514.0,


## Show artifacts for run in config

'badc44b0e5ac467da14f710da0b410c6'

In [27]:
l_artifacts_top_level = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings.config_dict['mlflow_run_id'],
    only_top_level=True,
    verbose=True,
)
l_artifacts_all = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings.config_dict['mlflow_run_id'],
    only_top_level=False,
    verbose=False,
)

17:56:03 | INFO | "   293 <- Artifacts to check count"
17:56:03 | INFO | "   293 <- Artifacts clean count"
17:56:03 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"
17:56:09 | INFO | "   293 <- Artifacts clean count"
17:56:09 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"


In [28]:
l_artifacts_top_level

['ann_df-2022-09-10_003611',
 'ann_df_test-2022-09-09_212342',
 'ann_ndjson-2022-09-10_003611',
 'ann_ndjson_test-2022-09-09_212342',
 'ann_ndjson_test-2022-09-10_003611',
 'df_posts_agg_c1',
 'df_subs_agg_c1',
 'df_subs_agg_c1_ndjson',
 'df_subs_agg_c1_unweighted',
 'df_subs_agg_c1_unweighted_ndjson']

In [35]:
# get path for latest ndjson output
path_ndjson_ = (
    cfg_reshape_embeddings.config_dict['embeddings_artifact_path'] + '_ndjson'
)
ndjson_path_ = [_ for _ in l_artifacts_all if path_ndjson_ in _][-1]
print(ndjson_path_)
print(ndjson_path_)

'mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-08-31_034148.json'

In [36]:
from subclu.models.bq_embedding_schemas import embeddings_schema

In [37]:
load_data_to_bq_table(
    uri=d_paths['mlflow_artifact_path'],
    bq_project=cfg_reshape_embeddings['bq_project'],
    bq_dataset=cfg_reshape_embeddings['bq_dataset'],
    bq_table_name=cfg_reshape_embeddings['bq_table'],
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=dict_reshape_config['bq_table_description'],
    update_table_description=False,
)

NameError: name 'd_paths' is not defined

In [None]:
LEGACY

In [10]:
%%time

df_agg_sub_c = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c.shape)

03:32:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_unweighted"
100%|##########################################| 6/6 [00:00<00:00, 17476.27it/s]
03:32:53 | INFO | "  Parquet files found:     4"
03:32:53 | INFO | "  Parquet files to use:     4"


(771760, 515)
CPU times: user 9.31 s, sys: 4.32 s, total: 13.6 s
Wall time: 7.86 s


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [11]:
df_agg_sub_c['posts_for_embeddings_count'].describe()

count    771760.000000
mean         67.257111
std         479.863049
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [14]:
# value_counts_and_pcts(
#     df_agg_sub_c['posts_for_embeddings_count'],
#     sort_index=True,
#     sort_index_ascending=True,
# )

In [13]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,227368,29.5%,227368,29.5%
5 posts,23205,3.0%,250573,32.5%
4 posts,33084,4.3%,283657,36.8%
3 posts,56898,7.4%,340555,44.1%
2 posts,125070,16.2%,465625,60.3%
1 post,240338,31.1%,705963,91.5%
0 posts,65797,8.5%,771760,100.0%


## Run new function that reshapes & uploads to BQ in a single call


In [15]:
%%time

reshape_embeddings_and_upload_to_bq(
    df_agg_sub_c,
    dict_reshape_config=cfg_reshape_embeddings.config_dict,
    save_path_local_root=path_this_model,
    f_name_prefix='subreddit_embeddings',
    embedding_col_prefix='embeddings_',
)

03:41:48 | INFO | "512 <- # embedding columns found"
03:41:48 | INFO | "(771760, 515) <- Shape of input df"
03:41:49 | INFO | "Metadata cols to add:
  {'mlflow_run_id': 'badc44b0e5ac467da14f710da0b410c6', 'pt': '2022-08-11', 'model_version': 'v0.6.0', 'model_name': 'cau-text-mUSE'}"
03:41:49 | INFO | "Converting embeddings to repeated format..."
03:42:34 | INFO | "(771760, 8) <- Shape of new df before converting to JSON"
03:42:34 | INFO | "df output cols:
  ['pt', 'mlflow_run_id', 'model_name', 'model_version', 'subreddit_id', 'subreddit_name', 'posts_for_embeddings_count', 'embeddings']"
03:42:36 | INFO | "Converting embeddings to JSON..."
03:43:52 | INFO | "Saving file to:
  /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-31_033019/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-08-31_034148.json"
03:43:58 | INFO | "Logging to run ID: badc44b0e5ac467da14f710da0b410c6, artifact:
  df_subs_agg_c1_unweighted_ndjson"
03:45:13 | INFO

CPU times: user 2min 41s, sys: 38.8 s, total: 3min 19s
Wall time: 5min 56s


In [22]:
# delete variables for first config to prevent errors on the 2nd config
del df_agg_sub_c, cfg_reshape_embeddings_wt, path_this_model

# 2nd config for embeddings

This one adds extra weight to the subreddit description for subreddits that have fewer than 3 posts.




In [17]:
manual_model_timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
path_this_model2 = get_project_subfolder(
    f"data/models/aggregate_embeddings/manual_v060_{manual_model_timestamp}"
)
Path.mkdir(path_this_model2, parents=True, exist_ok=True)
path_this_model2

PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-31_034951')

In [25]:
cfg_reshape_embeddings_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0_desc_extra_weight.yaml',
    config_path="../config",
)

In [26]:
for k_, v_ in cfg_reshape_embeddings_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            # print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to reshape embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: badc44b0e5ac467da14f710da0b410c6
embeddings_artifact_path: df_subs_agg_c1
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: False,
pt: 2022-08-10
model_version: v0.6.0
model_name: cau-text-mUSE extra weight for subreddit description
embeddings_config: aggregate_embeddings_v0.6.0


In [27]:
%%time

df_agg_sub_c2 = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c2.shape)

03:51:40 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1"
100%|###########################################| 13/13 [00:18<00:00,  1.46s/it]
03:51:59 | INFO | "  Parquet files found:     4"
03:51:59 | INFO | "  Parquet files to use:     4"


(771760, 515)
CPU times: user 20 s, sys: 7.46 s, total: 27.5 s
Wall time: 26.9 s


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [28]:
df_agg_sub_c2['posts_for_embeddings_count'].describe()

count    771760.000000
mean         67.257111
std         479.863049
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [30]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c2['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,227368,29.5%,227368,29.5%
5 posts,23205,3.0%,250573,32.5%
4 posts,33084,4.3%,283657,36.8%
3 posts,56898,7.4%,340555,44.1%
2 posts,125070,16.2%,465625,60.3%
1 post,240338,31.1%,705963,91.5%
0 posts,65797,8.5%,771760,100.0%


## Run new function that reshapes & uploads to BQ in a single call


In [32]:
%%time

reshape_embeddings_and_upload_to_bq(
    df_agg_sub_c2,
    dict_reshape_config=cfg_reshape_embeddings_wt.config_dict,
    save_path_local_root=path_this_model2,
    f_name_prefix='subreddit_embeddings',
    embedding_col_prefix='embeddings_',
)

03:58:24 | INFO | "512 <- # embedding columns found"
03:58:24 | INFO | "(771760, 515) <- Shape of input df"
03:58:24 | INFO | "Metadata cols to add:
  {'mlflow_run_id': 'badc44b0e5ac467da14f710da0b410c6', 'pt': '2022-08-10', 'model_version': 'v0.6.0', 'model_name': 'cau-text-mUSE extra weight for subreddit description'}"
03:58:24 | INFO | "Converting embeddings to repeated format..."
03:59:03 | INFO | "(771760, 8) <- Shape of new df before converting to JSON"
03:59:03 | INFO | "df output cols:
  ['pt', 'mlflow_run_id', 'model_name', 'model_version', 'subreddit_id', 'subreddit_name', 'posts_for_embeddings_count', 'embeddings']"
03:59:03 | INFO | "Converting embeddings to JSON..."
04:00:15 | INFO | "Saving file to:
  /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-31_034951/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json"
04:00:20 | INFO | "Logging to run ID: badc44b0e5ac467da14f710da0b410c6, artifact:
  df_subs_agg_c1_ndjson

CPU times: user 1min 52s, sys: 33.1 s, total: 2min 25s
Wall time: 4min 15s
