# Purpose

**2023-02-21: v0.6.0** update
<br>For some reason the previous table with v0.6.0 embeddings was deleted. I tried to run this notebook to re-upload the embeddings for v0.6.0 so that we can compare or backfill if needed.

However, the authentication in vertex AI notebook is broken, so I can't run the function to upload the data from here.

**See the colab notebook that loads the data using SQL instead.**


---

**2022-08-30: v0.6.0**
The default parquet embedding format for my embeddings (1 row per column) is not favored for bigquery & other steam standards.

The preferred format is: 1 column that has repeated records. For example:
- `data-prod-165221.ml_content.subreddit_embeddings_ft2` 
    - [console link](https://console.cloud.google.com/bigquery?project=data-science-prod-218515&ws=!1m10!1m4!4m3!1sdata-prod-165221!2sml_content!3ssimilar_subreddit_ft2!1m4!4m3!1sdata-prod-165221!2sml_content!3ssubreddit_embeddings_ft2)
    - github link
        - https://github.snooguts.net/reddit/gazette-models/blob/master/similar_subreddit/embeddings/__main__.py#L105-L112

In this notebook we convert a dataframe into a new-line delimited JSON file. 
<br>With pandas we can vectorize this function instead of having to loop through each row (subreddit) individually.

In this notebook we're loading 2 separate embedding flavors based on the `embeddings_artifact_path` and loading to the BQ in two separate partitions.

---

```python
sr_embedding_dict = {
    "pt": date_today,
    "model_name": MODEL_NAME,
    "model_version": MODEL_VERSION,
    "subreddit_id": sr_dict["subreddit_id"],
    "subreddit_name": subreddit_name_lowercase,
    "embedding": sr_embedding.tolist(),
}
```

In BQ:
```
Field name		Type	Mode		Description
model_name  	STRING	NULLABLE		Model name	
model_version	STRING	NULLABLE		Model version	
subreddit_id	STRING	NULLABLE		Subreddit id	
subreddit_name	STRING	NULLABLE		Lower case subreddit name	
**embedding		FLOAT	REPEATED		Subreddit embeddings
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime, timedelta
import os
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
    elapsed_time,
)

from subclu.utils.mlflow_logger import MlflowLogger
from subclu.models.bq_embedding_schemas import embeddings_schema
from subclu.models.reshape_embeddings_for_bq import reshape_embeddings_to_ndjson, reshape_embeddings_and_upload_to_bq
from subclu.utils.big_query_utils import load_data_to_bq_table

print_lib_versions([hydra, mlflow, np, pd, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
subclu		v: 0.6.1


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Load 1st config for embeddings to reshape and where to save them

The embedding aggregation should've been logged to `mlflow` so we should be able to
- make calls to mlflow to get the embeddings
- add the new embeddings format to the original job





In [30]:
cfg_reshape_embeddings = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0.yaml',
    config_path="../config",
)

In [31]:
cfg_reshape_embeddings_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0_desc_extra_weight.yaml',
    config_path="../config",
)

In [32]:
for k_, v_ in cfg_reshape_embeddings.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
    dataset_name: v0.6.0 inputs. ~110k seed subreddits, ~340k with 3+ posts, ~700k total subreddits
    bucket_name: i18n-subreddit-clustering
    folder_subreddits_text_and_meta: i18n_topic_model_batch/runs/20220811/subreddits/text
    folder_posts_text_and_meta: i18n_topic_model_batch/runs/20220811/posts
    folder_comments_text_and_meta: i18n_topic_model_batch/runs/20220811/comments
    folder_post_and_comment_text_and_meta: i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all
data_embeddings_to_aggregate:
    bucket_embeddings: i18n-subreddit-clustering
    post_and_comments_folder_embeddings: i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all/embedding/2022-08-11_084218
    subreddit_desc_folder_embeddings: i18n_topic_model_batch/runs/20220811/subreddits/text/embedding/2022-08-11_082859
    col_subreddit_id: subreddit_id
aggregate_params:
    min_post_and_comment_text_len: 3
    agg_post_post_and_comment_wei

In [33]:
for k_, v_ in cfg_reshape_embeddings_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            # print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to reshape embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: badc44b0e5ac467da14f710da0b410c6
embeddings_artifact_path: df_subs_agg_c1
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: False,
pt: 2022-08-10
model_version: v0.6.0
model_name: cau-text-mUSE extra weight for subreddit description
embeddings_config: aggregate_embeddings_v0.6.0


# Start MLflow & Get Artifact Paths

This mlflow server will help us get the artifacts & subreddit-embeddings for the `mlflow-uuid` flagged in the configurations below.

In [13]:
mlf = MlflowLogger(tracking_uri=cfg_reshape_embeddings.config_dict['mlflow_tracking_uri'])

In [15]:
mlf.list_experiment_meta(output_format='pandas').tail(9)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
35,35,v0.6.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/35,active
36,36,v0.6.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/36,active
37,37,v0.6.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/37,active
38,38,v0.6.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/38,active
39,39,v0.6.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/39,active
40,40,v0.6.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/40,active
41,41,v0.6.1_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/41,active
42,42,v0.6.1_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/42,active
43,43,v0.6.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/43,active


In [16]:
%%time

df_mlf = mlf.search_all_runs(35)
print(df_mlf.shape)

(4, 43)
CPU times: user 59.4 ms, sys: 7.3 ms, total: 66.7 ms
Wall time: 65.9 ms


In [17]:
# only show finished runs
df_mlf[df_mlf['status'] == "FINISHED"].iloc[:5, :13]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.df_v_subs-cols,metrics.df_subs_agg_c1-cols,metrics.df_v_subs-rows,metrics.cpu_count,metrics.df_v_post_comments-cols,metrics.memory_used,metrics.df_subs_agg_c1_uw-rows
0,badc44b0e5ac467da14f710da0b410c6,35,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts,2022-08-16 08:41:53.006000+00:00,2022-09-10 00:54:17.303000+00:00,514.0,515.0,771760.0,96.0,515.0,774858.0,771760.0
1,ca79765b72c5428395b02926612d85fd,35,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/35/ca79765b72c5428395b02926612d85fd/artifacts,2022-08-16 08:41:31.162000+00:00,2022-08-31 03:13:27.187000+00:00,514.0,,771760.0,96.0,515.0,442282.0,


## Get path to latest reshaped data
Since we already reshaped the data and it's in GCS, let's read that instead of reshaping & uploading to GCS

In [18]:
l_artifacts_top_level = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    only_top_level=True,
    verbose=True,
    full_path=True,
)
l_artifacts_all = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    only_top_level=False,
    verbose=False,
    full_path=True,
)

01:29:43 | INFO | "   293 <- Artifacts to check count"
01:29:43 | INFO | "   293 <- Artifacts clean count"
01:29:43 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"
01:29:48 | INFO | "   293 <- Artifacts clean count"
01:29:48 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"


In [19]:
l_artifacts_top_level

['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_df-2022-09-10_003611',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_df_test-2022-09-09_212342',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_ndjson-2022-09-10_003611',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_ndjson_test-2022-09-09_212342',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/ann_ndjson_test-2022-09-10_003611',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_posts_agg_c1',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson',
 'gs://i18n-subreddit-clustering/ml

In [22]:
# get path for parquet file. This should be a full folder with multiple parquet files
path_parquet_ = (
    cfg_reshape_embeddings.config_dict['embeddings_artifact_path']
)

folder_parquet_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path']
)
folder_parquet_full_ = [i_ for i_ in l_artifacts_top_level if i_.split('/')[-1] == folder_parquet_][0]

# print(f"Folder with parquet files:\n{path_parquet_}")
print(f"\nFolder with parquet files (full):\n{folder_parquet_full_}\n")

[_ for _ in l_artifacts_all if folder_parquet_ == _.split('/')[-2]]


Folder with parquet files (full):
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1



['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/_common_metadata',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/_metadata',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.0.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.1.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.2.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.3.parquet']

In [23]:
# get path for latest ndjson output FILE
#  NOTE: there could be multiple runs of this file
path_ndjson_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path'] + '_ndjson'
)
l_ndjson_files_ = [_ for _ in l_artifacts_all if path_ndjson_ in _]
ndjson_path_ = l_ndjson_files_[-1]
print(f"ndjson Files:\n{l_ndjson_files_}")

print(f"\nFile to upload to BQ:\n{ndjson_path_}")

ndjson Files:
['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json']

File to upload to BQ:
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json


# Run new function to only upload existing reshaped data

In [34]:
%%time

load_data_to_bq_table(
    uri=ndjson_path_,
    bq_project=cfg_reshape_embeddings_wt.config_dict['bq_project'],
    bq_dataset=cfg_reshape_embeddings_wt.config_dict['bq_dataset'],
    bq_table_name=cfg_reshape_embeddings_wt.config_dict['bq_table'],
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=cfg_reshape_embeddings_wt.config_dict['bq_table_description'],
    update_table_description=False,
)

16:37:46 | INFO | "Loading this URI:
  gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json
Into this table:
  reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings"


Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/reddit-employee-datasets/datasets/david_bermejo/tables/cau_subreddit_embeddings?prettyPrint=false: Access Denied: Table reddit-employee-datasets:david_bermejo.cau_subreddit_embeddings: Permission bigquery.tables.get denied on table reddit-employee-datasets:david_bermejo.cau_subreddit_embeddings (or it may not exist).

# Explore data [optional]

## Load embeddings from parquet files

In [10]:
%%time

df_agg_sub_c = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c.shape)

03:32:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_unweighted"
100%|##########################################| 6/6 [00:00<00:00, 17476.27it/s]
03:32:53 | INFO | "  Parquet files found:     4"
03:32:53 | INFO | "  Parquet files to use:     4"


(771760, 515)
CPU times: user 9.31 s, sys: 4.32 s, total: 13.6 s
Wall time: 7.86 s


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [11]:
df_agg_sub_c['posts_for_embeddings_count'].describe()

count    771760.000000
mean         67.257111
std         479.863049
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [14]:
# value_counts_and_pcts(
#     df_agg_sub_c['posts_for_embeddings_count'],
#     sort_index=True,
#     sort_index_ascending=True,
# )

In [13]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,227368,29.5%,227368,29.5%
5 posts,23205,3.0%,250573,32.5%
4 posts,33084,4.3%,283657,36.8%
3 posts,56898,7.4%,340555,44.1%
2 posts,125070,16.2%,465625,60.3%
1 post,240338,31.1%,705963,91.5%
0 posts,65797,8.5%,771760,100.0%


# 2nd config for embeddings

This one adds extra weight to the subreddit description for subreddits that have fewer than 3 posts.




In [27]:
%%time

df_agg_sub_c2 = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c2.shape)

03:51:40 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1"
100%|###########################################| 13/13 [00:18<00:00,  1.46s/it]
03:51:59 | INFO | "  Parquet files found:     4"
03:51:59 | INFO | "  Parquet files to use:     4"


(771760, 515)
CPU times: user 20 s, sys: 7.46 s, total: 27.5 s
Wall time: 26.9 s


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [28]:
df_agg_sub_c2['posts_for_embeddings_count'].describe()

count    771760.000000
mean         67.257111
std         479.863049
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [30]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c2['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,227368,29.5%,227368,29.5%
5 posts,23205,3.0%,250573,32.5%
4 posts,33084,4.3%,283657,36.8%
3 posts,56898,7.4%,340555,44.1%
2 posts,125070,16.2%,465625,60.3%
1 post,240338,31.1%,705963,91.5%
0 posts,65797,8.5%,771760,100.0%
