# Purpose

**2022-08-30: v0.6.0**
The default parquet embedding format for my embeddings (1 row per column) is not favored for bigquery & other steam standards.

The preferred format is: 1 column that has repeated records. For example:
- `data-prod-165221.ml_content.subreddit_embeddings_ft2` 
    - [console link](https://console.cloud.google.com/bigquery?project=data-science-prod-218515&ws=!1m10!1m4!4m3!1sdata-prod-165221!2sml_content!3ssimilar_subreddit_ft2!1m4!4m3!1sdata-prod-165221!2sml_content!3ssubreddit_embeddings_ft2)
    - github link
        - https://github.snooguts.net/reddit/gazette-models/blob/master/similar_subreddit/embeddings/__main__.py#L105-L112

In this notebook we convert a dataframe into a new-line delimited JSON file. 
<br>With pandas we can vectorize this function instead of having to loop through each row (subreddit) individually.

In this notebook we're loading 2 separate embedding flavors based on the `embeddings_artifact_path` and loading to the BQ in two separate partitions.

---

```python
sr_embedding_dict = {
    "pt": date_today,
    "model_name": MODEL_NAME,
    "model_version": MODEL_VERSION,
    "subreddit_id": sr_dict["subreddit_id"],
    "subreddit_name": subreddit_name_lowercase,
    "embedding": sr_embedding.tolist(),
}
```

In BQ:
```
Field name		Type	Mode		Description
model_name  	STRING	NULLABLE		Model name	
model_version	STRING	NULLABLE		Model version	
subreddit_id	STRING	NULLABLE		Subreddit id	
subreddit_name	STRING	NULLABLE		Lower case subreddit name	
**embedding		FLOAT	REPEATED		Subreddit embeddings
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime, timedelta
import os
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
    elapsed_time,
)

from subclu.utils.mlflow_logger import MlflowLogger
from subclu.models.bq_embedding_schemas import embeddings_schema
from subclu.models.reshape_embeddings_for_bq import reshape_embeddings_to_ndjson, reshape_embeddings_and_upload_to_bq
from subclu.utils.big_query_utils import load_data_to_bq_table

print_lib_versions([hydra, mlflow, np, pd, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
subclu		v: 0.6.1


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set Local model path (for saving)

Why? We need to reshape the embeddings and save them locally before uploading them to mlflow.

In [4]:
manual_model_timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
path_this_model = get_project_subfolder(
    f"data/models/aggregate_embeddings/manual_reshape_v061_{manual_model_timestamp}"
)
Path.mkdir(path_this_model, parents=True, exist_ok=True)
path_this_model

PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_reshape_v061_2022-11-18_163428')

# Load 1st config for embeddings to reshape and where to save them

The embedding aggregation should've been logged to `mlflow` so we should be able to
- make calls to mlflow to get the embeddings
- add the new embeddings format to the original job





In [14]:
cfg_reshape_embeddings = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.1.yaml',
    config_path="../config",
)

In [16]:
for k_, v_ in cfg_reshape_embeddings.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
    dataset_name: v0.6.1 inputs. ~110k seed subreddits, ~340k with 3+ posts, ~700k total subreddits
    bucket_name: i18n-subreddit-clustering
    folder_subreddits_text_and_meta: i18n_topic_model_batch/runs/20221107/subreddits/text
    folder_posts_text_and_meta: i18n_topic_model_batch/runs/20221107/posts
    folder_comments_text_and_meta: i18n_topic_model_batch/runs/20221107/comments
    folder_post_and_comment_text_and_meta: i18n_topic_model_batch/runs/20221107/post_and_comment_text_combined/text_all
data_embeddings_to_aggregate:
    bucket_embeddings: i18n-subreddit-clustering
    post_and_comments_folder_embeddings: i18n_topic_model_batch/runs/20221107/post_and_comment_text_combined/text_all/embedding/2022-11-07_081017
    subreddit_desc_folder_embeddings: i18n_topic_model_batch/runs/20221107/subreddits/text/embedding/2022-11-07_074632
    col_subreddit_id: subreddit_id
aggregate_params:
    min_post_and_comment_text_len: 3
    agg_post_post_and_comment_wei

## Start MLflow
We need it to get the paths for artifacts and load subreddit embeddings (based on the mlflow run ID).

In [9]:
mlf = MlflowLogger(tracking_uri=cfg_reshape_embeddings.config_dict['mlflow_tracking_uri'])

In [10]:
mlf.list_experiment_meta(output_format='pandas').tail(9)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
35,35,v0.6.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/35,active
36,36,v0.6.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/36,active
37,37,v0.6.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/37,active
38,38,v0.6.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/38,active
39,39,v0.6.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/39,active
40,40,v0.6.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/40,active
41,41,v0.6.1_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/41,active
42,42,v0.6.1_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/42,active
43,43,v0.6.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/43,active


In [11]:
%%time

df_mlf = mlf.search_all_runs(experiment_ids=[40])
print(df_mlf.shape)

(4, 43)
CPU times: user 53.7 ms, sys: 9.22 ms, total: 62.9 ms
Wall time: 63.8 ms


In [13]:
# only show finished runs
df_mlf[df_mlf['status'] == "FINISHED"].iloc[:5, :13]

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.df_v_subs-cols,metrics.df_v_post_comments-cols,metrics.memory_used_percent,metrics.memory_used,metrics.df_v_subs-rows,metrics.memory_free,metrics.time_fxn-data_loading_time
2,91ac7ca171024c779c0992f59470c81b,40,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts,2022-11-07 21:38:57.662000+00:00,2022-11-08 08:52:44.944000+00:00,514.0,515.0,0.552656,798566.0,781653.0,1283359.0,5.742884


## Load embeddings

In [17]:
%%time

df_agg_sub_c = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c.shape)

16:52:34 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_unweighted"
100%|#############################################| 6/6 [00:18<00:00,  3.08s/it]
16:52:53 | INFO | "  Parquet files found:     4"
16:52:53 | INFO | "  Parquet files to use:     4"


(781653, 515)
CPU times: user 19.2 s, sys: 8.91 s, total: 28.1 s
Wall time: 26.5 s


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [18]:
df_agg_sub_c['posts_for_embeddings_count'].describe()

count    781653.000000
mean         68.569835
std         487.025214
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [19]:
# value_counts_and_pcts(
#     df_agg_sub_c['posts_for_embeddings_count'],
#     sort_index=True,
#     sort_index_ascending=True,
# )

In [20]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,232401,29.7%,232401,29.7%
5 posts,23732,3.0%,256133,32.8%
4 posts,33723,4.3%,289856,37.1%
3 posts,57946,7.4%,347802,44.5%
2 posts,128068,16.4%,475870,60.9%
1 post,235794,30.2%,711664,91.0%
0 posts,69989,9.0%,781653,100.0%


In [21]:
%%time

reshape_embeddings_and_upload_to_bq(
    df_agg_sub_c,
    dict_reshape_config=cfg_reshape_embeddings.config_dict,
    save_path_local_root=path_this_model,
    f_name_prefix='subreddit_embeddings',
    embedding_col_prefix='embeddings_',
)

16:53:07 | INFO | "512 <- # embedding columns found"
16:53:07 | INFO | "(781653, 515) <- Shape of input df"
16:53:07 | INFO | "Metadata cols to add:
  {'mlflow_run_id': '91ac7ca171024c779c0992f59470c81b', 'pt': '2022-11-07', 'model_version': 'v0.6.1', 'model_name': 'cau-text-mUSE'}"
16:53:07 | INFO | "Converting embeddings to repeated format..."
16:53:53 | INFO | "(781653, 8) <- Shape of new df before converting to JSON"
16:53:53 | INFO | "df output cols:
  ['pt', 'mlflow_run_id', 'model_name', 'model_version', 'subreddit_id', 'subreddit_name', 'posts_for_embeddings_count', 'embeddings']"
16:53:53 | INFO | "Converting embeddings to JSON..."
16:55:04 | INFO | "Saving file to:
  /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_reshape_v061_2022-11-18_163428/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-11-18_165307.json"
16:55:10 | INFO | "Logging to run ID: 91ac7ca171024c779c0992f59470c81b, artifact:
  df_subs_agg_c1_unweighted_ndjson"
16:56:2

CPU times: user 1min 51s, sys: 31.7 s, total: 2min 22s
Wall time: 4min 36s


## Run new function that reshapes & uploads to BQ in a single call


In [24]:
# delete variables for first config to prevent over-writing or re-writing errors on the 2nd config
del df_agg_sub_c, cfg_reshape_embeddings, path_this_model

# 2nd config for embeddings

This one adds extra weight to the subreddit description for subreddits that have fewer than 3 posts.




In [25]:
manual_model_timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
path_this_model2 = get_project_subfolder(
    f"data/models/aggregate_embeddings/manual_v060_{manual_model_timestamp}"
)
Path.mkdir(path_this_model2, parents=True, exist_ok=True)
path_this_model2

PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-11-18_171028')

In [26]:
cfg_reshape_embeddings_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.1_desc_extra_weight.yaml',
    config_path="../config",
)

In [28]:
for k_, v_ in cfg_reshape_embeddings_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
    dataset_name: v0.6.1 inputs. ~110k seed subreddits, ~340k with 3+ posts, ~700k total subreddits
    bucket_name: i18n-subreddit-clustering
    folder_subreddits_text_and_meta: i18n_topic_model_batch/runs/20221107/subreddits/text
    folder_posts_text_and_meta: i18n_topic_model_batch/runs/20221107/posts
    folder_comments_text_and_meta: i18n_topic_model_batch/runs/20221107/comments
    folder_post_and_comment_text_and_meta: i18n_topic_model_batch/runs/20221107/post_and_comment_text_combined/text_all
data_embeddings_to_aggregate:
    bucket_embeddings: i18n-subreddit-clustering
    post_and_comments_folder_embeddings: i18n_topic_model_batch/runs/20221107/post_and_comment_text_combined/text_all/embedding/2022-11-07_081017
    subreddit_desc_folder_embeddings: i18n_topic_model_batch/runs/20221107/subreddits/text/embedding/2022-11-07_074632
    col_subreddit_id: subreddit_id
aggregate_params:
    min_post_and_comment_text_len: 3
    agg_post_post_and_comment_wei

In [29]:
%%time

df_agg_sub_c2 = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c2.shape)

17:10:58 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1"
100%|########################################| 13/13 [00:00<00:00, 53196.05it/s]
17:10:58 | INFO | "  Parquet files found:     4"
17:10:58 | INFO | "  Parquet files to use:     4"


(781653, 515)
CPU times: user 9.64 s, sys: 4.72 s, total: 14.4 s
Wall time: 27.8 s


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [30]:
df_agg_sub_c2['posts_for_embeddings_count'].describe()

count    781653.000000
mean         68.569835
std         487.025214
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [31]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c2['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,232401,29.7%,232401,29.7%
5 posts,23732,3.0%,256133,32.8%
4 posts,33723,4.3%,289856,37.1%
3 posts,57946,7.4%,347802,44.5%
2 posts,128068,16.4%,475870,60.9%
1 post,235794,30.2%,711664,91.0%
0 posts,69989,9.0%,781653,100.0%


## Run new function that reshapes & uploads to BQ in a single call


In [32]:
%%time

reshape_embeddings_and_upload_to_bq(
    df_agg_sub_c2,
    dict_reshape_config=cfg_reshape_embeddings_wt.config_dict,
    save_path_local_root=path_this_model2,
    f_name_prefix='subreddit_embeddings',
    embedding_col_prefix='embeddings_',
)

17:12:17 | INFO | "512 <- # embedding columns found"
17:12:17 | INFO | "(781653, 515) <- Shape of input df"
17:12:17 | INFO | "Metadata cols to add:
  {'mlflow_run_id': '91ac7ca171024c779c0992f59470c81b', 'pt': '2022-11-08', 'model_version': 'v0.6.1', 'model_name': 'cau-text-mUSE extra weight for subreddit description'}"
17:12:18 | INFO | "Converting embeddings to repeated format..."
17:13:03 | INFO | "(781653, 8) <- Shape of new df before converting to JSON"
17:13:03 | INFO | "df output cols:
  ['pt', 'mlflow_run_id', 'model_name', 'model_version', 'subreddit_id', 'subreddit_name', 'posts_for_embeddings_count', 'embeddings']"
17:13:03 | INFO | "Converting embeddings to JSON..."
17:14:14 | INFO | "Saving file to:
  /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-11-18_171028/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json"
17:14:20 | INFO | "Logging to run ID: 91ac7ca171024c779c0992f59470c81b, artifact:
  df_subs_agg_c1_ndjson

CPU times: user 1min 51s, sys: 33.2 s, total: 2min 24s
Wall time: 5min 1s
