# Purpose

**2023-02-21: v0.6.0** update
<br>For some reason the previous table with CAU embeddings was deleted. With this notebook, we recreate the table AND upload subreddit embeddings from: 
- v0.6.0
- v0.6.1

To make it easier, we'll only add the embeddings that have additional weight for subreddits with 1 or 2 posts because those did slightly better in previous models. We can load and test the other embeddings later.

--- 

**2022-08-30: v0.6.0**
The default parquet embedding format for my embeddings (1 row per column) is not favored for bigquery & other steam standards.

The preferred format is: 1 column that has repeated records. For example:
- `data-prod-165221.ml_content.subreddit_embeddings_ft2` 
    - [console link](https://console.cloud.google.com/bigquery?project=data-science-prod-218515&ws=!1m10!1m4!4m3!1sdata-prod-165221!2sml_content!3ssimilar_subreddit_ft2!1m4!4m3!1sdata-prod-165221!2sml_content!3ssubreddit_embeddings_ft2)
    - github link
        - https://github.snooguts.net/reddit/gazette-models/blob/master/similar_subreddit/embeddings/__main__.py#L105-L112

In this notebook we convert a dataframe into a new-line delimited JSON file. 
<br>With pandas we can vectorize this function instead of having to loop through each row (subreddit) individually.

In this notebook we're loading 2 separate embedding flavors based on the `embeddings_artifact_path` and loading to the BQ in two separate partitions.

---

```python
sr_embedding_dict = {
    "pt": date_today,
    "model_name": MODEL_NAME,
    "model_version": MODEL_VERSION,
    "subreddit_id": sr_dict["subreddit_id"],
    "subreddit_name": subreddit_name_lowercase,
    "embedding": sr_embedding.tolist(),
}
```

In BQ:
```
Field name		Type	Mode		Description
model_name  	STRING	NULLABLE		Model name	
model_version	STRING	NULLABLE		Model version	
subreddit_id	STRING	NULLABLE		Subreddit id	
subreddit_name	STRING	NULLABLE		Lower case subreddit name	
**embedding		FLOAT	REPEATED		Subreddit embeddings
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime, timedelta
import os
import logging
from logging import info
from pathlib import Path

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
    elapsed_time,
)

from subclu.utils.mlflow_logger import MlflowLogger
from subclu.models.bq_embedding_schemas import embeddings_schema
from subclu.models.reshape_embeddings_for_bq import reshape_embeddings_to_ndjson, reshape_embeddings_and_upload_to_bq
from subclu.utils.big_query_utils import load_data_to_bq_table

print_lib_versions([hydra, mlflow, np, pd, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
subclu		v: 0.6.1


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Load configs for embeddings to reshape & upload to BQ

The embedding aggregation should've been logged to `mlflow` so we should be able to
- make calls to mlflow to get the embeddings
- add the new embeddings format to the original job

---
We have 2 strategies for getting subreddit embeddings. For now we'll only make available the `extra weight to subreddit meta` because that performed slightly better in previous taxonomy/rating models.




In [11]:
cfg_reshape_embeddings_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0_desc_extra_weight.yaml',
    config_path="../config",
)

In [12]:
cfg_reshape_embeddings_061_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.1_desc_extra_weight.yaml',
    config_path="../config",
)

In [57]:
cfg_reshape_embeddings_061 = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.1.yaml',
    config_path="../config",
)

In [13]:
for k_, v_ in cfg_reshape_embeddings_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
    dataset_name: v0.6.0 inputs. ~110k seed subreddits, ~340k with 3+ posts, ~700k total subreddits
    bucket_name: i18n-subreddit-clustering
    folder_subreddits_text_and_meta: i18n_topic_model_batch/runs/20220811/subreddits/text
    folder_posts_text_and_meta: i18n_topic_model_batch/runs/20220811/posts
    folder_comments_text_and_meta: i18n_topic_model_batch/runs/20220811/comments
    folder_post_and_comment_text_and_meta: i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all
data_embeddings_to_aggregate:
    bucket_embeddings: i18n-subreddit-clustering
    post_and_comments_folder_embeddings: i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all/embedding/2022-08-11_084218
    subreddit_desc_folder_embeddings: i18n_topic_model_batch/runs/20220811/subreddits/text/embedding/2022-08-11_082859
    col_subreddit_id: subreddit_id
aggregate_params:
    min_post_and_comment_text_len: 3
    agg_post_post_and_comment_wei

In [14]:
for k_, v_ in cfg_reshape_embeddings_061_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
#             print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to reshape embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: 91ac7ca171024c779c0992f59470c81b
embeddings_artifact_path: df_subs_agg_c1
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: True,
pt: 2022-11-08
model_version: v0.6.1
model_name: cau-text-mUSE extra weight for subreddit description
embeddings_config: aggregate_embeddings_v0.6.1


# Start MLflow & Get Artifact Paths

This mlflow server will help us get the artifacts & subreddit-embeddings for the `mlflow-uuid` flagged in the configurations below.

In [16]:
mlf = MlflowLogger(tracking_uri=cfg_reshape_embeddings_wt.config_dict['mlflow_tracking_uri'])

In [22]:
# get experiments that have aggregates in them
df_exp_all = mlf.list_experiment_meta(output_format='pandas')
print(df_exp_all.shape)

df_exp_aggs = (
    df_exp_all[df_exp_all['name'].str.contains('aggregate')]
)
print(df_exp_aggs.shape)
df_exp_aggs.tail(8)

(44, 4)
(14, 4)


Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
21,21,v0.4.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/21,active
22,22,v0.4.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/22,active
28,28,v0.5.0_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/28,active
29,29,v0.5.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/29,active
34,34,v0.6.0_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/34,active
35,35,v0.6.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/35,active
39,39,v0.6.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/39,active
40,40,v0.6.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/40,active


In [35]:
%%time
# limit scope to the latest experiments with aggregate info
df_mlf = (
    mlf.search_all_runs(experiment_ids=sorted(df_exp_aggs['experiment_id'].astype(int))[-6:])
    .assign(experiment_id=lambda x: x['experiment_id'].astype(int))
)
print(df_mlf.shape)

(17, 46)
CPU times: user 79.5 ms, sys: 0 ns, total: 79.5 ms
Wall time: 78.8 ms


In [38]:
l_cols_to_drop = list()
for init_ in ['metrics.memory_', 'params.memory_', 'metrics.cpu_c', 'params.cpu_c']:
    l_cols_to_drop = l_cols_to_drop + [c for c in df_mlf if c.startswith(init_)]

# only show finished runs
# (
#     df_mlf[df_mlf['status'] == "FINISHED"]
#     .drop(columns=['status'] + l_cols_to_drop)
#     .dropna(axis=1, how='all')
#     .iloc[:5, :13]
# )

# show only runs to upload
(
    df_mlf[
        df_mlf['run_id'].isin([cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'], cfg_reshape_embeddings_wt.config_dict['mlflow_run_id']])
    ]
    .drop(columns=['status'] + l_cols_to_drop)
    .dropna(axis=1, how='all')
    .iloc[:5, :13]
)

Unnamed: 0,run_id,experiment_id,artifact_uri,start_time,end_time,metrics.df_v_subs-cols,metrics.df_v_subs-rows,metrics.time_fxn-data_loading_time,metrics.df_v_post_comments-cols,metrics.df_v_post_comments-rows,metrics.df_posts_agg_c1-rows,metrics.df_subs_agg_c1_uw-rows,metrics.df_subs_agg_c1-cols
2,91ac7ca171024c779c0992f59470c81b,40,gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts,2022-11-07 21:38:57.662000+00:00,2022-11-23 02:06:49.677000+00:00,514.0,781653.0,5.742884,515.0,53597817.0,53597817.0,781653.0,515.0
4,badc44b0e5ac467da14f710da0b410c6,35,gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts,2022-08-16 08:41:53.006000+00:00,2022-09-10 00:54:17.303000+00:00,514.0,771760.0,3.698822,515.0,51906348.0,51906348.0,771760.0,515.0


## Get paths to latest reshaped data
Since we already reshaped the data and it's in GCS, let's read that instead of reshaping creating a copy

In [39]:
%%time

l_artifacts_top_level = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    only_top_level=True,
    verbose=True,
    full_path=True,
)
l_artifacts_all = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    only_top_level=False,
    verbose=False,
    full_path=True,
)
# l_artifacts_top_level

17:38:57 | INFO | "   293 <- Artifacts to check count"
17:38:57 | INFO | "   293 <- Artifacts clean count"
17:38:57 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"
17:39:03 | INFO | "   293 <- Artifacts clean count"
17:39:03 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"


In [42]:
%%time

l_artifacts_top_level_061 = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'],
    only_top_level=True,
    verbose=True,
    full_path=True,
)

l_artifacts_all_061 = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'],
    only_top_level=False,
    verbose=False,
    full_path=True,
)

17:39:50 | INFO | "   342 <- Artifacts to check count"
17:39:50 | INFO | "   342 <- Artifacts clean count"
17:39:50 | INFO | "     9 <- Artifacts & folders at TOP LEVEL clean count"
17:39:56 | INFO | "   342 <- Artifacts clean count"
17:39:56 | INFO | "     9 <- Artifacts & folders at TOP LEVEL clean count"


CPU times: user 9 s, sys: 673 ms, total: 9.67 s
Wall time: 12.7 s


In [41]:
# get path for parquet file. This should be a full folder with multiple parquet files
path_parquet_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path']
)

folder_parquet_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path']
)
folder_parquet_full_ = [i_ for i_ in l_artifacts_top_level if i_.split('/')[-1] == folder_parquet_][0]

print(f"Folder with parquet files:\n{path_parquet_}")
print(f"\nFolder with parquet files (full):\n{folder_parquet_full_}\n")

[_ for _ in l_artifacts_all if folder_parquet_ == _.split('/')[-2]]

Folder with parquet files:
df_subs_agg_c1

Folder with parquet files (full):
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1



['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/_common_metadata',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/_metadata',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.0.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.1.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.2.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.3.parquet']

In [43]:
# get path for latest ndjson output FILE
#  NOTE: there could be multiple runs of this file so we pull the latest one
path_ndjson_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path'] + '_ndjson'
)
l_ndjson_files_ = [_ for _ in l_artifacts_all if path_ndjson_ in _]
ndjson_path_ = l_ndjson_files_[-1]
print(f"ndjson Files:\n{l_ndjson_files_}")

print(f"\nFile to upload to BQ:\n{ndjson_path_}")

ndjson Files:
['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json']

File to upload to BQ:
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json


In [44]:
# get path for latest ndjson output FILE
#  NOTE: there could be multiple runs of this file so we pull the latest one
path_ndjson_061 = (
    cfg_reshape_embeddings_061_wt.config_dict['embeddings_artifact_path'] + '_ndjson'
)
l_ndjson_files_061 = [_ for _ in l_artifacts_all_061 if path_ndjson_061 in _]
ndjson_path_061 = l_ndjson_files_061[-1]
# print(f"ndjson file list:\n{l_ndjson_files_061}")

print(f"\nFile to upload to BQ:\n{ndjson_path_061}")


File to upload to BQ:
gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json


# Run new function to only upload existing reshaped data

In [45]:
%%time

load_data_to_bq_table(
    uri=ndjson_path_,
    bq_project=cfg_reshape_embeddings_wt.config_dict['bq_project'],
    bq_dataset=cfg_reshape_embeddings_wt.config_dict['bq_dataset'],
    bq_table_name=cfg_reshape_embeddings_wt.config_dict['bq_table'],
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=cfg_reshape_embeddings_wt.config_dict['bq_table_description'],
    update_table_description=True,
)

17:54:08 | INFO | "Loading this URI:
  gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json
Into this table:
  reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings"
17:54:08 | INFO | "Created table reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings"
17:54:09 | INFO | "  0 rows in table BEFORE adding data"
17:54:54 | INFO | "Original Table Expiration: 2023-06-02 17:54:08.511000+00:00"
17:54:55 | INFO | "NEW Table Expiration: None"
17:54:55 | INFO | "Updating subreddit description from:
  Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
to:
  Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/"
17:54:55 | INFO | "  771,760 rows in table AFTER adding data"


CPU times: user 75.3 ms, sys: 31.9 ms, total: 107 ms
Wall time: 48.7 s


In [46]:
%%time

load_data_to_bq_table(
    uri=ndjson_path_061,
    bq_project=cfg_reshape_embeddings_061_wt.config_dict['bq_project'],
    bq_dataset=cfg_reshape_embeddings_061_wt.config_dict['bq_dataset'],
    bq_table_name=cfg_reshape_embeddings_061_wt.config_dict['bq_table'],
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=cfg_reshape_embeddings_061_wt.config_dict['bq_table_description'],
    update_table_description=False,
)

17:55:47 | INFO | "Loading this URI:
  gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json
Into this table:
  reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings"
17:55:48 | INFO | "Table reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings already exist"
17:55:48 | INFO | "  771,760 rows in table BEFORE adding data"
17:56:32 | INFO | "Original Table Expiration: None"
17:56:33 | INFO | "NEW Table Expiration: None"
17:56:33 | INFO | "  1,553,413 rows in table AFTER adding data"


CPU times: user 56.1 ms, sys: 24.1 ms, total: 80.2 ms
Wall time: 46.6 s


## Alternative -- load data into BQ using SQL

if the python job fails because of authentication issues, we can also add the data using SQL. 

**Note** that the following call assumes the table schema was set with the functions above.

```SQL

LOAD DATA INTO `reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings`
FROM FILES (
  format = 'JSON',
  uris = [
      "gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json", 
      "gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json"
  ]
);
```


# Load SUBREDDIT embeddings from parquet files [for display]

This helps us sample what the parquet embeddings look like (columns & shape).
ETA:
- ~2 minutes: download from GCS to local cache
- ~30 seconds: loading into pandas (after download to local)

In [69]:
%%time

df_agg_sub_c = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings_061.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings_061.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c.shape)

18:21:09 | INFO | "Remote artifact path to download:
  gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_unweighted"
18:21:09 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_unweighted"
100%|##########################################| 7/7 [00:00<00:00, 23009.50it/s]
18:21:09 | INFO | "  Parquet files found:     4"
18:21:09 | INFO | "  Parquet files to use:     4"


(781653, 515)
CPU times: user 10.9 s, sys: 4.26 s, total: 15.2 s
Wall time: 8.46 s


In [68]:
%%time

df_agg_sub_c2 = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings_061_wt.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c2.shape)

18:20:38 | INFO | "Remote artifact path to download:
  gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1"
18:20:38 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1"
100%|########################################| 14/14 [00:00<00:00, 29837.53it/s]
18:20:38 | INFO | "  Parquet files found:     4"
18:20:38 | INFO | "  Parquet files to use:     4"


(781653, 515)
CPU times: user 10.9 s, sys: 5.01 s, total: 15.9 s
Wall time: 8.56 s


In [70]:
df_agg_sub_c.iloc[:5, :10]

Unnamed: 0,subreddit_id,subreddit_name,posts_for_embeddings_count,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6
0,t5_1001tl,jewel_xo,1,-0.041036,-0.00865,-0.013949,0.034232,-0.045981,0.021885,0.024
1,t5_1004au,tisbutafleshwound,3,0.010298,-0.000277,-0.004013,0.01762,-0.076757,0.031346,0.069366
2,t5_1006a0,sethigh,1,0.028022,0.007011,-0.005661,0.025936,-0.00777,0.031322,0.056113
3,t5_1008xr,asiandiasporamusic,2,0.016526,-0.006581,0.01715,0.007448,0.029106,0.04169,0.038882
4,t5_1009a3,memesenespanol,299,-0.005113,-0.005898,-0.012267,0.006103,-0.009704,0.039449,0.010752


In [71]:
df_agg_sub_c2.iloc[:5, :10]

Unnamed: 0,subreddit_id,subreddit_name,posts_for_embeddings_count,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6
0,t5_1001tl,jewel_xo,1,-0.028712,-0.027187,0.024826,0.046359,0.006391,0.04674,-0.005454
1,t5_1004au,tisbutafleshwound,3,0.010298,-0.000277,-0.004013,0.01762,-0.076757,0.031346,0.069366
2,t5_1006a0,sethigh,1,0.027356,0.032256,-0.022585,-0.004125,0.013944,0.039986,-0.007854
3,t5_1008xr,asiandiasporamusic,2,-0.011276,0.00072,-0.010621,0.021452,0.035787,0.040039,0.034074
4,t5_1009a3,memesenespanol,299,-0.005113,-0.005898,-0.012267,0.006103,-0.009704,0.039449,0.010752


Check whether embeddings values are the same for subreddits with `3>= posts`

In [72]:
n_rows_ = 6
n_cols_ = 10

display(
    df_agg_sub_c2.iloc[:n_rows_, :3].merge(
        pd.DataFrame(
            np.isclose(
                df_agg_sub_c.iloc[:n_rows_, 3:n_cols_], df_agg_sub_c2.iloc[:n_rows_, 3:n_cols_]
            ),
            columns=df_agg_sub_c2.iloc[:n_rows_, 3:n_cols_].columns,
        ),
        how='left',
        left_index=True,
        right_index=True,
    )
)
del n_rows_, n_cols_

Unnamed: 0,subreddit_id,subreddit_name,posts_for_embeddings_count,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6
0,t5_1001tl,jewel_xo,1,False,False,False,False,False,False,False
1,t5_1004au,tisbutafleshwound,3,True,True,True,True,True,True,True
2,t5_1006a0,sethigh,1,False,False,False,False,False,False,False
3,t5_1008xr,asiandiasporamusic,2,False,False,False,False,False,False,False
4,t5_1009a3,memesenespanol,299,True,True,True,True,True,True,True
5,t5_100a1y,karstcast,7,True,True,True,True,True,True,True


## Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [73]:
df_agg_sub_c2['posts_for_embeddings_count'].describe()

count    781653.000000
mean         68.569835
std         487.025214
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [74]:
assert np.allclose(
    df_agg_sub_c['posts_for_embeddings_count'].describe(), 
    df_agg_sub_c2['posts_for_embeddings_count'].describe()
)

df_agg_sub_c['posts_for_embeddings_count'].describe()

count    781653.000000
mean         68.569835
std         487.025214
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [75]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c2['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,232401,29.7%,232401,29.7%
5 posts,23732,3.0%,256133,32.8%
4 posts,33723,4.3%,289856,37.1%
3 posts,57946,7.4%,347802,44.5%
2 posts,128068,16.4%,475870,60.9%
1 post,235794,30.2%,711664,91.0%
0 posts,69989,9.0%,781653,100.0%


In [76]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, 4, 5, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4 posts', '5 posts', '6+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
6+ posts,232401,29.7%,232401,29.7%
5 posts,23732,3.0%,256133,32.8%
4 posts,33723,4.3%,289856,37.1%
3 posts,57946,7.4%,347802,44.5%
2 posts,128068,16.4%,475870,60.9%
1 post,235794,30.2%,711664,91.0%
0 posts,69989,9.0%,781653,100.0%
