# Purpose


2022-07-07:
The data we pulled to build v0.5.0 model was in a temp table that was deleted. We have access to the post & comment text in GCS, but the metadata (vote counts, nsfw tag, ocr raw, etc.) is gone.

Here we'll check the subreddit metadata to create a new table that includes the metadata needed for the posts used in the v0.5.0 model.

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [58]:
from datetime import datetime
import gc
import os
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu

from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
    get_venn_sets2,
)
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.i18n_topic_model_batch.subclu2.utils.data_loaders_gcs import (
    LoadSubredditsGCS
)



print_lib_versions([dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.5.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Load config & set paths 

Let's use the config to make sure we're looking at the right data w/o having to hard-code paths

In [41]:
path_local_cache_ = "/home/jupyter/subreddit_clustering_i18n/data/local_cache/"

In [42]:
cfg_v050_text_and_meta = LoadHydraConfig(
    config_name='v0.5.0_model.yaml',
    config_path='../i18n_topic_model_batch/subclu2/config/data_text_and_metadata',
)
(cfg_v050_text_and_meta.config_dict)

{'dataset_name': 'v0.5.0 inputs. ~80k seed subreddits, ~190k total subreddits',
 'bucket_name': 'i18n-subreddit-clustering',
 'folder_subreddits_text_and_meta': 'i18n_topic_model_batch/runs/20220629/subreddits/text',
 'folder_subreddits_text_and_meta_alt': 'i18n_topic_model_batch/runs/20220707/subreddits/text',
 'folder_posts_text_and_meta': 'i18n_topic_model_batch/runs/20220707/posts',
 'folder_comments_text_and_meta': 'i18n_topic_model_batch/runs/20220707/comments',
 'folder_post_and_comment_text_and_meta': 'i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds',
 'folder_post_and_comment_text_and_meta_alt': 'i18n_topic_model_batch/runs/20220707/post_and_comment_text_combined/text_subreddit_seeds',
 'folder_post_and_comment_text_and_meta_non_seed': 'i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_non_subreddit_seeds'}

# Load metadata + embeddings

For v0.5.0 embeddings I didn't use mlflow to track the embeddings inference. We'll need to get them from these folders in GCS:

- [Subreddit metadata](https://console.cloud.google.com/storage/browser/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/subreddits/text/embedding/2022-06-29_084555)
    - `i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/subreddits/text/embedding/2022-06-29_084555`
- [Post + Comment Text (already combined)](https://console.cloud.google.com/storage/browser/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds/embedding/2022-06-29_091925)
    - `i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds/embedding/2022-06-29_091925`



## Subreddit meta

Here's where we should be able to retrieve the date-range needed to get the lost metadata.

In [46]:
%%time

df_meta_subs = LoadSubredditsGCS(
    bucket_name=cfg_v050_text_and_meta.config_dict['bucket_name'],
    gcs_path=cfg_v050_text_and_meta.config_dict['folder_subreddits_text_and_meta'],
    local_cache_path=path_local_cache_,
    df_format='pandas',
    unique_check=False,
).read_as_one_df()

print(df_meta_subs.shape)

21:22:39 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/subreddits/text"
21:22:39 | INFO | "  6 <- Files matching prefix"
21:22:39 | INFO | "  6 <- Files to check"
21:22:40 | INFO | "  Files already cached: 0"
21:22:40 | INFO | "0:00:02.697869  <- Downloading files elapsed time"


(196371, 46)
CPU times: user 1.62 s, sys: 1 s, total: 2.62 s
Wall time: 3.45 s


In [48]:
df_meta_subs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196371 entries, 0 to 196370
Data columns (total 46 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   i18n_type                                 163 non-null     object 
 1   subreddit_id                              196371 non-null  object 
 2   subreddit_name                            196371 non-null  object 
 3   subreddit_seed_for_clusters               196371 non-null  bool   
 4   geo_relevant_country_count                36710 non-null   float64
 5   geo_relevant_countries                    36710 non-null   object 
 6   geo_relevant_country_codes                36710 non-null   object 
 7   users_l7                                  196371 non-null  int64  
 8   posts_not_removed_l28                     196371 non-null  int64  
 9   subscribers                               196371 non-null  int64  
 10  over_18             

In [49]:
l_dt_cols = [
    'pt',
    'successful_post_start_date'
]
counts_describe(df_meta_subs[l_dt_cols])

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
pt,object,196371,1,0.00%,0,0.00%
successful_post_start_date,object,196371,1,0.00%,0,0.00%


In [50]:
df_meta_subs[l_dt_cols].head()

Unnamed: 0,pt,successful_post_start_date
0,2022-06-27,2022-05-29
1,2022-06-27,2022-05-29
2,2022-06-27,2022-05-29
3,2022-06-27,2022-05-29
4,2022-06-27,2022-05-29


## Read post+comments combined text
This is the data that was used to create the embeddings

In [51]:
%%time

df_pc_combined  = LoadSubredditsGCS(
    bucket_name=cfg_v050_text_and_meta.config_dict['bucket_name'],
    gcs_path=cfg_v050_text_and_meta.config_dict['folder_post_and_comment_text_and_meta'],
    local_cache_path=path_local_cache_,
    df_format='pandas',
    unique_check=False,
).read_as_one_df()

print(df_pc_combined.shape)

21:24:24 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds"
21:24:24 | INFO | "  74 <- Files matching prefix"
21:24:24 | INFO | "  74 <- Files to check"
21:25:34 | INFO | "  Files already cached: 0"
21:25:34 | INFO | "0:01:11.680874  <- Downloading files elapsed time"


(16360314, 8)
CPU times: user 1min 52s, sys: 1min 13s, total: 3min 5s
Wall time: 1min 49s


In [53]:
df_pc_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16360314 entries, 0 to 643615
Data columns (total 8 columns):
 #   Column                           Dtype  
---  ------                           -----  
 0   subreddit_seed_for_clusters      bool   
 1   subreddit_id                     object 
 2   subreddit_name                   object 
 3   post_id                          object 
 4   net_upvotes_lookup               int64  
 5   comment_for_embedding_count      float64
 6   post_and_comment_text_clean_len  int64  
 7   post_and_comment_text_clean      object 
dtypes: bool(1), float64(1), int64(2), object(4)
memory usage: 1014.2+ MB


# Compare subreddit meta


In [54]:
%%time

df_meta_subs_alt = LoadSubredditsGCS(
    bucket_name=cfg_v050_text_and_meta.config_dict['bucket_name'],
    gcs_path=cfg_v050_text_and_meta.config_dict['folder_subreddits_text_and_meta_alt'],
    local_cache_path=path_local_cache_,
    df_format='pandas',
    unique_check=False,
).read_as_one_df()

print(df_meta_subs_alt.shape)

21:28:22 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220707/subreddits/text"
21:28:22 | INFO | "  1 <- Files matching prefix"
21:28:22 | INFO | "  1 <- Files to check"
21:28:24 | INFO | "  Files already cached: 0"
21:28:24 | INFO | "0:00:04.129538  <- Downloading files elapsed time"


(202889, 46)
CPU times: user 1.7 s, sys: 1.29 s, total: 2.99 s
Wall time: 4.88 s


## Compare subreddit meta for new run & old run

Looks like the new run contains more subreddits. Maybe there were more posts created around the datetime cut-off last time?

In [56]:
df_meta_subs_alt.shape

(202889, 46)

In [57]:
df_meta_subs.shape

(196371, 46)

In [55]:
df_meta_subs_alt.equals(df_meta_subs)

False

### What's overlap between new & old run?

Looks like over 6k subreddits are in the new, but not the old.
Unclear if there was a rating change or something else

In [77]:
d_v_subs = get_venn_sets2(
    df_meta_subs['subreddit_id'].unique(),
    df_meta_subs_alt['subreddit_id'].unique(),
    a_name='old',
    b_name='new',
)
d_v_subs.keys()

  196,371 <- old
  202,889 <- new
  202,889 <- old + new
        0 <- old_only
  196,371 <- old_and_new
    6,518 <- new_only


dict_keys(['old_only', 'old_and_new', 'new_only'])

In [72]:
df_meta_subs_explore = df_meta_subs_alt[
    (df_meta_subs_alt['subreddit_id'].isin(d_v_subs['new_only'])) &
    (df_meta_subs_alt['posts_not_removed_l28'] >= 5) &
    (df_meta_subs_alt['subscribers'] >= 5)
]
print(df_meta_subs_explore.shape)
# df_meta_subs_explore.head(10)

(115, 46)


### Let's check diff for only the seed subreddits
Since these are the most important ones for the model.

The good news is that the new run includes more subreddits and is not missing any of the old ones.

In [73]:
value_counts_and_pcts(
    df_meta_subs['subreddit_seed_for_clusters']
)

Unnamed: 0,subreddit_seed_for_clusters-count,subreddit_seed_for_clusters-percent,subreddit_seed_for_clusters-pct_cumulative_sum
False,114397,58.3%,58.3%
True,81974,41.7%,100.0%


In [74]:
value_counts_and_pcts(
    df_meta_subs_alt['subreddit_seed_for_clusters']
)

Unnamed: 0,subreddit_seed_for_clusters-count,subreddit_seed_for_clusters-percent,subreddit_seed_for_clusters-pct_cumulative_sum
False,119939,59.1%,59.1%
True,82950,40.9%,100.0%


In [76]:
mask_seed_subs_ = df_meta_subs['subreddit_seed_for_clusters'] == True 
mask_seed_subs_alt_ = df_meta_subs_alt['subreddit_seed_for_clusters'] == True

d_v_subs_seeds = get_venn_sets2(
    df_meta_subs[mask_seed_subs_]['subreddit_id'].unique(),
    df_meta_subs_alt[mask_seed_subs_alt_]['subreddit_id'].unique(),
    a_name='old',
    b_name='new',
)
d_v_subs_seeds.keys()

   81,974 <- old
   82,950 <- new
   82,950 <- old + new
        0 <- old_only
   81,974 <- old_and_new
      976 <- new_only


dict_keys(['old_only', 'old_and_new', 'new_only'])

# Compare post+comments files
The text could be slightly different (e.g., new comments), but at least we should be looking at 99%+ of the same post-IDs b/c we need the metadata to filter NSFW for indexing.

In [78]:
%%time

df_pc_combined_alt  = LoadSubredditsGCS(
    bucket_name=cfg_v050_text_and_meta.config_dict['bucket_name'],
    gcs_path=cfg_v050_text_and_meta.config_dict['folder_post_and_comment_text_and_meta_alt'],
    local_cache_path=path_local_cache_,
    df_format='pandas',
    unique_check=False,
).read_as_one_df()

print(df_pc_combined_alt.shape)

21:58:35 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220707/post_and_comment_text_combined/text_subreddit_seeds"
21:58:35 | INFO | "  36 <- Files matching prefix"
21:58:35 | INFO | "  36 <- Files to check"
21:59:50 | INFO | "  Files already cached: 0"
21:59:50 | INFO | "0:01:16.524619  <- Downloading files elapsed time"


(16874763, 8)
CPU times: user 2min, sys: 1min 22s, total: 3min 23s
Wall time: 1min 56s


## Compare post+comments IDs

In [80]:
df_pc_combined_alt.shape

(16874763, 8)

In [79]:
df_pc_combined.shape

(16360314, 8)

### What's overlap between new & old run?

For our use cases, it's ok for the new run to include more posts.

However, it looks like there are 200k posts that are not in the new post+text table... hmmm

In [82]:
%%time
d_v_pc = get_venn_sets2(
    df_pc_combined['post_id'].unique(),
    df_pc_combined_alt['post_id'].unique(),
    a_name='old',
    b_name='new',
)
d_v_pc.keys()

16,360,314 <- old
16,874,763 <- new
17,081,905 <- old + new
  207,142 <- old_only
16,153,172 <- old_and_new
  721,591 <- new_only
CPU times: user 1min 7s, sys: 1.86 s, total: 1min 9s
Wall time: 1min 9s


dict_keys(['old_only', 'old_and_new', 'new_only'])

## Read posts meta



In [83]:
%%time

df_posts  = LoadSubredditsGCS(
    bucket_name=cfg_v050_text_and_meta.config_dict['bucket_name'],
    gcs_path=cfg_v050_text_and_meta.config_dict['folder_posts_text_and_meta'],
    local_cache_path=path_local_cache_,
    df_format='pandas',
    unique_check=False,
).read_as_one_df()

print(df_posts.shape)

22:08:01 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220707/posts"
22:08:01 | INFO | "  75 <- Files matching prefix"
22:08:01 | INFO | "  75 <- Files to check"
22:10:10 | INFO | "  Files already cached: 0"
22:10:10 | INFO | "0:02:11.730905  <- Downloading files elapsed time"


(17735766, 46)
CPU times: user 4min 8s, sys: 2min 34s, total: 6min 42s
Wall time: 3min 51s


In [85]:
df_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17735766 entries, 0 to 280378
Data columns (total 46 columns):
 #   Column                                      Dtype         
---  ------                                      -----         
 0   rank_post_in_sub                            int64         
 1   subreddit_id                                object        
 2   subreddit_name                              object        
 3   post_id                                     object        
 4   user_id                                     object        
 5   submit_date                                 object        
 6   endpoint_timestamp                          datetime64[ns]
 7   geo_country_code                            object        
 8   is_deleted                                  int64         
 9   removed                                     int64         
 10  neutered                                    bool          
 11  content_category                            object

### It's possible that maybe posts didn't make the cutoff based on text length
Maybe we still have the metadata in the post-only table??

But in reality, there are still ~200k posts without text or meta in the new table... maybe they got removed/neutered/deleted since last time.

In [84]:
%%time
d_v_post_v_pc = get_venn_sets2(
    df_pc_combined['post_id'].unique(),
    df_posts['post_id'].unique(),
    a_name='pc_old',
    b_name='posts_new',
)
d_v_post_v_pc.keys()

 16,360,314 <- pc_old
 17,735,766 <- posts_new
 17,942,903 <- pc_old + posts_new
    207,137 <- pc_old_only
 16,153,177 <- pc_old_and_posts_new
  1,582,589 <- posts_new_only
CPU times: user 1min 8s, sys: 2.76 s, total: 1min 10s
Wall time: 1min 10s


dict_keys(['pc_old_only', 'pc_old_and_posts_new', 'posts_new_only'])

In [84]:
%%time
d_v_post_v_pc = get_venn_sets2(
    df_pc_combined['post_id'].unique(),
    df_posts['post_id'].unique(),
    a_name='pc_old',
    b_name='posts_new',
)
d_v_post_v_pc.keys()

 16,360,314 <- pc_old
 17,735,766 <- posts_new
 17,942,903 <- pc_old + posts_new
    207,137 <- pc_old_only
 16,153,177 <- pc_old_and_posts_new
  1,582,589 <- posts_new_only
CPU times: user 1min 8s, sys: 2.76 s, total: 1min 10s
Wall time: 1min 10s


dict_keys(['pc_old_only', 'pc_old_and_posts_new', 'posts_new_only'])


## Read comments meta


In [19]:
TODO

NameError: name 'TODO' is not defined