# Purpose


2021-10-06:
We're going back to pandas now that I have the VM machine with a ton of RAM.

There might be some tweaks needed to batch a few subreddits at a time, but at least we can get more consistent state/progress than with `dask`.

---
2021-10-06:
The job with dask failed silently - even with 3+ TB of RAM.  `Dask` was reporting that saving was complete - but it only saved one `parquet` file instead of hundreds of files.

New direction: now that I have access to a large VM, I might as well try to go back and do the calculations in memory (in pandas).


-- 
2021-10-05:
I ran into memory errors with 600GB or RAM, so here's a try with 1.4TB... if this doesn't work. Then I don't know what will...

---

2021-08-10: Finally completed testing with sampling <= 10 files. Now ready to run process on full data!

Ended up doing it all in dask + pandas + numpy because of problems installing `cuDF`.

---
2021-08-02: Now that I'm processing millions of comments and posts, I need to re-write the functions to try to do some work in parallel and reduce the amount of data loaded in RAM.

- `Dask` seems like a great option to load data and only compute some of it as needed.
- `cuDF` could be a way to speed up some computation using GPUs
- `Dask-delayed` could be a way to create a task DAG lazily before computing all the aggregates.


---

In notebook 09 I combined embeddings from posts & subreddits (`djb_09.00-combine_post_and_comments_and_visualize_for_presentation.ipynb`).

In this notebook I'll be testing functions that include mlflow so that it's easier to try a lot of different weights to find better respresentations.

Take embeddings created by other models & combine them:
```
new post embeddings = post + comments + subreddit description

new subreddit embeddings = new posts (weighted by post length or upvotes?)
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
import os
import logging
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.models.aggregate_embeddings import (
    AggregateEmbeddings, AggregateEmbeddingsConfig,
    load_config_agg_jupyter, get_dask_df_shape,
)
from subclu.models import aggregate_embeddings_pd

from subclu.utils import set_working_directory
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks
from subclu.eda.aggregates import (
    compare_raw_v_weighted_language
)
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl
)


print_lib_versions([dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Get runs that we can use for embeddings aggregation jobs

In [6]:
%%time

df_mlf_runs =  mlf.search_all_runs(experiment_ids=[13, 14, 15, 16])
df_mlf_runs.shape

CPU times: user 331 ms, sys: 6.46 ms, total: 337 ms
Wall time: 336 ms


(77, 128)

In [7]:
mask_finished = df_mlf_runs['status'] == 'FINISHED'
mask_output_over_1M_rows = (
    (df_mlf_runs['metrics.df_vect_posts_rows'] >= 1e5) |
    (df_mlf_runs['metrics.df_vect_comments'] >= 1e5)
)
# df_mlf_runs[mask_finished].shape

df_mlf_use_for_agg = df_mlf_runs[mask_output_over_1M_rows]
df_mlf_use_for_agg.shape

(3, 128)

In [8]:
cols_with_multiple_vals = df_mlf_use_for_agg.columns[df_mlf_use_for_agg.nunique(dropna=False) > 1]
# len(cols_with_multiple_vals)

style_df_numeric(
    df_mlf_use_for_agg
    [cols_with_multiple_vals]
    .drop(['artifact_uri', 'end_time',
           # 'start_time',
           ], 
          axis=1)
    .dropna(axis='columns', how='all')
    .iloc[:, :30]
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,run id,experiment id,start time,metrics.total comment files processed,metrics.vectorizing time minutes comments,metrics.vectorizing time minutes full function,metrics.df vect comments,params.n comment files slice end,params.n comment files slice start,params.tf batch inference rows,params.n sample comment files,tags.mlflow.runName,tags.model version
51,deb3454ece2a4a8d8e4149c2d8494c0d,14,2021-10-05 01:44:32.386000+00:00,15.0,39.11,45.94,10121046.0,,,3200,15,comments_batch_01-2021-10-05_014431,
52,5f10cd75334142168a6ebb787e477c1f,14,2021-10-05 00:22:20.334000+00:00,20.0,47.64,57.33,13558304.0,,,4200,20,comments_batch_01-2021-10-05_002219,0.4.0
56,9a27f9a72cf348c98d50f486abf3b009,13,2021-10-04 22:21:46.401000+00:00,2.0,3.93,5.03,1286661.0,,,6000,2,posts_as_comments_full_text-2021-10-04_222146,


# Load configs for aggregation jobs

`n_sample_comments_files` and `n_sample_posts_files` allow us to only load a few files at a time (e.g., 2 instead of 50) to test the process end-to-end.

---
Note that by default `hydra` is a cli tool. If we want to call use it in jupyter, we need to manually initialize configs & compose the configuration. See my custom function `load_config_agg_jupyter`. Also see:
- [Notebook with `Hydra` examples in a notebook](https://github.com/facebookresearch/hydra/blob/master/examples/jupyter_notebooks/compose_configs_in_notebook.ipynb).
- [Hydra docs, Hydra in Jupyter](https://hydra.cc/docs/next/advanced/jupyter_notebooks/).


In [9]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=4',     # 
               'n_sample_comments_files=4',  # 6 is limit for logging unique counts at comment level
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

config_full_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_full}",
               'n_sample_posts_files=null', 
               'n_sample_comments_files=null',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

# config_full_lc_true = AggregateEmbeddingsConfig(
#     config_path="../config",
#     config_name='aggregate_embeddings',
#     overrides=[f"mlflow_experiment={mlflow_experiment_full}",
#                'n_sample_posts_files=null', 
#                'n_sample_comments_files=null',
#                'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_true',
#               ]
# )
# pprint(config_test_sample_lc_false.config_dict, indent=2)

In [10]:
# config_test_sample_lc_false.config_flat,

In [11]:
df_configs = pd.DataFrame(
    [
        config_test_sample_lc_false.config_flat,
        # config_test_full_lc_false.config_flat,
        config_full_lc_false.config_flat,
        # config_full_lc_true.config_flat,
    ]
)

In [12]:
# We can't use (df_configs.nunique(dropna=False) > 1)
#  because when a col's content is a list or something unhashable, we get an error
#  so instead we'll check each column individually

# cols_with_diffs_config = df_configs.columns[df_configs.nunique(dropna=False) > 1]
cols_with_diffs_config = list()
for c_ in df_configs.columns:
    try:
        if df_configs[c_].nunique() > 1:
            cols_with_diffs_config.append(c_)
    except TypeError:
        cols_with_diffs_config.append(c_)
        

df_configs[cols_with_diffs_config]

Unnamed: 0,comments_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids_lowercase,subreddit_meta_vectorized_mlflow_uuids,subreddit_meta_vectorized_mlflow_uuids_lowercase,comments_uuid,mlflow_experiment
0,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates_test
1,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates


In [13]:
# pprint(config_test_sample_lc_false.config_flat, indent=2)

# Run Full data with `lower_case=False`

The logic for sampling files and download/`caching` files locally lives in the `mlf` custom function.

Caching can save 9+ minutes if we try to download the files from GCS every time.

In [14]:
keys_to_check_in_config = ['mlflow_experiment', 'n_sample_posts_files', 'n_sample_comments_files', 'aggregate_params', 'calculate_similarites']

for k_ in keys_to_check_in_config:
    v_ = config_full_lc_false.config_dict.get(k_)
    if isinstance(v_, dict):
        print(f"\n{k_}:")
        [print(f"  {k2_}: \t{v2_}") for k2_, v2_ in v_.items()]
    else:
        print(f"{k_}: \t{v_}")

mlflow_experiment: 	v0.4.0_use_multi_aggregates
n_sample_posts_files: 	None
n_sample_comments_files: 	None

aggregate_params:
  min_comment_text_len: 	2
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
calculate_similarites: 	True


In [21]:
BREAK

In [None]:
%%time

try:
    job_agg1._send_log_file_to_mlflow()
    mlflow.end_run("FAILED")
    # run setup_logging() to remove logging to the file of a failed job
    setup_logging()
    
    del job_agg1
    del d_dfs1
except NameError:
    pass
gc.collect()

mlflow.end_run("FAILED")


job_agg1 = aggregate_embeddings_pd.AggregateEmbeddings(
    run_name=f"agg_full_lc_false_pd-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_full_lc_false.config_flat
)
job_agg1.run_aggregation()

gc.collect()

10:27:33 | INFO | "== Start run_aggregation() method =="
10:27:33 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
10:27:33 | INFO | "host_name: djb-100-2021-04-28-djb-eda-german-subs"
10:27:33 | INFO | "cpu_count: 160"
10:27:33 | INFO | "RAM stats:
{'memory_used_percent': '1.67%', 'memory_total': '3,874,634', 'memory_used': '64,759', 'memory_free': '3,681,161'}"
10:27:33 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-10-12_102733-agg_full_lc_false-2021-10-12_102732"
10:27:33 | INFO | "  Saving config to local path..."
10:27:33 | INFO | "  Logging config to mlflow..."
10:27:34 | INFO | "-- Start _load_raw_embeddings() method --"
10:27:34 | INFO | "Loading subreddit description embeddings..."
10:27:35 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14

In [19]:
job_agg1._send_log_file_to_mlflow()
gc.collect()

16:40:44 | INFO | "Could NOT log to MLFLow, there's no active run."


41

In [21]:
gc.collect()

174

In [22]:
150 * 4

600

# Run full data, `lower_case=True`

Looks like the problem I ran into with the file being corrupted might've been a problem with downloading the file(s). Fix: delete the local cache and download the files again.

In [None]:
BREAK

In [None]:
# %%time

# mlflow.end_run("FAILED")
# gc.collect()
# try:
#     # run setup_logging() to remove logging to the file of a failed job
#     setup_logging()
    
#     del job_agg2
#     del d_dfs2
# except NameError:
#     pass
# gc.collect()

# job_agg2 = AggregateEmbeddings(
#     run_name=f"full_lc_true-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
#     **config_full_lc_true.config_flat
# )
# job_agg2.run_aggregation()

# gc.collect()

15:47:51 | INFO | "== Start run_aggregation() method =="
15:47:51 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
15:47:52 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-08-10_154752-full_lc_true-2021-08-10_154751"
15:47:52 | INFO | "  Saving config to local path..."
15:47:52 | INFO | "  Logging config to mlflow..."
15:47:52 | INFO | "-- Start _load_raw_embeddings() method --"
15:47:52 | INFO | "Loading subreddit description embeddings..."
15:47:53 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_subreddits_description"


  0%|          | 0/4 [00:00<?, ?it/s]

15:47:54 | INFO | "  Reading 1 files"
15:47:55 | INFO | "       3,767 |  513 <- Raw vectorized subreddit description shape"
15:47:56 | INFO | "Loading POSTS embeddings..."
15:47:57 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_posts"


  0%|          | 0/51 [00:00<?, ?it/s]

15:48:44 | INFO | "  Reading 48 files"
15:48:47 | INFO | "   1,649,929 |  514 <- Raw POSTS shape"
15:48:51 | INFO | "Loading COMMENTS embeddings..."
15:48:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_comments"


  0%|          | 0/38 [00:00<?, ?it/s]

15:54:48 | INFO | "  Reading 37 files"
15:54:49 | INFO | "  0:06:56.293258 <- Total raw embeddings load time elapsed"
15:54:49 | INFO | "-- Start _load_metadata() method --"
15:54:49 | INFO | "Loading POSTS metadata..."
15:54:49 | INFO | "Reading raw data..."
15:54:49 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/posts/top/2021-07-16"


  0%|          | 0/43 [00:00<?, ?it/s]

15:54:51 | INFO | "  Applying transformations..."
15:54:52 | INFO | "  (1649929, 14) <- Raw META POSTS shape"
15:54:52 | INFO | "Loading subs metadata..."
15:54:52 | INFO | "  reading sub-level data & merging with aggregates..."
15:54:52 | INFO | "Reading raw data..."
15:54:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/subreddits/top/2021-07-16"


  0%|          | 0/1 [00:00<?, ?it/s]

15:54:53 | INFO | "  Applying transformations..."
15:54:54 | INFO | "  (3767, 38) <- Raw META subreddit description shape"
15:54:54 | INFO | "Loading COMMENTS metadata..."
15:54:54 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-07-09"


  0%|          | 0/37 [00:00<?, ?it/s]

15:54:55 | INFO | "  (Delayed('int-11aa2518-d088-4702-bee1-c90e9c40927d'), 7) <- Raw META COMMENTS shape"
15:54:55 | INFO | "  0:00:05.773888 <- Total metadata loading time elapsed"
15:54:55 | INFO | "-- Start _agg_comments_to_post_level() method --"
15:54:55 | INFO | "Getting count of comments per post..."
'<=' not supported between instances of 'NoneType' and 'int'"
15:55:18 | INFO | "Filtering which comments need to be averaged..."
15:56:48 | INFO | "      126,642 <- Comments that DON'T need to be averaged"
15:56:48 | INFO | "   19,041,512 <- Comments that need to be averaged"
15:56:48 | INFO | "No column to weight comments, simple mean for comments at post level"
15:59:15 | INFO | "      979,701 |  514 <- df_v_com_agg SHAPE"
15:59:15 | INFO | "  0:04:20.021986 <- Total comments to post agg loading time elapsed"
15:59:15 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
15:59:17 | INFO | "DEFINE agg_posts_w_comments..."
15:59:17 | INFO | "  (Dela

  0%|          | 0/11 [00:00<?, ?it/s]

17:18:50 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
17:18:50 | INFO | "Saving locally..."
17:42:53 | INFO | "  Saving existing dask df as parquet..."
18:06:23 | INFO | "Logging artifact to mlflow..."
18:06:25 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
18:06:25 | INFO | "Saving locally..."
18:06:25 | INFO | "Keeping index intact..."
18:06:25 | INFO | "Converting pandas to dask..."
18:06:25 | INFO | "   108.6 MB <- Memory usage"
18:06:25 | INFO | "       3	<- target Dask partitions	   40.0 <- target MB partition size"
18:06:29 | INFO | "Logging artifact to mlflow..."
18:06:31 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity_pair **"
18:06:31 | INFO | "Saving locally..."
18:06:33 | INFO | "Converting pandas to dask..."
18:06:40 | INFO | "  6,002.0 MB <- Memory usage"
18:06:40 | INFO | "      81	<- target Dask partitions	   75.0 <- target MB partition size"
18:06:53 | INFO | "Logging artifact to mlflow..."
18:07:16 | I

In [23]:
mlflow.end_run("FAILED")

# Debugging

# Run test with `lower_case=False

Sample only a few files in comments/ posts to make sure that job completes even when we're testing new code/logic.

Limit to only 2 files of each kind to get minimum test to run end to end.

### Load config

In [19]:
# run setup_logging() to remove logging to the file of a failed job
setup_logging()

In [20]:
logging.debug("debug test")
logging.info("info test")
logging.warning("warning message")
logging.error("error message")

16:20:03 | INFO | "info test"
16:20:03 | ERROR | "error message"


In [47]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=2',     # 
               'n_sample_comments_files=4',  # 6 is limit for logging unique counts at comment level
               'calculate_similarites=false',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

In [48]:
keys_to_check_in_config = ['mlflow_experiment', 'n_sample_posts_files', 'n_sample_comments_files', 'aggregate_params', 'calculate_similarites']

for k_ in keys_to_check_in_config:
    v_ = config_test_sample_lc_false.config_dict.get(k_)
    if isinstance(v_, dict):
        print(f"\n{k_}:")
        [print(f"  {k2_}: \t{v2_}") for k2_, v2_ in v_.items()]
    else:
        print(f"{k_}: \t{v_}")

mlflow_experiment: 	v0.4.0_use_multi_aggregates_test
n_sample_posts_files: 	2
n_sample_comments_files: 	4

aggregate_params:
  min_comment_text_len: 	2
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
calculate_similarites: 	False


In [None]:
BREAK

## Run test job/config


TODO: how to vectorize or run this in parallel?


It took around 9.5 minutes to go through 465k posts.
```bash
100%|██████████| 464967/464967 [09:21<00:00, 827.91it/s]
22:14:59 | INFO | "  (464967, 512) <- df_agg_posts_w_comments.shape (only posts with comments)"

```


In [23]:
%%time


gc.collect()
try:
    # run setup_logging() to remove logging to the file of a failed job
    job_agg_test._send_log_file_to_mlflow()
    setup_logging()
    del job_agg_test
except NameError:
    pass
gc.collect()
mlflow.end_run("FAILED")

job_agg_test = aggregate_embeddings_pd.AggregateEmbeddings(
    run_name=f"sample_test_lc_false_pd-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    
    # use pre-loaded dfs
    df_v_posts=df_v_posts_test_sample,
    df_v_sub=df_v_sub_test_sample,
    df_v_comments=df_v_comments_test_sample,
    df_posts_meta=df_posts_meta_,
    df_comments_meta=df_comments_meta_,
    
    **config_test_sample_lc_false.config_flat
)
job_agg_test.run_aggregation()

gc.collect()

In [52]:
job_agg_test._send_log_file_to_mlflow()
gc.collect()

17:49:55 | INFO | "Logging log-file to mlflow..."


4462

### Create pre-loaded dfs to save on loading time

After we know loading works, this could save 3-4 minutes per iteration.

In [None]:
BREAK

In [44]:
%%time
df_v_posts_test_sample = job_agg_test.df_v_posts.copy()
print(df_v_posts_test_sample.shape)

df_v_sub_test_sample = job_agg_test.df_v_sub.copy()
print(df_v_sub_test_sample.shape)

(8439672, 515)
(19262, 514)


In [60]:
%%time
df_v_comments_test_sample = job_agg_test.df_v_comments.copy()
print(df_v_comments_test_sample.shape)

(2649171, 516)
CPU times: user 1.15 s, sys: 1.42 s, total: 2.57 s
Wall time: 2.57 s


In [64]:
%%time

df_posts_meta_ = job_agg_test.df_posts_meta
df_comments_meta_ = job_agg_test.df_comments_meta

CPU times: user 17 µs, sys: 7 µs, total: 24 µs
Wall time: 50.3 µs


### Check computed dfs

In [24]:
# for k_, v_ in {k_: v_ for k_, v_ in vars(job_agg_test).items() if 'df_' in k_}.items():
for k_, v_ in {k_: v_ for k_, v_ in vars(job_agg1).items() if 'df_' in k_}.items():
    print(f"\n{k_}")
    try:
        print(f"  {v_.shape}")
        display(v_.iloc[:8, :10])
        if not ('meta' in k_):
            print(v_.info())
    except Exception as e:
        pass


df_v_posts
  (8439672, 515)


Unnamed: 0,subreddit_name,subreddit_id,post_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6
0,circumcisiongrief,t5_zzszh,t3_oy5757,-0.02456,0.010143,-0.030832,0.037089,-0.069964,0.058501,0.011466
1,circumcisiongrief,t5_zzszh,t3_p7959y,-0.024816,-0.002123,-0.028851,-0.034535,-0.101,0.031001,0.030201
2,circumcisiongrief,t5_zzszh,t3_p9qjt4,0.003813,0.053867,-0.044054,0.007976,-0.112127,-0.015414,0.066434
3,circumcisiongrief,t5_zzszh,t3_p6pby5,0.027264,-0.019307,0.031788,0.006627,0.034353,0.036209,-0.051651
4,circumcisiongrief,t5_zzszh,t3_p01h3v,-0.005145,0.039692,-0.043006,0.018923,-0.09223,-0.000506,0.005156
5,circumcisiongrief,t5_zzszh,t3_p6ww7c,-0.005935,0.007493,-0.021896,0.044703,-0.079627,0.014662,-0.04879
6,circumcisiongrief,t5_zzszh,t3_paf7xc,-0.020565,0.0597,-0.000603,0.023017,-0.084957,0.050607,0.064898
7,circumcisiongrief,t5_zzszh,t3_pprskd,0.017955,-0.018898,0.031398,0.035929,-0.072952,0.009319,0.025411


<class 'pandas.core.frame.DataFrame'>
Int64Index: 8439672 entries, 0 to 361953
Columns: 515 entries, subreddit_name to embeddings_511
dtypes: float32(512), object(3)
memory usage: 16.3+ GB
None

df_v_comments
  (39760856, 516)


Unnamed: 0,subreddit_name,subreddit_id,post_id,comment_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5
0,0sanitymemes,t5_2qlzfy,t3_p90j2e,t1_h9w0eth,0.012936,-0.024696,0.003495,0.004394,-0.09688,0.075883
1,0sanitymemes,t5_2qlzfy,t3_p9ierl,t1_h9y9a59,-0.046967,0.0337,-0.030745,-0.077184,0.084849,0.008081
2,0sanitymemes,t5_2qlzfy,t3_owhp69,t1_h7g32pv,0.022889,-0.061986,-0.083042,-0.022774,0.004993,0.028299
3,0sanitymemes,t5_2qlzfy,t3_pn8y4r,t1_hcnr1bw,-0.012224,-0.042725,-0.083593,-0.02774,0.02886,-0.016362
4,0sanitymemes,t5_2qlzfy,t3_ozito8,t1_h81169c,-0.071418,0.033446,0.007267,-0.013441,-0.030473,0.079096
5,0sanitymemes,t5_2qlzfy,t3_ozs874,t1_h82dc69,0.015912,0.019512,0.038823,-0.062955,-0.097027,0.051943
6,0sanitymemes,t5_2qlzfy,t3_pqgvlr,t1_hdaztwy,0.137612,-0.013515,0.071143,-0.021872,-0.124753,0.053046
7,0sanitymemes,t5_2qlzfy,t3_pqh6fd,t1_hdb4z1a,0.043433,0.007597,-0.043037,0.03352,-0.076607,0.072387


<class 'pandas.core.frame.DataFrame'>
Int64Index: 39760856 entries, 0 to 692418
Columns: 516 entries, subreddit_name to embeddings_511
dtypes: float32(512), object(4)
memory usage: 77.4+ GB
None

df_v_sub
  (19262, 514)


Unnamed: 0_level_0,subreddit_name,subreddit_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7
__null_dask_index__,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,askreddit,t5_2qh1i,0.022812,-0.042123,-0.007514,0.049629,0.07665,0.037742,0.043356,-0.021046
1,pics,t5_2qh0u,-0.051188,0.001655,0.036857,0.010167,0.042715,0.034471,0.045897,-0.069887
2,funny,t5_2qh33,0.052985,-0.029943,-0.020383,-0.022284,0.076624,0.057212,0.022809,0.036846
3,memes,t5_2qjpg,-0.012688,0.007123,-0.046276,0.013266,0.039581,0.06643,-0.068151,0.028627
4,interestingasfuck,t5_2qhsa,-0.010259,0.077889,-0.066735,0.031045,0.072704,0.050287,0.033434,0.007021
5,holup,t5_qir9n,-0.048384,-0.075352,-0.021186,-0.018942,0.075612,0.072915,-0.002732,0.013651
6,publicfreakout,t5_2yrq6,-0.02956,0.051576,-0.032588,-0.019716,0.072152,0.047094,0.004342,-0.041286
7,facepalm,t5_2r5rp,-0.054565,-0.035727,0.053798,-0.045296,0.088955,0.029438,-0.026451,0.055427


<class 'pandas.core.frame.DataFrame'>
Int64Index: 19262 entries, 0 to 19261
Columns: 514 entries, subreddit_name to embeddings_511
dtypes: float32(512), object(2)
memory usage: 38.1+ MB
None

df_subs_meta
  (19262, 91)


Unnamed: 0,pt_date,subreddit_name,subreddit_id,geo_relevant_country_codes,geo_relevant_countries,geo_relevant_country_count,geo_relevant_subreddit,ambassador_subreddit,combined_topic,combined_topic_and_rating
0,2021-09-21,askreddit,t5_2qh1i,,,,False,False,uncategorized,uncategorized
1,2021-09-21,pics,t5_2qh0u,,,,False,False,art,art
2,2021-09-21,funny,t5_2qh33,,,,False,False,uncategorized,uncategorized
3,2021-09-21,memes,t5_2qjpg,,,,False,False,uncategorized,uncategorized
4,2021-09-21,interestingasfuck,t5_2qhsa,,,,False,False,uncategorized,uncategorized
5,2021-09-21,holup,t5_qir9n,,,,False,False,uncategorized,uncategorized
6,2021-09-21,publicfreakout,t5_2yrq6,,,,False,False,uncategorized,over18_nsfw
7,2021-09-21,facepalm,t5_2r5rp,,,,False,False,uncategorized,uncategorized



df_posts_meta
  (8439672, 15)


Unnamed: 0,subreddit_name,subreddit_id,post_id,submit_date,upvotes,combined_topic_and_rating,post_type,weighted_language,text_len,text_word_count
0,circumcisiongrief,t5_zzszh,t3_oy5757,2021-08-04,0,over18_nsfw,text,en,391,71
1,circumcisiongrief,t5_zzszh,t3_p7959y,2021-08-19,0,over18_nsfw,text,en,471,103
2,circumcisiongrief,t5_zzszh,t3_p9qjt4,2021-08-23,0,over18_nsfw,image,en,88,17
3,circumcisiongrief,t5_zzszh,t3_p6pby5,2021-08-18,0,over18_nsfw,text,en,23,3
4,circumcisiongrief,t5_zzszh,t3_p01h3v,2021-08-07,0,over18_nsfw,text,en,628,130
5,circumcisiongrief,t5_zzszh,t3_p6ww7c,2021-08-18,0,over18_nsfw,text,en,378,73
6,circumcisiongrief,t5_zzszh,t3_paf7xc,2021-08-24,0,over18_nsfw,text,en,136,24
7,circumcisiongrief,t5_zzszh,t3_pprskd,2021-09-17,0,over18_nsfw,text,en,3045,592



df_comments_meta
  (39901968, 8)


Unnamed: 0,subreddit_name,subreddit_id,post_id,comment_id,submit_date,upvotes,comment_text_len,comment_text_word_count
0,0sanitymemes,t5_2qlzfy,t3_p90j2e,t1_h9w0eth,2021-08-22,14,144,34
1,0sanitymemes,t5_2qlzfy,t3_p9ierl,t1_h9y9a59,2021-08-22,31,69,12
2,0sanitymemes,t5_2qlzfy,t3_owhp69,t1_h7g32pv,2021-08-02,95,102,20
3,0sanitymemes,t5_2qlzfy,t3_pn8y4r,t1_hcnr1bw,2021-09-13,0,948,135
4,0sanitymemes,t5_2qlzfy,t3_ozito8,t1_h81169c,2021-08-07,5,82,14
5,0sanitymemes,t5_2qlzfy,t3_ozs874,t1_h82dc69,2021-08-07,11,129,26
6,0sanitymemes,t5_2qlzfy,t3_pqgvlr,t1_hdaztwy,2021-09-18,7,9,3
7,0sanitymemes,t5_2qlzfy,t3_pqh6fd,t1_hdb4z1a,2021-09-18,8,90,22



df_comment_count_per_post
  (8439672, 3)


Unnamed: 0,post_id,comment_count,comment_count_
0,t3_ovhuuz,2.0,2.0
1,t3_ovhuvj,9.0,4+
2,t3_ovhuvk,1.0,1.0
3,t3_ovhuvp,2.0,2.0
4,t3_ovhuvr,9.0,4+
5,t3_ovhuvw,7.0,4+
6,t3_ovhuwk,1.0,1.0
7,t3_ovhuwp,1.0,1.0


<class 'pandas.core.frame.DataFrame'>
Int64Index: 8439672 entries, 0 to 8439671
Data columns (total 3 columns):
 #   Column          Dtype  
---  ------          -----  
 0   post_id         object 
 1   comment_count   float64
 2   comment_count_  object 
dtypes: float64(1), object(2)
memory usage: 257.6+ MB
None

df_posts_agg_b

df_posts_agg_c
  (8439672, 515)


Unnamed: 0,subreddit_name,subreddit_id,post_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6
29777,0sanitymemes,t5_2qlzfy,t3_ovlum8,0.145335,0.032148,0.025021,0.038979,-0.049647,0.040922,0.012444
30433,0sanitymemes,t5_2qlzfy,t3_ovly4k,0.115917,0.022836,0.026377,0.031407,-0.047224,0.036824,0.011568
34390,0sanitymemes,t5_2qlzfy,t3_ovmkrd,-0.028301,-0.022054,0.017779,-0.004818,-0.054394,0.070195,-0.028871
43074,0sanitymemes,t5_2qlzfy,t3_ovnz2q,-0.001363,-0.051564,-0.012638,0.019144,-0.015151,0.053491,-0.045563
50065,0sanitymemes,t5_2qlzfy,t3_ovp369,0.011754,-0.038589,0.052715,0.039036,0.041119,0.045222,-0.073373
52162,0sanitymemes,t5_2qlzfy,t3_ovpexl,-0.010066,-0.025976,-0.01747,0.027131,-0.017439,0.02852,0.011335
59459,0sanitymemes,t5_2qlzfy,t3_ovqia6,0.012692,0.024433,0.028118,0.016716,-0.050641,0.058717,0.002631
86740,0sanitymemes,t5_2qlzfy,t3_ovubi1,0.040472,-0.003698,-0.003935,-0.012225,-0.063462,0.044867,-0.021906


<class 'pandas.core.frame.DataFrame'>
Int64Index: 8439672 entries, 29777 to 8378380
Columns: 515 entries, subreddit_name to embeddings_511
dtypes: float64(512), object(3)
memory usage: 32.4+ GB
None

df_subs_agg_a
  (19192, 514)


Unnamed: 0,subreddit_name,subreddit_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7
0,0sanitymemes,t5_2qlzfy,-0.015388,-0.002936,0.010141,-0.001565,-0.013422,0.036574,-0.007536,-0.000176
1,0xpolygon,t5_2qgijx,-0.031953,0.042221,-0.044101,-0.055411,-0.019055,0.019684,-0.003844,-0.004553
2,1000lbsisters,t5_2axvbl,-0.017252,-0.018655,-0.001956,0.007038,-0.010388,0.061051,0.014859,0.019972
3,100gecs,t5_131dor,-0.005886,-0.003683,0.019314,0.006257,-0.031561,0.041858,0.011672,-0.007873
4,100kanojo,t5_2asd3o,-0.001714,0.022761,0.013923,-0.024491,0.004383,0.042427,0.002715,0.006445
5,100thieves,t5_3e98s,-0.000314,-0.003383,0.012124,0.007236,0.004828,0.046164,0.001995,-0.0043
6,100yearsago,t5_2y3jq,0.009795,-0.014319,0.014024,-0.00923,0.027354,0.029391,-0.02142,-0.028558
7,1022,t5_2v7cn,-0.027633,0.008863,-0.004185,0.011031,-0.06321,0.049075,0.023571,-0.001067


<class 'pandas.core.frame.DataFrame'>
Int64Index: 19192 entries, 0 to 19191
Columns: 514 entries, subreddit_name to embeddings_511
dtypes: float32(512), object(2)
memory usage: 37.9+ MB
None

df_subs_agg_b

df_subs_agg_c
  (19192, 514)


Unnamed: 0,subreddit_name,subreddit_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7
0,0sanitymemes,t5_2qlzfy,-0.015309,-0.000338,0.004237,-0.003082,-0.00516,0.039126,-0.007306,-0.001087
1,0xpolygon,t5_2qgijx,-0.032465,0.040708,-0.03843,-0.053849,-0.01101,0.011037,0.002698,0.001709
2,1000lbsisters,t5_2axvbl,-0.015282,-0.010422,0.003329,0.007292,0.000404,0.053924,0.0184,0.021107
3,100gecs,t5_131dor,-0.004657,-0.008885,0.018894,0.00437,-0.031983,0.041073,0.014725,-0.007694
4,100kanojo,t5_2asd3o,0.004579,0.022865,0.01759,-0.02558,0.004692,0.040058,0.003381,0.01089
5,100thieves,t5_3e98s,0.000797,-0.006705,0.01381,0.002443,0.006766,0.039071,0.005408,-0.008182
6,100yearsago,t5_2y3jq,0.002633,-0.006108,0.014209,-0.005926,0.030038,0.031586,-0.011541,-0.027743
7,1022,t5_2v7cn,-0.024393,0.010842,0.001057,0.01296,-0.045104,0.046439,0.025671,0.000203


<class 'pandas.core.frame.DataFrame'>
Int64Index: 19192 entries, 0 to 19191
Columns: 514 entries, subreddit_name to embeddings_511
dtypes: float64(512), object(2)
memory usage: 75.4+ MB
None

df_subs_agg_a_similarity
  (19192, 19192)


Unnamed: 0_level_0,0sanitymemes,0xpolygon,1000lbsisters,100gecs,100kanojo,100thieves,100yearsago,1022,1050ti,10s
subreddit_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0sanitymemes,1.0,0.317122,0.407538,0.531972,0.56197,0.630557,0.24878,0.288425,0.230281,0.281101
0xpolygon,0.317122,1.0,0.134003,0.381688,0.278018,0.463511,0.174326,0.402996,0.291412,0.410954
1000lbsisters,0.407538,0.134003,1.0,0.411273,0.439234,0.338524,0.164762,0.200813,0.032907,0.158848
100gecs,0.531972,0.381688,0.411273,1.0,0.499382,0.583254,0.236132,0.41987,0.223884,0.397008
100kanojo,0.56197,0.278018,0.439234,0.499382,1.0,0.49707,0.230966,0.196443,0.177663,0.233683
100thieves,0.630557,0.463511,0.338524,0.583254,0.49707,1.0,0.297804,0.367563,0.338098,0.450703
100yearsago,0.24878,0.174326,0.164762,0.236132,0.230966,0.297804,1.0,0.169843,0.037582,0.078023
1022,0.288425,0.402996,0.200813,0.41987,0.196443,0.367563,0.169843,1.0,0.297787,0.548043


<class 'pandas.core.frame.DataFrame'>
Index: 19192 entries, 0sanitymemes to zyzz
Columns: 19192 entries, 0sanitymemes to zyzz
dtypes: float32(19192)
memory usage: 1.4+ GB
None

df_subs_agg_b_similarity

df_subs_agg_c_similarity
  (19192, 19192)


Unnamed: 0_level_0,0sanitymemes,0xpolygon,1000lbsisters,100gecs,100kanojo,100thieves,100yearsago,1022,1050ti,10s
subreddit_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0sanitymemes,1.0,0.32362,0.473484,0.541708,0.609227,0.641112,0.252647,0.283359,0.228379,0.324135
0xpolygon,0.32362,1.0,0.15721,0.350132,0.258499,0.446187,0.196747,0.360179,0.291814,0.374882
1000lbsisters,0.473484,0.15721,1.0,0.426692,0.493127,0.39505,0.19253,0.222239,0.046362,0.206581
100gecs,0.541708,0.350132,0.426692,1.0,0.490515,0.595395,0.258179,0.398795,0.268588,0.391884
100kanojo,0.609227,0.258499,0.493127,0.490515,1.0,0.522462,0.249595,0.191399,0.188991,0.271813
100thieves,0.641112,0.446187,0.39505,0.595395,0.522462,1.0,0.321409,0.357185,0.333538,0.462633
100yearsago,0.252647,0.196747,0.19253,0.258179,0.249595,0.321409,1.0,0.21316,0.074235,0.118345
1022,0.283359,0.360179,0.222239,0.398795,0.191399,0.357185,0.21316,1.0,0.293922,0.497022


<class 'pandas.core.frame.DataFrame'>
Index: 19192 entries, 0sanitymemes to zyzz
Columns: 19192 entries, 0sanitymemes to zyzz
dtypes: float64(19192)
memory usage: 2.7+ GB
None

df_subs_agg_a_similarity_pair
  (368313672, 9)


Unnamed: 0_level_0,Unnamed: 1_level_0,cosine_distance,subreddit_name_a,subreddit_name_b,German_posts_percent_a,German_posts_percent_b,manual_topic_and_rating_a,manual_topic_and_rating_b,post_median_word_count_a,post_median_word_count_b
subreddit_id_a,subreddit_id_b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
t5_2qlzfy,t5_3lacy,0.761624,0sanitymemes,shitpostxiv,0.012522,0.018333,internet culture and memes,uncategorized,6.0,7.0
t5_2qlzfy,t5_ujz0m,0.752945,0sanitymemes,fgomemes,0.012522,0.017182,internet culture and memes,over18_nsfw,6.0,6.0
t5_2qlzfy,t5_3oeyf,0.734468,0sanitymemes,fortnitebr,0.012522,0.008333,internet culture and memes,uncategorized,6.0,9.0
t5_2qlzfy,t5_37o2hz,0.730891,0sanitymemes,genshin_memepact,0.012522,0.014167,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_3h1lw,0.728006,0sanitymemes,shittyrainbow6,0.012522,0.0162,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_2df2ik,0.727538,0sanitymemes,shadowfightarena,0.012522,0.0125,internet culture and memes,uncategorized,6.0,9.0
t5_2qlzfy,t5_w4q80,0.724028,0sanitymemes,apexoutlands,0.012522,0.019167,internet culture and memes,uncategorized,6.0,6.0
t5_2qlzfy,t5_1j0ju4,0.723788,0sanitymemes,okbuddyguardian,0.012522,0.039841,internet culture and memes,internet culture and memes,6.0,4.0


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 368313672 entries, ('t5_2qlzfy', 't5_3lacy') to ('t5_2sosg', 't5_2z3jg')
Data columns (total 9 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   cosine_distance            float32
 1   subreddit_name_a           object 
 2   subreddit_name_b           object 
 3   German_posts_percent_a     float64
 4   German_posts_percent_b     float64
 5   manual_topic_and_rating_a  object 
 6   manual_topic_and_rating_b  object 
 7   post_median_word_count_a   float64
 8   post_median_word_count_b   float64
dtypes: float32(1), float64(4), object(4)
memory usage: 24.7+ GB
None

df_subs_agg_b_similarity_pair

df_subs_agg_c_similarity_pair
  (368313672, 9)


Unnamed: 0_level_0,Unnamed: 1_level_0,cosine_distance,subreddit_name_a,subreddit_name_b,German_posts_percent_a,German_posts_percent_b,manual_topic_and_rating_a,manual_topic_and_rating_b,post_median_word_count_a,post_median_word_count_b
subreddit_id_a,subreddit_id_b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
t5_2qlzfy,t5_3lacy,0.794118,0sanitymemes,shitpostxiv,0.012522,0.018333,internet culture and memes,uncategorized,6.0,7.0
t5_2qlzfy,t5_37o2hz,0.775581,0sanitymemes,genshin_memepact,0.012522,0.014167,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_ujz0m,0.767907,0sanitymemes,fgomemes,0.012522,0.017182,internet culture and memes,over18_nsfw,6.0,6.0
t5_2qlzfy,t5_n5scw,0.755079,0sanitymemes,girlsundshitposts,0.012522,0.05303,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_74is2,0.753731,0sanitymemes,okbuddyretard,0.012522,0.033333,internet culture and memes,uncategorized,6.0,4.0
t5_2qlzfy,t5_3h1lw,0.753024,0sanitymemes,shittyrainbow6,0.012522,0.0162,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_37q5a,0.744985,0sanitymemes,lostpause,0.012522,0.020833,internet culture and memes,podcasts and streamers,6.0,5.0
t5_2qlzfy,t5_3i3t6,0.741584,0sanitymemes,unexpectedtf2,0.012522,0.011299,internet culture and memes,uncategorized,6.0,4.0


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 368313672 entries, ('t5_2qlzfy', 't5_3lacy') to ('t5_2sosg', 't5_2z3jg')
Data columns (total 9 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   cosine_distance            float64
 1   subreddit_name_a           object 
 2   subreddit_name_b           object 
 3   German_posts_percent_a     float64
 4   German_posts_percent_b     float64
 5   manual_topic_and_rating_a  object 
 6   manual_topic_and_rating_b  object 
 7   post_median_word_count_a   float64
 8   post_median_word_count_b   float64
dtypes: float64(5), object(4)
memory usage: 26.1+ GB
None

df_subs_agg_a_similarity_top_pair
  (3838400, 9)


Unnamed: 0_level_0,Unnamed: 1_level_0,cosine_distance,subreddit_name_a,subreddit_name_b,German_posts_percent_a,German_posts_percent_b,manual_topic_and_rating_a,manual_topic_and_rating_b,post_median_word_count_a,post_median_word_count_b
subreddit_id_a,subreddit_id_b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
t5_2qlzfy,t5_3lacy,0.761624,0sanitymemes,shitpostxiv,0.012522,0.018333,internet culture and memes,uncategorized,6.0,7.0
t5_2qlzfy,t5_ujz0m,0.752945,0sanitymemes,fgomemes,0.012522,0.017182,internet culture and memes,over18_nsfw,6.0,6.0
t5_2qlzfy,t5_3oeyf,0.734468,0sanitymemes,fortnitebr,0.012522,0.008333,internet culture and memes,uncategorized,6.0,9.0
t5_2qlzfy,t5_37o2hz,0.730891,0sanitymemes,genshin_memepact,0.012522,0.014167,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_3h1lw,0.728006,0sanitymemes,shittyrainbow6,0.012522,0.0162,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_2df2ik,0.727538,0sanitymemes,shadowfightarena,0.012522,0.0125,internet culture and memes,uncategorized,6.0,9.0
t5_2qlzfy,t5_w4q80,0.724028,0sanitymemes,apexoutlands,0.012522,0.019167,internet culture and memes,uncategorized,6.0,6.0
t5_2qlzfy,t5_1j0ju4,0.723788,0sanitymemes,okbuddyguardian,0.012522,0.039841,internet culture and memes,internet culture and memes,6.0,4.0


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3838400 entries, ('t5_2qlzfy', 't5_3lacy') to ('t5_2sosg', 't5_iuzv7')
Data columns (total 9 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   cosine_distance            float32
 1   subreddit_name_a           object 
 2   subreddit_name_b           object 
 3   German_posts_percent_a     float64
 4   German_posts_percent_b     float64
 5   manual_topic_and_rating_a  object 
 6   manual_topic_and_rating_b  object 
 7   post_median_word_count_a   float64
 8   post_median_word_count_b   float64
dtypes: float32(1), float64(4), object(4)
memory usage: 264.9+ MB
None

df_subs_agg_b_similarity_top_pair

df_subs_agg_c_similarity_top_pair
  (3838400, 9)


Unnamed: 0_level_0,Unnamed: 1_level_0,cosine_distance,subreddit_name_a,subreddit_name_b,German_posts_percent_a,German_posts_percent_b,manual_topic_and_rating_a,manual_topic_and_rating_b,post_median_word_count_a,post_median_word_count_b
subreddit_id_a,subreddit_id_b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
t5_2qlzfy,t5_3lacy,0.794118,0sanitymemes,shitpostxiv,0.012522,0.018333,internet culture and memes,uncategorized,6.0,7.0
t5_2qlzfy,t5_37o2hz,0.775581,0sanitymemes,genshin_memepact,0.012522,0.014167,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_ujz0m,0.767907,0sanitymemes,fgomemes,0.012522,0.017182,internet culture and memes,over18_nsfw,6.0,6.0
t5_2qlzfy,t5_n5scw,0.755079,0sanitymemes,girlsundshitposts,0.012522,0.05303,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_74is2,0.753731,0sanitymemes,okbuddyretard,0.012522,0.033333,internet culture and memes,uncategorized,6.0,4.0
t5_2qlzfy,t5_3h1lw,0.753024,0sanitymemes,shittyrainbow6,0.012522,0.0162,internet culture and memes,uncategorized,6.0,5.0
t5_2qlzfy,t5_37q5a,0.744985,0sanitymemes,lostpause,0.012522,0.020833,internet culture and memes,podcasts and streamers,6.0,5.0
t5_2qlzfy,t5_3i3t6,0.741584,0sanitymemes,unexpectedtf2,0.012522,0.011299,internet culture and memes,uncategorized,6.0,4.0


<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3838400 entries, ('t5_2qlzfy', 't5_3lacy') to ('t5_2sosg', 't5_4myudf')
Data columns (total 9 columns):
 #   Column                     Dtype  
---  ------                     -----  
 0   cosine_distance            float64
 1   subreddit_name_a           object 
 2   subreddit_name_b           object 
 3   German_posts_percent_a     float64
 4   German_posts_percent_b     float64
 5   manual_topic_and_rating_a  object 
 6   manual_topic_and_rating_b  object 
 7   post_median_word_count_a   float64
 8   post_median_word_count_b   float64
dtypes: float64(5), object(4)
memory usage: 279.5+ MB
None

df_v_com_agg
  (7029301, 515)


Unnamed: 0,subreddit_name,subreddit_id,post_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6
0,0sanitymemes,t5_2qlzfy,t3_ovly4k,-0.001755,-0.014411,0.0318,0.00112,-0.037531,0.020433,0.008061
1,0sanitymemes,t5_2qlzfy,t3_ovmkrd,-0.015998,-0.035019,0.025448,0.008916,-0.032112,0.069218,-0.003567
2,0sanitymemes,t5_2qlzfy,t3_ovnz2q,0.01046,0.009205,0.015113,0.027652,-0.039074,0.039467,0.04825
3,0sanitymemes,t5_2qlzfy,t3_ovp369,0.054907,0.007499,0.069496,0.041878,0.001678,0.070194,-0.077451
4,0sanitymemes,t5_2qlzfy,t3_ovpexl,0.129815,-0.022494,0.036611,0.015707,-0.079794,0.056509,0.036529
5,0sanitymemes,t5_2qlzfy,t3_ovqia6,-0.002432,-0.025784,0.036367,0.001857,-0.011705,0.020703,0.006595
6,0sanitymemes,t5_2qlzfy,t3_ovubi1,0.004887,0.010785,-1.2e-05,-0.002698,-0.015722,0.029091,0.000425
7,0sanitymemes,t5_2qlzfy,t3_ovuy09,0.032474,-0.008697,0.004628,-0.007205,-0.050312,0.02408,0.002667


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7029301 entries, 0 to 7029300
Columns: 515 entries, subreddit_name to embeddings_511
dtypes: float32(512), object(3)
memory usage: 13.6+ GB
None


In [42]:
# job_agg_test._save_and_log_aggregate_and_similarity_dfs()

In [38]:
mlflow.end_run("FAILED")
gc.collect()

2794

# Check output dfs

In [51]:
type(vars(job_agg_test))

dict

In [31]:
# d_dfs2 = {k: v for k, v in vars(job_agg_test).items() if 'df_' in k}


# for k2, df_2 in tqdm(d_dfs2.items()):
#     print(f"\n{k2}")
#     try:
#         print(f"  {df_2.shape} <- df shape")
#         print(f"  {df_2.npartitions} <- dask partitions")
#         # print(f"{get_dask_df_shape(df_2)} <- df.shape")
#         # print(f"  {df_2.memory_usage(deep=True).sum() / 1048576:4,.1f} MB <- Memory usage")
#         if any(['meta' in k2, '_v_' in k2]):
#             pass
#         else:
#             pass
# #             display(df_2.iloc[:5, :15])

#     except (TypeError, AttributeError):
#         if isinstance(df_2, pd.DataFrame):
#             print(f"  {df_2.shape} <- df shape")

## VM size notes

`614 GB` of RAM is not enough for 40 million posts...

VM & cluster set up:
```bash
96 CPUS
640 GB RAM

8 workers
- 12 threads per worker
- 76 GB per worker
```

Traceback:

```bash
23:17:48 | INFO | "      775,092 <- Comments that DON'T need to be averaged"
23:17:48 | INFO | "   39,126,876 <- Comments that need to be averaged"
23:17:48 | INFO | "No column to weight comments, simple mean for comments at post level"
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<timed exec> in <module>

/home/david.bermejo/repos/subreddit_clustering_i18n/subclu/models/aggregate_embeddings.py in run_aggregation(self)
    249         # - up-votes
    250         # ---
--> 251         self._agg_comments_to_post_level()
...
KilledWorker: ("('dataframe-groupby-sum-agg-849e94fd54a49f8ed34330862f20cb9d', 0)", <WorkerState 'tcp://127.0.0.1:41351', name: 6, memory: 0, processing: 1>)

```

### time profiling

inputs:
``` python
mlflow_experiment: 	v0.4.0_use_multi_aggregates_test
n_sample_posts_files: 	5
n_sample_comments_files: 	10

aggregate_params:
  min_comment_text_len: 	10
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
```

VM & cluster set up:
```
96 CPUS
640 GB RAM

8 workers
- 12 threads per worker
- 76 GB per worker
```

### Filtered/selected logs

Overview:

| Time/ETA | Step | Notes |
| --- | --- | --- |
| `0:11:43` minutes | load raw embeddings (w/o caching) | |
| `0:03:15` minutes | Load metadata (w/o caching):  |  | 
| `0:04:30` minutes | Aggegation steps (all) | Note that this might only be the time to create the dag, not necessarily the time to actually compute the data | 
| `0:37:24` minutes | Calculate similarities  |  | 
| `1:30:00` HOURS | Saving & logging files | Saving alone could take more than 1 hour... mand I'd forgotten about this | 
|  |  |  | 


Note that there's very different ETAs for saving each DF, the first 2 are really large and take a long time. The last few are smaller, so the time estimates from `tqdm` can vary a ton:
```bash
3/11 [40:25<1:14:34, 559.33s/it]   27%
9/11 [50:22<03:22, 101.02s/it]     82% 
11/11 [1:30:11<00:00, 750.74s/it] 100%
```


Getting shape of `dask df` is taking almost half of the saving time!

**TODO: REMOVE** logging df shape for now to save a ton of time!

```bash
20:54:05 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
20:54:05 | INFO | "Saving locally..."                                   # get_df_shape() starts here...
21:13:56 | INFO | "  Saving existing dask df as parquet..."             # get df_shape() ends here, ABOUT 40 MINUTES!
21:33:11 | INFO | "Logging artifact to mlflow..."                       # In contrast, SAVING the dask df only takes about 20 MINUTES!
21:33:14 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"    # And logging the dfs up to GCS only takes about 3 seconds?!
21:33:14 | INFO | "Saving locally..."
21:33:14 | INFO | "Keeping index intact..."
21:33:14 | INFO | "Converting pandas to dask..."
21:33:15 | INFO | "   185.4 MB <- Memory usage"
21:33:15 | INFO | "       5	<- target Dask partitions	   40.0 <- target MB partition size"
21:33:19 | INFO | "Logging artifact to mlflow..."
21:33:22 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity_pair **"
21:33:22 | INFO | "Saving locally..."
21:33:24 | INFO | "Converting pandas to dask..."
21:33:35 | INFO | "  10,391.8 MB <- Memory usage"
21:33:35 | INFO | "     139	<- target Dask partitions	   75.0 <- target MB partition size"
21:33:55 | INFO | "Logging artifact to mlflow..."
21:34:30 | INFO | "** df_sub_level_agg_b_post_and_comments **"

```


More details in log file:

`logs/AggregateEmbeddings/2021-10-05_195710-sample_test_lc_false-2021-10-05_195710.log`


```bash
# load raw embeddings (w/o caching): 11:43 minutes
# ---
20:01:19 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/2fcfefc3d5af43328168d3478b4fdeb6/artifacts/df_vect_comments"
40/40 [07:29<00:00, 8.17s/it]
20:08:49 | INFO | "  Parquet files found: 5"
20:08:49 | INFO | "  Keep only comments for posts with embeddings"
20:08:54 | INFO | "  0:11:43.326935 <- Total raw embeddings load time elapsed"


# Load metadata (w/o caching): 3:15 minutes
# ---
20:08:54 | INFO | "-- Start _load_metadata() method --"
20:08:54 | INFO | "Loading POSTS metadata..."

20:10:39 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-10-04"
100%
59/59 [01:30<00:00, 1.43s/it]
20:12:10 | INFO | "  (Delayed('int-e6188e6d-6319-487d-b054-bea8a30d912b'), 7) <- Raw META COMMENTS shape"
20:12:10 | INFO | "  0:03:15.218880 <- Total metadata loading time elapsed"


# Aggegation steps (all): 4:30 minutes
#   Note that this might only be the time to create the dag, not necessarily the time to actually compute the data
20:12:10 | INFO | "-- Start _agg_comments_to_post_level() method --"
20:12:10 | INFO | "Getting count of comments per post..."
20:12:39 | INFO | "Filtering which comments need to be averaged..."
20:13:23 | INFO | "       22,197 <- Comments that DON'T need to be averaged"
20:13:23 | INFO | "    1,087,458 <- Comments that need to be averaged"
20:13:28 | INFO | "No column to weight comments, simple mean for comments at post level"
20:13:57 | INFO | "      191,558 |  514 <- df_v_com_agg SHAPE"
20:13:57 | INFO | "  0:01:46.878385 <- Total comments to post agg loading time elapsed"
20:13:57 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
20:13:59 | INFO | "DEFINE agg_posts_w_comments..."
...
20:16:38 | INFO | "A - posts only"
20:16:39 | INFO | "  (Delayed('int-3ee084d4-434c-4a38-aeb1-185b50648908'), 513) <- df_subs_agg_a.shape (only posts)"
20:16:39 | INFO | "B - posts + comments"
20:16:39 | INFO | "  (Delayed('int-6bd21f72-fc1d-495c-987a-1da6e4a18683'), 513) <- df_subs_agg_b.shape (posts + comments)"
20:16:39 | INFO | "C - posts + comments + sub descriptions"
20:16:40 | INFO | "  (Delayed('int-59c54047-7ac8-49cb-81b3-478b4ae5b60e'), 513) <- df_subs_agg_c.shape (posts + comments + sub description)"
20:16:40 | INFO | "  0:00:01.507065 <- Total for ALL subreddit-level agg time elapsed"


# Calculate similarities 37:24 minutes
20:16:40 | INFO | "-- Start _calculate_subreddit_similarities() method --"
20:16:40 | INFO | "A..."
20:16:56 | INFO | "  (4924, 4924) <- df_subs_agg_a_similarity.shape"
20:17:21 | INFO | "Merge distance + metadata..."
20:17:59 | INFO | "Create new df to keep only top 20 subs by distance..."
20:18:10 | INFO | "  (24240852, 11) <- df_dist_pair_meta.shape (before setting index)"
20:18:10 | INFO | "  (98480, 11) <- df_dist_pair_meta_top_only.shape (before setting index)"
...
20:54:04 | INFO | "  0:37:24.347689 <- Total for _calculate_subreddit_similarities() time elapsed"


# *** Saving & logging file: WTF? Saving alone could take more than 2 hours!! WTF?!!  ***
20:54:04 | INFO | "-- Start _save_and_log_aggregate_and_similarity_dfs() method --"
20:54:04 | INFO | "  Saving config to local path..."
20:54:04 | INFO | "  Logging config to mlflow..."
*** 3/11 [40:25<1:14:34, 559.33s/it]  27%   ***
20:54:05 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
20:54:05 | INFO | "Saving locally..."
...
21:13:56 | INFO | "  Saving existing dask df as parquet..."
21:33:11 | INFO | "Logging artifact to mlflow..."
21:33:14 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
21:33:14 | INFO | "Saving locally..."

```