# Purpose

2021-08-10: Finally completed testing with sampling <= 10 files. Now ready to run process on full data!

Ended up doing it all in dask + pandas + numpy because of problems installing `cuDF`.

---
2021-08-02: Now that I'm processing millions of comments and posts, I need to re-write the functions to try to do some work in parallel and reduce the amount of data loaded in RAM.

- `Dask` seems like a great option to load data and only compute some of it as needed.
- `cuDF` could be a way to speed up some computation using GPUs
- `Dask-delayed` could be a way to create a task DAG lazily before computing all the aggregates.


---

In notebook 09 I combined embeddings from posts & subreddits (`djb_09.00-combine_post_and_comments_and_visualize_for_presentation.ipynb`).

In this notebook I'll be testing functions that include mlflow so that it's easier to try a lot of different weights to find better respresentations.

Take embeddings created by other models & combine them:
```
new post embeddings = post + comments + subreddit description

new subreddit embeddings = new posts (weighted by post length or upvotes?)
```

# Notebook setup

In [None]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
import os
import logging
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.models.aggregate_embeddings import (
    AggregateEmbeddings, AggregateEmbeddingsConfig,
    load_config_agg_jupyter, get_dask_df_shape,
)

from subclu.utils import set_working_directory
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks
from subclu.eda.aggregates import (
    compare_raw_v_weighted_language
)
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl
)


print_lib_versions([dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Get runs that we can use for embeddings aggregation jobs

In [6]:
%%time

df_mlf_runs =  mlf.search_all_runs(experiment_ids=[9, '10', 11, 12])
df_mlf_runs.shape

CPU times: user 271 ms, sys: 11.2 ms, total: 283 ms
Wall time: 284 ms


(118, 104)

In [7]:
mask_finished = df_mlf_runs['status'] == 'FINISHED'
mask_output_over_1M_rows = (
    (df_mlf_runs['metrics.df_vect_posts_rows'] >= 1e6) |
    (df_mlf_runs['metrics.df_vect_comments'] >= 1e6)
)
# df_mlf_runs[mask_finished].shape

df_mlf_use_for_agg = df_mlf_runs[mask_output_over_1M_rows]
df_mlf_use_for_agg.shape

(3, 104)

In [8]:
cols_with_multiple_vals = df_mlf_use_for_agg.columns[df_mlf_use_for_agg.nunique(dropna=False) > 1]
# len(cols_with_multiple_vals)

style_df_numeric(
    df_mlf_use_for_agg
    [cols_with_multiple_vals]
    .drop(['artifact_uri', 'end_time',
           # 'start_time',
           ], 
          axis=1)
    .dropna(axis='columns', how='all')
    .iloc[:, :30]
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,run id,experiment id,status,start time,metrics.vectorizing time minutes comments,metrics.df vect subreddits description cols,metrics.df vect posts rows,metrics.vectorizing time minutes subreddit meta,metrics.total comment files processed,metrics.df vect subreddits description rows,metrics.vectorizing time minutes posts,metrics.df vect posts cols,metrics.df vect comments,metrics.vectorizing time minutes full function,params.tf limit first n chars,params.tokenize lowercase,params.posts path,params.tf batch inference rows,params.subreddits path,params.batch comment files,tags.mlflow.source.git.commit,tags.mlflow.runName
95,a948e9fd651545f997430cddc6b529eb,10,FINISHED,2021-07-29 23:02:33.997000+00:00,145.73,514.00,1649929.00,0.08,37.00,3767.00,14.74,515.00,19168154.00,176.77,1000,True,posts/top/2021-07-16,2000,subreddits/top/2021-07-16,True,63f5f420fb6b48d8243749cba183071757dac531,new_batch_fxn-2021-07-29_230233
97,e66c5db26bd64f6da09c012eea700d0a,10,FINISHED,2021-07-29 18:59:48.715000+00:00,117.47,-,-,-,37.00,-,-,-,19200854.00,133.16,850,False,,6100,,True,64f49e85a8ef56a6795edf9da9a6f5964cb6830b,new_batch_fxn_2021-07-29_185948
106,614a38e6690c4f3ba08725b1585b2ee9,9,KILLED,2021-07-29 11:49:53.924000+00:00,-,514.00,1649929.00,0.07,-,3767.00,10.01,515.00,-,-,1000,False,posts/top/2021-07-16,2100,subreddits/top/2021-07-16,,64f49e85a8ef56a6795edf9da9a6f5964cb6830b,test_new_fxn2021-07-29_114953


# Load configs for aggregation jobs

`n_sample_comments_files` and `n_sample_posts_files` allow us to only load a few files at a time (e.g., 2 instead of 50) to test the process end-to-end.

---
Note that by default `hydra` is a cli tool. If we want to call use it in jupyter, we need to manually initialize configs & compose the configuration. See my custom function `load_config_agg_jupyter`. Also see:
- [Notebook with `Hydra` examples in a notebook](https://github.com/facebookresearch/hydra/blob/master/examples/jupyter_notebooks/compose_configs_in_notebook.ipynb).
- [Hydra docs, Hydra in Jupyter](https://hydra.cc/docs/next/advanced/jupyter_notebooks/).


In [48]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=4',     # 
               'n_sample_comments_files=4',  # 6 is limit for logging unique counts at comment level
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)
# config_test_full_lc_false = AggregateEmbeddingsConfig(
#     config_path="../config",
#     config_name=root_agg_config_name,
#     overrides=[f"mlflow_experiment={mlflow_experiment_test}",
#                'n_sample_posts_files=null', 
#                'n_sample_comments_files=null',
#                # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
#               ]
# )

config_full_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_full}",
               'n_sample_posts_files=null', 
               'n_sample_comments_files=null',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

# config_full_lc_true = AggregateEmbeddingsConfig(
#     config_path="../config",
#     config_name='aggregate_embeddings',
#     overrides=[f"mlflow_experiment={mlflow_experiment_full}",
#                'n_sample_posts_files=null', 
#                'n_sample_comments_files=null',
#                'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_true',
#               ]
# )
# pprint(config_test_sample_lc_false.config_dict, indent=2)

In [None]:
# config_test_sample_lc_false.config_flat,

In [23]:
df_configs = pd.DataFrame(
    [
        config_test_sample_lc_false.config_flat,
        # config_test_full_lc_false.config_flat,
        config_full_lc_false.config_flat,
        # config_full_lc_true.config_flat,
    ]
)

In [24]:
# We can't use (df_configs.nunique(dropna=False) > 1)
#  because when a col's content is a list or something unhashable, we get an error
#  so instead we'll check each column individually

# cols_with_diffs_config = df_configs.columns[df_configs.nunique(dropna=False) > 1]
cols_with_diffs_config = list()
for c_ in df_configs.columns:
    try:
        if df_configs[c_].nunique() > 1:
            cols_with_diffs_config.append(c_)
    except TypeError:
        cols_with_diffs_config.append(c_)
        

df_configs[cols_with_diffs_config]

Unnamed: 0,comments_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids_lowercase,subreddit_meta_vectorized_mlflow_uuids,subreddit_meta_vectorized_mlflow_uuids_lowercase,comments_uuid,mlflow_experiment
0,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates_test
1,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates


In [25]:
pprint(config_test_sample_lc_false.config_flat, indent=2)

{ 'agg_comments_to_post_weight_col': None,
  'agg_post_comment_weight': 20,
  'agg_post_post_weight': 70,
  'agg_post_subreddit_desc_weight': 10,
  'agg_post_to_subreddit_weight_col': None,
  'bucket_name': 'i18n-subreddit-clustering',
  'col_comment_id': 'comment_id',
  'col_post_id': 'post_id',
  'col_subreddit_id': 'subreddit_id',
  'col_text_comment_word_count': 'comment_text_word_count',
  'col_text_post_word_count': 'text_word_count',
  'comments_folder_embeddings': 'df_vect_comments',
  'comments_uuid': [ '5f10cd75334142168a6ebb787e477c1f',
                     '2fcfefc3d5af43328168d3478b4fdeb6'],
  'comments_vectorized_mlflow_uuids': [ '5f10cd75334142168a6ebb787e477c1f',
                                        '2fcfefc3d5af43328168d3478b4fdeb6'],
  'comments_vectorized_mlflow_uuids_lowercase': None,
  'dataset_name': 'v0.4.0 inputs - Top Subreddits (no Geo) + Geo-relevant '
                  'subs, comments: TBD',
  'folder_comments_text_and_meta': 'comments/top/2021-10-04',
  

In [None]:
BREAK

# Initialize a local dask client
so that we can see the progress/process for dask jobs.

**dask default**: 8 workers with 64 CPUs present<br>
I tested: 16 & 12 workers, but they runs out of RAM and job crashes.

I rolled back to 8 workers to complete job /prevent OOM errors even if it feels like it's wasting workers, at least it finishes.

In [26]:
%%time

from dask.distributed import Client, LocalCluster

# dask default: 8 workers with 64 CPUs present,
cluster = LocalCluster(n_workers=8)
client = Client(cluster)

CPU times: user 478 ms, sys: 220 ms, total: 698 ms
Wall time: 2.4 s


In [27]:
client.dashboard_link

'http://127.0.0.1:8787/status'

# Run Full data with `lower_case=False`

The logic for sampling files and download/`caching` files locally lives in the `mlf` custom function.

Caching can save 9+ minutes if we try to download the files from GCS every time.

In [None]:
BREAK

In [None]:
%%time

mlflow.end_run("FAILED")
gc.collect()
try:
    del job_agg1
    del d_dfs1
except NameError:
    pass
gc.collect()

job_agg1 = AggregateEmbeddings(
    run_name=f"full_lc_false-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_full_lc_false.config_flat
)
job_agg1.run_aggregation()

gc.collect()

07:02:28 | INFO | "== Start run_aggregation() method =="
07:02:28 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
07:02:29 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-08-10_070229-full_lc_false-2021-08-10_070228"
07:02:29 | INFO | "  Saving config to local path..."
07:02:29 | INFO | "  Logging config to mlflow..."
07:02:29 | INFO | "-- Start _load_raw_embeddings() method --"
07:02:29 | INFO | "Loading subreddit description embeddings..."
07:02:30 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/9/614a38e6690c4f3ba08725b1585b2ee9/artifacts/df_vect_subreddits_description"


  0%|          | 0/4 [00:00<?, ?it/s]

07:02:30 | INFO | "  Reading 1 files"
07:02:31 | INFO | "       3,767 |  513 <- Raw vectorized subreddit description shape"
07:02:32 | INFO | "Loading POSTS embeddings..."
07:02:33 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/9/614a38e6690c4f3ba08725b1585b2ee9/artifacts/df_vect_posts"


  0%|          | 0/51 [00:00<?, ?it/s]

07:02:33 | INFO | "  Reading 48 files"
07:02:35 | INFO | "   1,649,929 |  514 <- Raw POSTS shape"
07:02:38 | INFO | "Loading COMMENTS embeddings..."
07:02:39 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/e66c5db26bd64f6da09c012eea700d0a/artifacts/df_vect_comments"


  0%|          | 0/38 [00:00<?, ?it/s]

07:02:39 | INFO | "  Reading 37 files"
07:02:39 | INFO | "  0:00:09.844584 <- Total raw embeddings load time elapsed"
07:02:39 | INFO | "-- Start _agg_comments_to_post_level() method --"
07:02:40 | INFO | "Getting count of comments per post..."
'<=' not supported between instances of 'NoneType' and 'int'"
07:02:55 | INFO | "Filtering which comments need to be averaged..."
07:04:23 | INFO | "      128,716 <- Comments that DON'T need to be averaged"
07:04:24 | INFO | "   19,072,138 <- Comments that need to be averaged"
07:04:24 | INFO | "No column to weight comments, simple mean for comments at post level"
07:06:14 | INFO | "      985,894 |  514 <- df_v_com_agg SHAPE"
07:06:14 | INFO | "  0:03:34.315443 <- Total comments to post agg loading time elapsed"
07:06:14 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
07:06:15 | INFO | "DEFINE agg_posts_w_comments..."
07:06:16 | INFO | "  (Delayed('int-e8e786da-22e6-4893-a586-2b056bcc6e58'), 513) <- df_agg_

  0%|          | 0/11 [00:00<?, ?it/s]

08:36:46 | INFO | "** df_post_level_agg_b_post_and_comments **"
08:36:46 | INFO | "Saving locally..."
08:54:18 | INFO | "Logging artifact to mlflow..."
08:55:34 | INFO | "** df_post_level_agg_c_post_comments_sub_desc **"
08:55:34 | INFO | "Saving locally..."
09:16:16 | INFO | "     268	<- EXISTING Dask partitions"
09:38:02 | INFO | "Logging artifact to mlflow..."
09:39:58 | INFO | "** df_sub_level_agg_a_post_only **"
09:39:58 | INFO | "Saving locally..."
09:40:04 | INFO | "       1	<- EXISTING Dask partitions"
09:40:12 | INFO | "Logging artifact to mlflow..."
09:40:13 | INFO | "** df_sub_level_agg_a_post_only_similarity **"
09:40:13 | INFO | "Saving locally..."
09:40:13 | INFO | "Keeping index intact..."
09:40:13 | INFO | "Converting pandas to dask..."
09:40:13 | INFO | "   108.6 MB <- Memory usage"
09:40:13 | INFO | "       3	<- target Dask partitions	   40.0 <- target MB partition size"
09:40:16 | INFO | "Logging artifact to mlflow..."
09:40:18 | INFO | "** df_sub_level_agg_a_post_on

# Run full data, `lower_case=True`

Looks like the problem I ran into with the file being corrupted might've been a problem with downloading the file(s). Fix: delete the local cache and download the files again.

In [None]:
BREAK

In [None]:
%%time

mlflow.end_run("FAILED")
gc.collect()
try:
    del job_agg2
    del d_dfs2
except NameError:
    pass
gc.collect()

job_agg2 = AggregateEmbeddings(
    run_name=f"full_lc_true-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_full_lc_true.config_flat
)
job_agg2.run_aggregation()

gc.collect()

15:47:51 | INFO | "== Start run_aggregation() method =="
15:47:51 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
15:47:52 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-08-10_154752-full_lc_true-2021-08-10_154751"
15:47:52 | INFO | "  Saving config to local path..."
15:47:52 | INFO | "  Logging config to mlflow..."
15:47:52 | INFO | "-- Start _load_raw_embeddings() method --"
15:47:52 | INFO | "Loading subreddit description embeddings..."
15:47:53 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_subreddits_description"


  0%|          | 0/4 [00:00<?, ?it/s]

15:47:54 | INFO | "  Reading 1 files"
15:47:55 | INFO | "       3,767 |  513 <- Raw vectorized subreddit description shape"
15:47:56 | INFO | "Loading POSTS embeddings..."
15:47:57 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_posts"


  0%|          | 0/51 [00:00<?, ?it/s]

15:48:44 | INFO | "  Reading 48 files"
15:48:47 | INFO | "   1,649,929 |  514 <- Raw POSTS shape"
15:48:51 | INFO | "Loading COMMENTS embeddings..."
15:48:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_comments"


  0%|          | 0/38 [00:00<?, ?it/s]

15:54:48 | INFO | "  Reading 37 files"
15:54:49 | INFO | "  0:06:56.293258 <- Total raw embeddings load time elapsed"
15:54:49 | INFO | "-- Start _load_metadata() method --"
15:54:49 | INFO | "Loading POSTS metadata..."
15:54:49 | INFO | "Reading raw data..."
15:54:49 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/posts/top/2021-07-16"


  0%|          | 0/43 [00:00<?, ?it/s]

15:54:51 | INFO | "  Applying transformations..."
15:54:52 | INFO | "  (1649929, 14) <- Raw META POSTS shape"
15:54:52 | INFO | "Loading subs metadata..."
15:54:52 | INFO | "  reading sub-level data & merging with aggregates..."
15:54:52 | INFO | "Reading raw data..."
15:54:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/subreddits/top/2021-07-16"


  0%|          | 0/1 [00:00<?, ?it/s]

15:54:53 | INFO | "  Applying transformations..."
15:54:54 | INFO | "  (3767, 38) <- Raw META subreddit description shape"
15:54:54 | INFO | "Loading COMMENTS metadata..."
15:54:54 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-07-09"


  0%|          | 0/37 [00:00<?, ?it/s]

15:54:55 | INFO | "  (Delayed('int-11aa2518-d088-4702-bee1-c90e9c40927d'), 7) <- Raw META COMMENTS shape"
15:54:55 | INFO | "  0:00:05.773888 <- Total metadata loading time elapsed"
15:54:55 | INFO | "-- Start _agg_comments_to_post_level() method --"
15:54:55 | INFO | "Getting count of comments per post..."
'<=' not supported between instances of 'NoneType' and 'int'"
15:55:18 | INFO | "Filtering which comments need to be averaged..."
15:56:48 | INFO | "      126,642 <- Comments that DON'T need to be averaged"
15:56:48 | INFO | "   19,041,512 <- Comments that need to be averaged"
15:56:48 | INFO | "No column to weight comments, simple mean for comments at post level"
15:59:15 | INFO | "      979,701 |  514 <- df_v_com_agg SHAPE"
15:59:15 | INFO | "  0:04:20.021986 <- Total comments to post agg loading time elapsed"
15:59:15 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
15:59:17 | INFO | "DEFINE agg_posts_w_comments..."
15:59:17 | INFO | "  (Dela

  0%|          | 0/11 [00:00<?, ?it/s]

17:18:50 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
17:18:50 | INFO | "Saving locally..."
17:42:53 | INFO | "  Saving existing dask df as parquet..."
18:06:23 | INFO | "Logging artifact to mlflow..."
18:06:25 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
18:06:25 | INFO | "Saving locally..."
18:06:25 | INFO | "Keeping index intact..."
18:06:25 | INFO | "Converting pandas to dask..."
18:06:25 | INFO | "   108.6 MB <- Memory usage"
18:06:25 | INFO | "       3	<- target Dask partitions	   40.0 <- target MB partition size"
18:06:29 | INFO | "Logging artifact to mlflow..."
18:06:31 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity_pair **"
18:06:31 | INFO | "Saving locally..."
18:06:33 | INFO | "Converting pandas to dask..."
18:06:40 | INFO | "  6,002.0 MB <- Memory usage"
18:06:40 | INFO | "      81	<- target Dask partitions	   75.0 <- target MB partition size"
18:06:53 | INFO | "Logging artifact to mlflow..."
18:07:16 | I

In [None]:
mlflow.end_run("FAILED")

## Check output dfs

In [16]:
%%time

d_dfs2 = dict()
(
    d_dfs2['df_v_sub'], d_dfs2['df_v_posts'], d_dfs2['df_v_comments'],
#     d_dfs2['df_subs_meta'], d_dfs2['df_posts_meta'], d_dfs2['df_comments_meta'],
    
    # Aggs don't get computed until run_aggergation() method gets called
    d_dfs2['df_subs_agg_a'], d_dfs2['df_subs_agg_b'], d_dfs2['df_subs_agg_c'], 
    d_dfs2['df_posts_agg_b'], d_dfs2['df_posts_agg_c'], 
    # d_dfs2['df_posts_agg_d'],

) = (
    job_agg2.df_v_sub, job_agg2.df_v_posts, job_agg2.df_v_comments,
#     job_agg2.df_subs_meta, job_agg2.df_posts_meta, job_agg2.df_comments_meta,
    
    job_agg2.df_subs_agg_a, job_agg2.df_subs_agg_b, job_agg2.df_subs_agg_c, 
    job_agg2.df_posts_agg_b, job_agg2.df_posts_agg_c,
    # job_agg2.df_posts_agg_d,  # D doesn't exist yet
)

for k2, df_2 in tqdm(d_dfs2.items()):
    print(f"\n{k2}")
    try:
        print(f"{df_2.shape} <- df shape")
        print(f"{df_2.npartitions} <- dask partitions")
        # print(f"{get_dask_df_shape(df_2)} <- df.shape")
        # print(f"  {df_2.memory_usage(deep=True).sum() / 1048576:4,.1f} MB <- Memory usage")
        if any(['meta' in k2, '_v_' in k2]):
            pass
        else:
            pass
            # display(df_2.iloc[:5, :15])

    except (TypeError, AttributeError):
        if isinstance(df_2, pd.DataFrame):
            print(f"{df_2.shape} <- df shape")

  0%|          | 0/8 [00:00<?, ?it/s]


df_v_sub
(Delayed('int-185fda3f-7783-4eda-ab30-01e2e615a376'), 513) <- df shape
1 <- dask partitions

df_v_posts
(Delayed('int-e68a5168-e9cd-4015-909f-fe9bd1f44b45'), 514) <- df shape
48 <- dask partitions

df_v_comments
(Delayed('int-193d0b34-441d-47ed-a139-7810bc1b5d22'), 515) <- df shape
37 <- dask partitions

df_subs_agg_a

df_subs_agg_b

df_subs_agg_c

df_posts_agg_b

df_posts_agg_c
CPU times: user 67.1 ms, sys: 19.9 ms, total: 87 ms
Wall time: 73.5 ms


In [17]:
%%time
job_agg2.df_v_comments.tail()

CPU times: user 13.2 s, sys: 3.97 s, total: 17.2 s
Wall time: 5min 16s


Unnamed: 0,subreddit_name,post_id,comment_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9,embeddings_10,embeddings_11,embeddings_12,embeddings_13,embeddings_14,embeddings_15,embeddings_16,embeddings_17,embeddings_18,embeddings_19,embeddings_20,embeddings_21,embeddings_22,embeddings_23,embeddings_24,embeddings_25,embeddings_26,...,embeddings_482,embeddings_483,embeddings_484,embeddings_485,embeddings_486,embeddings_487,embeddings_488,embeddings_489,embeddings_490,embeddings_491,embeddings_492,embeddings_493,embeddings_494,embeddings_495,embeddings_496,embeddings_497,embeddings_498,embeddings_499,embeddings_500,embeddings_501,embeddings_502,embeddings_503,embeddings_504,embeddings_505,embeddings_506,embeddings_507,embeddings_508,embeddings_509,embeddings_510,embeddings_511
483398,19219,t3_nyqfmb,t1_h1lzfc6,-0.039235,0.037132,-0.033202,-0.009557,0.050591,0.035887,-0.075159,0.026383,-0.050233,0.015396,-0.038596,-0.044591,0.057975,-0.015356,0.055454,0.05732,0.018831,0.008324,0.011653,0.018594,-0.043833,0.011787,0.042242,-0.052835,0.024818,0.004836,0.049712,...,-0.000133,0.011356,-0.082734,-0.010753,0.048662,-0.024157,-0.024809,-0.055835,-0.003466,0.037034,-0.026417,0.017363,0.041431,0.012136,-0.068084,0.06084,-0.04969,-0.052496,-0.038352,0.043732,0.013604,0.044257,0.021174,0.014155,-0.009225,-0.033644,0.016431,-0.012967,0.034172,0.043068
483399,19219,t3_nyqfmb,t1_h1nvupw,0.133316,-0.078149,-0.041633,0.018155,-0.0543,0.0676,-0.048958,0.065234,0.043681,-0.023903,0.006257,0.00196,-0.027169,-0.032003,-0.073401,-0.054141,0.022288,-0.03241,-0.012456,0.081674,-0.066014,0.125445,0.048917,-0.001884,0.089766,-0.058626,-0.003665,...,0.008788,0.017257,-0.053767,-0.019011,0.075949,-0.114006,-0.055496,-0.027919,-0.029389,-0.003952,0.030141,-0.009108,-0.059239,-0.037925,-0.078233,0.025885,-0.042855,0.039699,-0.011792,-0.036198,-0.101189,0.005298,0.046411,-0.064717,-0.060539,-0.023111,-0.064473,0.050646,-0.025636,-0.001309
483400,19219,t3_nyqjvy,t1_h1nvw1v,0.126373,-0.060233,0.060693,-0.05058,-0.087892,0.031024,0.024518,0.037328,-0.121069,0.014265,0.004135,-0.042354,0.052518,-0.049579,-0.047506,-0.035735,0.050334,0.01555,-0.019624,0.054169,-0.037536,0.051272,-0.019931,-0.042147,0.038471,0.049895,-0.061262,...,-0.049496,-0.010919,-0.008698,0.055555,0.020364,-0.069475,0.026155,0.009584,0.056685,0.018035,0.043154,0.060423,0.002035,-0.032125,0.019131,0.001583,-0.023084,-0.010004,0.003967,0.029053,-0.11076,-0.039405,0.065578,0.058648,0.053029,0.002569,-0.01634,0.012294,0.02745,0.116622
483401,19219,t3_nyqjvy,t1_h1lpzcm,0.135668,-0.013888,0.027121,0.007115,-0.040884,0.06713,0.007366,0.049615,-0.111433,0.061169,0.003263,0.020159,0.060283,-0.035772,-0.02329,-0.019193,-0.002709,0.091329,0.057891,0.036463,-0.036746,0.075548,0.040508,0.011599,-0.033443,-0.019531,-0.037201,...,-0.032646,-0.064596,0.01203,0.022919,0.003585,-0.117777,-0.010177,-0.039885,0.04084,0.014177,0.07619,0.048344,0.018211,-0.021509,0.029074,-0.004563,0.017267,0.001694,0.029015,0.021855,-0.07453,-0.062627,0.083473,0.074024,0.053475,-0.008467,0.01031,-0.026929,0.015378,0.107798
483402,amarantavp,t3_nvktm4,t1_h1nrxvs,-0.043134,-0.053204,-0.095469,-0.064357,-0.064727,0.032114,0.01315,-0.052689,-0.06274,0.032209,0.048887,-0.026321,0.053665,-0.0238,-0.019007,0.018622,0.028733,0.003409,0.034294,-0.061813,-0.040847,0.042555,0.010091,0.086087,0.037003,-0.03778,0.082697,...,0.008532,-0.074757,0.063589,0.060451,0.053137,-0.071232,-0.0158,0.034549,0.036525,0.006025,-0.020449,0.023536,0.017202,-0.038329,0.021685,0.025175,-0.036519,-0.056744,-0.009225,-0.049298,-0.074865,0.03127,0.031479,-0.023028,0.047527,-0.019314,-0.058948,-0.020677,-4.9e-05,0.085667


In [None]:
# %%time
# s_post_id = job_agg2.df_v_comments['post_id'].compute()

In [19]:
%%time
s_post_post_id = job_agg2.df_v_posts['post_id'].compute()

CPU times: user 1.77 s, sys: 562 ms, total: 2.33 s
Wall time: 14.7 s


In [21]:
s_post_post_id.shape

(1649929,)

In [22]:
s_post_post_id.head()

0    t3_oa3lo5
1    t3_oae9rr
2    t3_oa1y50
3    t3_oa7uvd
4    t3_oa7p5v
Name: post_id, dtype: object

# Run test with `lower_case=False

Sample only a few files in comments/ posts to make sure that job completes even when we're testing new code/logic.

In [None]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=2',     # 
               'n_sample_comments_files=2',  # 6 is limit for logging unique counts at comment level
               'calculate_similarites=false',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]

In [None]:
keys_to_check_in_config = ['mlflow_experiment', 'n_sample_posts_files', 'n_sample_comments_files', 'aggregate_params', 'calculate_similarites']

for k_ in keys_to_check_in_config:
    v_ = config_test_sample_lc_false.config_dict.get(k_)
    if isinstance(v_, dict):
        print(f"\n{k_}:")
        [print(f"  {k2_}: \t{v2_}") for k2_, v2_ in v_.items()]
    else:
        print(f"{k_}: \t{v_}")

In [47]:
%%time

mlflow.end_run("FAILED")
gc.collect()
try:
    del job_agg1
    del d_dfs1
except NameError:
    pass
gc.collect()

job_agg_test = AggregateEmbeddings(
    run_name=f"sample_test_lc_false-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_test_sample_lc_false.config_flat
)
job_agg_test.run_aggregation()

gc.collect()

19:57:10 | INFO | "== Start run_aggregation() method =="
19:57:10 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
19:57:10 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-10-05_195710-sample_test_lc_false-2021-10-05_195710"
19:57:10 | INFO | "  Saving config to local path..."
19:57:10 | INFO | "  Logging config to mlflow..."
19:57:11 | INFO | "-- Start _load_raw_embeddings() method --"
19:57:11 | INFO | "Loading subreddit description embeddings..."
19:57:11 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/8eef951842a34a6e81d176b15ae74afd/artifacts/df_vect_subreddits_description"


  0%|          | 0/5 [00:00<?, ?it/s]

19:57:11 | INFO | "  Parquet files found: 2"
19:57:12 | INFO | "      19,262 |  513 <- Raw vectorized subreddit description shape"
19:57:12 | INFO | "Loading POSTS embeddings..."
19:57:12 | INFO | "  Sampling POSTS FILES down to: 5"
19:57:13 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/8eef951842a34a6e81d176b15ae74afd/artifacts/df_vect_posts"


  0%|          | 0/28 [00:00<?, ?it/s]

19:57:13 | INFO | "  Parquet files found: 5"
19:57:17 | INFO | "   1,560,976 |  514 <- Raw POSTS shape"
19:57:23 | INFO | "Loading COMMENTS embeddings..."
19:57:23 | INFO | "  Sampling COMMENTS FILES down to: 10"
19:57:23 | INFO | "  Sampling 5 FILES per run UUID"
19:57:23 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/5f10cd75334142168a6ebb787e477c1f/artifacts/df_vect_comments"


  0%|          | 0/21 [00:00<?, ?it/s]

20:01:18 | INFO | "  Parquet files found: 5"
20:01:19 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/2fcfefc3d5af43328168d3478b4fdeb6/artifacts/df_vect_comments"


  0%|          | 0/40 [00:00<?, ?it/s]

20:08:49 | INFO | "  Parquet files found: 5"
20:08:49 | INFO | "  Keep only comments for posts with embeddings"
20:08:54 | INFO | "  0:11:43.326935 <- Total raw embeddings load time elapsed"
20:08:54 | INFO | "-- Start _load_metadata() method --"
20:08:54 | INFO | "Loading POSTS metadata..."
20:08:54 | INFO | "Reading raw data..."
20:08:55 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/posts/top/2021-09-27"


  0%|          | 0/27 [00:00<?, ?it/s]

20:09:49 | INFO | "  Applying transformations..."
20:10:27 | INFO | "  (8439672, 15) <- Raw META POSTS shape"
20:10:27 | INFO | "Loading subs metadata..."
20:10:27 | INFO | "  reading sub-level data & merging with aggregates..."
20:10:27 | INFO | "Reading raw data..."
20:10:27 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/subreddits/top/2021-09-24"


  0%|          | 0/1 [00:00<?, ?it/s]

20:10:29 | INFO | "  Applying transformations..."
20:10:39 | INFO | "  (19262, 91) <- Raw META subreddit description shape"
20:10:39 | INFO | "Loading COMMENTS metadata..."
20:10:39 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-10-04"


  0%|          | 0/59 [00:00<?, ?it/s]

20:12:10 | INFO | "  (Delayed('int-e6188e6d-6319-487d-b054-bea8a30d912b'), 7) <- Raw META COMMENTS shape"
20:12:10 | INFO | "  0:03:15.218880 <- Total metadata loading time elapsed"
20:12:10 | INFO | "-- Start _agg_comments_to_post_level() method --"
20:12:10 | INFO | "Getting count of comments per post..."
20:12:39 | INFO | "Filtering which comments need to be averaged..."
20:13:23 | INFO | "       22,197 <- Comments that DON'T need to be averaged"
20:13:23 | INFO | "    1,087,458 <- Comments that need to be averaged"
20:13:28 | INFO | "No column to weight comments, simple mean for comments at post level"
20:13:57 | INFO | "      191,558 |  514 <- df_v_com_agg SHAPE"
20:13:57 | INFO | "  0:01:46.878385 <- Total comments to post agg loading time elapsed"
20:13:57 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
20:13:59 | INFO | "DEFINE agg_posts_w_comments..."
20:13:59 | INFO | "  (Delayed('int-6b8041bc-e42f-4c56-914c-7333a1959a2c'), 513) <- df_ag

  0%|          | 0/11 [00:00<?, ?it/s]

20:54:05 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
20:54:05 | INFO | "Saving locally..."
21:13:56 | INFO | "  Saving existing dask df as parquet..."
21:33:11 | INFO | "Logging artifact to mlflow..."
21:33:14 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
21:33:14 | INFO | "Saving locally..."
21:33:14 | INFO | "Keeping index intact..."
21:33:14 | INFO | "Converting pandas to dask..."
21:33:15 | INFO | "   185.4 MB <- Memory usage"
21:33:15 | INFO | "       5	<- target Dask partitions	   40.0 <- target MB partition size"
21:33:19 | INFO | "Logging artifact to mlflow..."
21:33:22 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity_pair **"
21:33:22 | INFO | "Saving locally..."
21:33:24 | INFO | "Converting pandas to dask..."
21:33:35 | INFO | "  10,391.8 MB <- Memory usage"
21:33:35 | INFO | "     139	<- target Dask partitions	   75.0 <- target MB partition size"
21:33:55 | INFO | "Logging artifact to mlflow..."
21:34:30 | 

CPU times: user 1h 2min 24s, sys: 7min 21s, total: 1h 9min 46s
Wall time: 2h 27min 8s


13532

### time profiling

inputs:
``` python
mlflow_experiment: 	v0.4.0_use_multi_aggregates_test
n_sample_posts_files: 	5
n_sample_comments_files: 	10

aggregate_params:
  min_comment_text_len: 	10
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
```

cluster set up:
```
96 CPUS
640 GB RAM

8 workers
- 12 threads per worker
- 76 GB per worker
```

### Filtered/selected logs

Overview:

| Time/ETA | Step | Notes |
| --- | --- | --- |
| `0:11:43` minutes | load raw embeddings (w/o caching) | |
| `0:03:15` minutes | Load metadata (w/o caching):  |  | 
| `0:04:30` minutes | Aggegation steps (all) | Note that this might only be the time to create the dag, not necessarily the time to actually compute the data | 
| `0:37:24` minutes | Calculate similarities  |  | 
| `1:30:00` HOURS | Saving & logging files | Saving alone could take more than 1 hour... mand I'd forgotten about this | 
|  |  |  | 


Note that there's very different ETAs for saving each DF, the first 2 are really large and take a long time. The last few are smaller, so the time estimates from `tqdm` can vary a ton:
```bash
3/11 [40:25<1:14:34, 559.33s/it]   27%
9/11 [50:22<03:22, 101.02s/it]     82% 
11/11 [1:30:11<00:00, 750.74s/it] 100%
```



More details in log file:

`logs/AggregateEmbeddings/2021-10-05_195710-sample_test_lc_false-2021-10-05_195710.log`


```bash
# load raw embeddings (w/o caching): 11:43 minutes
# ---
20:01:19 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/2fcfefc3d5af43328168d3478b4fdeb6/artifacts/df_vect_comments"
40/40 [07:29<00:00, 8.17s/it]
20:08:49 | INFO | "  Parquet files found: 5"
20:08:49 | INFO | "  Keep only comments for posts with embeddings"
20:08:54 | INFO | "  0:11:43.326935 <- Total raw embeddings load time elapsed"


# Load metadata (w/o caching): 3:15 minutes
# ---
20:08:54 | INFO | "-- Start _load_metadata() method --"
20:08:54 | INFO | "Loading POSTS metadata..."

20:10:39 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-10-04"
100%
59/59 [01:30<00:00, 1.43s/it]
20:12:10 | INFO | "  (Delayed('int-e6188e6d-6319-487d-b054-bea8a30d912b'), 7) <- Raw META COMMENTS shape"
20:12:10 | INFO | "  0:03:15.218880 <- Total metadata loading time elapsed"


# Aggegation steps (all): 4:30 minutes
#   Note that this might only be the time to create the dag, not necessarily the time to actually compute the data
20:12:10 | INFO | "-- Start _agg_comments_to_post_level() method --"
20:12:10 | INFO | "Getting count of comments per post..."
20:12:39 | INFO | "Filtering which comments need to be averaged..."
20:13:23 | INFO | "       22,197 <- Comments that DON'T need to be averaged"
20:13:23 | INFO | "    1,087,458 <- Comments that need to be averaged"
20:13:28 | INFO | "No column to weight comments, simple mean for comments at post level"
20:13:57 | INFO | "      191,558 |  514 <- df_v_com_agg SHAPE"
20:13:57 | INFO | "  0:01:46.878385 <- Total comments to post agg loading time elapsed"
20:13:57 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
20:13:59 | INFO | "DEFINE agg_posts_w_comments..."
...
20:16:38 | INFO | "A - posts only"
20:16:39 | INFO | "  (Delayed('int-3ee084d4-434c-4a38-aeb1-185b50648908'), 513) <- df_subs_agg_a.shape (only posts)"
20:16:39 | INFO | "B - posts + comments"
20:16:39 | INFO | "  (Delayed('int-6bd21f72-fc1d-495c-987a-1da6e4a18683'), 513) <- df_subs_agg_b.shape (posts + comments)"
20:16:39 | INFO | "C - posts + comments + sub descriptions"
20:16:40 | INFO | "  (Delayed('int-59c54047-7ac8-49cb-81b3-478b4ae5b60e'), 513) <- df_subs_agg_c.shape (posts + comments + sub description)"
20:16:40 | INFO | "  0:00:01.507065 <- Total for ALL subreddit-level agg time elapsed"


# Calculate similarities 37:24 minutes
20:16:40 | INFO | "-- Start _calculate_subreddit_similarities() method --"
20:16:40 | INFO | "A..."
20:16:56 | INFO | "  (4924, 4924) <- df_subs_agg_a_similarity.shape"
20:17:21 | INFO | "Merge distance + metadata..."
20:17:59 | INFO | "Create new df to keep only top 20 subs by distance..."
20:18:10 | INFO | "  (24240852, 11) <- df_dist_pair_meta.shape (before setting index)"
20:18:10 | INFO | "  (98480, 11) <- df_dist_pair_meta_top_only.shape (before setting index)"
...
20:54:04 | INFO | "  0:37:24.347689 <- Total for _calculate_subreddit_similarities() time elapsed"


# *** Saving & logging file: WTF? Saving alone could take more than 2 hours!! WTF?!!  ***
20:54:04 | INFO | "-- Start _save_and_log_aggregate_and_similarity_dfs() method --"
20:54:04 | INFO | "  Saving config to local path..."
20:54:04 | INFO | "  Logging config to mlflow..."
*** 3/11 [40:25<1:14:34, 559.33s/it]  27%   ***
20:54:05 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
20:54:05 | INFO | "Saving locally..."
...
21:13:56 | INFO | "  Saving existing dask df as parquet..."
21:33:11 | INFO | "Logging artifact to mlflow..."
21:33:14 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
21:33:14 | INFO | "Saving locally..."

```