# Purpose

2021-10-05:
I ran into memory errors with 600GB or RAM, so here's a try with 1.4TB... if this doesn't work. Then I don't know what will...

---

2021-08-10: Finally completed testing with sampling <= 10 files. Now ready to run process on full data!

Ended up doing it all in dask + pandas + numpy because of problems installing `cuDF`.

---
2021-08-02: Now that I'm processing millions of comments and posts, I need to re-write the functions to try to do some work in parallel and reduce the amount of data loaded in RAM.

- `Dask` seems like a great option to load data and only compute some of it as needed.
- `cuDF` could be a way to speed up some computation using GPUs
- `Dask-delayed` could be a way to create a task DAG lazily before computing all the aggregates.


---

In notebook 09 I combined embeddings from posts & subreddits (`djb_09.00-combine_post_and_comments_and_visualize_for_presentation.ipynb`).

In this notebook I'll be testing functions that include mlflow so that it's easier to try a lot of different weights to find better respresentations.

Take embeddings created by other models & combine them:
```
new post embeddings = post + comments + subreddit description

new subreddit embeddings = new posts (weighted by post length or upvotes?)
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
import os
import logging
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.models.aggregate_embeddings import (
    AggregateEmbeddings, AggregateEmbeddingsConfig,
    load_config_agg_jupyter, get_dask_df_shape,
)

from subclu.utils import set_working_directory
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks
from subclu.eda.aggregates import (
    compare_raw_v_weighted_language
)
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl
)


print_lib_versions([dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Get runs that we can use for embeddings aggregation jobs

In [6]:
%%time

df_mlf_runs =  mlf.search_all_runs(experiment_ids=[13, 14, 15, 16])
df_mlf_runs.shape

CPU times: user 88.6 ms, sys: 633 µs, total: 89.2 ms
Wall time: 88.5 ms


(44, 120)

In [7]:
mask_finished = df_mlf_runs['status'] == 'FINISHED'
mask_output_over_1M_rows = (
    (df_mlf_runs['metrics.df_vect_posts_rows'] >= 1e5) |
    (df_mlf_runs['metrics.df_vect_comments'] >= 1e5)
)
# df_mlf_runs[mask_finished].shape

df_mlf_use_for_agg = df_mlf_runs[mask_output_over_1M_rows]
df_mlf_use_for_agg.shape

(3, 120)

In [8]:
cols_with_multiple_vals = df_mlf_use_for_agg.columns[df_mlf_use_for_agg.nunique(dropna=False) > 1]
# len(cols_with_multiple_vals)

style_df_numeric(
    df_mlf_use_for_agg
    [cols_with_multiple_vals]
    .drop(['artifact_uri', 'end_time',
           # 'start_time',
           ], 
          axis=1)
    .dropna(axis='columns', how='all')
    .iloc[:, :30]
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,run id,experiment id,start time,metrics.total comment files processed,metrics.df vect comments,metrics.vectorizing time minutes comments,metrics.vectorizing time minutes full function,params.n sample comment files,params.tf batch inference rows,params.n comment files slice start,params.n comment files slice end,tags.mlflow.runName
18,deb3454ece2a4a8d8e4149c2d8494c0d,14,2021-10-05 01:44:32.386000+00:00,15.0,10121046.0,39.11,45.94,15,3200,,,comments_batch_01-2021-10-05_014431
19,5f10cd75334142168a6ebb787e477c1f,14,2021-10-05 00:22:20.334000+00:00,20.0,13558304.0,47.64,57.33,20,4200,,,comments_batch_01-2021-10-05_002219
23,9a27f9a72cf348c98d50f486abf3b009,13,2021-10-04 22:21:46.401000+00:00,2.0,1286661.0,3.93,5.03,2,6000,,,posts_as_comments_full_text-2021-10-04_222146


# Load configs for aggregation jobs

`n_sample_comments_files` and `n_sample_posts_files` allow us to only load a few files at a time (e.g., 2 instead of 50) to test the process end-to-end.

---
Note that by default `hydra` is a cli tool. If we want to call use it in jupyter, we need to manually initialize configs & compose the configuration. See my custom function `load_config_agg_jupyter`. Also see:
- [Notebook with `Hydra` examples in a notebook](https://github.com/facebookresearch/hydra/blob/master/examples/jupyter_notebooks/compose_configs_in_notebook.ipynb).
- [Hydra docs, Hydra in Jupyter](https://hydra.cc/docs/next/advanced/jupyter_notebooks/).


In [9]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=4',     # 
               'n_sample_comments_files=4',  # 6 is limit for logging unique counts at comment level
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

config_full_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_full}",
               'n_sample_posts_files=null', 
               'n_sample_comments_files=null',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

# config_full_lc_true = AggregateEmbeddingsConfig(
#     config_path="../config",
#     config_name='aggregate_embeddings',
#     overrides=[f"mlflow_experiment={mlflow_experiment_full}",
#                'n_sample_posts_files=null', 
#                'n_sample_comments_files=null',
#                'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_true',
#               ]
# )
# pprint(config_test_sample_lc_false.config_dict, indent=2)

In [10]:
# config_test_sample_lc_false.config_flat,

In [11]:
df_configs = pd.DataFrame(
    [
        config_test_sample_lc_false.config_flat,
        # config_test_full_lc_false.config_flat,
        config_full_lc_false.config_flat,
        # config_full_lc_true.config_flat,
    ]
)

In [12]:
# We can't use (df_configs.nunique(dropna=False) > 1)
#  because when a col's content is a list or something unhashable, we get an error
#  so instead we'll check each column individually

# cols_with_diffs_config = df_configs.columns[df_configs.nunique(dropna=False) > 1]
cols_with_diffs_config = list()
for c_ in df_configs.columns:
    try:
        if df_configs[c_].nunique() > 1:
            cols_with_diffs_config.append(c_)
    except TypeError:
        cols_with_diffs_config.append(c_)
        

df_configs[cols_with_diffs_config]

Unnamed: 0,comments_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids_lowercase,subreddit_meta_vectorized_mlflow_uuids,subreddit_meta_vectorized_mlflow_uuids_lowercase,comments_uuid,mlflow_experiment
0,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates_test
1,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates


In [13]:
# pprint(config_test_sample_lc_false.config_flat, indent=2)

In [27]:
BREAK

# Initialize a local dask client
so that we can see the progress/process for dask jobs.

2021-10-06: 

With 96 CPUs & 1.4 TB of RAM
- By default, dask starts 12 workers with this config.
- the cluster kept getting stuck when I was only using 8 workers.
- I changed it to 10 workers and jobs were movig forward until it ran out of memory too many times.

Let's try 9 workers as a middle path?



---
**dask default**: 8 workers with 64 CPUs present<br>
I tested: 16 & 12 workers, but they runs out of RAM and job crashes.

I rolled back to 8 workers to complete job /prevent OOM errors even if it feels like it's wasting workers, at least it finishes.

In [14]:
%%time

from dask.distributed import Client, LocalCluster

# dask default: 8 workers with 64 CPUs present,
cluster = LocalCluster(n_workers=9)  # 8 too few, 10 too many, what does it pick by default?
client = Client(cluster)

CPU times: user 526 ms, sys: 157 ms, total: 684 ms
Wall time: 1.4 s


In [15]:
client.dashboard_link

'http://127.0.0.1:8787/status'

In [16]:
# client.close()
# cluster.close()

In [16]:
mlf.log_param_hostname()
mlf.log_cpu_count()

11:00:54 | INFO | "host_name: djb-100-2021-04-28-djb-eda-german-subs"
11:00:54 | INFO | "cpu_count: 96"


96

In [17]:
mlf.log_ram_stats()
mlf.log_ram_stats(only_memory_used=True)

11:00:55 | INFO | "RAM stats:
{'memory_used_percent': '0.16%', 'memory_total': '1,444,963', 'memory_used': '2,305', 'memory_free': '1,319,944'}"
11:00:55 | INFO | "RAM stats:
{'memory_used_percent': '0.16%', 'memory_used': '2,309'}"


{'memory_used_percent': 0.0015979647921780696, 'memory_used': 2309}

# Run Full data with `lower_case=False`

The logic for sampling files and download/`caching` files locally lives in the `mlf` custom function.

Caching can save 9+ minutes if we try to download the files from GCS every time.

In [18]:
keys_to_check_in_config = ['mlflow_experiment', 'n_sample_posts_files', 'n_sample_comments_files', 'aggregate_params', 'calculate_similarites']

for k_ in keys_to_check_in_config:
    v_ = config_full_lc_false.config_dict.get(k_)
    if isinstance(v_, dict):
        print(f"\n{k_}:")
        [print(f"  {k2_}: \t{v2_}") for k2_, v2_ in v_.items()]
    else:
        print(f"{k_}: \t{v_}")

mlflow_experiment: 	v0.4.0_use_multi_aggregates
n_sample_posts_files: 	None
n_sample_comments_files: 	None

aggregate_params:
  min_comment_text_len: 	2
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
calculate_similarites: 	True


In [19]:
BREAK

NameError: name 'BREAK' is not defined

In [20]:
%%time

try:
    job_agg1._send_log_file_to_mlflow()
    mlflow.end_run("FAILED")
    # run setup_logging() to remove logging to the file of a failed job
    setup_logging()
    
    del job_agg1
    del d_dfs1
except NameError:
    pass
gc.collect()

mlflow.end_run("FAILED")


job_agg1 = AggregateEmbeddings(
    run_name=f"agg_full_lc_false-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_full_lc_false.config_flat
)
job_agg1.run_aggregation()

gc.collect()

11:01:02 | INFO | "== Start run_aggregation() method =="
11:01:02 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
11:01:02 | INFO | "host_name: djb-100-2021-04-28-djb-eda-german-subs"
11:01:02 | INFO | "cpu_count: 96"
11:01:02 | INFO | "RAM stats:
{'memory_used_percent': '0.16%', 'memory_total': '1,444,963', 'memory_used': '2,316', 'memory_free': '1,319,933'}"
11:01:03 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-10-06_110103-agg_full_lc_false-2021-10-06_110102"
11:01:03 | INFO | "  Saving config to local path..."
11:01:03 | INFO | "  Logging config to mlflow..."
11:01:03 | INFO | "-- Start _load_raw_embeddings() method --"
11:01:03 | INFO | "Loading subreddit description embeddings..."
11:01:04 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/8

KilledWorker: ("('dataframe-groupby-count-agg-74b774925f5f2a72d7c6318b1e2d4dcd', 0)", <WorkerState 'tcp://127.0.0.1:43891', name: 6, memory: 0, processing: 1>)

In [23]:
job_agg1._send_log_file_to_mlflow()
gc.collect()

11:18:04 | INFO | "Logging log-file to mlflow..."


2313

In [22]:
gc.collect()

26

# Run full data, `lower_case=True`

Looks like the problem I ran into with the file being corrupted might've been a problem with downloading the file(s). Fix: delete the local cache and download the files again.

In [None]:
BREAK

In [None]:
# %%time

# mlflow.end_run("FAILED")
# gc.collect()
# try:
#     # run setup_logging() to remove logging to the file of a failed job
#     setup_logging()
    
#     del job_agg2
#     del d_dfs2
# except NameError:
#     pass
# gc.collect()

# job_agg2 = AggregateEmbeddings(
#     run_name=f"full_lc_true-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
#     **config_full_lc_true.config_flat
# )
# job_agg2.run_aggregation()

# gc.collect()

15:47:51 | INFO | "== Start run_aggregation() method =="
15:47:51 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
15:47:52 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-08-10_154752-full_lc_true-2021-08-10_154751"
15:47:52 | INFO | "  Saving config to local path..."
15:47:52 | INFO | "  Logging config to mlflow..."
15:47:52 | INFO | "-- Start _load_raw_embeddings() method --"
15:47:52 | INFO | "Loading subreddit description embeddings..."
15:47:53 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_subreddits_description"


  0%|          | 0/4 [00:00<?, ?it/s]

15:47:54 | INFO | "  Reading 1 files"
15:47:55 | INFO | "       3,767 |  513 <- Raw vectorized subreddit description shape"
15:47:56 | INFO | "Loading POSTS embeddings..."
15:47:57 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_posts"


  0%|          | 0/51 [00:00<?, ?it/s]

15:48:44 | INFO | "  Reading 48 files"
15:48:47 | INFO | "   1,649,929 |  514 <- Raw POSTS shape"
15:48:51 | INFO | "Loading COMMENTS embeddings..."
15:48:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/10/a948e9fd651545f997430cddc6b529eb/artifacts/df_vect_comments"


  0%|          | 0/38 [00:00<?, ?it/s]

15:54:48 | INFO | "  Reading 37 files"
15:54:49 | INFO | "  0:06:56.293258 <- Total raw embeddings load time elapsed"
15:54:49 | INFO | "-- Start _load_metadata() method --"
15:54:49 | INFO | "Loading POSTS metadata..."
15:54:49 | INFO | "Reading raw data..."
15:54:49 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/posts/top/2021-07-16"


  0%|          | 0/43 [00:00<?, ?it/s]

15:54:51 | INFO | "  Applying transformations..."
15:54:52 | INFO | "  (1649929, 14) <- Raw META POSTS shape"
15:54:52 | INFO | "Loading subs metadata..."
15:54:52 | INFO | "  reading sub-level data & merging with aggregates..."
15:54:52 | INFO | "Reading raw data..."
15:54:52 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/subreddits/top/2021-07-16"


  0%|          | 0/1 [00:00<?, ?it/s]

15:54:53 | INFO | "  Applying transformations..."
15:54:54 | INFO | "  (3767, 38) <- Raw META subreddit description shape"
15:54:54 | INFO | "Loading COMMENTS metadata..."
15:54:54 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-07-09"


  0%|          | 0/37 [00:00<?, ?it/s]

15:54:55 | INFO | "  (Delayed('int-11aa2518-d088-4702-bee1-c90e9c40927d'), 7) <- Raw META COMMENTS shape"
15:54:55 | INFO | "  0:00:05.773888 <- Total metadata loading time elapsed"
15:54:55 | INFO | "-- Start _agg_comments_to_post_level() method --"
15:54:55 | INFO | "Getting count of comments per post..."
'<=' not supported between instances of 'NoneType' and 'int'"
15:55:18 | INFO | "Filtering which comments need to be averaged..."
15:56:48 | INFO | "      126,642 <- Comments that DON'T need to be averaged"
15:56:48 | INFO | "   19,041,512 <- Comments that need to be averaged"
15:56:48 | INFO | "No column to weight comments, simple mean for comments at post level"
15:59:15 | INFO | "      979,701 |  514 <- df_v_com_agg SHAPE"
15:59:15 | INFO | "  0:04:20.021986 <- Total comments to post agg loading time elapsed"
15:59:15 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
15:59:17 | INFO | "DEFINE agg_posts_w_comments..."
15:59:17 | INFO | "  (Dela

  0%|          | 0/11 [00:00<?, ?it/s]

17:18:50 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
17:18:50 | INFO | "Saving locally..."
17:42:53 | INFO | "  Saving existing dask df as parquet..."
18:06:23 | INFO | "Logging artifact to mlflow..."
18:06:25 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
18:06:25 | INFO | "Saving locally..."
18:06:25 | INFO | "Keeping index intact..."
18:06:25 | INFO | "Converting pandas to dask..."
18:06:25 | INFO | "   108.6 MB <- Memory usage"
18:06:25 | INFO | "       3	<- target Dask partitions	   40.0 <- target MB partition size"
18:06:29 | INFO | "Logging artifact to mlflow..."
18:06:31 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity_pair **"
18:06:31 | INFO | "Saving locally..."
18:06:33 | INFO | "Converting pandas to dask..."
18:06:40 | INFO | "  6,002.0 MB <- Memory usage"
18:06:40 | INFO | "      81	<- target Dask partitions	   75.0 <- target MB partition size"
18:06:53 | INFO | "Logging artifact to mlflow..."
18:07:16 | I

In [23]:
mlflow.end_run("FAILED")

# Debugging

# Run test with `lower_case=False

Sample only a few files in comments/ posts to make sure that job completes even when we're testing new code/logic.

Limit to only 2 files of each kind to get minimum test to run end to end.

In [32]:
# run setup_logging() to remove logging to the file of a failed job
setup_logging()

In [33]:
logging.debug("debug test")
logging.info("info test")
logging.warning("warning message")
logging.error("error message")

08:02:56 | INFO | "info test"
08:02:56 | ERROR | "error message"


In [34]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=2',     # 
               'n_sample_comments_files=2',  # 6 is limit for logging unique counts at comment level
               'calculate_similarites=false',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

In [35]:
keys_to_check_in_config = ['mlflow_experiment', 'n_sample_posts_files', 'n_sample_comments_files', 'aggregate_params', 'calculate_similarites']

for k_ in keys_to_check_in_config:
    v_ = config_test_sample_lc_false.config_dict.get(k_)
    if isinstance(v_, dict):
        print(f"\n{k_}:")
        [print(f"  {k2_}: \t{v2_}") for k2_, v2_ in v_.items()]
    else:
        print(f"{k_}: \t{v_}")

mlflow_experiment: 	v0.4.0_use_multi_aggregates_test
n_sample_posts_files: 	2
n_sample_comments_files: 	2

aggregate_params:
  min_comment_text_len: 	2
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
calculate_similarites: 	False


In [None]:
BREAK

In [36]:
%%time

mlflow.end_run("FAILED")
gc.collect()
try:
    # run setup_logging() to remove logging to the file of a failed job
    setup_logging()
    del job_agg1
    del d_dfs1
except NameError:
    pass
gc.collect()

job_agg_test = AggregateEmbeddings(
    run_name=f"sample_test_lc_false-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_test_sample_lc_false.config_flat
)
job_agg_test.run_aggregation()

gc.collect()

08:03:20 | INFO | "== Start run_aggregation() method =="
08:03:20 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
08:03:20 | INFO | "host_name: djb-100-2021-04-28-djb-eda-german-subs"
08:03:20 | INFO | "cpu_count: 96"
08:03:20 | INFO | "RAM stats:
{'memory_used_percent': '0.15%', 'memory_total': '1,444,963', 'memory_used': '2,221', 'memory_free': '1,441,429'}"
08:03:20 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-10-06_080320-sample_test_lc_false-2021-10-06_080319"
08:03:20 | INFO | "  Saving config to local path..."
08:03:20 | INFO | "  Logging config to mlflow..."
08:03:21 | INFO | "-- Start _load_raw_embeddings() method --"
08:03:21 | INFO | "Loading subreddit description embeddings..."
08:03:22 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/1

KeyboardInterrupt: 

In [40]:
job_agg_test._send_log_file_to_mlflow()
gc.collect()

08:15:13 | INFO | "Logging log-file to mlflow..."


In [42]:
# job_agg_test._save_and_log_aggregate_and_similarity_dfs()

In [43]:
mlflow.end_run("FAILED")
gc.collect()

2828

# Check output dfs

In [51]:
type(vars(job_agg_test))

dict

In [31]:
# d_dfs2 = {k: v for k, v in vars(job_agg_test).items() if 'df_' in k}


# for k2, df_2 in tqdm(d_dfs2.items()):
#     print(f"\n{k2}")
#     try:
#         print(f"  {df_2.shape} <- df shape")
#         print(f"  {df_2.npartitions} <- dask partitions")
#         # print(f"{get_dask_df_shape(df_2)} <- df.shape")
#         # print(f"  {df_2.memory_usage(deep=True).sum() / 1048576:4,.1f} MB <- Memory usage")
#         if any(['meta' in k2, '_v_' in k2]):
#             pass
#         else:
#             pass
# #             display(df_2.iloc[:5, :15])

#     except (TypeError, AttributeError):
#         if isinstance(df_2, pd.DataFrame):
#             print(f"  {df_2.shape} <- df shape")

## VM size notes

`614 GB` of RAM is not enough for 40 million posts...

VM & cluster set up:
```bash
96 CPUS
640 GB RAM

8 workers
- 12 threads per worker
- 76 GB per worker
```

Traceback:

```bash
23:17:48 | INFO | "      775,092 <- Comments that DON'T need to be averaged"
23:17:48 | INFO | "   39,126,876 <- Comments that need to be averaged"
23:17:48 | INFO | "No column to weight comments, simple mean for comments at post level"
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<timed exec> in <module>

/home/david.bermejo/repos/subreddit_clustering_i18n/subclu/models/aggregate_embeddings.py in run_aggregation(self)
    249         # - up-votes
    250         # ---
--> 251         self._agg_comments_to_post_level()
...
KilledWorker: ("('dataframe-groupby-sum-agg-849e94fd54a49f8ed34330862f20cb9d', 0)", <WorkerState 'tcp://127.0.0.1:41351', name: 6, memory: 0, processing: 1>)

```

### time profiling

inputs:
``` python
mlflow_experiment: 	v0.4.0_use_multi_aggregates_test
n_sample_posts_files: 	5
n_sample_comments_files: 	10

aggregate_params:
  min_comment_text_len: 	10
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
```

VM & cluster set up:
```
96 CPUS
640 GB RAM

8 workers
- 12 threads per worker
- 76 GB per worker
```

### Filtered/selected logs

Overview:

| Time/ETA | Step | Notes |
| --- | --- | --- |
| `0:11:43` minutes | load raw embeddings (w/o caching) | |
| `0:03:15` minutes | Load metadata (w/o caching):  |  | 
| `0:04:30` minutes | Aggegation steps (all) | Note that this might only be the time to create the dag, not necessarily the time to actually compute the data | 
| `0:37:24` minutes | Calculate similarities  |  | 
| `1:30:00` HOURS | Saving & logging files | Saving alone could take more than 1 hour... mand I'd forgotten about this | 
|  |  |  | 


Note that there's very different ETAs for saving each DF, the first 2 are really large and take a long time. The last few are smaller, so the time estimates from `tqdm` can vary a ton:
```bash
3/11 [40:25<1:14:34, 559.33s/it]   27%
9/11 [50:22<03:22, 101.02s/it]     82% 
11/11 [1:30:11<00:00, 750.74s/it] 100%
```


Getting shape of `dask df` is taking almost half of the saving time!

**TODO: REMOVE** logging df shape for now to save a ton of time!

```bash
20:54:05 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
20:54:05 | INFO | "Saving locally..."                                   # get_df_shape() starts here...
21:13:56 | INFO | "  Saving existing dask df as parquet..."             # get df_shape() ends here, ABOUT 40 MINUTES!
21:33:11 | INFO | "Logging artifact to mlflow..."                       # In contrast, SAVING the dask df only takes about 20 MINUTES!
21:33:14 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"    # And logging the dfs up to GCS only takes about 3 seconds?!
21:33:14 | INFO | "Saving locally..."
21:33:14 | INFO | "Keeping index intact..."
21:33:14 | INFO | "Converting pandas to dask..."
21:33:15 | INFO | "   185.4 MB <- Memory usage"
21:33:15 | INFO | "       5	<- target Dask partitions	   40.0 <- target MB partition size"
21:33:19 | INFO | "Logging artifact to mlflow..."
21:33:22 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity_pair **"
21:33:22 | INFO | "Saving locally..."
21:33:24 | INFO | "Converting pandas to dask..."
21:33:35 | INFO | "  10,391.8 MB <- Memory usage"
21:33:35 | INFO | "     139	<- target Dask partitions	   75.0 <- target MB partition size"
21:33:55 | INFO | "Logging artifact to mlflow..."
21:34:30 | INFO | "** df_sub_level_agg_b_post_and_comments **"

```


More details in log file:

`logs/AggregateEmbeddings/2021-10-05_195710-sample_test_lc_false-2021-10-05_195710.log`


```bash
# load raw embeddings (w/o caching): 11:43 minutes
# ---
20:01:19 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/14/2fcfefc3d5af43328168d3478b4fdeb6/artifacts/df_vect_comments"
40/40 [07:29<00:00, 8.17s/it]
20:08:49 | INFO | "  Parquet files found: 5"
20:08:49 | INFO | "  Keep only comments for posts with embeddings"
20:08:54 | INFO | "  0:11:43.326935 <- Total raw embeddings load time elapsed"


# Load metadata (w/o caching): 3:15 minutes
# ---
20:08:54 | INFO | "-- Start _load_metadata() method --"
20:08:54 | INFO | "Loading POSTS metadata..."

20:10:39 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/comments/top/2021-10-04"
100%
59/59 [01:30<00:00, 1.43s/it]
20:12:10 | INFO | "  (Delayed('int-e6188e6d-6319-487d-b054-bea8a30d912b'), 7) <- Raw META COMMENTS shape"
20:12:10 | INFO | "  0:03:15.218880 <- Total metadata loading time elapsed"


# Aggegation steps (all): 4:30 minutes
#   Note that this might only be the time to create the dag, not necessarily the time to actually compute the data
20:12:10 | INFO | "-- Start _agg_comments_to_post_level() method --"
20:12:10 | INFO | "Getting count of comments per post..."
20:12:39 | INFO | "Filtering which comments need to be averaged..."
20:13:23 | INFO | "       22,197 <- Comments that DON'T need to be averaged"
20:13:23 | INFO | "    1,087,458 <- Comments that need to be averaged"
20:13:28 | INFO | "No column to weight comments, simple mean for comments at post level"
20:13:57 | INFO | "      191,558 |  514 <- df_v_com_agg SHAPE"
20:13:57 | INFO | "  0:01:46.878385 <- Total comments to post agg loading time elapsed"
20:13:57 | INFO | "-- Start (df_posts_agg_b) _agg_posts_and_comments_to_post_level() method --"
20:13:59 | INFO | "DEFINE agg_posts_w_comments..."
...
20:16:38 | INFO | "A - posts only"
20:16:39 | INFO | "  (Delayed('int-3ee084d4-434c-4a38-aeb1-185b50648908'), 513) <- df_subs_agg_a.shape (only posts)"
20:16:39 | INFO | "B - posts + comments"
20:16:39 | INFO | "  (Delayed('int-6bd21f72-fc1d-495c-987a-1da6e4a18683'), 513) <- df_subs_agg_b.shape (posts + comments)"
20:16:39 | INFO | "C - posts + comments + sub descriptions"
20:16:40 | INFO | "  (Delayed('int-59c54047-7ac8-49cb-81b3-478b4ae5b60e'), 513) <- df_subs_agg_c.shape (posts + comments + sub description)"
20:16:40 | INFO | "  0:00:01.507065 <- Total for ALL subreddit-level agg time elapsed"


# Calculate similarities 37:24 minutes
20:16:40 | INFO | "-- Start _calculate_subreddit_similarities() method --"
20:16:40 | INFO | "A..."
20:16:56 | INFO | "  (4924, 4924) <- df_subs_agg_a_similarity.shape"
20:17:21 | INFO | "Merge distance + metadata..."
20:17:59 | INFO | "Create new df to keep only top 20 subs by distance..."
20:18:10 | INFO | "  (24240852, 11) <- df_dist_pair_meta.shape (before setting index)"
20:18:10 | INFO | "  (98480, 11) <- df_dist_pair_meta_top_only.shape (before setting index)"
...
20:54:04 | INFO | "  0:37:24.347689 <- Total for _calculate_subreddit_similarities() time elapsed"


# *** Saving & logging file: WTF? Saving alone could take more than 2 hours!! WTF?!!  ***
20:54:04 | INFO | "-- Start _save_and_log_aggregate_and_similarity_dfs() method --"
20:54:04 | INFO | "  Saving config to local path..."
20:54:04 | INFO | "  Logging config to mlflow..."
*** 3/11 [40:25<1:14:34, 559.33s/it]  27%   ***
20:54:05 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc **"
20:54:05 | INFO | "Saving locally..."
...
21:13:56 | INFO | "  Saving existing dask df as parquet..."
21:33:11 | INFO | "Logging artifact to mlflow..."
21:33:14 | INFO | "** df_sub_level_agg_c_post_comments_and_sub_desc_similarity **"
21:33:14 | INFO | "Saving locally..."

```