# Purpose


2021-10-06:
We're going back to pandas now that I have the VM machine with a ton of RAM.

There might be some tweaks needed to batch a few subreddits at a time, but at least we can get more consistent state/progress than with `dask`.

---
2021-10-06:
The job with dask failed silently - even with 3+ TB of RAM.  `Dask` was reporting that saving was complete - but it only saved one `parquet` file instead of hundreds of files.

New direction: now that I have access to a large VM, I might as well try to go back and do the calculations in memory (in pandas).


-- 
2021-10-05:
I ran into memory errors with 600GB or RAM, so here's a try with 1.4TB... if this doesn't work. Then I don't know what will...

---

2021-08-10: Finally completed testing with sampling <= 10 files. Now ready to run process on full data!

Ended up doing it all in dask + pandas + numpy because of problems installing `cuDF`.

---
2021-08-02: Now that I'm processing millions of comments and posts, I need to re-write the functions to try to do some work in parallel and reduce the amount of data loaded in RAM.

- `Dask` seems like a great option to load data and only compute some of it as needed.
- `cuDF` could be a way to speed up some computation using GPUs
- `Dask-delayed` could be a way to create a task DAG lazily before computing all the aggregates.


---

In notebook 09 I combined embeddings from posts & subreddits (`djb_09.00-combine_post_and_comments_and_visualize_for_presentation.ipynb`).

In this notebook I'll be testing functions that include mlflow so that it's easier to try a lot of different weights to find better respresentations.

Take embeddings created by other models & combine them:
```
new post embeddings = post + comments + subreddit description

new subreddit embeddings = new posts (weighted by post length or upvotes?)
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
import os
import logging
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.models.aggregate_embeddings import (
    AggregateEmbeddings, AggregateEmbeddingsConfig,
    load_config_agg_jupyter, get_dask_df_shape,
)
from subclu.models import aggregate_embeddings_pd

from subclu.utils import set_working_directory
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks
from subclu.eda.aggregates import (
    compare_raw_v_weighted_language
)
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl
)


print_lib_versions([dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Get runs that we can use for embeddings aggregation jobs

In [6]:
%%time

df_mlf_runs =  mlf.search_all_runs(experiment_ids=[13, 14, 15, 16])
df_mlf_runs.shape

CPU times: user 276 ms, sys: 6.64 ms, total: 283 ms
Wall time: 282 ms


(78, 132)

In [7]:
mask_finished = df_mlf_runs['status'] == 'FINISHED'
mask_output_over_1M_rows = (
    (df_mlf_runs['metrics.df_vect_posts_rows'] >= 1e5) |
    (df_mlf_runs['metrics.df_vect_comments'] >= 1e5)
)
# df_mlf_runs[mask_finished].shape

df_mlf_use_for_agg = df_mlf_runs[mask_output_over_1M_rows]
df_mlf_use_for_agg.shape

(3, 132)

In [8]:
cols_with_multiple_vals = df_mlf_use_for_agg.columns[df_mlf_use_for_agg.nunique(dropna=False) > 1]
# len(cols_with_multiple_vals)

style_df_numeric(
    df_mlf_use_for_agg
    [cols_with_multiple_vals]
    .drop(['artifact_uri', 'end_time',
           # 'start_time',
           ], 
          axis=1)
    .dropna(axis='columns', how='all')
    .iloc[:, :30]
    ,
    rename_cols_for_display=True,
)

Unnamed: 0,run id,experiment id,start time,metrics.df vect comments,metrics.vectorizing time minutes full function,metrics.total comment files processed,metrics.vectorizing time minutes comments,params.tf batch inference rows,params.n sample comment files,params.n comment files slice start,params.n comment files slice end,tags.mlflow.runName,tags.model version
52,deb3454ece2a4a8d8e4149c2d8494c0d,14,2021-10-05 01:44:32.386000+00:00,10121046.0,45.94,15.0,39.11,3200,15,,,comments_batch_01-2021-10-05_014431,
53,5f10cd75334142168a6ebb787e477c1f,14,2021-10-05 00:22:20.334000+00:00,13558304.0,57.33,20.0,47.64,4200,20,,,comments_batch_01-2021-10-05_002219,0.4.0
57,9a27f9a72cf348c98d50f486abf3b009,13,2021-10-04 22:21:46.401000+00:00,1286661.0,5.03,2.0,3.93,6000,2,,,posts_as_comments_full_text-2021-10-04_222146,


# Load configs for aggregation jobs

`n_sample_comments_files` and `n_sample_posts_files` allow us to only load a few files at a time (e.g., 2 instead of 50) to test the process end-to-end.

---
Note that by default `hydra` is a cli tool. If we want to call use it in jupyter, we need to manually initialize configs & compose the configuration. See my custom function `load_config_agg_jupyter`. Also see:
- [Notebook with `Hydra` examples in a notebook](https://github.com/facebookresearch/hydra/blob/master/examples/jupyter_notebooks/compose_configs_in_notebook.ipynb).
- [Hydra docs, Hydra in Jupyter](https://hydra.cc/docs/next/advanced/jupyter_notebooks/).


In [9]:
mlflow_experiment_test = 'v0.4.0_use_multi_aggregates_test'
mlflow_experiment_full = 'v0.4.0_use_multi_aggregates'

root_agg_config_name = 'aggregate_embeddings_v0.4.0'

config_test_sample_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_test}",
               'n_sample_posts_files=4',     # 
               'n_sample_comments_files=4',  # 6 is limit for logging unique counts at comment level
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

config_full_lc_false = AggregateEmbeddingsConfig(
    config_path="../config",
    config_name=root_agg_config_name,
    overrides=[f"mlflow_experiment={mlflow_experiment_full}",
               'n_sample_posts_files=null', 
               'n_sample_comments_files=null',
               # 'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_false',
              ]
)

# config_full_lc_true = AggregateEmbeddingsConfig(
#     config_path="../config",
#     config_name='aggregate_embeddings',
#     overrides=[f"mlflow_experiment={mlflow_experiment_full}",
#                'n_sample_posts_files=null', 
#                'n_sample_comments_files=null',
#                'data_embeddings_to_aggregate=top_subs-2021_07_16-use_multi_lower_case_true',
#               ]
# )
# pprint(config_test_sample_lc_false.config_dict, indent=2)

In [10]:
# config_test_sample_lc_false.config_flat,

In [11]:
df_configs = pd.DataFrame(
    [
        config_test_sample_lc_false.config_flat,
        # config_test_full_lc_false.config_flat,
        config_full_lc_false.config_flat,
        # config_full_lc_true.config_flat,
    ]
)

In [12]:
# We can't use (df_configs.nunique(dropna=False) > 1)
#  because when a col's content is a list or something unhashable, we get an error
#  so instead we'll check each column individually

# cols_with_diffs_config = df_configs.columns[df_configs.nunique(dropna=False) > 1]
cols_with_diffs_config = list()
for c_ in df_configs.columns:
    try:
        if df_configs[c_].nunique() > 1:
            cols_with_diffs_config.append(c_)
    except TypeError:
        cols_with_diffs_config.append(c_)
        

df_configs[cols_with_diffs_config]

Unnamed: 0,comments_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids,posts_vectorized_mlflow_uuids_lowercase,subreddit_meta_vectorized_mlflow_uuids,subreddit_meta_vectorized_mlflow_uuids_lowercase,comments_uuid,mlflow_experiment
0,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates_test
1,"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],[8eef951842a34a6e81d176b15ae74afd],[537514ab3c724b10903000501802de0e],"[5f10cd75334142168a6ebb787e477c1f, 2fcfefc3d5af43328168d3478b4fdeb6]",v0.4.0_use_multi_aggregates


In [13]:
# pprint(config_test_sample_lc_false.config_flat, indent=2)

# Run Full data with `lower_case=False`

The logic for sampling files and download/`caching` files locally lives in the `mlf` custom function.

Caching can save 9+ minutes if we try to download the files from GCS every time.

In [14]:
keys_to_check_in_config = ['mlflow_experiment', 'n_sample_posts_files', 'n_sample_comments_files', 'aggregate_params', 'calculate_similarites']

for k_ in keys_to_check_in_config:
    v_ = config_full_lc_false.config_dict.get(k_)
    if isinstance(v_, dict):
        print(f"\n{k_}:")
        [print(f"  {k2_}: \t{v2_}") for k2_, v2_ in v_.items()]
    else:
        print(f"{k_}: \t{v_}")

mlflow_experiment: 	v0.4.0_use_multi_aggregates
n_sample_posts_files: 	None
n_sample_comments_files: 	None

aggregate_params:
  min_comment_text_len: 	2
  agg_comments_to_post_weight_col: 	None
  agg_post_to_subreddit_weight_col: 	None
  agg_post_post_weight: 	70
  agg_post_comment_weight: 	20
  agg_post_subreddit_desc_weight: 	10
calculate_similarites: 	True


In [21]:
BREAK

In [None]:
%%time

try:
    job_agg1._send_log_file_to_mlflow()
    mlflow.end_run("FAILED")
    # run setup_logging() to remove logging to the file of a failed job
    setup_logging()
    
    del job_agg1
    del d_dfs1
except NameError:
    pass
gc.collect()

mlflow.end_run("FAILED")


job_agg1 = aggregate_embeddings_pd.AggregateEmbeddings(
    run_name=f"agg_full_lc_false_pd-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    **config_full_lc_false.config_flat
)
job_agg1.run_aggregation()

gc.collect()

10:46:05 | INFO | "== Start run_aggregation() method =="
10:46:05 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
10:46:05 | INFO | "host_name: djb-100-2021-04-28-djb-eda-german-subs"
10:46:05 | INFO | "cpu_count: 160"
10:46:05 | INFO | "RAM stats:
{'memory_used_percent': '7.19%', 'memory_total': '3,874,634', 'memory_used': '278,514', 'memory_free': '3,465,918'}"
10:46:05 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/2021-10-12_104605-agg_full_lc_false_pd-2021-10-12_104604"
10:46:05 | INFO | "  Saving config to local path..."
10:46:05 | INFO | "  Logging config to mlflow..."
10:46:06 | INFO | "-- Start _load_raw_embeddings() method --"
10:46:06 | INFO | "Loading subreddit description embeddings..."
10:46:07 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlrun

In [22]:
job_agg1._send_log_file_to_mlflow()
gc.collect()

16:47:20 | INFO | "Could NOT log to MLFLow, there's no active run."


22

In [23]:
gc.collect()

0

# Run full data, `lower_case=True`

Looks like the problem I ran into with the file being corrupted might've been a problem with downloading the file(s). Fix: delete the local cache and download the files again.

In [None]:
BREAK

In [None]:
# %%time

# mlflow.end_run("FAILED")
# gc.collect()
# try:
#     # run setup_logging() to remove logging to the file of a failed job
#     setup_logging()
    
#     del job_agg2
#     del d_dfs2
# except NameError:
#     pass
# gc.collect()

# job_agg2 = AggregateEmbeddings(
#     run_name=f"full_lc_true-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
#     **config_full_lc_true.config_flat
# )
# job_agg2.run_aggregation()

# gc.collect()

In [23]:
mlflow.end_run("FAILED")

# Debugging

In [None]:
BREAK

### Check computed dfs

In [None]:
150 * 4

In [None]:
for k_, v_ in {k_: v_ for k_, v_ in vars(job_agg1).items() if 'df_' in k_}.items():
    print(f"\n{k_}")
    try:
        print(f"  {v_.shape}")
        display(v_.iloc[:8, :10])
        if not ('meta' in k_):
            print(v_.info())
    except Exception as e:
        pass

In [42]:
# job_agg_test._save_and_log_aggregate_and_similarity_dfs()

In [38]:
mlflow.end_run("FAILED")
gc.collect()

2794