# Purpose

**2021-10-04**: We have a new batch of COMMENTS for v0.4.0. Need to process them.

Diff from before: This time instead of processing ALL the comment files in one VM, I'm going to try to process only a few at a time (e.g., 14) so that I can run some batches in parallel (in multiple machines or multiple GPUs for machines with multiple GPUs).

---
Provenance / previous<br>
~2021-09-28: Run inference on posts for v0.4.0 POSTS~

Diff from before: instead of only using `text` (post title + post body), we'll be using multiple columns to concat and get the embeddings.

---

This notebook runs the `vectorize_text_to_embeddings` function to:
- loading USE-multilingual model
- load post & comment text
- convert the text into embeddings (at post or level)


# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
# from functools import partial
# import os
import logging
# from pathlib import Path
# from pprint import pprint

import mlflow

# from tqdm.auto import tqdm
from tqdm import tqdm
import numpy as np
import pandas as pd

from google.cloud import storage

# TF libraries... I've been getting errors when these aren't loaded
import tensorflow_text
import tensorflow as tf

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.models.vectorize_text import (
    vectorize_text_to_embeddings,
)
from subclu.models import vectorize_text_tf

from subclu.utils import set_working_directory
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)


print_lib_versions([mlflow, np, mlflow, pd, tensorflow_text, tf, subclu])

python		v 3.7.10
===
mlflow		v: 1.16.0
numpy		v: 1.18.5
mlflow		v: 1.16.0
pandas		v: 1.2.5
tensorflow_text	v: 2.3.0
tensorflow	v: 2.3.3
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Initialize mlflow logging with sqlite database

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


# Check whether we have access to a GPU

In [6]:
l_phys_gpus = tf.config.list_physical_devices('GPU')
# from tensorflow.python.client import device_lib

print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\nGPUs\n==="
    f"\nNum GPUs Available: {len(l_phys_gpus)}"
    f"\nGPU details:\n{l_phys_gpus}"
#     f"\n\nAll devices:\n===\n"
#     f"{device_lib.list_local_devices()}"
)


Built with CUDA? True
GPUs
===
Num GPUs Available: 1
GPU details:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


# Load config with data to process

In [7]:
config_data_v040 = LoadHydraConfig(
    config_path="../config/data_text_and_metadata",
    config_name='v0.4.0_19k_top_subs_and_geo_relevant_2021_09_27',
#     config_name='top_subreddits_2021_07_16',
)
config_data_v040.config_dict

{'dataset_name': 'v0.4.0 inputs - Top Subreddits (no Geo) + Geo-relevant subs, comments: TBD',
 'bucket_name': 'i18n-subreddit-clustering',
 'folder_subreddits_text_and_meta': 'subreddits/top/2021-09-24',
 'folder_posts_text_and_meta': 'posts/top/2021-09-27',
 'folder_comments_text_and_meta': 'comments/top/2021-10-04'}

In [8]:
mlflow_experiment_test = 'v0.4.0_use_multi_inference_test'
mlflow_experiment_full = 'v0.4.0_use_multi_inference'

# Add or over-ride configs here
bucket_name = config_data_v040.config_dict['bucket_name']
subreddits_path = None  # config_data_v040.config_dict['folder_subreddits_text_and_meta']
posts_path = None  # config_data_v040.config_dict['folder_posts_text_and_meta']
comments_path = config_data_v040.config_dict['folder_comments_text_and_meta']

# Test with batching function


### New function (with batching)
Most inputs will be the same.
However, some things will change:
- Added new parameter to sample only first N files (we'll process each file individually)

For subreddit only, we can expand to more than 1,500 characters.

HOWEVER - when scoring posts &/or comments, we're better off trimming to first ~1,000 characters to speed things up. We can increase the character len if results aren't great... this could be a hyperparameter to tune.

### Notes on batch & characters:

Comments tend to be shorter, so we can usually run larger batches. A batch of `6,000` can still result in `OOM` errors, so go lower than that.
```python
    # TF batches
    tf_batch_inference_rows=5000,
    tf_limit_first_n_chars=900,
```

Posts tend to be longer, so we're better off running smaller batches:
```python
    # TF batches
    tf_batch_inference_rows=2400,
    tf_limit_first_n_chars=900,
```


In [9]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"posts_as_comments_full_text-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=None,
    posts_path=None,
    comments_path=comments_path,

    # TF batches
    tf_batch_inference_rows=6000,
    tf_limit_first_n_chars=900,
    
    # Sampling/batching files or rows
    n_sample_comment_files=2,
    # n_sample_comments=49100,
    # n_sample_posts=9500,
    get_embeddings_verbose=True,
    
)

22:21:46 | INFO | "Start vectorize function"
22:21:46 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-10-04_222146"
22:21:46 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
22:21:46 | INFO | "  Saving config to local path..."
22:21:46 | INFO | "  Logging config to mlflow..."
22:21:47 | INFO | "Loading model use_multilingual..."
22:21:56 | INFO | "  0:00:08.692843 <- Load TF HUB model time elapsed"
22:21:56 | INFO | "** Procesing Comments files one at a time ***"
22:21:56 | INFO | "-- Loading & vectorizing COMMENTS in files: 2 --
Expected batch size: 6000"
local variable 'df_posts' referenced before assignment"
  0%|          | 0/2 [00:00<?, ?it/s]22:21:56 | INFO | "Processing: comments/top/2021-10-04/000000000000.parquet"
22:22:00 | INFO | "cols_index: ['subreddit_name', 'subreddit_id', 'post_id', 'comment_id']"
22:22:00 | INFO

### Test on a slice of files

The previous `n_sample_comment_files` would always sample the first N files, but we didn't check whether file list was sorted.

```
# # TODO(djb): blobs can't be sorted, so maybe I should save files to local cache first...
```
With new refactoring: 
- I sort list to ensure consistency on each run 
- add slice start & end parameters to pick arbitrary files in list

In [11]:
# mlflow.end_run(status='KILLED')

# vectorize_text_tf.vectorize_text_to_embeddings(
#     model_name='use_multilingual',
#     run_name=f"posts_as_comments_full_text-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
#     mlflow_experiment=mlflow_experiment_test,
    
#     tokenize_lowercase=False,
    
#     bucket_name=bucket_name,
#     subreddits_path=None,
#     posts_path=None,
#     comments_path=comments_path,

#     # TF batches
#     tf_batch_inference_rows=7000,
#     tf_limit_first_n_chars=900,
    
#     # Sampling FILES
#     # n_sample_comment_files=2,
#     n_comment_files_slice_start=2,
#     n_comment_files_slice_end=4,
    
#     # Sampling ROWS
#     # n_sample_comments=49100,
#     # n_sample_posts=9500,
# )

In [13]:
mlflow.end_run(status='KILLED')

## Re-do with new batching logic
Trying to do all 19 million comments at once broke, sigh, so need to batch one file at a time.

### Re-run comments and log to non-test mlflow experiment




In [9]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"comments_batch_01-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=None,
    posts_path=None,
    comments_path=comments_path,

    # TF batches
    tf_batch_inference_rows=3200,
    tf_limit_first_n_chars=900,
    
    # Sampling FILES
    n_sample_comment_files=15,
)

01:44:31 | INFO | "Start vectorize function"
01:44:31 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-10-05_014431"
01:44:32 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
01:44:32 | INFO | "  Saving config to local path..."
01:44:32 | INFO | "  Logging config to mlflow..."
01:44:33 | INFO | "Loading model use_multilingual..."
01:44:36 | INFO | "  0:00:03.205714 <- Load TF HUB model time elapsed"
01:44:36 | INFO | "** Procesing Comments files one at a time ***"
01:44:36 | INFO | "-- Loading & vectorizing COMMENTS in files: 15 --
Expected batch size: 3200"
local variable 'df_posts' referenced before assignment"
  0%|          | 0/15 [00:00<?, ?it/s]01:44:36 | INFO | "Processing: comments/top/2021-10-04/000000000000.parquet"
01:44:40 | INFO | "Getting embeddings in batches of size: 3200"
100%|####################################

In [10]:
gc.collect()

55156

### Batch two

In [None]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"comments_batch_01-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=None,
    posts_path=None,
    comments_path=comments_path,

    # TF batches
    tf_batch_inference_rows=3200,
    tf_limit_first_n_chars=900,
    
    # Sampling FILES
    # n_sample_comment_files=15,
    n_comment_files_slice_start=20,
    n_comment_files_slice_end=62,
)

04:29:14 | INFO | "Start vectorize function"
04:29:14 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-10-05_042914"
04:29:14 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
04:29:15 | INFO | "  Saving config to local path..."
04:29:15 | INFO | "  Logging config to mlflow..."
04:29:15 | INFO | "Loading model use_multilingual..."
04:29:17 | INFO | "  0:00:02.476168 <- Load TF HUB model time elapsed"
04:29:17 | INFO | "** Procesing Comments files one at a time ***"
04:29:17 | INFO | "-- Loading & vectorizing COMMENTS in files: 59 --
Expected batch size: 3200"
local variable 'df_posts' referenced before assignment"
  0%|          | 0/59 [00:00<?, ?it/s]04:29:17 | INFO | "    -- Skipping file: comments/top/2021-10-04/000000000000.parquet --"
04:29:17 | INFO | "    -- Skipping file: comments/top/2021-10-04/000000000001.parquet --"
04

In [15]:
slice_start = 20
slice_end = 24

for i, f_ in enumerate(np.arange(50)):
    if not(slice_start <= i < slice_end):
        # print(f"skip: {i}")
        continue
    print(f"do stuff {i}")
        

do stuff 20
do stuff 21
do stuff 22
do stuff 23


Summary for posts (logs say comments because we used the hack to batch comments):

**Params**:
```python
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=None,  # posts_path
    comments_path=posts_path,
    
    tf_batch_inference_rows=2450,
    tf_limit_first_n_chars=900,
```


**Timing**
```bash
23:19:04 | INFO | "  Saving to local: df_vect_posts/000000000026 | 361,954 Rows by 515 Cols"
100%|##########| 27/27 [1:19:38<00:00, 177.00s/it]
23:19:14 | INFO | "Logging COMMENT files as mlflow artifact (to GCS)..."
23:22:43 | INFO | "  1:23:50.042083 <- Total vectorize fxn time elapsed"
```

# Run full with `lower_case=True`

This one is expected to be a little slower because it'll call `.str.lower()` on each batch of text.

---

TODO: unsure if it's worth running this job in parallel while I do work on a separate VM... might be a big pain to manually sync the rows from metrics & params happening at the same time in two different VMs.



In [None]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"comments_lower_case-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=True,
    
    bucket_name=bucket_name,
    subreddits_path=None,
    posts_path=None,
    comments_path=comments_path,

    # TF batches
    tf_batch_inference_rows=3200,
    tf_limit_first_n_chars=900,
    
    # Sampling FILES
    # n_sample_comment_files=15,
    # n_comment_files_slice_start=20,
    # n_comment_files_slice_end=62,
)

# Appendix

### Notes on previous function (all in memory):
- 60GB of RAM wasn't good enough for 19Million comments _lol_ (also might've run into memory leaks in the GPU)

```
...
12:02:14 | INFO | "  (19168154, 6) <- updated df_comments shape"
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"
100%
9128/9128 [1:32:26<00:00, 1.97it/s]

<__array_function__ internals> in concatenate(*args, **kwargs)

MemoryError: Unable to allocate 36.6 GiB for an array with shape (512, 19168154) and data type float32

```

### New batching fxn
Besides file-batching, this job increased the row-batches from 2,000 to 6,100... unclear if this is having a negative impact. Maybe smaller batches are somehow more efficient?
Now that I'm reading one file at a time, it looks like speed is taking a big hit

Baseline when running it all in memory. It took `1:32:26`, but it ran out of memory (RAM).
The current ETA is around `2 hours`

```
# singe file, all in memory (results in OOM)
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"
100%
9128/9128 [1:32:26<00:00, 1.97it/s]


# one file at a time... slower, but we get results one file at a time...
16%
6/37 [21:11<1:49:46, 212.45s/it]
```


Notes on new fxn to batch posts as if they're comments. (Because batching logic is only implemented for comments)

```python
    # Hack: Rename cols so that I can process `posts` as a batch of comments
    bucket_name=bucket_name,
    subreddits_path=None,
    posts_path=None,  # posts_path
    comments_path=comments_path,
    
    col_post_id=None,
    col_comment_id='post_id',
    col_text_comment='text',
    col_text_comment_word_count='text_word_count',
    cols_index_comment=['subreddit_name', 'subreddit_id', 'post_id'],
    local_comms_subfolder_relative='df_vect_posts',
    mlflow_comments_folder='df_vect_posts_extra_text',
    cols_comment_text_to_concat=['flair_text', 'post_url_for_embeddings', 'text', 'ocr_inferred_text_agg_clean'],
```


In [None]:
LEGACY

# Run full with lower_case=False (legacy fse/fasttext)

Time on CPU, only comments + subs:
```
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:08 | INFO | "  (629, 4) <- df_subs shape"

13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"
```

In [18]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='full_data-lowercase_false',
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    subreddits_path='subreddits/de/2021-06-16',
    posts_path=None,  # 'posts/de/2021-06-16',
    comments_path='comments/de/2021-06-16',
    tf_batch_inference_rows=1500,
    tf_limit_first_n_chars=1100,
    n_sample_posts=None,
    n_sample_comments=None,
)

13:28:50 | INFO | "Start vectorize function"
13:28:50 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-01_1328"
13:28:50 | INFO | "Load comments df..."
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:07 | INFO | "Keep only comments that match posts IDs in df_posts..."
13:29:07 | INFO | "df_posts missing, so we can't filter comments..."
13:29:07 | INFO | "Load subreddits df..."
13:29:08 | INFO | "  (629, 4) <- df_subs shape"
13:29:08 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/mlflow/mlruns.db"
13:29:09 | INFO | "Loading model use_multilingual...
  with kwargs: None"




13:29:11 | INFO | "  0:00:02.282361 <- Load TF HUB model time elapsed"
13:29:11 | INFO | "Vectorizing subreddit descriptions..."




13:29:13 | INFO | "  Saving to local... df_vect_subreddits_description..."
13:29:13 | INFO | "  Logging to mlflow..."




13:29:14 | INFO | "Vectorizing COMMENTS..."
13:29:14 | INFO | "Getting embeddings in batches of size: 1500"


  0%|          | 0/740 [00:00<?, ?it/s]

13:44:30 | INFO | "  Saving to local... df_vect_comments..."
13:44:49 | INFO | "  Logging to mlflow..."
13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"
