# Purpose

2021-07-28: Run it on the top Subreddits + German subs. Ideally this should help us find counterpart subs in other languages.

---

This notebook runs the `vectorize_text_to_embeddings` function to:
- loading USE-multilingual model
- load post & comment text
- convert the text into embeddings (at post or comment level)


# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
# from functools import partial
# import os
import logging
# from pathlib import Path
# from pprint import pprint

import mlflow

# from tqdm.auto import tqdm
from tqdm import tqdm
import numpy as np
import pandas as pd

from google.cloud import storage

# TF libraries... I've been getting errors when these aren't loaded
import tensorflow_text
import tensorflow as tf

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.models.vectorize_text import (
    vectorize_text_to_embeddings,
)
from subclu.models import vectorize_text_tf

from subclu.utils import set_working_directory
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)


print_lib_versions([mlflow, np, mlflow, pd, tensorflow_text, tf, subclu])

python		v 3.7.10
===
mlflow		v: 1.16.0
numpy		v: 1.18.5
mlflow		v: 1.16.0
pandas		v: 1.2.5
tensorflow_text	v: 2.3.0
tensorflow	v: 2.3.3
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Initialize mlflow logging with sqlite database

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


# Check whether we have access to a GPU

In [6]:
l_phys_gpus = tf.config.list_physical_devices('GPU')
# from tensorflow.python.client import device_lib

print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\nGPUs\n==="
    f"\nNum GPUs Available: {len(l_phys_gpus)}"
    f"\nGPU details:\n{l_phys_gpus}"
#     f"\n\nAll devices:\n===\n"
#     f"{device_lib.list_local_devices()}"
)


Built with CUDA? True
GPUs
===
Num GPUs Available: 1
GPU details:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


# Load config with data to process

In [7]:
config_data_v040 = LoadHydraConfig(
    config_path="../config/data_text_and_metadata",
    config_name='v0.4.0_19k_top_subs_and_geo_relevant_2021_09_27',
#     config_name='top_subreddits_2021_07_16',
)
config_data_v040.config_dict

{'dataset_name': 'v0.4.0 inputs - Top Subreddits (no Geo) + Geo-relevant subs, comments: TBD',
 'bucket_name': 'i18n-subreddit-clustering',
 'folder_subreddits_text_and_meta': 'subreddits/top/2021-09-24',
 'folder_posts_text_and_meta': 'posts/top/2021-09-27',
 'folder_comments_text_and_meta': None}

In [8]:
mlflow_experiment_test = 'v0.4.0_use_multi_inference_test'
mlflow_experiment_full = 'v0.4.0_use_multi_inference'

bucket_name = config_data_v040.config_dict['bucket_name']
subreddits_path = config_data_v040.config_dict['folder_subreddits_text_and_meta']
posts_path = config_data_v040.config_dict['folder_posts_text_and_meta']
# comments_path = None

# Call function to vectorize text

- Batch of: 3000 
- Limit characters to: 1000
Finally leaves enough room to use around 50% of RAM (of 60GB)

The problem is that each iteration takes around 3 minutes, which means whole job for GERMAN only will tka around 4:42 hours:mins...

```
When subreddit_id column was missing:
CPU times: user 75.8 ms, sys: 21.1 ms, total: 96.9 ms
Wall time: 884 ms
(3767, 28)

```

In [14]:
# check columns in subreddit meta...
# %%time

# df_subs = pd.read_parquet(
#     path=f"gs://{bucket_name}/{subreddits_path}",
#     # columns=l_cols_subreddits,
# )
# df_subs.shape

In [15]:
# df_subs.tail()

## Test on a `sample` of posts & comments to make sure entire process works first (before running long job)

For subreddit only, we can expand to more than 1,500 characters.

HOWEVER - when scoring posts &/or comments, we're better off trimming to first ~1,000 characters to speed things up. We can increase the character len if results aren't great... this could be a hyperparameter to tune.

```
08:27:18 | INFO | "Start vectorize function"
08:27:18 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_0827"
08:27:18 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
08:27:26 | INFO | "  0:00:07.773679 <- df_post time elapsed"
08:27:26 | INFO | "  (1649929, 6) <- df_posts.shape"
08:27:27 | INFO | "  Sampling posts down to: 2,500"
08:27:27 | INFO | "  (2500, 6) <- df_posts.shape AFTER sampling"
08:27:27 | INFO | "Load comments df..."
08:27:57 | INFO | "  (19200854, 6) <- df_comments shape"
08:28:08 | INFO | "Keep only comments that match posts IDs in df_posts..."
08:28:11 | INFO | "  (31630, 6) <- updated df_comments shape"
08:28:11 | INFO | "  Sampling COMMENTS down to: 5,100"
08:28:11 | INFO | "  (5100, 6) <- df_comments.shape AFTER sampling"
08:28:11 | INFO | "Load subreddits df..."
08:28:12 | INFO | "  (3767, 4) <- df_subs shape"
...
08:28:15 | INFO | "Getting embeddings in batches of size: 2000"
100%
2/2 [00:03<00:00, 1.67s/it]
08:28:19 | INFO | "  Saving to local... df_vect_subreddits_description..."
08:28:19 | INFO | "  Logging to mlflow..."
08:28:20 | INFO | "Vectorizing POSTS..."
08:28:20 | INFO | "Getting embeddings in batches of size: 2000"
100%
2/2 [00:00<00:00, 2.42it/s]
08:28:21 | INFO | "  Saving to local... df_vect_posts..."
08:28:21 | INFO | "  Logging to mlflow..."
08:28:22 | INFO | "Vectorizing COMMENTS..."
08:28:22 | INFO | "Getting embeddings in batches of size: 2000"
100%
3/3 [00:01<00:00, 1.95it/s]
08:28:24 | INFO | "  Saving to local... df_vect_comments..."
08:28:24 | INFO | "  Logging to mlflow..."
08:28:25 | INFO | "  0:01:06.542544 <- Total vectorize fxn time elapsed"

```

In [None]:
BREAK

In [17]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='test_n_samples',
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=True,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=None,
    
    tf_batch_inference_rows=2000,
    tf_limit_first_n_chars=1000,
    n_sample_posts=2500,
    n_sample_comments=5100,
)

07:07:25 | INFO | "Start vectorize function"
07:07:25 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-09-28_0707"
07:07:25 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-09-27"
07:07:46 | INFO | "  0:00:21.124010 <- df_post time elapsed"
07:07:46 | INFO | "  (8439672, 6) <- df_posts.shape"
07:07:51 | INFO | "  Sampling posts down to: 2,500"
07:07:52 | INFO | "  (2500, 6) <- df_posts.shape AFTER sampling"
07:07:52 | INFO | "Load subreddits df..."
07:07:53 | INFO | "  (19262, 4) <- df_subs shape"
07:07:53 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
07:07:54 | INFO | "Loading model use_multilingual...
  with kwargs: None"
07:08:02 | INFO | "  0:00:08.667182 <- Load TF HUB model time elapsed"
07:08:02 | INFO | "Vectorizing subreddit descriptions..."
07:08:02 | INFO | "Getting embeddings in batches

  0%|          | 0/10 [00:00<?, ?it/s]

07:08:17 | INFO | "  Saving to local... df_vect_subreddits_description..."
07:08:18 | INFO | "  Logging to mlflow..."
07:08:19 | INFO | "Vectorizing POSTS..."
07:08:19 | INFO | "Getting embeddings in batches of size: 2000"


  0%|          | 0/2 [00:00<?, ?it/s]

07:08:20 | INFO | "  Saving to local... df_vect_posts..."
07:08:20 | INFO | "  Logging to mlflow..."
07:08:21 | INFO | "  0:00:55.876173 <- Total vectorize fxn time elapsed"


In [18]:
print(df_vect_subs.shape)
df_vect_subs.iloc[:5, :10]

(19262, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
askreddit,t5_2qh1i,-0.015748,-0.057748,-0.012613,0.034228,0.070311,0.045727,0.05243,-0.052749,0.053685,-0.021486
pics,t5_2qh0u,-0.056925,0.027936,-0.009723,-0.009849,0.0432,0.045963,0.049922,-0.061319,0.053243,-0.052809
funny,t5_2qh33,0.049139,-0.038282,-0.03178,-0.030479,0.0748,0.056662,0.006651,0.047494,-0.043262,0.012178
memes,t5_2qjpg,-0.022032,0.010012,-0.067775,-0.026031,0.061845,0.063659,-0.066072,0.048686,0.035556,0.02053
interestingasfuck,t5_2qhsa,-0.020677,0.061429,-0.029565,0.029978,0.066374,0.061271,0.069265,0.028228,0.004899,0.044498


In [19]:
print(df_vect.shape)
df_vect.iloc[:5, :10]

(2500, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
xboxone,t5_2xbci,t3_ozpl4x,-0.043297,0.020574,-0.067325,0.020786,-0.012332,0.005021,0.071557,-0.074539,-0.047362,0.030414
celeb_nylons,t5_wqh04,t3_pj7bnw,0.115459,0.076455,0.011925,0.045632,0.015686,0.003254,-0.045785,0.040692,-0.096398,-0.007461
animalsbeingmoms,t5_37h9r,t3_pdgq6w,-0.042089,-0.014872,-0.015677,-0.008563,-0.128294,0.019726,0.016793,0.051896,-0.060583,-0.04376
preppers,t5_2riow,t3_p5gmfl,-0.057952,0.077444,-0.008418,-0.015304,0.025608,0.073165,-0.061645,0.046887,0.042444,-0.02651
ancientcoins,t5_2wmz0,t3_oyojy8,-0.04423,0.026513,-0.041851,-0.017429,-0.079371,0.0018,-0.079442,-0.042409,-0.040717,-0.022141


In [21]:
# print(df_vect_comments.shape)
# df_vect_comments.iloc[10:15, -10:]

# Test new batching function

Most inputs will be the same.
However, some things will change:
- Add new parameter to sample only first N files (we'll process each file individually)

In [22]:
# storage_client = storage.Client()
# bucket = storage_client.bucket(bucket_name)
# # folder = configs.gcp_storage_folder

# # print( str(configs.gcp_bucket) +"/"+ str(folder))
# for blob in tqdm(list(bucket.list_blobs(prefix=posts_path))[:5]):
# #     print(blob)
#     print(blob.name)
#     print(blob.name.split('/')[-1].split('.')[0])

In [9]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"batch_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=None,
    
    tf_batch_inference_rows=2100,
    tf_limit_first_n_chars=1000,  # Getting OOM errors with 1,000 chars
    n_sample_posts=4500,
    n_sample_comments=4100,
)

07:38:08 | INFO | "Start vectorize function"
07:38:08 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-09-28_073808"
07:38:08 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
07:38:08 | INFO | "  Saving config to local path..."
07:38:08 | INFO | "  Logging config to mlflow..."
07:38:09 | INFO | "Loading model use_multilingual..."
07:38:12 | INFO | "  0:00:03.218737 <- Load TF HUB model time elapsed"
07:38:12 | INFO | "Load subreddits df..."
07:38:13 | INFO | "  0:00:00.915581 <- df_subs loading time elapsed"
07:38:13 | INFO | "  (19262, 4) <- df_subs shape"
07:38:13 | INFO | "Vectorizing subreddit descriptions..."
07:38:13 | INFO | "Getting embeddings in batches of size: 2200"
 OOM when allocating tensor with shape[573337,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node Statef

### Timing is super fast, even with a bigger sample size

```
11:40:06 | INFO | "Start vectorize function"
11:40:06 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1140"
11:40:07 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
11:40:07 | INFO | "  Saving config to local path..."
11:40:07 | INFO | "  Logging config to mlflow..."
11:40:08 | INFO | "Loading model use_multilingual..."
11:40:10 | INFO | "  0:00:02.417308 <- Load TF HUB model time elapsed"
11:40:10 | WARNING | "For TF-HUB models, the only preprocessing applied is lowercase()"
11:40:10 | INFO | "Load subreddits df..."
11:40:11 | INFO | "  0:00:00.519934 <- df_subs loading time elapsed"
11:40:11 | INFO | "  (3767, 4) <- df_subs shape"
11:40:11 | INFO | "Vectorizing subreddit descriptions..."
100%
2/2 [00:03<00:00, 1.65s/it]
11:40:15 | INFO | "  0:00:04.080246 <- df_subs vectorizing time elapsed"
...
11:40:16 | INFO | "Loading df_posts...
11:40:23 | INFO | "  0:00:06.460565 <- df_post loading time elapsed"
11:40:23 | INFO | "  (1649929, 6) <- df_posts.shape"
11:40:24 | INFO | "  Sampling posts down to: 9,500"
11:40:24 | INFO | "  (9500, 6) <- df_posts.shape AFTER sampling"
11:40:24 | INFO | "Vectorizing POSTS..."
100%
5/5 [00:03<00:00, 1.59it/s]
11:40:28 | INFO | "  0:00:03.774021 <- df_posts vectorizing time elapsed"
...
11:40:30 | INFO | "Load comments df..."
11:40:58 | INFO | "  (19200854, 6) <- df_comments shape"
11:41:10 | INFO | "Keep only comments that match posts IDs in df_posts..."
11:41:14 | INFO | "  (95313, 6) <- updated df_comments shape"
11:41:14 | INFO | "  Sampling COMMENTS down to: 19,100"
11:41:14 | INFO | "  (19100, 6) <- df_comments.shape AFTER sampling"
11:41:14 | INFO | "Vectorizing COMMENTS..."
100%
10/10 [00:05<00:00, 1.60it/s]
11:41:20 | INFO | "  0:00:06.239953 <- df_posts vectorizing time elapsed"
11:41:20 | INFO | "  Saving to local... df_vect_comments..."
11:41:20 | INFO | "    42.1 MB <- Memory usage"
11:41:20 | INFO | "       2	<- target Dask partitions	   30.0 <- target MB partition size"
11:41:21 | INFO | "  Logging to mlflow..."
11:41:23 | INFO | "  0:01:16.130234 <- Total vectorize fxn time elapsed"

```

In [10]:
# mlflow.end_run(status='KILLED')

# vectorize_text_tf.vectorize_text_to_embeddings(
#     model_name='use_multilingual',
#     run_name=f"test_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
#     mlflow_experiment=mlflow_experiment_test,
    
#     tokenize_lowercase=False,
    
#     bucket_name=bucket_name,
#     subreddits_path=subreddits_path,
#     posts_path=posts_path,
#     comments_path=comments_path,
    
#     tf_batch_inference_rows=2100,
#     tf_limit_first_n_chars=1000,
#     n_sample_posts=9500,
#     n_sample_comments=19100,
# )

In [None]:
BREAK

# Run full with `lower_case=False`
Let's see if the current refactor is good enough or if I really need to manually batch files...

**answer**: no it wasn't good enough -- 60GB of RAM wasn't good enough for 19Million comments _lol_.

```
...
12:02:14 | INFO | "  (19168154, 6) <- updated df_comments shape"
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"
100%
9128/9128 [1:32:26<00:00, 1.97it/s]

<__array_function__ internals> in concatenate(*args, **kwargs)

MemoryError: Unable to allocate 36.6 GiB for an array with shape (512, 19168154) and data type float32
```


In [11]:
comments_path = None

In [22]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"posts_as_comments_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=None,  # posts_path
    comments_path=posts_path,

    # Hack: Rename cols so that I can process posts as a batch of comments
    col_post_id=None,
    col_comment_id='post_id',
    col_text_comment='text',
    col_text_comment_word_count='text_word_count',
    cols_index_comment=['subreddit_name', 'subreddit_id', 'post_id'],
    local_comms_subfolder_relative='df_vect_posts',
    mlflow_comments_folder='df_vect_posts',
    
    tf_batch_inference_rows=2500,
    tf_limit_first_n_chars=850,
    
    n_sample_comment_files=2,
    
#     n_sample_posts=9500,
#     n_sample_comments=19100,
)

08:28:13 | INFO | "Start vectorize function"
08:28:13 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-09-28_082813"
08:28:13 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
08:28:13 | INFO | "  Saving config to local path..."
08:28:13 | INFO | "  Logging config to mlflow..."
08:28:14 | INFO | "Loading model use_multilingual..."




08:28:16 | INFO | "  0:00:02.249177 <- Load TF HUB model time elapsed"
08:28:16 | INFO | "Load subreddits df..."
08:28:17 | INFO | "  0:00:00.686549 <- df_subs loading time elapsed"
08:28:17 | INFO | "  (19262, 4) <- df_subs shape"
08:28:17 | INFO | "Vectorizing subreddit descriptions..."
08:28:17 | INFO | "Getting embeddings in batches of size: 2300"
  0%|                                                     | 0/9 [00:00<?, ?it/s]



100%|#############################################| 9/9 [00:17<00:00,  1.95s/it]
08:28:35 | INFO | "  0:00:18.179895 <- df_subs vectorizing time elapsed"
08:28:35 | INFO | "  Saving to local: df_vect_subreddits_description/df | 19,262 Rows by 514 Cols"
08:28:35 | INFO | "Converting pandas to dask..."
08:28:35 | INFO | "    40.1 MB <- Memory usage"
08:28:35 | INFO | "       2	<- target Dask partitions	   30.0 <- target MB partition size"
08:28:36 | INFO | "  Logging to mlflow..."
08:28:37 | INFO | "** Procesing Comments files one at a time ***"
08:28:37 | INFO | "-- Loading & vectorizing COMMENTS in files: 2 --
Expected batch size: 2300"
local variable 'df_posts' referenced before assignment"
  0%|          | 0/2 [00:00<?, ?it/s]08:28:37 | INFO | "Processing: posts/top/2021-09-27/000000000000.parquet"
ResourceExhausted, lowering character limit
 OOM when allocating tensor with shape[598746,1280] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{

In [23]:
mlflow.end_run(status='KILLED')

## Test - Re-do comments with new batching logic
Trying to do all 19 million comments at once broke, sigh, so need to batch one file at a time.

### Re-run comments and log to non-test mlflow experiment


Besides file-batching, this job increased the row-batches from 2,000 to 6,100... unclear if this is having a negative impact. Maybe smaller batches are somehow more efficient?
Now that I'm reading one file at a time, it looks like speed is taking a big hit

Baseline when running it all in memory. It took `1:32:26`, but it ran out of memory (RAM).
The current ETA is around `2 hours`

```
# singe file, all in memory (results in OOM)
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"
100%
9128/9128 [1:32:26<00:00, 1.97it/s]


# one file at a time... slower, but we get results one file at a time...
16%
6/37 [21:11<1:49:46, 212.45s/it]
```


In [24]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"posts_as_comments_batch_fxn-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=None,  # posts_path
    comments_path=posts_path,

    # Hack: Rename cols so that I can process posts as a batch of comments
    col_post_id=None,
    col_comment_id='post_id',
    col_text_comment='text',
    col_text_comment_word_count='text_word_count',
    cols_index_comment=['subreddit_name', 'subreddit_id', 'post_id'],
    local_comms_subfolder_relative='df_vect_posts',
    mlflow_comments_folder='df_vect_posts',
    
    tf_batch_inference_rows=2600,
    tf_limit_first_n_chars=850,
    
    n_sample_comment_files=None,
    
#     n_sample_posts=9500,
#     n_sample_comments=19100,
)

08:42:39 | INFO | "Start vectorize function"
08:42:39 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-09-28_084239"
08:42:39 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
08:42:40 | INFO | "  Saving config to local path..."
08:42:40 | INFO | "  Logging config to mlflow..."
08:42:40 | INFO | "Loading model use_multilingual..."
08:42:42 | INFO | "  0:00:02.266991 <- Load TF HUB model time elapsed"
08:42:42 | INFO | "Load subreddits df..."
08:42:43 | INFO | "  0:00:00.692434 <- df_subs loading time elapsed"
08:42:43 | INFO | "  (19262, 4) <- df_subs shape"
08:42:43 | INFO | "Vectorizing subreddit descriptions..."
08:42:44 | INFO | "Getting embeddings in batches of size: 2600"
ResourceExhausted, lowering character limit
 OOM when allocating tensor with shape[577467,1280] and type float on /job:localhost/replica:0/task:0/device:GP

In [25]:
gc.collect()

55223

# Run full with `lower_case=True`

This one is expected to be a little slower because it'll call `.str.lower()` on each batch of text.

---

TODO: unsure if it's worth running this job in parallel while I do work on a separate VM... might be a big pain to manually sync the rows from metrics & params happening at the same time in two different VMs.



In [26]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"posts_as_comments_batch_fxn-{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=True,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=None,  # posts_path
    comments_path=posts_path,

    # Hack: Rename cols so that I can process posts as a batch of comments
    col_post_id=None,
    col_comment_id='post_id',
    col_text_comment='text',
    col_text_comment_word_count='text_word_count',
    cols_index_comment=['subreddit_name', 'subreddit_id', 'post_id'],
    local_comms_subfolder_relative='df_vect_posts',
    mlflow_comments_folder='df_vect_posts',
    
    tf_batch_inference_rows=2600,
    tf_limit_first_n_chars=850,
    
    n_sample_comment_files=None,
    
#     n_sample_posts=9500,
#     n_sample_comments=19100,
)

10:05:02 | INFO | "Start vectorize function"
10:05:02 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-09-28_100502"
10:05:02 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
10:05:03 | INFO | "  Saving config to local path..."
10:05:03 | INFO | "  Logging config to mlflow..."
10:05:03 | INFO | "Loading model use_multilingual..."
10:05:05 | INFO | "  0:00:02.265257 <- Load TF HUB model time elapsed"
10:05:05 | INFO | "Load subreddits df..."
10:05:06 | INFO | "  0:00:00.683829 <- df_subs loading time elapsed"
10:05:06 | INFO | "  (19262, 4) <- df_subs shape"
10:05:06 | INFO | "Vectorizing subreddit descriptions..."
10:05:06 | INFO | "Getting embeddings in batches of size: 2600"
ResourceExhausted, lowering character limit
 OOM when allocating tensor with shape[568066,1280] and type float on /job:localhost/replica:0/task:0/device:GP

In [None]:
LEGACY

# Run full with lower_case=False

Time on CPU, only comments + subs:
```
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:08 | INFO | "  (629, 4) <- df_subs shape"

13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"
```

In [18]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='full_data-lowercase_false',
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    subreddits_path='subreddits/de/2021-06-16',
    posts_path=None,  # 'posts/de/2021-06-16',
    comments_path='comments/de/2021-06-16',
    tf_batch_inference_rows=1500,
    tf_limit_first_n_chars=1100,
    n_sample_posts=None,
    n_sample_comments=None,
)

13:28:50 | INFO | "Start vectorize function"
13:28:50 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-01_1328"
13:28:50 | INFO | "Load comments df..."
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:07 | INFO | "Keep only comments that match posts IDs in df_posts..."
13:29:07 | INFO | "df_posts missing, so we can't filter comments..."
13:29:07 | INFO | "Load subreddits df..."
13:29:08 | INFO | "  (629, 4) <- df_subs shape"
13:29:08 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/mlflow/mlruns.db"
13:29:09 | INFO | "Loading model use_multilingual...
  with kwargs: None"




13:29:11 | INFO | "  0:00:02.282361 <- Load TF HUB model time elapsed"
13:29:11 | INFO | "Vectorizing subreddit descriptions..."




13:29:13 | INFO | "  Saving to local... df_vect_subreddits_description..."
13:29:13 | INFO | "  Logging to mlflow..."




13:29:14 | INFO | "Vectorizing COMMENTS..."
13:29:14 | INFO | "Getting embeddings in batches of size: 1500"


  0%|          | 0/740 [00:00<?, ?it/s]

13:44:30 | INFO | "  Saving to local... df_vect_comments..."
13:44:49 | INFO | "  Logging to mlflow..."
13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"
