# Purpose

2021-07-28: Run it on the top Subreddits + German subs. Ideally this should help us find counterpart subs in other languages.

---

This notebook runs the `vectorize_text_to_embeddings` function to:
- loading USE-multilingual model
- load post & comment text
- convert the text into embeddings (at post or comment level)


# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
# from functools import partial
# import os
import logging
# from pathlib import Path
# from pprint import pprint

import mlflow

from tqdm.auto import tqdm
import numpy as np
import pandas as pd

from google.cloud import storage

# TF libraries... I've been getting errors when these aren't loaded
import tensorflow_text
import tensorflow as tf

import subclu
from subclu.models.vectorize_text import (
    vectorize_text_to_embeddings,
)
from subclu.models import vectorize_text_tf

from subclu.utils import set_working_directory
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)


print_lib_versions([mlflow, np, mlflow, pd, tensorflow_text, tf, subclu])

python		v 3.7.10
===
mlflow		v: 1.16.0
numpy		v: 1.18.5
mlflow		v: 1.16.0
pandas		v: 1.2.5
tensorflow_text	v: 2.3.0
tensorflow	v: 2.3.3
subclu		v: 0.3.2


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Initialize mlflow logging with sqlite database

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


# Check whether we have access to a GPU

In [6]:
l_phys_gpus = tf.config.list_physical_devices('GPU')
from tensorflow.python.client import device_lib

print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\nGPUs\n==="
    f"\nNum GPUs Available: {len(l_phys_gpus)}"
    f"\nGPU details:\n{l_phys_gpus}"
    f"\n\nAll devices:\n===\n"
    f"{device_lib.list_local_devices()}"
)


Built with CUDA? True
GPUs
===
Num GPUs Available: 1
GPU details:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

All devices:
===
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6261311781606232107
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14842963934755766732
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 13515039862361315005
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14676252416
locality {
  bus_id: 1
  links {
  }
}
incarnation: 269525325704656488
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]


# Call function to vectorize text

- Batch of: 3000 
- Limit characters to: 1000
Finally leaves enough room to use around 50% of RAM (of 60GB)

The problem is that each iteration takes around 3 minutes, which means whole job for GERMAN only will tka around 4:42 hours:mins...

In [8]:
mlflow_experiment_test = 'v0.3.2_use_multi_inference_test'
mlflow_experiment_full = 'v0.3.2_use_multi_inference'

bucket_name = 'i18n-subreddit-clustering'
subreddits_path = "subreddits/top/2021-07-16"
posts_path = 'posts/top/2021-07-16'
comments_path = 'comments/top/2021-07-09'

```
When subreddit_id column was missing:
CPU times: user 75.8 ms, sys: 21.1 ms, total: 96.9 ms
Wall time: 884 ms
(3767, 28)

```

In [8]:
# check columns in subreddit meta...
# %%time

# df_subs = pd.read_parquet(
#     path=f"gs://{bucket_name}/{subreddits_path}",
#     # columns=l_cols_subreddits,
# )
# df_subs.shape

In [9]:
# df_subs.tail()

## Test on a `sample` of posts & comments to make sure entire process works first (before running long job)

For subreddit only, we can expand to more than 1,500 characters.

HOWEVER - when scoring posts &/or comments, we're better off trimming to first ~1,000 characters to speed things up. We can increase the character len if results aren't great... this could be a hyperparameter to tune.

```
08:27:18 | INFO | "Start vectorize function"
08:27:18 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_0827"
08:27:18 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
08:27:26 | INFO | "  0:00:07.773679 <- df_post time elapsed"
08:27:26 | INFO | "  (1649929, 6) <- df_posts.shape"
08:27:27 | INFO | "  Sampling posts down to: 2,500"
08:27:27 | INFO | "  (2500, 6) <- df_posts.shape AFTER sampling"
08:27:27 | INFO | "Load comments df..."
08:27:57 | INFO | "  (19200854, 6) <- df_comments shape"
08:28:08 | INFO | "Keep only comments that match posts IDs in df_posts..."
08:28:11 | INFO | "  (31630, 6) <- updated df_comments shape"
08:28:11 | INFO | "  Sampling COMMENTS down to: 5,100"
08:28:11 | INFO | "  (5100, 6) <- df_comments.shape AFTER sampling"
08:28:11 | INFO | "Load subreddits df..."
08:28:12 | INFO | "  (3767, 4) <- df_subs shape"
...
08:28:15 | INFO | "Getting embeddings in batches of size: 2000"
100%
2/2 [00:03<00:00, 1.67s/it]
08:28:19 | INFO | "  Saving to local... df_vect_subreddits_description..."
08:28:19 | INFO | "  Logging to mlflow..."
08:28:20 | INFO | "Vectorizing POSTS..."
08:28:20 | INFO | "Getting embeddings in batches of size: 2000"
100%
2/2 [00:00<00:00, 2.42it/s]
08:28:21 | INFO | "  Saving to local... df_vect_posts..."
08:28:21 | INFO | "  Logging to mlflow..."
08:28:22 | INFO | "Vectorizing COMMENTS..."
08:28:22 | INFO | "Getting embeddings in batches of size: 2000"
100%
3/3 [00:01<00:00, 1.95it/s]
08:28:24 | INFO | "  Saving to local... df_vect_comments..."
08:28:24 | INFO | "  Logging to mlflow..."
08:28:25 | INFO | "  0:01:06.542544 <- Total vectorize fxn time elapsed"

```

In [None]:
BREAK

In [20]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='test_n_samples',
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=True,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=comments_path,
    
    tf_batch_inference_rows=2000,
    tf_limit_first_n_chars=1000,
    n_sample_posts=2500,
    n_sample_comments=5100,
)

08:27:18 | INFO | "Start vectorize function"
08:27:18 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_0827"
08:27:18 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
08:27:26 | INFO | "  0:00:07.773679 <- df_post time elapsed"
08:27:26 | INFO | "  (1649929, 6) <- df_posts.shape"
08:27:27 | INFO | "  Sampling posts down to: 2,500"
08:27:27 | INFO | "  (2500, 6) <- df_posts.shape AFTER sampling"
08:27:27 | INFO | "Load comments df..."
08:27:57 | INFO | "  (19200854, 6) <- df_comments shape"
08:28:08 | INFO | "Keep only comments that match posts IDs in df_posts..."
08:28:11 | INFO | "  (31630, 6) <- updated df_comments shape"
08:28:11 | INFO | "  Sampling COMMENTS down to: 5,100"
08:28:11 | INFO | "  (5100, 6) <- df_comments.shape AFTER sampling"
08:28:11 | INFO | "Load subreddits df..."
08:28:12 | INFO | "  (3767, 4) <- df_subs shape"
08:28:12 | INFO | "MLflow tracking URI: sqlit

  0%|          | 0/2 [00:00<?, ?it/s]

08:28:19 | INFO | "  Saving to local... df_vect_subreddits_description..."
08:28:19 | INFO | "  Logging to mlflow..."
08:28:20 | INFO | "Vectorizing POSTS..."
08:28:20 | INFO | "Getting embeddings in batches of size: 2000"


  0%|          | 0/2 [00:00<?, ?it/s]

08:28:21 | INFO | "  Saving to local... df_vect_posts..."
08:28:21 | INFO | "  Logging to mlflow..."
08:28:22 | INFO | "Vectorizing COMMENTS..."
08:28:22 | INFO | "Getting embeddings in batches of size: 2000"


  0%|          | 0/3 [00:00<?, ?it/s]

08:28:24 | INFO | "  Saving to local... df_vect_comments..."
08:28:24 | INFO | "  Logging to mlflow..."
08:28:25 | INFO | "  0:01:06.542544 <- Total vectorize fxn time elapsed"


In [24]:
print(df_vect_subs.shape)
df_vect_subs.iloc[:5, :10]

(3767, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
pics,t5_2qh0u,-0.056925,0.027936,-0.009723,-0.009849,0.0432,0.045963,0.049922,-0.061319,0.053243,-0.052809
funny,t5_2qh33,0.045474,-0.039333,-0.03179,-0.015574,0.074503,0.054003,0.00787,0.061827,-0.050316,0.023417
memes,t5_2qjpg,-0.014767,0.018347,-0.069566,-0.02242,0.063016,0.066394,-0.061886,0.04054,0.01935,0.027958
news,t5_2qh3l,-0.066339,0.056393,0.036245,-0.021127,0.076642,0.040693,0.019423,0.054693,-0.012191,0.065671
interestingasfuck,t5_2qhsa,-0.020677,0.061429,-0.029565,0.029978,0.066374,0.061271,0.069265,0.028228,0.004899,0.044498


In [25]:
print(df_vect.shape)
df_vect.iloc[:5, :10]

(2500, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
stocks,t5_2qjfk,t3_o9h0x8,-0.054105,0.0703,-0.077626,-0.032931,-0.022623,-0.004873,0.01813,-0.031575,-0.058212,0.04276
laptops,t5_2qoip,t3_o76rnd,-0.018935,-0.026588,0.071572,-0.015938,-0.083586,-0.082511,0.044506,-0.015417,-0.043235,0.068179
luftraum,t5_q02q4,t3_nuwjin,-0.072257,0.020968,-0.012867,-0.042401,0.024641,0.081959,-0.042031,0.01064,-0.005116,0.025049
adultery,t5_2sjkv,t3_oi8tno,0.046559,-0.042573,0.039657,-0.066817,-0.096201,0.001347,-0.074167,0.002814,-0.066348,-0.02913
poopshitters,t5_wgmeb,t3_o2es6a,-0.056548,-0.018536,0.015383,-0.018324,-0.007343,0.037289,0.082599,0.029512,-0.039021,0.031015


In [26]:
print(df_vect_comments.shape)
df_vect_comments.iloc[10:15, -10:]

(5100, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,embeddings_502,embeddings_503,embeddings_504,embeddings_505,embeddings_506,embeddings_507,embeddings_508,embeddings_509,embeddings_510,embeddings_511
subreddit_name,subreddit_id,post_id,comment_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
yoga,t5_2qhq6,t3_o9upy3,t1_h3djels,-0.049714,0.039176,0.031363,-0.025456,0.020385,-0.009454,0.032895,0.048567,0.006216,0.068287
worldnews,t5_2qh13,t3_o7yf4y,t1_h333pw7,-0.000592,-0.064295,-0.041862,-0.026075,0.028604,-0.028089,-0.017898,0.0315,-0.0451,0.064217
aww,t5_2qh1o,t3_nvidt5,t1_h13y1e4,0.022847,-0.084315,0.062082,-0.042768,0.049408,0.04187,0.021409,0.03514,0.034683,0.112163
formula1,t5_2qimj,t3_o8beb8,t1_h33xaio,-0.042507,-0.012686,0.035909,0.1043,0.034547,0.009831,-0.02029,0.04584,0.037209,0.074332
aaaaaaacccccccce,t5_3aa11,t3_o9igjh,t1_h3igwhh,0.045924,-0.043366,0.009658,0.00471,-0.062547,-0.043556,0.074638,0.029582,0.018934,0.036313


# Test new batching function

Most inputs will be the same.
However, some things will change:
- Add new parameter to sample only first N files (we'll process each file individually)

In [9]:
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
# folder = configs.gcp_storage_folder

# print( str(configs.gcp_bucket) +"/"+ str(folder))
for blob in tqdm(list(bucket.list_blobs(prefix=posts_path))[:5]):
#     print(blob)
    print(blob.name)
    print(blob.name.split('/')[-1].split('.')[0])

  0%|          | 0/5 [00:00<?, ?it/s]

posts/top/2021-07-16/000000000000.parquet
000000000000
posts/top/2021-07-16/000000000001.parquet
000000000001
posts/top/2021-07-16/000000000002.parquet
000000000002
posts/top/2021-07-16/000000000003.parquet
000000000003
posts/top/2021-07-16/000000000004.parquet
000000000004


In [49]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"test_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=comments_path,
    
    tf_batch_inference_rows=2100,
    tf_limit_first_n_chars=1000,
    n_sample_posts=3500,
    n_sample_comments=5100,
)

11:22:16 | INFO | "Start vectorize function"
11:22:16 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1122"
11:22:17 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
11:22:17 | INFO | "  Saving config to local path..."
11:22:17 | INFO | "  Logging config to mlflow..."
11:22:18 | INFO | "Loading model use_multilingual..."




11:22:20 | INFO | "  0:00:02.280150 <- Load TF HUB model time elapsed"
11:22:20 | INFO | "Load subreddits df..."
11:22:21 | INFO | "  0:00:00.619620 <- df_subs loading time elapsed"
11:22:21 | INFO | "  (3767, 4) <- df_subs shape"
11:22:21 | INFO | "Vectorizing subreddit descriptions..."
11:22:21 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/2 [00:00<?, ?it/s]



11:22:25 | INFO | "  0:00:04.172970 <- df_subs vectorizing time elapsed"
11:22:25 | INFO | "  Saving to local... df_vect_subreddits_description..."
11:22:25 | INFO | "     7.8 MB <- Memory usage"
11:22:25 | INFO | "       1	<- target Dask partitions	   30.0 <- target MB partition size"
11:22:25 | INFO | "  Logging to mlflow..."
11:22:30 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
11:22:37 | INFO | "  0:00:06.666186 <- df_post loading time elapsed"
11:22:37 | INFO | "  (1649929, 6) <- df_posts.shape"
11:22:38 | INFO | "  Sampling posts down to: 3,500"
11:22:38 | INFO | "  (3500, 6) <- df_posts.shape AFTER sampling"
11:22:38 | INFO | "Vectorizing POSTS..."
11:22:38 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/2 [00:00<?, ?it/s]

11:22:40 | INFO | "  0:00:01.568036 <- df_posts vectorizing time elapsed"
11:22:40 | INFO | "  Saving to local... df_vect_posts..."
11:22:40 | INFO | "     7.5 MB <- Memory usage"
11:22:40 | INFO | "       1	<- target Dask partitions	   30.0 <- target MB partition size"
11:22:40 | INFO | "  Logging to mlflow..."
11:22:41 | INFO | "Load comments df..."
11:23:11 | INFO | "  (19200854, 6) <- df_comments shape"
11:23:23 | INFO | "Keep only comments that match posts IDs in df_posts..."
11:23:26 | INFO | "  (34463, 6) <- updated df_comments shape"
11:23:26 | INFO | "  Sampling COMMENTS down to: 5,100"
11:23:26 | INFO | "  (5100, 6) <- df_comments.shape AFTER sampling"
11:23:26 | INFO | "Vectorizing COMMENTS..."
11:23:26 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/3 [00:00<?, ?it/s]

11:23:28 | INFO | "  Saving to local... df_vect_comments..."
11:23:28 | INFO | "    11.2 MB <- Memory usage"
11:23:28 | INFO | "       1	<- target Dask partitions	   30.0 <- target MB partition size"
11:23:28 | INFO | "  Logging to mlflow..."
11:23:30 | INFO | "  0:01:13.306714 <- Total vectorize fxn time elapsed"




### Timing is super fast, even with a bigger sample size

```
11:40:06 | INFO | "Start vectorize function"
11:40:06 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1140"
11:40:07 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
11:40:07 | INFO | "  Saving config to local path..."
11:40:07 | INFO | "  Logging config to mlflow..."
11:40:08 | INFO | "Loading model use_multilingual..."
11:40:10 | INFO | "  0:00:02.417308 <- Load TF HUB model time elapsed"
11:40:10 | WARNING | "For TF-HUB models, the only preprocessing applied is lowercase()"
11:40:10 | INFO | "Load subreddits df..."
11:40:11 | INFO | "  0:00:00.519934 <- df_subs loading time elapsed"
11:40:11 | INFO | "  (3767, 4) <- df_subs shape"
11:40:11 | INFO | "Vectorizing subreddit descriptions..."
100%
2/2 [00:03<00:00, 1.65s/it]
11:40:15 | INFO | "  0:00:04.080246 <- df_subs vectorizing time elapsed"
...
11:40:16 | INFO | "Loading df_posts...
11:40:23 | INFO | "  0:00:06.460565 <- df_post loading time elapsed"
11:40:23 | INFO | "  (1649929, 6) <- df_posts.shape"
11:40:24 | INFO | "  Sampling posts down to: 9,500"
11:40:24 | INFO | "  (9500, 6) <- df_posts.shape AFTER sampling"
11:40:24 | INFO | "Vectorizing POSTS..."
100%
5/5 [00:03<00:00, 1.59it/s]
11:40:28 | INFO | "  0:00:03.774021 <- df_posts vectorizing time elapsed"
...
11:40:30 | INFO | "Load comments df..."
11:40:58 | INFO | "  (19200854, 6) <- df_comments shape"
11:41:10 | INFO | "Keep only comments that match posts IDs in df_posts..."
11:41:14 | INFO | "  (95313, 6) <- updated df_comments shape"
11:41:14 | INFO | "  Sampling COMMENTS down to: 19,100"
11:41:14 | INFO | "  (19100, 6) <- df_comments.shape AFTER sampling"
11:41:14 | INFO | "Vectorizing COMMENTS..."
100%
10/10 [00:05<00:00, 1.60it/s]
11:41:20 | INFO | "  0:00:06.239953 <- df_posts vectorizing time elapsed"
11:41:20 | INFO | "  Saving to local... df_vect_comments..."
11:41:20 | INFO | "    42.1 MB <- Memory usage"
11:41:20 | INFO | "       2	<- target Dask partitions	   30.0 <- target MB partition size"
11:41:21 | INFO | "  Logging to mlflow..."
11:41:23 | INFO | "  0:01:16.130234 <- Total vectorize fxn time elapsed"

```

In [51]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"test_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=comments_path,
    
    tf_batch_inference_rows=2100,
    tf_limit_first_n_chars=1000,
    n_sample_posts=9500,
    n_sample_comments=19100,
)

11:40:06 | INFO | "Start vectorize function"
11:40:06 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1140"
11:40:07 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
11:40:07 | INFO | "  Saving config to local path..."
11:40:07 | INFO | "  Logging config to mlflow..."
11:40:08 | INFO | "Loading model use_multilingual..."
11:40:10 | INFO | "  0:00:02.417308 <- Load TF HUB model time elapsed"
11:40:10 | INFO | "Load subreddits df..."
11:40:11 | INFO | "  0:00:00.519934 <- df_subs loading time elapsed"
11:40:11 | INFO | "  (3767, 4) <- df_subs shape"
11:40:11 | INFO | "Vectorizing subreddit descriptions..."
11:40:11 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/2 [00:00<?, ?it/s]

11:40:15 | INFO | "  0:00:04.080246 <- df_subs vectorizing time elapsed"
11:40:15 | INFO | "  Saving to local... df_vect_subreddits_description..."
11:40:15 | INFO | "     7.8 MB <- Memory usage"
11:40:15 | INFO | "       1	<- target Dask partitions	   30.0 <- target MB partition size"
11:40:15 | INFO | "  Logging to mlflow..."
11:40:16 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
11:40:23 | INFO | "  0:00:06.460565 <- df_post loading time elapsed"
11:40:23 | INFO | "  (1649929, 6) <- df_posts.shape"
11:40:24 | INFO | "  Sampling posts down to: 9,500"
11:40:24 | INFO | "  (9500, 6) <- df_posts.shape AFTER sampling"
11:40:24 | INFO | "Vectorizing POSTS..."
11:40:24 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/5 [00:00<?, ?it/s]

11:40:28 | INFO | "  0:00:03.774021 <- df_posts vectorizing time elapsed"
11:40:28 | INFO | "  Saving to local... df_vect_posts..."
11:40:28 | INFO | "    20.3 MB <- Memory usage"
11:40:28 | INFO | "       1	<- target Dask partitions	   30.0 <- target MB partition size"
11:40:28 | INFO | "  Logging to mlflow..."
11:40:30 | INFO | "Load comments df..."
11:40:58 | INFO | "  (19200854, 6) <- df_comments shape"
11:41:10 | INFO | "Keep only comments that match posts IDs in df_posts..."
11:41:14 | INFO | "  (95313, 6) <- updated df_comments shape"
11:41:14 | INFO | "  Sampling COMMENTS down to: 19,100"
11:41:14 | INFO | "  (19100, 6) <- df_comments.shape AFTER sampling"
11:41:14 | INFO | "Vectorizing COMMENTS..."
11:41:14 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/10 [00:00<?, ?it/s]

11:41:20 | INFO | "  0:00:06.239953 <- df_posts vectorizing time elapsed"
11:41:20 | INFO | "  Saving to local... df_vect_comments..."
11:41:20 | INFO | "    42.1 MB <- Memory usage"
11:41:20 | INFO | "       2	<- target Dask partitions	   30.0 <- target MB partition size"
11:41:21 | INFO | "  Logging to mlflow..."
11:41:23 | INFO | "  0:01:16.130234 <- Total vectorize fxn time elapsed"


In [None]:
BREAK

# Run full with `lower_case=False`
Let's see if the current refactor is good enough or if I really need to manually batch files...

**answer**: no it wasn't good enough -- 60GB of RAM wasn't good enough for 19Million comments _lol_.

```
...
12:02:14 | INFO | "  (19168154, 6) <- updated df_comments shape"
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"
100%
9128/9128 [1:32:26<00:00, 1.97it/s]

<__array_function__ internals> in concatenate(*args, **kwargs)

MemoryError: Unable to allocate 36.6 GiB for an array with shape (512, 19168154) and data type float32
```


In [10]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"test_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=comments_path,
    
    tf_batch_inference_rows=2100,
    tf_limit_first_n_chars=1000,
    
#     n_sample_posts=9500,
#     n_sample_comments=19100,
)

11:49:53 | INFO | "Start vectorize function"
11:49:53 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1149"
11:49:53 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
11:49:54 | INFO | "  Saving config to local path..."
11:49:54 | INFO | "  Logging config to mlflow..."
11:49:54 | INFO | "Loading model use_multilingual..."
11:49:57 | INFO | "  0:00:02.616759 <- Load TF HUB model time elapsed"
11:49:57 | INFO | "Load subreddits df..."
11:49:58 | INFO | "  0:00:01.091284 <- df_subs loading time elapsed"
11:49:58 | INFO | "  (3767, 4) <- df_subs shape"
11:49:58 | INFO | "Vectorizing subreddit descriptions..."
11:49:58 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/2 [00:00<?, ?it/s]

11:50:02 | INFO | "  0:00:04.246728 <- df_subs vectorizing time elapsed"
11:50:02 | INFO | "  Saving to local... df_vect_subreddits_description..."
11:50:02 | INFO | "     7.8 MB <- Memory usage"
11:50:02 | INFO | "       1	<- target Dask partitions	   30.0 <- target MB partition size"
11:50:03 | INFO | "  Logging to mlflow..."
11:50:04 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
11:50:11 | INFO | "  0:00:07.403893 <- df_post loading time elapsed"
11:50:11 | INFO | "  (1649929, 6) <- df_posts.shape"
11:50:12 | INFO | "Vectorizing POSTS..."
11:50:12 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/786 [00:00<?, ?it/s]

12:00:12 | INFO | "  0:10:00.597031 <- df_posts vectorizing time elapsed"
12:00:14 | INFO | "  Saving to local... df_vect_posts..."
12:00:14 | INFO | "  3,532.8 MB <- Memory usage"
12:00:14 | INFO | "      48	<- target Dask partitions	   75.0 <- target MB partition size"
12:00:33 | INFO | "  Logging to mlflow..."
12:01:23 | INFO | "Load comments df..."
12:01:58 | INFO | "  (19200854, 6) <- df_comments shape"
12:02:10 | INFO | "Keep only comments that match posts IDs in df_posts..."
12:02:14 | INFO | "  (19168154, 6) <- updated df_comments shape"
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"


  0%|          | 0/9128 [00:00<?, ?it/s]

MemoryError: Unable to allocate 36.6 GiB for an array with shape (512, 19168154) and data type float32

In [10]:
mlflow.end_run(status='KILLED')

## Test - Re-do comments with new batching logic
Trying to do all 19 million comments at once broke, sigh, so need to batch one file at a time.

In [15]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"test_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
#     subreddits_path=subreddits_path,
#     posts_path=posts_path,
    subreddits_path=None,
    posts_path=None,
    comments_path=comments_path,
    
    tf_batch_inference_rows=6100,
    tf_limit_first_n_chars=750,
    
    n_sample_comment_files=5,
    n_sample_comments=49100,
#     n_sample_posts=9500,

)

18:46:09 | INFO | "Start vectorize function"
18:46:09 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1846"
18:46:10 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
18:46:10 | INFO | "  Saving config to local path..."
18:46:10 | INFO | "  Logging config to mlflow..."
18:46:10 | INFO | "Loading model use_multilingual..."
18:46:13 | INFO | "  0:00:02.357815 <- Load TF HUB model time elapsed"
18:46:13 | INFO | "** Procesing Comments files one at a time ***"
18:46:13 | INFO | "-- Loading & vectorizing COMMENTS in files: 5 --
Expected batch size: 6100"
local variable 'df_posts' referenced before assignment"


  0%|          | 0/5 [00:00<?, ?it/s]

18:46:14 | INFO | "  Sampling COMMENTS down to: 49,100     Samples PER FILE: 9,821"
18:46:14 | INFO | "  (9821, 6) <- df_comments.shape AFTER sampling"


  0%|          | 0/2 [00:00<?, ?it/s]

18:46:19 | INFO | "  Saving to local: df_vect_comments/000000000000 | 9,821 Rows by 516 Cols"
18:46:20 | INFO | "  Sampling COMMENTS down to: 49,100     Samples PER FILE: 9,821"
18:46:20 | INFO | "  (9821, 6) <- df_comments.shape AFTER sampling"


  0%|          | 0/2 [00:00<?, ?it/s]

18:46:24 | INFO | "  Saving to local: df_vect_comments/000000000001 | 9,821 Rows by 516 Cols"
18:46:26 | INFO | "  Sampling COMMENTS down to: 49,100     Samples PER FILE: 9,821"
18:46:26 | INFO | "  (9821, 6) <- df_comments.shape AFTER sampling"


  0%|          | 0/2 [00:00<?, ?it/s]

18:46:30 | INFO | "  Saving to local: df_vect_comments/000000000002 | 9,821 Rows by 516 Cols"
18:46:31 | INFO | "  Sampling COMMENTS down to: 49,100     Samples PER FILE: 9,821"
18:46:31 | INFO | "  (9821, 6) <- df_comments.shape AFTER sampling"


  0%|          | 0/2 [00:00<?, ?it/s]

18:46:35 | INFO | "  Saving to local: df_vect_comments/000000000003 | 9,821 Rows by 516 Cols"
18:46:37 | INFO | "  Sampling COMMENTS down to: 49,100     Samples PER FILE: 9,821"
18:46:37 | INFO | "  (9821, 6) <- df_comments.shape AFTER sampling"


  0%|          | 0/2 [00:00<?, ?it/s]

18:46:41 | INFO | "  Saving to local: df_vect_comments/000000000004 | 9,821 Rows by 516 Cols"
18:46:44 | INFO | "  0:00:34.354815 <- Total vectorize fxn time elapsed"


### Re-run comments and log to non-test mlflow experiment


Besides file-batching, this job increased the row-batches from 2,000 to 6,100... unclear if this is having a negative impact. Maybe smaller batches are somehow more efficient?
Now that I'm reading one file at a time, it looks like speed is taking a big hit

Baseline when running it all in memory. It took `1:32:26`, but it ran out of memory (RAM).
The current ETA is around `2 hours`

```
# singe file, all in memory (results in OOM)
12:02:14 | INFO | "Vectorizing COMMENTS..."
12:02:14 | INFO | "Getting embeddings in batches of size: 2100"
100%
9128/9128 [1:32:26<00:00, 1.97it/s]


# one file at a time... slower, but we get results one file at a time...
16%
6/37 [21:11<1:49:46, 212.45s/it]
```


In [None]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"new_batch_fxn_{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    
    bucket_name=bucket_name,
#     subreddits_path=subreddits_path,
#     posts_path=posts_path,
    subreddits_path=None,
    posts_path=None,
    comments_path=comments_path,
    
    tf_batch_inference_rows=6100,
    tf_limit_first_n_chars=850,
    
    n_sample_comment_files=None,
    n_sample_comments=None,
#     n_sample_posts=None,
)

18:59:48 | INFO | "Start vectorize function"
18:59:48 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_1859"
18:59:48 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db"
18:59:49 | INFO | "  Saving config to local path..."
18:59:49 | INFO | "  Logging config to mlflow..."
18:59:49 | INFO | "Loading model use_multilingual..."
18:59:51 | INFO | "  0:00:02.346687 <- Load TF HUB model time elapsed"
18:59:51 | INFO | "** Procesing Comments files one at a time ***"
18:59:51 | INFO | "-- Loading & vectorizing COMMENTS in files: 37 --
Expected batch size: 6100"
local variable 'df_posts' referenced before assignment"


  0%|          | 0/37 [00:00<?, ?it/s]

18:59:51 | INFO | "Processing: comments/top/2021-07-09/000000000000.parquet"


  0%|          | 0/90 [00:00<?, ?it/s]

19:03:19 | INFO | "  Saving to local: df_vect_comments/000000000000 | 546,915 Rows by 516 Cols"
19:03:29 | INFO | "Processing: comments/top/2021-07-09/000000000001.parquet"


  0%|          | 0/96 [00:00<?, ?it/s]

19:07:09 | INFO | "  Saving to local: df_vect_comments/000000000001 | 582,195 Rows by 516 Cols"
19:07:21 | INFO | "Processing: comments/top/2021-07-09/000000000002.parquet"


  0%|          | 0/79 [00:00<?, ?it/s]

19:10:25 | INFO | "  Saving to local: df_vect_comments/000000000002 | 478,463 Rows by 516 Cols"
19:10:36 | INFO | "Processing: comments/top/2021-07-09/000000000003.parquet"


  0%|          | 0/77 [00:00<?, ?it/s]

19:13:28 | INFO | "  Saving to local: df_vect_comments/000000000003 | 467,274 Rows by 516 Cols"
19:13:39 | INFO | "Processing: comments/top/2021-07-09/000000000004.parquet"


  0%|          | 0/96 [00:00<?, ?it/s]

19:17:15 | INFO | "  Saving to local: df_vect_comments/000000000004 | 584,222 Rows by 516 Cols"
19:17:26 | INFO | "Processing: comments/top/2021-07-09/000000000005.parquet"


  0%|          | 0/91 [00:00<?, ?it/s]

19:20:52 | INFO | "  Saving to local: df_vect_comments/000000000005 | 551,542 Rows by 516 Cols"
19:21:03 | INFO | "Processing: comments/top/2021-07-09/000000000006.parquet"


  0%|          | 0/97 [00:00<?, ?it/s]

19:24:43 | INFO | "  Saving to local: df_vect_comments/000000000006 | 588,569 Rows by 516 Cols"
19:24:54 | INFO | "Processing: comments/top/2021-07-09/000000000007.parquet"


  0%|          | 0/91 [00:00<?, ?it/s]

19:28:20 | INFO | "  Saving to local: df_vect_comments/000000000007 | 549,341 Rows by 516 Cols"
19:28:32 | INFO | "Processing: comments/top/2021-07-09/000000000008.parquet"


  0%|          | 0/74 [00:00<?, ?it/s]

19:31:19 | INFO | "  Saving to local: df_vect_comments/000000000008 | 447,740 Rows by 516 Cols"
19:31:31 | INFO | "Processing: comments/top/2021-07-09/000000000009.parquet"


  0%|          | 0/91 [00:00<?, ?it/s]

19:34:57 | INFO | "  Saving to local: df_vect_comments/000000000009 | 554,171 Rows by 516 Cols"
19:35:09 | INFO | "Processing: comments/top/2021-07-09/000000000010.parquet"


  0%|          | 0/79 [00:00<?, ?it/s]

19:38:08 | INFO | "  Saving to local: df_vect_comments/000000000010 | 478,748 Rows by 516 Cols"
19:38:19 | INFO | "Processing: comments/top/2021-07-09/000000000011.parquet"


  0%|          | 0/83 [00:00<?, ?it/s]

19:41:27 | INFO | "  Saving to local: df_vect_comments/000000000011 | 506,263 Rows by 516 Cols"
19:41:38 | INFO | "Processing: comments/top/2021-07-09/000000000012.parquet"


  0%|          | 0/89 [00:00<?, ?it/s]

19:44:57 | INFO | "  Saving to local: df_vect_comments/000000000012 | 538,231 Rows by 516 Cols"
19:45:09 | INFO | "Processing: comments/top/2021-07-09/000000000013.parquet"


  0%|          | 0/85 [00:00<?, ?it/s]

19:48:23 | INFO | "  Saving to local: df_vect_comments/000000000013 | 516,219 Rows by 516 Cols"
19:48:35 | INFO | "Processing: comments/top/2021-07-09/000000000014.parquet"


  0%|          | 0/90 [00:00<?, ?it/s]

19:51:58 | INFO | "  Saving to local: df_vect_comments/000000000014 | 543,914 Rows by 516 Cols"
19:52:10 | INFO | "Processing: comments/top/2021-07-09/000000000015.parquet"


  0%|          | 0/96 [00:00<?, ?it/s]

19:55:46 | INFO | "  Saving to local: df_vect_comments/000000000015 | 583,580 Rows by 516 Cols"
19:55:59 | INFO | "Processing: comments/top/2021-07-09/000000000016.parquet"


  0%|          | 0/88 [00:00<?, ?it/s]

19:59:15 | INFO | "  Saving to local: df_vect_comments/000000000016 | 533,066 Rows by 516 Cols"
19:59:27 | INFO | "Processing: comments/top/2021-07-09/000000000017.parquet"


  0%|          | 0/88 [00:00<?, ?it/s]

20:02:47 | INFO | "  Saving to local: df_vect_comments/000000000017 | 536,540 Rows by 516 Cols"
20:02:58 | INFO | "Processing: comments/top/2021-07-09/000000000018.parquet"


  0%|          | 0/75 [00:00<?, ?it/s]

20:05:50 | INFO | "  Saving to local: df_vect_comments/000000000018 | 455,375 Rows by 516 Cols"
20:06:02 | INFO | "Processing: comments/top/2021-07-09/000000000019.parquet"


  0%|          | 0/94 [00:00<?, ?it/s]

20:09:34 | INFO | "  Saving to local: df_vect_comments/000000000019 | 570,490 Rows by 516 Cols"
20:09:45 | INFO | "Processing: comments/top/2021-07-09/000000000020.parquet"


  0%|          | 0/99 [00:00<?, ?it/s]

20:13:28 | INFO | "  Saving to local: df_vect_comments/000000000020 | 600,022 Rows by 516 Cols"
20:13:40 | INFO | "Processing: comments/top/2021-07-09/000000000021.parquet"


  0%|          | 0/83 [00:00<?, ?it/s]

20:16:51 | INFO | "  Saving to local: df_vect_comments/000000000021 | 504,174 Rows by 516 Cols"
20:17:03 | INFO | "Processing: comments/top/2021-07-09/000000000022.parquet"


  0%|          | 0/80 [00:00<?, ?it/s]

20:20:02 | INFO | "  Saving to local: df_vect_comments/000000000022 | 485,072 Rows by 516 Cols"
20:20:14 | INFO | "Processing: comments/top/2021-07-09/000000000023.parquet"


  0%|          | 0/80 [00:00<?, ?it/s]

20:23:14 | INFO | "  Saving to local: df_vect_comments/000000000023 | 482,067 Rows by 516 Cols"
20:23:25 | INFO | "Processing: comments/top/2021-07-09/000000000024.parquet"


  0%|          | 0/78 [00:00<?, ?it/s]

20:26:21 | INFO | "  Saving to local: df_vect_comments/000000000024 | 473,130 Rows by 516 Cols"
20:26:32 | INFO | "Processing: comments/top/2021-07-09/000000000025.parquet"


  0%|          | 0/92 [00:00<?, ?it/s]

20:30:01 | INFO | "  Saving to local: df_vect_comments/000000000025 | 560,305 Rows by 516 Cols"
20:30:13 | INFO | "Processing: comments/top/2021-07-09/000000000026.parquet"


  0%|          | 0/84 [00:00<?, ?it/s]

20:33:21 | INFO | "  Saving to local: df_vect_comments/000000000026 | 507,379 Rows by 516 Cols"
20:33:32 | INFO | "Processing: comments/top/2021-07-09/000000000027.parquet"


  0%|          | 0/87 [00:00<?, ?it/s]

20:36:48 | INFO | "  Saving to local: df_vect_comments/000000000027 | 527,642 Rows by 516 Cols"
20:36:59 | INFO | "Processing: comments/top/2021-07-09/000000000028.parquet"


  0%|          | 0/83 [00:00<?, ?it/s]

20:40:07 | INFO | "  Saving to local: df_vect_comments/000000000028 | 505,776 Rows by 516 Cols"
20:40:19 | INFO | "Processing: comments/top/2021-07-09/000000000029.parquet"


  0%|          | 0/81 [00:00<?, ?it/s]

20:43:22 | INFO | "  Saving to local: df_vect_comments/000000000029 | 490,102 Rows by 516 Cols"
20:43:34 | INFO | "Processing: comments/top/2021-07-09/000000000030.parquet"


  0%|          | 0/94 [00:00<?, ?it/s]

20:47:08 | INFO | "  Saving to local: df_vect_comments/000000000030 | 572,696 Rows by 516 Cols"
20:47:20 | INFO | "Processing: comments/top/2021-07-09/000000000031.parquet"


  0%|          | 0/62 [00:00<?, ?it/s]

20:49:42 | INFO | "  Saving to local: df_vect_comments/000000000031 | 375,509 Rows by 516 Cols"
20:49:53 | INFO | "Processing: comments/top/2021-07-09/000000000032.parquet"


  0%|          | 0/89 [00:00<?, ?it/s]

20:53:17 | INFO | "  Saving to local: df_vect_comments/000000000032 | 539,335 Rows by 516 Cols"
20:53:29 | INFO | "Processing: comments/top/2021-07-09/000000000033.parquet"


  0%|          | 0/77 [00:00<?, ?it/s]

20:56:23 | INFO | "  Saving to local: df_vect_comments/000000000033 | 468,589 Rows by 516 Cols"
20:56:34 | INFO | "Processing: comments/top/2021-07-09/000000000034.parquet"


  0%|          | 0/84 [00:00<?, ?it/s]

# Run full with `lower_case=True`

In [None]:
mlflow.end_run(status='KILLED')

vectorize_text_tf.vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name=f"test_new_fxn{datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')}",
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=True,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=comments_path,
    
    tf_batch_inference_rows=2100,
    tf_limit_first_n_chars=1000,
    
#     n_sample_posts=9500,
#     n_sample_comments=19100,
)

In [None]:
gc.collect()

In [None]:
LEGACY

# Run full with lower_case=False

Time on CPU, only comments + subs:
```
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:08 | INFO | "  (629, 4) <- df_subs shape"

13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"
```

In [18]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='full_data-lowercase_false',
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    subreddits_path='subreddits/de/2021-06-16',
    posts_path=None,  # 'posts/de/2021-06-16',
    comments_path='comments/de/2021-06-16',
    tf_batch_inference_rows=1500,
    tf_limit_first_n_chars=1100,
    n_sample_posts=None,
    n_sample_comments=None,
)

13:28:50 | INFO | "Start vectorize function"
13:28:50 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-01_1328"
13:28:50 | INFO | "Load comments df..."
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:07 | INFO | "Keep only comments that match posts IDs in df_posts..."
13:29:07 | INFO | "df_posts missing, so we can't filter comments..."
13:29:07 | INFO | "Load subreddits df..."
13:29:08 | INFO | "  (629, 4) <- df_subs shape"
13:29:08 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/mlflow/mlruns.db"
13:29:09 | INFO | "Loading model use_multilingual...
  with kwargs: None"




13:29:11 | INFO | "  0:00:02.282361 <- Load TF HUB model time elapsed"
13:29:11 | INFO | "Vectorizing subreddit descriptions..."




13:29:13 | INFO | "  Saving to local... df_vect_subreddits_description..."
13:29:13 | INFO | "  Logging to mlflow..."




13:29:14 | INFO | "Vectorizing COMMENTS..."
13:29:14 | INFO | "Getting embeddings in batches of size: 1500"


  0%|          | 0/740 [00:00<?, ?it/s]

13:44:30 | INFO | "  Saving to local... df_vect_comments..."
13:44:49 | INFO | "  Logging to mlflow..."
13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"


# Check mlflow experiment & Read artifact

In [27]:
df_mlf_exp = mlf.list_experiment_meta(output_format='pandas')
df_mlf_exp

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Check runs in experiment

In [28]:
exp_id = df_mlf_exp.loc[df_mlf_exp['name'] == mlflow_experiment_test, 
                        'experiment_id'].values[0]

mlf.search_all_runs(experiment_ids=[exp_id]).head(8)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.df_vect_subreddits_description_cols,metrics.df_vect_subreddits_description_rows,metrics.vectorizing_time_minutes,metrics.df_vect_comments_rows,metrics.df_vect_comments_cols,metrics.df_vect_posts_cols,metrics.df_vect_posts_rows,params.model_location,params.col_post_id,params.bucket_name,params.tokenize_lowercase,params.posts_path,params.col_comment_id,params.col_text_post,params.preprocess_text_folder,params.tf_limit_first_n_chars,params.subreddits_path,params.n_sample_comments,params.col_subreddit_id,params.col_text_subreddit_word_count,params.col_text_comment_word_count,params.comments_path,params.tokenize_function,params.col_text_post_word_count,params.col_text_subreddit_description,params.col_text_post_url,params.host_name,params.model_name,params.n_sample_posts,params.tf_batch_inference_rows,params.col_text_comment,tags.mlflow.source.git.commit,tags.mlflow.runName,tags.host_name,tags.mlflow.source.type,tags.mlflow.user,tags.mlflow.source.name
0,5e97065c83674451acca4eb66ea8b5f7,9,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/9/5e97065c83674451acca4eb66ea8b5f7/artifacts,2021-07-29 08:28:12.764000+00:00,2021-07-29 08:28:25.132000+00:00,512.0,3767.0,1.109042,5100.0,512.0,512.0,2500.0,https://tfhub.dev/google/universal-sentence-encoder-multilingual/3,post_id,i18n-subreddit-clustering,True,posts/top/2021-07-16,comment_id,text,,1000,subreddits/top/2021-07-16,5100,subreddit_id,subreddit_name_title_and_clean_descriptions_word_count,comment_text_word_count,comments/top/2021-07-09,sklearn,text_word_count,subreddit_name_title_and_clean_descriptions,post_url_for_embeddings,djb-subclu-inference-tf-2-3-20210630,use_multilingual,2500,2000,comment_body_text,636ffe8ca480035297dfc650c1c002676ceb5aa6,test_n_samples,djb-subclu-inference-tf-2-3-20210630,LOCAL,jupyter,/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py
1,4034ea26c5bf45f980a3132e9b28f677,9,KILLED,gs://i18n-subreddit-clustering/mlflow/mlruns/9/4034ea26c5bf45f980a3132e9b28f677/artifacts,2021-07-29 08:24:53.166000+00:00,2021-07-29 08:27:18.525000+00:00,,,,,,,,https://tfhub.dev/google/universal-sentence-encoder-multilingual/3,post_id,i18n-subreddit-clustering,True,posts/top/2021-07-16,comment_id,text,,1000,subreddits/top/2021-07-16,2100,subreddit_id,subreddit_name_title_and_clean_descriptions_word_count,comment_text_word_count,comments/top/2021-07-09,sklearn,text_word_count,subreddit_name_title_and_clean_descriptions,post_url_for_embeddings,djb-subclu-inference-tf-2-3-20210630,use_multilingual,1500,3000,comment_body_text,636ffe8ca480035297dfc650c1c002676ceb5aa6,test_n_samples,djb-subclu-inference-tf-2-3-20210630,LOCAL,jupyter,/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py


In [15]:
%%time

run_id = '45201072143a4d7fbb86a2f2b7d85520'

df_v_subs = mlf.read_run_artifact(
    run_id=run_id,
    artifact_folder='df_vect_subreddits_description',
    read_function=pd.read_parquet,
)
print(df_v_subs.shape)

(629, 512)
CPU times: user 169 ms, sys: 0 ns, total: 169 ms
Wall time: 1.75 s


In [23]:
np.allclose(df_vect_subs, df_v_subs)

True

In [16]:
df_v_subs.iloc[:5, :10]

Unnamed: 0_level_0,Unnamed: 1_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
de,t5_22i0,-0.018191,-0.045794,0.035795,-0.036392,0.033076,0.013654,-0.013067,-0.031252,0.001283,0.0247
ich_iel,t5_37k29,-0.019543,-0.00259,-0.002255,0.00964,-0.080638,0.053921,0.068653,-0.051635,0.038154,0.016369
nicoledobrikov1,t5_3oioc0,0.00024,0.0437,-0.030162,-0.023365,0.051887,0.050446,0.013388,-0.049501,-0.059686,-0.068271
germany,t5_2qi4z,0.030575,-0.057457,0.007206,0.029543,-0.003699,0.064915,-0.033345,-0.066493,-0.01916,0.014145
germansgonewild,t5_37g5b,0.022604,-0.032705,-0.016022,0.06629,0.052799,0.029996,0.008364,-0.049809,-0.004913,-0.056319


In [17]:
%%time

df_v_posts = mlf.read_run_artifact(
    run_id=run_id,
    artifact_folder='df_vect_posts',
    read_function=pd.read_parquet,
)
print(df_v_posts.shape)

(1500, 512)
CPU times: user 99.4 ms, sys: 92 ms, total: 191 ms
Wall time: 1.77 s


In [24]:
np.allclose(df_vect, df_v_posts)

True

In [27]:
df_v_posts.iloc[14:20, :10]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
pcbaumeister,t5_4c1x98,t3_nlj0xz,-0.028007,0.018425,0.025313,-0.084277,-0.044653,0.036714,-0.084001,-0.060739,0.022273,-0.055197
dagibeehot,t5_wv7c1,t3_mzpji1,0.079956,0.062191,0.042096,0.028683,-0.014719,0.002513,-0.016711,0.036436,-0.033169,-0.016168
germansgonewild,t5_37g5b,t3_nkhuwl,-0.04588,0.025566,0.004287,0.02,-0.086096,0.016448,-0.003725,0.049456,-0.073738,-0.021704
de,t5_22i0,t3_nkm3hr,-0.05499,0.009562,0.011608,0.017721,0.01471,0.058977,0.061449,0.020423,-0.010647,0.038405
de,t5_22i0,t3_mpc8ai,-0.056767,-0.07399,0.057309,0.051738,0.019686,0.081643,-0.010165,0.045042,-0.045683,-0.015345
huebi,t5_29zucx,t3_mubs9j,0.15478,0.00766,0.06628,-0.002162,-0.080812,0.075854,0.000574,0.07894,-0.122165,-0.002068


In [21]:
%%time

df_v_comments = mlf.read_run_artifact(
    run_id=run_id,
    artifact_folder='df_vect_comments',
    read_function=pd.read_parquet,
)
print(df_v_comments.shape)

(2100, 512)
CPU times: user 441 ms, sys: 64.8 ms, total: 506 ms
Wall time: 2.05 s


In [25]:
np.allclose(df_vect_comments, df_v_comments)

True

In [26]:
df_v_comments.iloc[:5, :10]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,comment_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
de,t5_22i0,t3_n1db9m,t1_gwgumip,-0.050319,0.019366,0.008127,-0.035954,0.058837,0.018773,-0.077046,-0.080273,-0.007283,0.020304
ich_iel,t5_37k29,t3_muosjc,t1_gv7rlal,-0.020838,0.016757,-0.027872,0.005312,0.046645,0.074642,0.02286,-0.041156,0.009235,-0.068941
buenzli,t5_2xbtv,t3_ngs8bj,t1_gysjscj,-0.038592,-0.034569,-0.045555,0.006089,-0.044613,0.008128,0.023125,-0.062052,-0.024423,-0.032473
nicoledobrikovof,t5_3k1wb9,t3_noa9fo,t1_h0dfyag,0.020388,-0.063959,0.013214,-0.057574,0.054215,0.06014,-0.015974,-0.032665,-0.087324,0.022982
de,t5_22i0,t3_ngydq1,t1_gyu585z,-0.052664,0.04226,0.013913,0.053029,0.043332,0.046601,-0.062652,-0.046233,-0.016664,0.081627
