# Purpose

2021-07-28: Run it on the top Subreddits + German subs. Ideally this should help us find counterpart subs in other languages.

---

This notebook runs the `vectorize_text_to_embeddings` function to:
- loading USE-multilingual model
- load post & comment text
- convert the text into embeddings (at post or comment level)


# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# from datetime import datetime
import gc
# from functools import partial
# import os
import logging
# from pathlib import Path
# from pprint import pprint

import mlflow

import numpy as np
import pandas as pd

# TF libraries... I've been getting errors when these aren't loaded
import tensorflow_text
import tensorflow as tf

import subclu
from subclu.models.vectorize_text import (
    vectorize_text_to_embeddings,
)
from subclu.utils import set_working_directory
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)


print_lib_versions([mlflow, np, mlflow, pd, tensorflow_text, tf, subclu])

python		v 3.7.10
===
mlflow		v: 1.16.0
numpy		v: 1.18.5
mlflow		v: 1.16.0
pandas		v: 1.2.5
tensorflow_text	v: 2.3.0
tensorflow	v: 2.3.3
subclu		v: 0.3.2


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Initialize mlflow logging with sqlite database

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-subclu-inference-tf-2-3-20210630/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


# Check whether we have access to a GPU

In [6]:
l_phys_gpus = tf.config.list_physical_devices('GPU')
from tensorflow.python.client import device_lib

print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\nGPUs\n==="
    f"\nNum GPUs Available: {len(l_phys_gpus)}"
    f"\nGPU details:\n{l_phys_gpus}"
    f"\n\nAll devices:\n===\n"
    f"{device_lib.list_local_devices()}"
)


Built with CUDA? True
GPUs
===
Num GPUs Available: 1
GPU details:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

All devices:
===
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15323824425429830595
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 9701032267388460439
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 1515914311549280819
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14676252416
locality {
  bus_id: 1
  links {
  }
}
incarnation: 4718140592461502921
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]


# Call function to vectorize text

- Batch of: 3000 
- Limit characters to: 1000
Finally leaves enough room to use around 50% of RAM (of 60GB)

The problem is that each iteration takes around 3 minutes, which means whole job for GERMAN only will tka around 4:42 hours:mins...

In [10]:
mlflow_experiment_test = 'v0.3.2_use_multi_inference_test'

bucket_name = 'i18n-subreddit-clustering'
subreddits_path = "subreddits/top/2021-07-16"
posts_path = 'posts/top/2021-07-16'
comments_path = 'comments/top/2021-07-09'

```
When subreddit_id column was missing:
CPU times: user 75.8 ms, sys: 21.1 ms, total: 96.9 ms
Wall time: 884 ms
(3767, 28)

```

In [16]:
%%time

df_subs = pd.read_parquet(
    path=f"gs://{bucket_name}/{subreddits_path}",
    # columns=l_cols_subreddits,
)
df_subs.shape

CPU times: user 75 ms, sys: 25.4 ms, total: 100 ms
Wall time: 1.04 s


(3767, 29)

In [18]:
df_subs.tail()

Unnamed: 0,subreddit_id,subreddit_name,combined_topic,combined_topic_and_rating,rating,rating_version,topic,topic_version,over_18,allow_top,video_whitelisted,subreddit_language,whitelist_status,subscribers,first_screenview_date,last_screenview_date,users_l7,users_l28,posts_l7,posts_l28,comments_l7,comments_l28,pt,subreddit_clean_description_word_count,subreddit_name_title_and_clean_descriptions_word_count,subreddit_title,subreddit_public_description,subreddit_description,subreddit_name_title_and_clean_descriptions
3762,t5_4nbamg,extremsport,uncategorized,uncategorized,,,,,,t,,en,,6,2021-06-23,2021-07-13,9,12,1,4,1,1,2021-07-16,112,128,"Extremsport – höher, schneller, weiter!","Willkommen auf r/extremsport, der deutschsprachigen Community für Extremsportarten!","Reddit Ambassador Program\n\nDiese Community wurde in Partnerschaft mit unserem deutschen Botschafterprogramm erstellt. Unser Ziel ist es, mehr deutschsprachige Räume auf Reddit zu schaffen und zu kultivieren, damit deutsche Nutzer mehr...","extremsport. \nExtremsport – höher, schneller, weiter!. \nWillkommen auf r extremsport, der deutschsprachigen Community für Extremsportarten!. \nReddit Ambassador Program\n\nDiese Community wurde in Partnerschaft mit unserem deutschen B..."
3763,t5_4o3o8u,sachgeschichten,uncategorized,uncategorized,,,,,,t,,en,,5,2021-06-26,2021-07-13,7,11,8,13,0,0,2021-07-16,139,175,sachgeschichten,"Wolltest du schon immer mal wissen, wie das eigentlich gemacht wird? Wie so etwas im Inneren abläuft? Oder findest du das Brummen einer Produktionsanlage einfach nur schön? Dann bist du hier richtig!","Wolltest du schon immer mal wissen, wie das eigentlich gemacht wird? Wie so etwas im Inneren abläuft? Oder findest du das Brummen einer Produktionsanlage einfach nur schön?\n\nr/sachgeschichten ist eine Sammlung von deutschsprachigen Vi...","sachgeschichten. \nsachgeschichten. \nWolltest du schon immer mal wissen, wie das eigentlich gemacht wird Wie so etwas im Inneren abläuft Oder findest du das Brummen einer Produktionsanlage einfach nur schön Dann bist du hier richtig..."
3764,t5_4oc1u4,softwareschrott,uncategorized,uncategorized,,,,,,t,,en,,4,2021-06-27,2021-07-12,4,11,8,10,0,0,2021-07-16,0,9,softwareschrott,Das deutsche Unterlases für grottige Software,,softwareschrott. \nsoftwareschrott. \nDas deutsche Unterlases für grottige Software. \n
3765,t5_4n4lf5,lacrosse_de,uncategorized,uncategorized,,,,,,t,,en,,7,2021-06-22,2021-07-13,5,9,2,11,0,5,2021-07-16,124,137,Lacrosse_de,"Willkommen auf r/Lacrosse_de, der deutschsprachigen Lacrosse Community auf Reddit.","Reddit Ambassador Program\n\nDiese Community wurde in Partnerschaft mit unserem deutschen Botschafterprogramm erstellt. Unser Ziel ist es, mehr deutschsprachige Räume auf Reddit zu schaffen und zu kultivieren, damit deutsche Nutzer mehr...","Lacrosse_de. \nLacrosse_de. \nWillkommen auf r Lacrosse de, der deutschsprachigen Lacrosse Community auf Reddit.. \nReddit Ambassador Program\n\nDiese Community wurde in Partnerschaft mit unserem deutschen Botschafterprogramm erstellt. ..."
3766,t5_4qwonp,diesimpsons,uncategorized,uncategorized,,,,,,t,,en,,3,2021-07-12,2021-07-13,8,8,8,8,1,1,2021-07-16,4,7,Die Simpsons,Die Simpsons auf Deutsch!,Die Simpsons auf Deutsch!,DieSimpsons. \nDie Simpsons. \nDie Simpsons auf Deutsch!


## Test on a `sample` of posts & comments to make sure entire process works first (before running long job)

For subreddit only, we can expand to more than 1,500 characters.

HOWEVER - when scoring posts &/or comments, we're better off trimming to first ~1,000 characters to speed things up. We can increase the character len if results aren't great... this could be a hyperparameter to tune.

In [None]:
BREAK

In [20]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='test_n_samples',
    mlflow_experiment=mlflow_experiment_test,
    
    tokenize_lowercase=True,
    
    bucket_name=bucket_name,
    subreddits_path=subreddits_path,
    posts_path=posts_path,
    comments_path=comments_path,
    
    tf_batch_inference_rows=2000,
    tf_limit_first_n_chars=1000,
    n_sample_posts=2500,
    n_sample_comments=5100,
)

08:27:18 | INFO | "Start vectorize function"
08:27:18 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-29_0827"
08:27:18 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/top/2021-07-16"
08:27:26 | INFO | "  0:00:07.773679 <- df_post time elapsed"
08:27:26 | INFO | "  (1649929, 6) <- df_posts.shape"
08:27:27 | INFO | "  Sampling posts down to: 2,500"
08:27:27 | INFO | "  (2500, 6) <- df_posts.shape AFTER sampling"
08:27:27 | INFO | "Load comments df..."
08:27:57 | INFO | "  (19200854, 6) <- df_comments shape"
08:28:08 | INFO | "Keep only comments that match posts IDs in df_posts..."
08:28:11 | INFO | "  (31630, 6) <- updated df_comments shape"
08:28:11 | INFO | "  Sampling COMMENTS down to: 5,100"
08:28:11 | INFO | "  (5100, 6) <- df_comments.shape AFTER sampling"
08:28:11 | INFO | "Load subreddits df..."
08:28:12 | INFO | "  (3767, 4) <- df_subs shape"
08:28:12 | INFO | "MLflow tracking URI: sqlit

  0%|          | 0/2 [00:00<?, ?it/s]

08:28:19 | INFO | "  Saving to local... df_vect_subreddits_description..."
08:28:19 | INFO | "  Logging to mlflow..."
08:28:20 | INFO | "Vectorizing POSTS..."
08:28:20 | INFO | "Getting embeddings in batches of size: 2000"


  0%|          | 0/2 [00:00<?, ?it/s]

08:28:21 | INFO | "  Saving to local... df_vect_posts..."
08:28:21 | INFO | "  Logging to mlflow..."
08:28:22 | INFO | "Vectorizing COMMENTS..."
08:28:22 | INFO | "Getting embeddings in batches of size: 2000"


  0%|          | 0/3 [00:00<?, ?it/s]

08:28:24 | INFO | "  Saving to local... df_vect_comments..."
08:28:24 | INFO | "  Logging to mlflow..."
08:28:25 | INFO | "  0:01:06.542544 <- Total vectorize fxn time elapsed"


In [24]:
print(df_vect_subs.shape)
df_vect_subs.iloc[:5, :10]

(3767, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
pics,t5_2qh0u,-0.056925,0.027936,-0.009723,-0.009849,0.0432,0.045963,0.049922,-0.061319,0.053243,-0.052809
funny,t5_2qh33,0.045474,-0.039333,-0.03179,-0.015574,0.074503,0.054003,0.00787,0.061827,-0.050316,0.023417
memes,t5_2qjpg,-0.014767,0.018347,-0.069566,-0.02242,0.063016,0.066394,-0.061886,0.04054,0.01935,0.027958
news,t5_2qh3l,-0.066339,0.056393,0.036245,-0.021127,0.076642,0.040693,0.019423,0.054693,-0.012191,0.065671
interestingasfuck,t5_2qhsa,-0.020677,0.061429,-0.029565,0.029978,0.066374,0.061271,0.069265,0.028228,0.004899,0.044498


In [25]:
print(df_vect.shape)
df_vect.iloc[:5, :10]

(2500, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
stocks,t5_2qjfk,t3_o9h0x8,-0.054105,0.0703,-0.077626,-0.032931,-0.022623,-0.004873,0.01813,-0.031575,-0.058212,0.04276
laptops,t5_2qoip,t3_o76rnd,-0.018935,-0.026588,0.071572,-0.015938,-0.083586,-0.082511,0.044506,-0.015417,-0.043235,0.068179
luftraum,t5_q02q4,t3_nuwjin,-0.072257,0.020968,-0.012867,-0.042401,0.024641,0.081959,-0.042031,0.01064,-0.005116,0.025049
adultery,t5_2sjkv,t3_oi8tno,0.046559,-0.042573,0.039657,-0.066817,-0.096201,0.001347,-0.074167,0.002814,-0.066348,-0.02913
poopshitters,t5_wgmeb,t3_o2es6a,-0.056548,-0.018536,0.015383,-0.018324,-0.007343,0.037289,0.082599,0.029512,-0.039021,0.031015


In [26]:
print(df_vect_comments.shape)
df_vect_comments.iloc[10:15, -10:]

(5100, 512)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,embeddings_502,embeddings_503,embeddings_504,embeddings_505,embeddings_506,embeddings_507,embeddings_508,embeddings_509,embeddings_510,embeddings_511
subreddit_name,subreddit_id,post_id,comment_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
yoga,t5_2qhq6,t3_o9upy3,t1_h3djels,-0.049714,0.039176,0.031363,-0.025456,0.020385,-0.009454,0.032895,0.048567,0.006216,0.068287
worldnews,t5_2qh13,t3_o7yf4y,t1_h333pw7,-0.000592,-0.064295,-0.041862,-0.026075,0.028604,-0.028089,-0.017898,0.0315,-0.0451,0.064217
aww,t5_2qh1o,t3_nvidt5,t1_h13y1e4,0.022847,-0.084315,0.062082,-0.042768,0.049408,0.04187,0.021409,0.03514,0.034683,0.112163
formula1,t5_2qimj,t3_o8beb8,t1_h33xaio,-0.042507,-0.012686,0.035909,0.1043,0.034547,0.009831,-0.02029,0.04584,0.037209,0.074332
aaaaaaacccccccce,t5_3aa11,t3_o9igjh,t1_h3igwhh,0.045924,-0.043366,0.009658,0.00471,-0.062547,-0.043556,0.074638,0.029582,0.018934,0.036313


# Check mlflow experiment & Read artifact

In [27]:
df_mlf_exp = mlf.list_experiment_meta(output_format='pandas')
df_mlf_exp

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Check runs in experiment

In [28]:
exp_id = df_mlf_exp.loc[df_mlf_exp['name'] == mlflow_experiment_test, 
                        'experiment_id'].values[0]

mlf.search_all_runs(experiment_ids=[exp_id]).head(8)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.df_vect_subreddits_description_cols,metrics.df_vect_subreddits_description_rows,metrics.vectorizing_time_minutes,metrics.df_vect_comments_rows,metrics.df_vect_comments_cols,metrics.df_vect_posts_cols,metrics.df_vect_posts_rows,params.model_location,params.col_post_id,params.bucket_name,params.tokenize_lowercase,params.posts_path,params.col_comment_id,params.col_text_post,params.preprocess_text_folder,params.tf_limit_first_n_chars,params.subreddits_path,params.n_sample_comments,params.col_subreddit_id,params.col_text_subreddit_word_count,params.col_text_comment_word_count,params.comments_path,params.tokenize_function,params.col_text_post_word_count,params.col_text_subreddit_description,params.col_text_post_url,params.host_name,params.model_name,params.n_sample_posts,params.tf_batch_inference_rows,params.col_text_comment,tags.mlflow.source.git.commit,tags.mlflow.runName,tags.host_name,tags.mlflow.source.type,tags.mlflow.user,tags.mlflow.source.name
0,5e97065c83674451acca4eb66ea8b5f7,9,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/9/5e97065c83674451acca4eb66ea8b5f7/artifacts,2021-07-29 08:28:12.764000+00:00,2021-07-29 08:28:25.132000+00:00,512.0,3767.0,1.109042,5100.0,512.0,512.0,2500.0,https://tfhub.dev/google/universal-sentence-encoder-multilingual/3,post_id,i18n-subreddit-clustering,True,posts/top/2021-07-16,comment_id,text,,1000,subreddits/top/2021-07-16,5100,subreddit_id,subreddit_name_title_and_clean_descriptions_word_count,comment_text_word_count,comments/top/2021-07-09,sklearn,text_word_count,subreddit_name_title_and_clean_descriptions,post_url_for_embeddings,djb-subclu-inference-tf-2-3-20210630,use_multilingual,2500,2000,comment_body_text,636ffe8ca480035297dfc650c1c002676ceb5aa6,test_n_samples,djb-subclu-inference-tf-2-3-20210630,LOCAL,jupyter,/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py
1,4034ea26c5bf45f980a3132e9b28f677,9,KILLED,gs://i18n-subreddit-clustering/mlflow/mlruns/9/4034ea26c5bf45f980a3132e9b28f677/artifacts,2021-07-29 08:24:53.166000+00:00,2021-07-29 08:27:18.525000+00:00,,,,,,,,https://tfhub.dev/google/universal-sentence-encoder-multilingual/3,post_id,i18n-subreddit-clustering,True,posts/top/2021-07-16,comment_id,text,,1000,subreddits/top/2021-07-16,2100,subreddit_id,subreddit_name_title_and_clean_descriptions_word_count,comment_text_word_count,comments/top/2021-07-09,sklearn,text_word_count,subreddit_name_title_and_clean_descriptions,post_url_for_embeddings,djb-subclu-inference-tf-2-3-20210630,use_multilingual,1500,3000,comment_body_text,636ffe8ca480035297dfc650c1c002676ceb5aa6,test_n_samples,djb-subclu-inference-tf-2-3-20210630,LOCAL,jupyter,/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py


In [15]:
%%time

run_id = '45201072143a4d7fbb86a2f2b7d85520'

df_v_subs = mlf.read_run_artifact(
    run_id=run_id,
    artifact_folder='df_vect_subreddits_description',
    read_function=pd.read_parquet,
)
print(df_v_subs.shape)

(629, 512)
CPU times: user 169 ms, sys: 0 ns, total: 169 ms
Wall time: 1.75 s


In [23]:
np.allclose(df_vect_subs, df_v_subs)

True

In [16]:
df_v_subs.iloc[:5, :10]

Unnamed: 0_level_0,Unnamed: 1_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
de,t5_22i0,-0.018191,-0.045794,0.035795,-0.036392,0.033076,0.013654,-0.013067,-0.031252,0.001283,0.0247
ich_iel,t5_37k29,-0.019543,-0.00259,-0.002255,0.00964,-0.080638,0.053921,0.068653,-0.051635,0.038154,0.016369
nicoledobrikov1,t5_3oioc0,0.00024,0.0437,-0.030162,-0.023365,0.051887,0.050446,0.013388,-0.049501,-0.059686,-0.068271
germany,t5_2qi4z,0.030575,-0.057457,0.007206,0.029543,-0.003699,0.064915,-0.033345,-0.066493,-0.01916,0.014145
germansgonewild,t5_37g5b,0.022604,-0.032705,-0.016022,0.06629,0.052799,0.029996,0.008364,-0.049809,-0.004913,-0.056319


In [17]:
%%time

df_v_posts = mlf.read_run_artifact(
    run_id=run_id,
    artifact_folder='df_vect_posts',
    read_function=pd.read_parquet,
)
print(df_v_posts.shape)

(1500, 512)
CPU times: user 99.4 ms, sys: 92 ms, total: 191 ms
Wall time: 1.77 s


In [24]:
np.allclose(df_vect, df_v_posts)

True

In [27]:
df_v_posts.iloc[14:20, :10]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
pcbaumeister,t5_4c1x98,t3_nlj0xz,-0.028007,0.018425,0.025313,-0.084277,-0.044653,0.036714,-0.084001,-0.060739,0.022273,-0.055197
dagibeehot,t5_wv7c1,t3_mzpji1,0.079956,0.062191,0.042096,0.028683,-0.014719,0.002513,-0.016711,0.036436,-0.033169,-0.016168
germansgonewild,t5_37g5b,t3_nkhuwl,-0.04588,0.025566,0.004287,0.02,-0.086096,0.016448,-0.003725,0.049456,-0.073738,-0.021704
de,t5_22i0,t3_nkm3hr,-0.05499,0.009562,0.011608,0.017721,0.01471,0.058977,0.061449,0.020423,-0.010647,0.038405
de,t5_22i0,t3_mpc8ai,-0.056767,-0.07399,0.057309,0.051738,0.019686,0.081643,-0.010165,0.045042,-0.045683,-0.015345
huebi,t5_29zucx,t3_mubs9j,0.15478,0.00766,0.06628,-0.002162,-0.080812,0.075854,0.000574,0.07894,-0.122165,-0.002068


In [21]:
%%time

df_v_comments = mlf.read_run_artifact(
    run_id=run_id,
    artifact_folder='df_vect_comments',
    read_function=pd.read_parquet,
)
print(df_v_comments.shape)

(2100, 512)
CPU times: user 441 ms, sys: 64.8 ms, total: 506 ms
Wall time: 2.05 s


In [25]:
np.allclose(df_vect_comments, df_v_comments)

True

In [26]:
df_v_comments.iloc[:5, :10]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9
subreddit_name,subreddit_id,post_id,comment_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
de,t5_22i0,t3_n1db9m,t1_gwgumip,-0.050319,0.019366,0.008127,-0.035954,0.058837,0.018773,-0.077046,-0.080273,-0.007283,0.020304
ich_iel,t5_37k29,t3_muosjc,t1_gv7rlal,-0.020838,0.016757,-0.027872,0.005312,0.046645,0.074642,0.02286,-0.041156,0.009235,-0.068941
buenzli,t5_2xbtv,t3_ngs8bj,t1_gysjscj,-0.038592,-0.034569,-0.045555,0.006089,-0.044613,0.008128,0.023125,-0.062052,-0.024423,-0.032473
nicoledobrikovof,t5_3k1wb9,t3_noa9fo,t1_h0dfyag,0.020388,-0.063959,0.013214,-0.057574,0.054215,0.06014,-0.015974,-0.032665,-0.087324,0.022982
de,t5_22i0,t3_ngydq1,t1_gyu585z,-0.052664,0.04226,0.013913,0.053029,0.043332,0.046601,-0.062652,-0.046233,-0.016664,0.081627


# Run full with lower_case=True

In [None]:
mlflow_experiment_full = 'use_multilingual_v1'

In [None]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='full_data-lowercase_true',
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=True,
    subreddits_path='subreddits/de/2021-06-16',
    posts_path='posts/de/2021-06-16',
    comments_path='comments/de/2021-06-16',
    tf_batch_inference_rows=2200,
    tf_limit_first_n_chars=1300,
    n_sample_posts=None,
    n_sample_comments=None,
)

10:33:47 | INFO | "Start vectorize function"
10:33:47 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-01_1033"
10:33:47 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/de/2021-06-16"
10:34:11 | INFO | "  0:00:23.874188 <- df_post time elapsed"
10:34:11 | INFO | "  (262226, 6) <- df_posts.shape"
10:34:11 | INFO | "Load comments df..."
10:34:38 | INFO | "  (1108757, 6) <- df_comments shape"
10:34:38 | INFO | "Keep only comments that match posts IDs in df_posts..."
10:34:39 | INFO | "  (1108757, 6) <- updated df_comments shape"
10:34:39 | INFO | "Load subreddits df..."
10:34:40 | INFO | "  (629, 4) <- df_subs shape"
10:34:41 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/mlflow/mlruns.db"
10:34:42 | INFO | "Loading model use_multilingual...
  with kwargs: None"
10:34:48 | INFO | "  0:00:06.458787 <- Load TF HUB model time elapsed"
10:34:48 | INFO | "Vectorizing subreddit descriptions..."
10

  0%|          | 0/120 [00:00<?, ?it/s]

10:39:14 | INFO | "  Saving to local... df_vect_posts..."
10:39:31 | INFO | "  Logging to mlflow..."
10:39:41 | INFO | "Vectorizing COMMENTS..."
10:39:42 | INFO | "Getting embeddings in batches of size: 2200"


  0%|          | 0/504 [00:00<?, ?it/s]

In [15]:
gc.collect()

49

# Run full with lower_case=False

Time on CPU, only comments + subs:
```
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:08 | INFO | "  (629, 4) <- df_subs shape"

13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"
```

In [18]:
mlflow.end_run(status='KILLED')

model, df_vect, df_vect_comments, df_vect_subs = vectorize_text_to_embeddings(
    model_name='use_multilingual',
    run_name='full_data-lowercase_false',
    mlflow_experiment=mlflow_experiment_full,
    
    tokenize_lowercase=False,
    subreddits_path='subreddits/de/2021-06-16',
    posts_path=None,  # 'posts/de/2021-06-16',
    comments_path='comments/de/2021-06-16',
    tf_batch_inference_rows=1500,
    tf_limit_first_n_chars=1100,
    n_sample_posts=None,
    n_sample_comments=None,
)

13:28:50 | INFO | "Start vectorize function"
13:28:50 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/use_multilingual/2021-07-01_1328"
13:28:50 | INFO | "Load comments df..."
13:29:07 | INFO | "  (1108757, 6) <- df_comments shape"
13:29:07 | INFO | "Keep only comments that match posts IDs in df_posts..."
13:29:07 | INFO | "df_posts missing, so we can't filter comments..."
13:29:07 | INFO | "Load subreddits df..."
13:29:08 | INFO | "  (629, 4) <- df_subs shape"
13:29:08 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/mlflow/mlruns.db"
13:29:09 | INFO | "Loading model use_multilingual...
  with kwargs: None"




13:29:11 | INFO | "  0:00:02.282361 <- Load TF HUB model time elapsed"
13:29:11 | INFO | "Vectorizing subreddit descriptions..."




13:29:13 | INFO | "  Saving to local... df_vect_subreddits_description..."
13:29:13 | INFO | "  Logging to mlflow..."




13:29:14 | INFO | "Vectorizing COMMENTS..."
13:29:14 | INFO | "Getting embeddings in batches of size: 1500"


  0%|          | 0/740 [00:00<?, ?it/s]

13:44:30 | INFO | "  Saving to local... df_vect_comments..."
13:44:49 | INFO | "  Logging to mlflow..."
13:45:11 | INFO | "  0:16:21.475036 <- Total vectorize fxn time elapsed"


# Example from previous call using FSE/FastText/uSIF

In [17]:
gc.collect()

mlflow.end_run(status='KILLED')
model, df_posts, d_ix_to_id = vectorize_text_to_embeddings(
    mlflow_experiment=mlflow_experiment,
    
    tokenize_function='sklearn_acronyms_emoji',
    tokenize_lowercase=True,
    train_min_word_count=4,
    train_exclude_duplicated_docs=True,
    train_subreddits_to_exclude=['wixbros', 'katjakrasavicenudes',
                                 'deutschetributes', 'germannudes',
                                 'annitheduck', 'germanonlyfans',
                                 'loredana', 'nicoledobrikovof',
                                 'germansgonewild', 'elisaalinenudes',
                                 'marialoeffler', 'germanwomenandcouples',
                                ],
)

07:25:16 | INFO | "Start vectorize function"
07:25:16 | INFO | "  Local model saving directory: /home/jupyter/subreddit_clustering_i18n/data/models/fse/2021-06-02_0725"
07:25:16 | INFO | "Loading df_posts...
  gs://i18n-subreddit-clustering/posts/2021-05-19"
07:25:22 | INFO | "  0:00:05.708467 <- df_post time elapsed"
07:25:22 | INFO | "  (111669, 6) <- df_posts.shape"
07:25:22 | INFO | "Load comments df..."
07:25:29 | INFO | "  (757388, 6) <- df_comments shape"
07:25:29 | INFO | "Keep only comments that match posts IDs in df_posts..."
07:25:30 | INFO | "  (638052, 6) <- updated df_comments shape"
07:25:30 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/mlflow/mlruns.db"
07:25:30 | INFO | "Filtering posts for SIF training..."
07:25:30 | INFO | "59,366 <- Exclude posts because of: subreddits filter"
07:25:30 | INFO | "30,537 <- Exclude posts because of: duplicated posts"
07:25:30 | INFO | "25,328 <- Exclude posts because of: minimum word count"
07:25:30 | INFO | "31,790 <- df_pos