# Purpose

2022-07-07:
For this version of the model I'll still be using the old method without nested values for the sake of speed. When we move to kubeflow, we'll want to create a partition and a schema that can handle potentially changing numbers of k.

# Imports & Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import logging
import os
from pathlib import Path

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import mlflow
import hydra

import subclu
from subclu.eda.aggregates import compare_raw_v_weighted_language
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.hydra_config_loader import LoadHydraConfig


# ===
# imports specific to this notebook



print_lib_versions([hydra, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.5.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas').tail(9)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
25,25,v0.4.1_mUSE_clustering_new_metrics,gs://i18n-subreddit-clustering/mlflow/mlruns/25,active
26,26,v0.4.1_nearest_neighbors_test,gs://i18n-subreddit-clustering/mlflow/mlruns/26,active
27,27,v0.4.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/27,active
28,28,v0.5.0_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/28,active
29,29,v0.5.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/29,active
30,30,v0.5.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/30,active
31,31,v0.5.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/31,active
32,32,v0.5.0_nearest_neighbors_test,gs://i18n-subreddit-clustering/mlflow/mlruns/32,active
33,33,v0.5.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/33,active


## Show selected model metadata

experiment ID 31 as the latest clustering runs.

We picked model with UUID 
```
8da1ce07d7214ea1a384b445f6d5db5d
```

In [6]:
mlflow_experiment_id = 31
run_id_target = '8da1ce07d7214ea1a384b445f6d5db5d'

gs_cluster_labels = (
    f"gs://i18n-subreddit-clustering/mlflow/mlruns/{mlflow_experiment_id}/{run_id_target}/artifacts/df_labels/df_labels.parquet"
)
gs_cluster_labels

'gs://i18n-subreddit-clustering/mlflow/mlruns/31/8da1ce07d7214ea1a384b445f6d5db5d/artifacts/df_labels/df_labels.parquet'

In [7]:
%%time

df_mlf = mlf.search_all_runs(experiment_ids=[mlflow_experiment_id])
df_mlf.shape

CPU times: user 432 ms, sys: 9.24 ms, total: 441 ms
Wall time: 440 ms


(36, 283)

In [8]:
mask_finished = df_mlf['status'] == 'FINISHED'
mask_score_complete = ~df_mlf['metrics.primary_topic-2350_to_2700-f1_score-weighted_avg'].isnull()

df_mlf_clustering_candidates = df_mlf[mask_finished & mask_score_complete]
df_mlf_clustering_candidates.shape

(36, 283)

In [9]:
cols_with_multiple_vals = df_mlf_clustering_candidates.columns[df_mlf_clustering_candidates.nunique(dropna=False) > 1]
print(f"interesting cols: {len(cols_with_multiple_vals)}")
# df_mlf_clustering_candidates[cols_with_multiple_vals].iloc[:5, :10]

interesting cols: 255


In [10]:
df_mlf_clustering_candidates[df_mlf_clustering_candidates['run_id'] == run_id_target][cols_with_multiple_vals]

Unnamed: 0,run_id,artifact_uri,start_time,end_time,metrics.primary_topic-0040_to_0050-recall-macro_avg,metrics.optimal_k-0500_to_0750,metrics.primary_topic-2350_to_2700-precision-macro_avg,metrics.primary_topic-0080_to_0100-f1_score-macro_avg,metrics.primary_topic-1350_to_1700-homogeneity_score,metrics.primary_topic-0080_to_0100-f1_score-weighted_avg,metrics.primary_topic-0250_to_0500-f1_score-macro_avg,metrics.primary_topic-0060_to_0070-recall-macro_avg,metrics.primary_topic-0100_to_0250-recall-weighted_avg,metrics.primary_topic-4000_to_4200-homogeneity_score,metrics.primary_topic-0060_to_0070-f1_score-weighted_avg,metrics.primary_topic-2700_to_3000-adjusted_rand_score,metrics.primary_topic-2000_to_2350-adjusted_mutual_info_score,metrics.primary_topic-4000_to_4200-precision-macro_avg,metrics.primary_topic-0250_to_0500-adjusted_mutual_info_score,metrics.primary_topic-0050_to_0060-f1_score-weighted_avg,metrics.optimal_k-0250_to_0500,metrics.primary_topic-0500_to_0750-adjusted_rand_score,metrics.optimal_k-3800_to_3900,metrics.primary_topic-3600_to_3800-f1_score-macro_avg,metrics.primary_topic-0500_to_0750-precision-macro_avg,metrics.memory_used_percent,metrics.primary_topic-0250_to_0500-homogeneity_score,metrics.primary_topic-0040_to_0050-adjusted_rand_score,metrics.primary_topic-4000_to_4200-recall-macro_avg,metrics.primary_topic-0250_to_0500-recall-weighted_avg,...,metrics.primary_topic-1000_to_1350-homogeneity_score,metrics.primary_topic-2000_to_2350-f1_score-weighted_avg,metrics.primary_topic-3000_to_3200-f1_score-macro_avg,metrics.primary_topic-0020_to_0040-precision-weighted_avg,metrics.optimal_k-1000_to_1350,metrics.primary_topic-2350_to_2700-adjusted_rand_score,metrics.primary_topic-2350_to_2700-recall-macro_avg,metrics.primary_topic-0100_to_0250-precision-weighted_avg,metrics.primary_topic-3800_to_3900-adjusted_mutual_info_score,metrics.primary_topic-0250_to_0500-f1_score-weighted_avg,metrics.primary_topic-3400_to_3600-adjusted_rand_score,metrics.primary_topic-3000_to_3200-homogeneity_score,metrics.primary_topic-0020_to_0040-precision-macro_avg,metrics.primary_topic-1700_to_2000-precision-macro_avg,metrics.optimal_k-3000_to_3200,params._pipe-reduce__tol,params._pipe-reduce__random_state,params._pipe-reduce__n_iter,params._pipe-reduce__n_components,params.mlflow_run_name,params._pipe-cluster__linkage,params._pipe-reduce__algorithm,params._pipe-cluster__affinity,params.pipe-reduce_name,params._pipe-normalize__copy,params.pipe-normalize_name,params._pipe-normalize__norm,tags.mlflow.log-model.history,tags.mlflow.runName,tags.mlflow.source.git.commit
32,8da1ce07d7214ea1a384b445f6d5db5d,gs://i18n-subreddit-clustering/mlflow/mlruns/31/8da1ce07d7214ea1a384b445f6d5db5d/artifacts,2022-07-06 17:59:57.539000+00:00,2022-07-06 18:18:07.294000+00:00,0.313893,603.0,0.641622,0.351835,0.604471,0.623818,0.463907,0.346683,0.671971,0.636226,0.60832,0.758404,0.62337,0.668956,0.57938,0.591211,266.0,0.73274,3896.0,0.622741,0.560553,0.093905,0.563036,0.680856,0.611681,0.695749,...,0.593744,0.73396,0.610324,0.502137,1004.0,0.756196,0.582495,0.639902,0.642868,0.674399,0.762232,0.626398,0.185163,0.63432,3107.0,0.0,,5,128,embedding_clustering-2022-07-06_175956,ward,randomized,euclidean,TruncatedSVD,True,Normalizer,l2,"[{""run_id"": ""8da1ce07d7214ea1a384b445f6d5db5d"", ""artifact_path"": ""clustering_model"", ""utc_time_created"": ""2022-07-06 18:14:16.616548"", ""flavors"": {""sklearn"": {""pickled_model"": ""model.pkl"", ""sklearn_version"": ""0.24.1"", ""serialization_for...",embedding_clustering-2022-07-06_175956,5986076742221675a16d6ef53cc0cdb81e56de9d


## Show selected model data to upload

In [11]:
%%time
df_labels = pd.read_parquet(
    gs_cluster_labels
)
print(df_labels.shape)

(81970, 151)
CPU times: user 881 ms, sys: 441 ms, total: 1.32 s
Wall time: 2.39 s


In [12]:
df_labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81970 entries, 0 to 81969
Columns: 151 entries, model_sort_order to k_8000_majority_primary_topic
dtypes: float64(1), int32(73), int64(1), object(76)
memory usage: 72.2+ MB


In [13]:
df_labels.iloc[:5, :10]

Unnamed: 0,model_sort_order,subreddit_name,subreddit_id,primary_topic,posts_for_modeling_count,k_0010_label,k_0012_label,k_0020_label,k_0025_label,k_0030_label
0,37929,memesenespanol,t5_1009a3,Internet Culture and Memes,111.0,5,6,8,9,10
1,38436,karstcast,t5_100a1y,Podcasts and Streamers,6.0,5,6,8,9,10
2,3416,measuredpenis,t5_100eoi,Mature Themes and Adult Content,150.0,1,1,1,1,1
3,50474,dragonballart,t5_100i9c,Art,72.0,6,7,11,13,15
4,47735,thelongestgameever2,t5_100mht,Gaming,7.0,6,7,11,13,15


In [14]:
df_labels.iloc[:5, -10:]

Unnamed: 0,k_5750_majority_primary_topic,k_6000_majority_primary_topic,k_6250_majority_primary_topic,k_6500_majority_primary_topic,k_6750_majority_primary_topic,k_7000_majority_primary_topic,k_7250_majority_primary_topic,k_7500_majority_primary_topic,k_7750_majority_primary_topic,k_8000_majority_primary_topic
0,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor
1,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor,Funny/Humor
2,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content,Mature Themes and Adult Content
3,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime,Anime
4,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming,Gaming


In [15]:
(
    df_labels
    .sort_values(by=['model_sort_order'], ascending=True)
    .iloc[:15, :10]
)

Unnamed: 0,model_sort_order,subreddit_name,subreddit_id,primary_topic,posts_for_modeling_count,k_0010_label,k_0012_label,k_0020_label,k_0025_label,k_0030_label
6625,0,sagittariussoles,t5_2k9j8e,,18.0,1,1,1,1,1
66250,1,petitebootysmuggler,t5_5zhg6h,,12.0,1,1,1,1,1
64558,2,hotblondesgonewild,t5_5tjuao,,14.0,1,1,1,1,1
35310,3,fantime,t5_35k6zg,Mature Themes and Adult Content,35.0,1,1,1,1,1
69772,4,hotperfect,t5_6bb10m,,70.0,1,1,1,1,1
70270,5,roxierusalka,t5_6ceaw2,,16.0,1,1,1,1,1
71741,6,coolfiles,t5_6fd53e,,26.0,1,1,1,1,1
72367,7,newhotgirlsonlyfanss,t5_6ghkib,,174.0,1,1,1,1,1
53710,8,onlyfansvipfree,t5_4k6nde,,29.0,1,1,1,1,1
55025,9,promoteonlyfansnsfw,t5_4qpjty,,84.0,1,1,1,1,1


In [16]:
(
    df_labels
    .sort_values(by=['model_sort_order'], ascending=True)
    .iloc[-15:, :10]
)

Unnamed: 0,model_sort_order,subreddit_name,subreddit_id,primary_topic,posts_for_modeling_count,k_0010_label,k_0012_label,k_0020_label,k_0025_label,k_0030_label
25988,81955,muglife,t5_2w4zr,Hobbies,101.0,10,12,20,25,30
51730,81956,tikimugs,t5_4av2vd,Hobbies,16.0,10,12,20,25,30
31311,81957,moldavite,t5_30hy8,Hobbies,37.0,10,12,20,25,30
38175,81958,beerstein,t5_39ceyj,,5.0,10,12,20,25,30
10302,81959,vintage,t5_2qsbw,,704.0,10,12,20,25,30
18159,81960,whatsthisworth,t5_2sslp,,400.0,10,12,20,25,30
19493,81961,collectables,t5_2t61k,Hobbies,107.0,10,12,20,25,30
33386,81962,finechina,t5_32tf9,Hobbies,66.0,10,12,20,25,30
38698,81963,antiquefurniture,t5_39xkv,,43.0,10,12,20,25,30
10917,81964,antiques,t5_2qz3j,Art,1545.0,10,12,20,25,30


# Use `bq` to upload data to bigQuery

Note that you may need to authenticate for this upload to work. Otherwise you might get an error like this:

```bash
BigQuery error in load operation: Access Denied: Project reddit-employee-
datasets: User does not have bigquery.jobs.create permission in project reddit-
employee-datasets.
```

Command to login:
```bash
gcloud auth application-default login
```

Command to log out:
```bash
gcloud auth application-default revoke
```

In [21]:
# print command (so we can run it elsewhere if needed)
print(
fr"""
bq load \
    --source_format=PARQUET \
    --project_id=reddit-employee-datasets \
    david_bermejo.subclu_v0050_subreddit_clusters_c_full \
    {gs_cluster_labels}
"""
)


bq load \
    --source_format=PARQUET \
    --project_id=reddit-employee-datasets \
    david_bermejo.subclu_v0050_subreddit_clusters_c_full \
    gs://i18n-subreddit-clustering/mlflow/mlruns/31/8da1ce07d7214ea1a384b445f6d5db5d/artifacts/df_labels/df_labels.parquet



In [17]:
%%time
!bq load \
    --source_format=PARQUET \
    --project_id=reddit-employee-datasets \
    david_bermejo.subclu_v0050_subreddit_clusters_c_full \
    $gs_cluster_labels


BigQuery error in load operation: Access Denied: Project reddit-employee-
datasets: User does not have bigquery.jobs.create permission in project reddit-
employee-datasets.
CPU times: user 7.71 ms, sys: 112 ms, total: 119 ms
Wall time: 1.07 s
