# Purpose

2021-10-15: Use this notebook to test & check logs for running new clustering script. Use this as a way to document different options for running the script with hydra.

Focus:
- clustering algos at **subreddit-level.**


# Imports & Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import logging
import os
from pathlib import Path

import numpy as np
import pandas as pd
import plotly
import seaborn as sns

import mlflow
import hydra

import subclu
from subclu.eda.aggregates import compare_raw_v_weighted_language
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.utils.mlflow_logger import MlflowLogger
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl, 
    get_color_dict, base_colors_for_manual_labels,
    check_colors_used,
)
from subclu.data.data_loaders import LoadPosts, LoadSubreddits, create_sub_level_aggregates


# ===
# imports specific to this notebook
from collections import Counter
# import umap
# import openTSNE
# from openTSNE import TSNE

# import hdbscan

import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize  # if we normalize the data, euclidean distance is approx of cosine

from sklearn.cluster import KMeans, DBSCAN, OPTICS, AgglomerativeClustering

print_lib_versions([hydra, np, pd, plotly, sklearn, sns, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
sklearn		v: 0.24.1
seaborn		v: 0.11.1
subclu		v: 0.4.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
mlf.list_experiment_meta(output_format='pandas')

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
0,0,Default,./mlruns/0,active
1,1,fse_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/1,active
2,2,fse_vectorize_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/2,active
3,3,subreddit_description_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/3,active
4,4,fse_vectorize_v1.1,gs://i18n-subreddit-clustering/mlflow/mlruns/4,active
5,5,use_multilingual_v0.1_test,gs://i18n-subreddit-clustering/mlflow/mlruns/5,active
6,6,use_multilingual_v1,gs://i18n-subreddit-clustering/mlflow/mlruns/6,active
7,7,use_multilingual_v1_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/7,active
8,8,use_multilingual_v1_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/8,active
9,9,v0.3.2_use_multi_inference_test,gs://i18n-subreddit-clustering/mlflow/mlruns/9,active


## Get experiment IDs to use for clustering

There are two runs that completed and they both have the same parameters, so we should be able to use either one. For now, let's select:<br>
`0591fdae9b7d4da7ae3839767b8aab66`

In [6]:
%%time

df_mlf = mlf.search_all_runs(experiment_ids=[16])
df_mlf.shape

CPU times: user 49.6 ms, sys: 13.7 ms, total: 63.2 ms
Wall time: 62.4 ms


(13, 86)

In [7]:
mask_finished = df_mlf['status'] == 'FINISHED'
mask_df_similarity_complete = ~df_mlf['metrics.df_sub_level_agg_a_post_only_similarity-rows'].isnull()

df_mlf_clustering_candidates = df_mlf[mask_finished & mask_df_similarity_complete]
df_mlf_clustering_candidates.shape

(2, 86)

In [8]:
cols_with_multiple_vals = df_mlf_clustering_candidates.columns[df_mlf_clustering_candidates.nunique(dropna=False) > 1]

df_mlf_clustering_candidates[cols_with_multiple_vals]

Unnamed: 0,run_id,artifact_uri,start_time,end_time,metrics.memory_used,metrics.vectorizing_time_minutes,metrics.memory_free,metrics.memory_used_percent,params.memory_used,params.f_log_file,params.memory_free,params.memory_used_percent,params.run_name,tags.mlflow.runName
0,cbb12818e82345dda96928bfdab8b16b,gs://i18n-subreddit-clustering/mlflow/mlruns/16/cbb12818e82345dda96928bfdab8b16b/artifacts,2021-10-12 10:46:05.235000+00:00,2021-10-12 16:41:33.492000+00:00,702999.0,355.468028,3465918.0,0.181436,278514,logs/AggregateEmbeddings/2021-10-12_10-46-05_agg_full_lc_false_pd-2021-10-12_104604.log,3465918,0.0718813699564913,agg_full_lc_false_pd-2021-10-12_104604,agg_full_lc_false_pd-2021-10-12_104604
1,0591fdae9b7d4da7ae3839767b8aab66,gs://i18n-subreddit-clustering/mlflow/mlruns/16/0591fdae9b7d4da7ae3839767b8aab66/artifacts,2021-10-12 10:27:33.324000+00:00,2021-10-12 16:40:41.501000+00:00,703208.0,373.134208,3681161.0,0.18149,64759,logs/AggregateEmbeddings/2021-10-12_10-27-33_agg_full_lc_false-2021-10-12_102732.log,3681161,0.0167135786244584,agg_full_lc_false-2021-10-12_102732,agg_full_lc_false-2021-10-12_102732


# Inspect config for clustering job

This config should include:
- data to load for clustering
- parameters for clustering algo
- hydra overrides to run jobs in parallel

In [9]:
test_experiment = 'v0.4.0_use_multi_clustering_test'

cfg_cluster_test_v040 = LoadHydraConfig(
    config_name='clustering_v0.4.0_base',
    config_path="../config",
    overrides=[
        f"mlflow_experiment_name={test_experiment}"
#         f"data_text_and_metadata=top_subreddits_2021_07_16",
#         f"data_embeddings_to_cluster=top_subs-2021_07_16-use_multi_lower_case_false_00",
    ],
)

print([k for k in cfg_cluster_test_v040.config_dict.keys()])

['data_text_and_metadata', 'data_embeddings_to_cluster', 'clustering_algo', 'embeddings_to_cluster', 'mlflow_tracking_uri', 'mlflow_experiment_name', 'pipeline']


In [10]:
# data with embeddings
cfg_cluster_test_v040.config_dict['data_embeddings_to_cluster']

{'run_uuid': '0591fdae9b7d4da7ae3839767b8aab66',
 'l_ix_sub': ['subreddit_name', 'subreddit_id'],
 'l_ix_post': ['subreddit_name', 'post_id'],
 'df_post_level_agg_b_post_and_comments': None,
 'df_post_level_agg_c_post_comments_sub_desc': 'df_post_level_agg_c_post_comments_sub_desc',
 'df_sub_level_agg_a_post_only': 'df_sub_level_agg_a_post_only',
 'df_sub_level_agg_a_post_only_similarity': 'df_sub_level_agg_a_post_only_similarity',
 'df_sub_level_agg_a_post_only_similarity_pair': 'df_sub_level_agg_a_post_only_similarity_pair',
 'df_sub_level_agg_a_post_only_similarity_top_pair': 'df_sub_level_agg_a_post_only_similarity_top_pair',
 'df_sub_level_agg_b_post_and_comments': None,
 'df_sub_level_agg_b_post_and_comments_similarity': None,
 'df_sub_level_agg_b_post_and_comments_similarity_pair': None,
 'df_sub_level_agg_c_post_comments_and_sub_desc': 'df_sub_level_agg_c_post_comments_and_sub_desc',
 'df_sub_level_agg_c_post_comments_and_sub_desc_similarity': 'df_sub_level_agg_c_post_comments_

In [11]:
# clustering algo
cfg_cluster_test_v040.config_dict['clustering_algo']

{'model_name': 'AgglomerativeClustering',
 'model_kwargs': {'n_clusters': 100,
  'affinity': 'euclidean',
  'linkage': 'ward',
  'connectivity': False}}

# Run commmand line fxn

The clustering fxn is in `subclu.models.conlustering.py`

Notes:
- We need to use the `-m` flag to run as a submodule (and allow relative imports)
- When using the `-m` flag, REMOVE the `.py` ending of the file!
- In the command line for hydra we can override w/o having to use the `+` sign.
    - https://hydra.cc/docs/tutorials/basic/your_first_app/config_file

## Check & set paths

In [12]:
test_experiment

'v0.4.0_use_multi_clustering_test'

In [13]:
os.getcwd()

'/home/jupyter/subreddit_clustering_i18n/notebooks/v0.4.0'

In [14]:
path_djb_repo = '/home/david.bermejo/repos/subreddit_clustering_i18n/' 
path_djb_models = '/home/david.bermejo/repos/subreddit_clustering_i18n/subclu/models' 
file_clustering_py = 'subclu.models.clustering'

In [15]:
# !ls

In [16]:
# !cd $path_djb_repo && ls

In [17]:
# !cd $path_djb_models && ls

## Run clustering from CLI

```bash
!cd $path_djb_repo && python -m $file_clustering_py mlflow_experiment_name=$test_experiment n_sample_embedding_rows=4000
```

In [54]:
file_clustering_py

'subclu.models.clustering'

In [66]:
# run on sample data, test experiment

!cd $path_djb_repo && python -m $file_clustering_py mlflow_experiment_name=$test_experiment n_sample_embedding_rows=4000 filter_embeddings.filter_subreddits.minimum_column_value=9

CFG keys: dict_keys(['data_text_and_metadata', 'data_embeddings_to_cluster', 'clustering_algo', 'embeddings_to_cluster', 'n_sample_embedding_rows', 'filter_embeddings', 'mlflow_tracking_uri', 'mlflow_experiment_name', 'pipeline'])
`2021-10-26 06:17:15,096` | `INFO` | `Define cluster class...`
`2021-10-26 06:17:15,795` | `INFO` | `== Start run_aggregation() method ==`
`2021-10-26 06:17:15,795` | `INFO` | `MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db`
`2021-10-26 06:17:15,948` | `INFO` | `=== START CLUSTERING - Process ID 45575`
`2021-10-26 06:17:16,047` | `INFO` | `host_name: djb-100-2021-04-28-djb-eda-german-subs`
`2021-10-26 06:17:16,047` | `INFO` | `cpu_count: 96`
`2021-10-26 06:17:16,116` | `INFO` | `RAM stats:
{'memory_used_percent': '0.93%', 'memory_total': '628,888', 'memory_used': '5,852', 'memory_free': '619,916'}`
`2021-10-26 06:17:16,209` | `INFO` | `Using hydra's path`
  Current working di

## Test multirun

Jobs can run in parallel with `--multirun` flag!

In [None]:
# run on sample data, multi-run

!cd $path_djb_repo && python -m $file_clustering_py --multirun \
    "filter_embeddings.filter_subreddits.minimum_column_value=range(2, 10, 1)" \
    mlflow_experiment_name=$test_experiment \
    n_sample_embedding_rows=4000

## Run on full data (one job)

In [40]:
# # run on full data, still a test

# !cd $path_djb_repo && python -m $file_clustering_py mlflow_experiment_name=$test_experiment

In [56]:
# # # run on full data, no longer a test

# !cd $path_djb_repo && python -m $file_clustering_py

## Run on full data (multijob)

In [76]:
# ward-related jobs
!cd $path_djb_repo && python -m $file_clustering_py --multirun \
    "filter_embeddings.filter_subreddits.minimum_column_value=range(2, 10, 1)" \
    "pipeline.normalize.add_step=choice(false, true)" \
    "pipeline.reduce.add_step=choice(false, true)"

In [75]:
# non-ward linkages

!cd $path_djb_repo && python -m $file_clustering_py --multirun \
    "clustering_algo.model_kwargs.linkage=choice('complete', 'average')" \
    "clustering_algo.model_kwargs.affinity=choice('cosine', 'euclidean')" \
    "filter_embeddings.filter_subreddits.minimum_column_value=range(2, 10, 1)" \
    "pipeline.reduce.add_step=choice(false, true)" \
    "pipeline.normalize.add_step=false"