# Purpose

### 2023-02-21
Use this notebook to take subreddit-embeddings that have already been reshaped and upload them to BQ to use as tables.

---
### Colab & Installing new libraries

**As of 2023-02:** Steps to run the notebook:
- Click on `Runtime` > `Run All`: 
    - To install required libraries
- Click on `Runtime` > `Restart and Run All`
    - To be able to actually use the new libraries

Unfortunately, Colab can't use the new libraries after running `pip install -e <...>`, so you need to `restart and run all` before being able to use the newly installed libraries **sigh**. Stackoverflow sources:
- [How to 'restart runtime' using python code or command line interface?](https://stackoverflow.com/questions/53154369/google-colab-how-to-restart-runtime-using-python-code-or-command-line-interf)
- [How to stop Google Colab runtime without it automatically restarting](https://stackoverflow.com/questions/68640984/how-to-stop-google-colab-runtime-without-it-automatically-restarting)

# Imports & notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

# Register bigquery magic (only needed for laptop/local, not colab)
# %load_ext google.cloud.bigquery

In [2]:
# colab auth for BigQuery, google drive, & google sheets (gspread)
from google.colab import auth, files, drive
from google.auth import default
import sys  # need sys to mount gdrive path

auth.authenticate_user()
print('Authenticated')

Authenticated


## Install custom library

This has some custom utils that make plotting displaying data nicer/easier

### Append google drive path so we can install library from there

In [3]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
import sys


g_drive_root = '/content/drive'

try:
    drive.mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
    
except Exception as e:
    try:
        drive._mount(g_drive_root, force_remount=True)
        print('   Authenticated & mounted Google Drive')
    except Exception as e:
        print(e)
        raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install library

In [4]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone


## Imports

In [5]:
from datetime import datetime, timedelta
import os
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
)

from subclu.utils.mlflow_logger import MlflowLogger
from subclu.models.bq_embedding_schemas import embeddings_schema
from subclu.models.reshape_embeddings_for_bq import reshape_embeddings_to_ndjson, reshape_embeddings_and_upload_to_bq
from subclu.utils.big_query_utils import load_data_to_bq_table

print_lib_versions([hydra, mlflow, np, pd, subclu])

python		v 3.8.10
===
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.21.6
pandas		v: 1.3.5
subclu		v: 0.6.1


In [6]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Load config to Reshape & Load Subreddit Embeddings

The embedding aggregation should've been logged to `mlflow` so we should be able to
- make calls to mlflow to get the embeddings
- add the new embeddings format to the original job

---

We have 2 strategies for getting subreddit embeddings. For now we'll only make available the `extra weight to subreddit meta` because that performed slightly better in previous taxonomy/rating models.




In [7]:
cfg_reshape_embeddings_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0_desc_extra_weight.yaml',
    config_path="../config",
)

In [8]:
cfg_reshape_embeddings_061_wt = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.1_desc_extra_weight.yaml',
    config_path="../config",
)

In [9]:
for k_, v_ in cfg_reshape_embeddings_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            # print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to reshape embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: badc44b0e5ac467da14f710da0b410c6
embeddings_artifact_path: df_subs_agg_c1
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: False,
pt: 2022-08-10
model_version: v0.6.0
model_name: cau-text-mUSE extra weight for subreddit description
embeddings_config: aggregate_embeddings_v0.6.0


In [10]:
for k_, v_ in cfg_reshape_embeddings_061_wt.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            # print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to reshape embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: 91ac7ca171024c779c0992f59470c81b
embeddings_artifact_path: df_subs_agg_c1
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: True,
pt: 2022-11-08
model_version: v0.6.1
model_name: cau-text-mUSE extra weight for subreddit description
embeddings_config: aggregate_embeddings_v0.6.1


# Start MLflow & Get Artifact Paths

This mlflow server will help us get the artifacts & subreddit-embeddings for the `mlflow-uuid` flagged in the configurations below.

In [11]:
# usual path fails. Instead use custom colab location
# mlf = MlflowLogger(tracking_uri=cfg_reshape_embeddings_wt.config_dict['mlflow_tracking_uri'])
path_mlruns_colab_ = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/mlflow_sync/colab"

mlf = MlflowLogger(
    tracking_uri=f"sqlite:///{path_mlruns_colab_}/mlruns.db",
)

In [12]:
mlf.list_experiment_meta(output_format='pandas').tail(9)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
35,35,v0.6.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/35,active
36,36,v0.6.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/36,active
37,37,v0.6.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/37,active
38,38,v0.6.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/38,active
39,39,v0.6.1_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/39,active
40,40,v0.6.1_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/40,active
41,41,v0.6.1_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/41,active
42,42,v0.6.1_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/42,active
43,43,v0.6.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/43,active


In [13]:
%%time

df_mlf = mlf.search_all_runs()
print(df_mlf.shape)

(997, 732)
CPU times: user 6.1 s, sys: 620 ms, total: 6.72 s
Wall time: 22.8 s


In [14]:
(
    df_mlf[
        df_mlf['run_id'].isin([cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'], cfg_reshape_embeddings_wt.config_dict['mlflow_run_id']])
    ]
    .drop(columns=[c for c in df_mlf if any([c.startswith('metrics.memory_'), c.startswith('params.memory_')])])
    .dropna(axis=1, how='all')
    .iloc[:, :25]
    # .T
)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.cpu_count,metrics.df_v_post_comments-cols,metrics.df_v_subs-cols,metrics.time_fxn-data_loading_time,metrics.df_v_post_comments-rows,metrics.df_v_subs-rows,metrics.df_subs_agg_c1-cols,metrics.df_subs_agg_c1_uw-rows,metrics.time_fxn-df_subs_agg_c1,metrics.time_fxn-df_subs_agg_c1_uw,metrics.df_posts_agg_c1-cols,metrics.df_subs_agg_c1_uw-cols,metrics.time_fxn-df_posts_agg_c1_no_delay,metrics.df_posts_agg_c1-rows,metrics.time_fxn-full_aggregation_fxn_minutes,metrics.df_subs_agg_c1-rows,params.mlflow_tracking_uri,params.cpu_count,params.host_name
57,91ac7ca171024c779c0992f59470c81b,40,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts,2022-11-07 21:38:57.662000+00:00,2022-11-23 02:06:49.677000+00:00,96.0,515.0,514.0,5.742884,53597817.0,781653.0,515.0,781653.0,14.350225,14.350225,515.0,515.0,555.547501,53597817.0,673.789062,781653.0,sqlite,96,djb-100-2021-04-28-djb-eda-german-subs
59,badc44b0e5ac467da14f710da0b410c6,35,FINISHED,gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts,2022-08-16 08:41:53.006000+00:00,2022-09-10 00:54:17.303000+00:00,96.0,515.0,514.0,3.698822,51906348.0,771760.0,515.0,771760.0,15.926672,15.926672,515.0,515.0,544.288655,51906348.0,820.674805,771760.0,sqlite,96,djb-100-2021-04-28-djb-eda-german-subs


# Get path to latest reshaped data
Since we already reshaped the data and it's in GCS, let's read that instead of reshaping & uploading to GCS

In [15]:
%%time

l_artifacts_top_level = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    only_top_level=True,
    verbose=True,
    full_path=True,
)

l_artifacts_all = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_wt.config_dict['mlflow_run_id'],
    only_top_level=False,
    verbose=False,
    full_path=True,
)

17:28:39 | INFO | "   293 <- Artifacts to check count"
17:28:39 | INFO | "   293 <- Artifacts clean count"
17:28:39 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"
17:28:47 | INFO | "   293 <- Artifacts clean count"
17:28:47 | INFO | "    10 <- Artifacts & folders at TOP LEVEL clean count"


CPU times: user 14.2 s, sys: 351 ms, total: 14.6 s
Wall time: 17.4 s


In [16]:
%%time

l_artifacts_top_level_061 = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'],
    only_top_level=True,
    verbose=True,
    full_path=True,
)

l_artifacts_all_061 = mlf.list_run_artifacts(
    run_id=cfg_reshape_embeddings_061_wt.config_dict['mlflow_run_id'],
    only_top_level=False,
    verbose=False,
    full_path=True,
)

17:28:55 | INFO | "   342 <- Artifacts to check count"
17:28:55 | INFO | "   342 <- Artifacts clean count"
17:28:55 | INFO | "     9 <- Artifacts & folders at TOP LEVEL clean count"
17:29:05 | INFO | "   342 <- Artifacts clean count"
17:29:05 | INFO | "     9 <- Artifacts & folders at TOP LEVEL clean count"


CPU times: user 14.8 s, sys: 368 ms, total: 15.1 s
Wall time: 17.6 s


In [17]:
# get path for parquet file. This should be a full folder with multiple parquet files
path_parquet_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path']
)

folder_parquet_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path']
)
folder_parquet_full_ = [i_ for i_ in l_artifacts_top_level if i_.split('/')[-1] == folder_parquet_][0]

# print(f"Folder with parquet files:\n{path_parquet_}")
print(f"\nFolder with parquet files (full):\n{folder_parquet_full_}\n")

[_ for _ in l_artifacts_all if folder_parquet_ == _.split('/')[-2]]


Folder with parquet files (full):
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1



['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/_common_metadata',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/_metadata',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.0.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.1.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.2.parquet',
 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1/part.3.parquet']

In [18]:
# get path for latest ndjson output FILE
#  NOTE: there could be multiple runs of this file
path_ndjson_ = (
    cfg_reshape_embeddings_wt.config_dict['embeddings_artifact_path'] + '_ndjson'
)
l_ndjson_files_ = [_ for _ in l_artifacts_all if path_ndjson_ in _]
ndjson_path_ = l_ndjson_files_[-1]
print(f"ndjson file list:\n{l_ndjson_files_}")

print(f"\nFile to upload to BQ:\n{ndjson_path_}")

ndjson file list:
['gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json']

File to upload to BQ:
gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json


In [19]:
# get path for latest ndjson output FILE
#  NOTE: there could be multiple runs of this file
path_ndjson_061 = (
    cfg_reshape_embeddings_061_wt.config_dict['embeddings_artifact_path'] + '_ndjson'
)
l_ndjson_files_061 = [_ for _ in l_artifacts_all_061 if path_ndjson_061 in _]
ndjson_path_061 = l_ndjson_files_061[-1]
# print(f"ndjson file list:\n{l_ndjson_files_061}")

print(f"\nFile to upload to BQ:\n{ndjson_path_061}")


File to upload to BQ:
gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json


# Run new function to only upload existing reshaped data

In [20]:
BREAK

NameError: ignored

In [None]:
%%time

load_data_to_bq_table(
    uri="",  # ndjson_path_
    bq_project=cfg_reshape_embeddings_wt.config_dict['bq_project'],
    bq_dataset=cfg_reshape_embeddings_wt.config_dict['bq_dataset'],
    bq_table_name=cfg_reshape_embeddings_wt.config_dict['bq_table'],
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=cfg_reshape_embeddings_wt.config_dict['bq_table_description'],
    update_table_description=False,
)

## Use SQL to insert data (in case loading from python breaks)

To avoid recreating the schema in SQL, run the python function above, even if it fails to laod the data.

In [None]:
print(f"\nFile to upload to BQ:\n{ndjson_path_}")

In [None]:
# %%time
# %%bigquery df_embeddings --project data-science-prod-218515

# LOAD DATA INTO `reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings`
# FROM FILES (
#   format = 'JSON',
#   uris = ["gs://i18n-subreddit-clustering/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-08-31_035824.json"]
# );

In [None]:
print(f"\nFile to upload to BQ:\n{ndjson_path_061}")

In [None]:
# %%time
# %%bigquery df_embeddings --project data-science-prod-218515

# LOAD DATA INTO `reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings`
# FROM FILES (
#   format = 'JSON',
#   uris = ["gs://i18n-subreddit-clustering/mlflow/mlruns/40/91ac7ca171024c779c0992f59470c81b/artifacts/df_subs_agg_c1_ndjson/subreddit_embeddings_2022-11-18_171217.json"]
# );