# Purpose

**2022-08-22: v0.6.0**
The default parquet embedding format for my embeddings (1 row per column) is not favored for bigquery & other steam standards.

The preferred format is: 1 column that has repeated records. For example:
- `data-prod-165221.ml_content.subreddit_embeddings_ft2` 
    - [console link](https://console.cloud.google.com/bigquery?project=data-science-prod-218515&ws=!1m10!1m4!4m3!1sdata-prod-165221!2sml_content!3ssimilar_subreddit_ft2!1m4!4m3!1sdata-prod-165221!2sml_content!3ssubreddit_embeddings_ft2)
    - github link
        - https://github.snooguts.net/reddit/gazette-models/blob/master/similar_subreddit/embeddings/__main__.py#L105-L112

In python (each record as a dictionary. it gets turned into JSON):
```python
sr_embedding_dict = {
    "pt": date_today,
    "model_name": MODEL_NAME,
    "model_version": MODEL_VERSION,
    "subreddit_id": sr_dict["subreddit_id"],
    "subreddit_name": subreddit_name_lowercase,
    "embedding": sr_embedding.tolist(),
}
```

In BQ:
```
Field name		Type	Mode		Description
model_name  	STRING	NULLABLE		Model name	
model_version	STRING	NULLABLE		Model version	
subreddit_id	STRING	NULLABLE		Subreddit id	
subreddit_name	STRING	NULLABLE		Lower case subreddit name	
**embedding		FLOAT	REPEATED		Subreddit embeddings
```

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from collections import defaultdict
from datetime import datetime, timedelta
import gc
import os
import json
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.utils.hydra_config_loader import LoadHydraConfig
from subclu.utils import set_working_directory, get_project_subfolder
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric,
    elapsed_time,
)

from subclu.utils.mlflow_logger import MlflowLogger
from subclu.models.bq_embedding_schemas import embeddings_schema
from subclu.models.reshape_embeddings_for_bq import reshape_embeddings_to_ndjson, reshape_embeddings_and_upload_to_bq
from subclu.utils.big_query_utils import load_data_to_bq_table

print_lib_versions([hydra, mlflow, np, pd, subclu])

python		v 3.7.10
===
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
subclu		v: 0.6.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set Local model path (for saving)

In [56]:
manual_model_timestamp = datetime.utcnow().strftime('%Y-%m-%d_%H%M%S')
path_this_model = get_project_subfolder(
    f"data/models/aggregate_embeddings/manual_v060_{manual_model_timestamp}"
)
Path.mkdir(path_this_model, parents=True, exist_ok=True)
path_this_model

PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-31_030238')

# Load config for embeddings to reshape and where to save them

The embedding aggregation should've been logged to `mlflow` so we should be able to
- make calls to mlflow to get the embeddings
- add the new embeddings format to the original job



In [57]:
cfg_reshape_embeddings = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq-subreddit-v0.6.0.yaml',
    config_path="../config",
#     overrides=[
#         f"mlflow_experiment=v0.6.0_mUSE_aggregates_test",
#         f"n_sample_posts_files=2",
#         f"n_parallel_jobs=4",
#     ],
)
# print(cfg_reshape_embeddings.config_dict.keys())

In [58]:
for k_, v_ in cfg_reshape_embeddings.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        for k2_, v2_ in v_.items():
            pass
            # print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to reshape embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: badc44b0e5ac467da14f710da0b410c6
embeddings_artifact_path: df_subs_agg_c1_unweighted
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: True,
pt: 2022-08-11
model_version: v0.6.0
model_name: cau-text-mUSE-embeddings
embeddings_config: aggregate_embeddings_v0.6.0


# Start MLflow & Load subreddit embeddings

In [8]:
mlf = MlflowLogger(tracking_uri=cfg_reshape_embeddings.config_dict['mlflow_tracking_uri'])

In [9]:
%%time

df_agg_sub_c = mlf.read_run_artifact(
    run_id=cfg_reshape_embeddings.config_dict['mlflow_run_id'],
    artifact_folder=cfg_reshape_embeddings.config_dict['embeddings_artifact_path'],
    read_function='pd_parquet',
    verbose=False,
)
print(df_agg_sub_c.shape)

02:05:21 | INFO | "Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/mlflow/mlruns/35/badc44b0e5ac467da14f710da0b410c6/artifacts/df_subs_agg_c1_unweighted"
100%|##########################################| 6/6 [00:00<00:00, 15857.48it/s]
02:05:22 | INFO | "  Parquet files found:     4"
02:05:22 | INFO | "  Parquet files to use:     4"


(771760, 515)
CPU times: user 8.79 s, sys: 4.22 s, total: 13 s
Wall time: 7.38 s


# Check distribution of posts for embeddings
We'd expect ~340k subs with 3+ posts

In [10]:
df_agg_sub_c['posts_for_embeddings_count'].describe()

count    771760.000000
mean         67.257111
std         479.863049
min           0.000000
25%           1.000000
50%           2.000000
75%           8.000000
max        8400.000000
Name: posts_for_embeddings_count, dtype: float64

In [11]:
value_counts_and_pcts(
    df_agg_sub_c['posts_for_embeddings_count'],
    sort_index=True,
    sort_index_ascending=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-pct_cumulative_sum
0,65797,8.5%,8.5%
1,240338,31.1%,39.7%
2,125070,16.2%,55.9%
3,56898,7.4%,63.2%
4,33084,4.3%,67.5%
5,23205,3.0%,70.5%
6,17701,2.3%,72.8%
7,13792,1.8%,74.6%
8,11189,1.4%,76.1%
9,9664,1.3%,77.3%


In [31]:
value_counts_and_pcts(
    pd.cut(
        df_agg_sub_c['posts_for_embeddings_count'],
        bins=[-1, 0, 1, 2, 3, np.inf],
        labels=["0 posts", "1 post", '2 posts', '3 posts', '4+ posts']
    ),
    sort_index=True,
    sort_index_ascending=False,
    cumsum_count=True,
)

Unnamed: 0,posts_for_embeddings_count-count,posts_for_embeddings_count-percent,posts_for_embeddings_count-cumulative_sum,posts_for_embeddings_count-pct_cumulative_sum
4+ posts,283657,36.8%,283657,36.8%
3 posts,56898,7.4%,340555,44.1%
2 posts,125070,16.2%,465625,60.3%
1 post,240338,31.1%,705963,91.5%
0 posts,65797,8.5%,771760,100.0%


In [59]:
# value_counts_and_pcts(
#     pd.cut(
#         df_agg_sub_c['posts_for_embeddings_count'],
#         bins=[-1, 0, 1, 2, np.inf],
#         labels=["0 posts", "1 post", '2 posts', '3+ posts']
#     ),
#     sort_index=True,
#     sort_index_ascending=True,
# )

# Test new function that reshapes & uploads to BQ in a single call

**NOTE**: For testing load a new config that has a different mlflow_run_id & different target BQ table.
This way we can save data to a test run & a test table that won't affect the final target MLflow run(artifacts) & target table.

In [62]:
cfg_reshape_emb_test = LoadHydraConfig(
    config_name='reshape_embeddings_for_bq_test-subreddit-v0.6.0.yaml',
    config_path="../config",
)
for k_, v_ in cfg_reshape_emb_test.config_dict.items():
    if isinstance(v_, dict):
        print(f"{k_}:")
        # for k2_, v2_ in v_.items():
        #    print(f"    {k2_}: {v2_}")
    else:
        print(f"{k_}: {v_}")

data_text_and_metadata:
data_embeddings_to_aggregate:
aggregate_params:
description: Use this config to TEST reshaping embeddings and upload them to BigQuery
bucket_output: i18n-subreddit-clustering
mlflow_tracking_uri: sqlite
mlflow_run_id: ca79765b72c5428395b02926612d85fd
embeddings_artifact_path: df_subs_agg_c1_unweighted
bq_project: reddit-employee-datasets
bq_dataset: david_bermejo
bq_table: cau_subreddit_embeddings_test
bq_table_description: Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
update_table_description: True,
pt: 2022-08-10
model_version: v0.6.0
model_name: cau-text-mUSE
embeddings_config: aggregate_embeddings_v0.6.0


In [63]:
# (
#     df_agg_sub_c
#     .sort_values(by=['posts_for_embeddings_count', 'subreddit_name'], ascending=[False, True])
#     .iloc[:22, :10]
# )

In [64]:
%%time

reshape_embeddings_and_upload_to_bq(
    df_agg_sub_c.head(10000),
    dict_reshape_config=cfg_reshape_emb_test.config_dict,
    save_path_local_root=path_this_model,
    f_name_prefix='subreddit_embeddings',
    embedding_col_prefix='embeddings_',
)

03:13:22 | INFO | "512 <- # embedding columns found"
03:13:22 | INFO | "(10000, 515) <- Shape of input df"
03:13:22 | INFO | "Metadata cols to add:
  {'mlflow_run_id': 'ca79765b72c5428395b02926612d85fd', 'pt': '2022-08-10', 'model_version': 'v0.6.0', 'model_name': 'cau-text-mUSE'}"
03:13:22 | INFO | "Converting embeddings to repeated format..."
03:13:23 | INFO | "(10000, 8) <- Shape of new df before converting to JSON"
03:13:23 | INFO | "df output cols:
  ['pt', 'mlflow_run_id', 'model_name', 'model_version', 'subreddit_id', 'subreddit_name', 'posts_for_embeddings_count', 'embeddings']"
03:13:23 | INFO | "Converting embeddings to JSON..."
03:13:24 | INFO | "Saving file to:
  /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-31_030238/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-08-31_031322.json"
03:13:24 | INFO | "Logging to run ID: ca79765b72c5428395b02926612d85fd, artifact:
  df_subs_agg_c1_unweighted_ndjson"
03:13:27 | INFO |

CPU times: user 1.58 s, sys: 611 ms, total: 2.19 s
Wall time: 21.3 s


# Test new function to convert to JSON format

With a single call, this function does:
- the conversion
- saving the file to JSON
- Logging to the file to mlflow

Then we can add a function on top of that to go one step further and also create the BQ table (TODO)

In [None]:
BREAK

In [11]:
l_embedding_cols = [c for c in df_agg_sub_c.columns if c.startswith('embeddings_')]
print(len(l_embedding_cols))

512


In [21]:
# this is the test ID to open & log artifacts to
mlflow_test_run_id = 'ca79765b72c5428395b02926612d85fd'

In [28]:
%%time

# set values needed for JSON function
bq_col_keys = [
    'pt', 
    'model_version',
    'model_name',
    'mlflow_run_id'
]
d_cols_to_add = {k: v for k, v in cfg_reshape_embeddings.config_dict.items() if k in bq_col_keys}


path_local_json = (
    path_this_model / f"{cfg_reshape_embeddings.config_dict['embeddings_artifact_path']}_ndjson"
)

d_paths = reshape_embeddings_to_ndjson(
    df_agg_sub_c.head(700),
    embedding_cols=l_embedding_cols,
    columns_to_add=d_cols_to_add,
    f_name_prefix='subreddit_embeddings',
    save_path_local=path_local_json,
    log_to_mlflow=True,
    mlflow_run_id=mlflow_test_run_id,
)
print(d_paths)

01:29:53 | INFO | "(700, 515) <- Shape of input df"
01:29:53 | INFO | "(700, 8) <- Shape of new df before converting to JSON"
01:29:53 | INFO | "df output cols:
  ['pt', 'mlflow_run_id', 'model_version', 'model_name', 'subreddit_id', 'subreddit_name', 'posts_for_embeddings_count', 'embeddings']"
01:29:54 | INFO | "Saving file to:
  /home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-30_223816/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-08-31_012953.json"
01:29:54 | INFO | "Logging to run ID: ca79765b72c5428395b02926612d85fd, artifact:
  df_subs_agg_c1_unweighted_ndjson"


{'f_local': PosixPath('/home/jupyter/subreddit_clustering_i18n/data/models/aggregate_embeddings/manual_v060_2022-08-30_223816/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-08-31_012953.json'), 'mlflow_path': None, 'mlflow_artifact_path': 'gs://i18n-subreddit-clustering/mlflow/mlruns/35/ca79765b72c5428395b02926612d85fd/artifacts/df_subs_agg_c1_unweighted_ndjson/subreddit_embeddings_2022-08-31_012953.json'}
CPU times: user 357 ms, sys: 117 ms, total: 474 ms
Wall time: 6.6 s


# Test JSON format on a subset of columns (manual)

In [None]:
BREAK

In [140]:
%%time

df_test = df_agg_sub_c.head(100).copy()

bq_col_keys = [
    'pt', 
#     'model_version', 'model_name'
]
for c_ in bq_col_keys:
    df_test[c_] = str(cfg_reshape_embeddings.config_dict[c_]).strip()

df_test['embeddings'] = df_test[l_embedding_cols[:90]].values.tolist()
df_test.drop(l_embedding_cols, axis=1).iloc[:5, :15]

CPU times: user 861 µs, sys: 4.56 ms, total: 5.43 ms
Wall time: 4.85 ms


Unnamed: 0,subreddit_id,subreddit_name,posts_for_embeddings_count,pt,embeddings
0,t5_1001tl,jewel_xo,1,2022-08-11,"[-0.011266076937317848, 0.0012464923784136772, 0.035281483083963394, 0.04045214131474495, -0.06590830534696579, 0.009003894403576851, -0.0008995584212243557, -0.01666421815752983, -0.07016290724277496, 0.04436667636036873, 0.06609217822..."
1,t5_10029e,milkyhentai,1,2022-08-11,"[-0.0394916906952858, 0.007736351806670427, 0.03830709680914879, 0.045457467436790466, -0.027101224288344383, 0.033553291112184525, 0.040518589317798615, 0.009468913078308105, -0.011264792643487453, 0.030157815665006638, 0.0654373615980..."
2,t5_1006k8,badwouldyourather,1,2022-08-11,"[-0.008159023709595203, 0.03525131568312645, -0.0009117084555327892, 0.036374252289533615, 0.046087976545095444, 0.031007394194602966, 0.0035368474200367928, -0.0012258067727088928, -0.0403834730386734, 0.0007534259930253029, 0.05662947..."
3,t5_100806,jojojosiah,2,2022-08-11,"[-0.0031079817563295364, 0.022202856838703156, 0.03846222162246704, 0.015268548391759396, 0.009758785367012024, 0.02192338928580284, 0.008807502686977386, -0.036922186613082886, -0.007005440071225166, 0.07199138402938843, -0.00258467532..."
4,t5_1009a3,memesenespanol,380,2022-08-11,"[0.0037306491285562515, -0.013876136392354965, -0.003986774943768978, 0.0026833100710064173, -0.010201744735240936, 0.038551658391952515, 0.012758726254105568, 0.016535189002752304, -0.056692514568567276, 0.0011827077250927687, 0.009329..."


In [141]:
# df_test.drop(l_embedding_cols, axis=1)

In [142]:
%%time

# Option A: pandas converts to JSON & to text in one step
#. Then we just save the file w/o having to iterate
df_json_txt = df_test.drop(l_embedding_cols, axis=1).to_json(orient='records', lines=True)
print(type(df_json_txt))

f_json_test = 'json_test-json_text.json'
with open(f_json_test, 'w') as f:
    f.write(df_json_txt)

<class 'str'>
CPU times: user 4.18 ms, sys: 98 µs, total: 4.27 ms
Wall time: 3.63 ms


In [143]:
%%time
# Option B:convert to dict and from dict to JSON
#  This requires iterating over a dictionary to save the data
df_dict = df_test.drop(l_embedding_cols, axis=1).to_dict(orient='records')

# print(len(df_dict))
type(df_dict)

f_json_dict_test = 'json_test-dict_per_line.json'
with open(f_json_dict_test, 'w') as f:
    for sub_ in df_dict:
        f.write(json.dumps(sub_) + "\n")

CPU times: user 8.43 ms, sys: 2.77 ms, total: 11.2 ms
Wall time: 11.5 ms


In [128]:
# print(df_dict)

In [129]:
# f_json_dict_test1 = 'json_test-dict_json_dump.json'
# with open(f_json_dict_test1, 'w') as f:
#     json.dump(df_dict, f)

## Upload file to GCS

In [144]:
%%time

from google.cloud import storage


test_blob = f'test/embeddings/{f_json_test}'

client = storage.Client()
bucket = client.get_bucket(cfg_reshape_embeddings.config_dict['bucket_output'])
blob = bucket.blob(test_blob)
blob.upload_from_filename(f_json_test)

CPU times: user 15.8 ms, sys: 48.1 ms, total: 63.8 ms
Wall time: 1.41 s


In [145]:
%%time
test_blob2 = f'test/embeddings/{f_json_test}'

client = storage.Client()
bucket = client.get_bucket(cfg_reshape_embeddings.config_dict['bucket_output'])
blob = bucket.blob(test_blob2)
blob.upload_from_filename(f_json_dict_test)

CPU times: user 21.8 ms, sys: 42.8 ms, total: 64.6 ms
Wall time: 1.42 s


## Create table from GCS file

In [132]:
# cfg_reshape_embeddings.config_dict

In [155]:
%%time
# bq_table_test = (
#     f"{cfg_reshape_embeddings.config_dict['bq_project']}."
#     f"{cfg_reshape_embeddings.config_dict['bq_dataset']}."
#     f"{cfg_reshape_embeddings.config_dict['bq_table']}_test"
# )

load_data_to_bq_table(
    uri=f"gs://{cfg_reshape_embeddings.config_dict['bucket_output']}/{test_blob}",
    bq_project=cfg_reshape_embeddings.config_dict['bq_project'],
    bq_dataset=cfg_reshape_embeddings.config_dict['bq_dataset'],
    bq_table_name=f"{cfg_reshape_embeddings.config_dict['bq_table']}_test",
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=cfg_reshape_embeddings.config_dict['bq_table_description'],
    update_table_description=True,
)

21:48:06 | INFO | "Loading data to table:
  reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings_test"
21:48:06 | INFO | "Table reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings_test already exist"
21:48:06 | INFO | "  100 rows in table BEFORE addig data"
21:48:11 | INFO | "  200 rows in table AFTER adding data"


CPU times: user 38.6 ms, sys: 45.8 ms, total: 84.3 ms
Wall time: 6.33 s


In [159]:
%%time

load_data_to_bq_table(
    uri=f"gs://{cfg_reshape_embeddings.config_dict['bucket_output']}/{test_blob2}",
    bq_project=cfg_reshape_embeddings.config_dict['bq_project'],
    bq_dataset=cfg_reshape_embeddings.config_dict['bq_dataset'],
    bq_table_name=f"{cfg_reshape_embeddings.config_dict['bq_table']}_test2",
    schema=embeddings_schema(),
    partition_column='pt',
    table_description=cfg_reshape_embeddings.config_dict['bq_table_description'],
    update_table_description=True,
)

21:51:38 | INFO | "Loading data to table:
  reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings_test2"
21:51:39 | INFO | "Table reddit-employee-datasets.david_bermejo.cau_subreddit_embeddings_test2 already exist"
21:51:39 | INFO | "  200 rows in table BEFORE addig data"
21:51:42 | INFO | "Updating subreddit description from:
  Content-based embeddings created from text in posts & comments in a subreddit. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/
to:
  Subreddit-level embeddings. See the wiki for more details. https://reddit.atlassian.net/wiki/spaces/DataScience/pages/2404220935/"
21:51:42 | INFO | "  300 rows in table AFTER adding data"


CPU times: user 39.4 ms, sys: 49.3 ms, total: 88.7 ms
Wall time: 5.02 s


In [None]:
LEGACY

In [74]:
mlflow.end_run("FAILED")
#TODO(djb): uncomment
mlflow_experiment = cfg_reshape_embeddings.config_dict['mlflow_experiment']
# 'v0.6.0_mUSE_aggregates', 'v0.6.0_mUSE_aggregates_test'


t_start_agg_embed = datetime.utcnow()
info(f"== Start run_aggregation() method ==")



info(f"MLflow tracking URI: {mlflow.get_tracking_uri()}")
mlf.set_experiment(mlflow_experiment)
mlflow.start_run()
mlf.add_git_hash_to_active_run()
mlf.set_tag_hostname(key='host_name')
mlf.log_param_hostname(key='host_name')
mlf.log_cpu_count()
mlf.log_ram_stats(param=True, only_memory_used=False)

04:32:42 | INFO | "== Start run_aggregation() method =="
04:32:42 | INFO | "MLflow tracking URI: sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db"
04:32:42 | INFO | "host_name: djb-100-2021-04-28-djb-eda-german-subs"
04:32:42 | INFO | "cpu_count: 96"
04:32:43 | INFO | "RAM stats:
{'memory_used_percent': '2.03%', 'memory_total': '1,444,961', 'memory_used': '29,348', 'memory_free': '1,407,217'}"


{'memory_total': 1444961,
 'memory_used_percent': 0.020310582776974603,
 'memory_used': 29348,
 'memory_free': 1407217}

In [75]:
# set weights
# Normalize them by dividing by 100
WEIGHT_POST_COMMENT = (
    cfg_reshape_embeddings.config_dict['aggregate_params']['agg_post_post_and_comment_weight'] / 100
)
WEIGHT_SUB_META = (
    cfg_reshape_embeddings.config_dict['aggregate_params']['agg_post_subreddit_desc_weight'] / 100
)
print(WEIGHT_POST_COMMENT + WEIGHT_SUB_META)
assert(1.0 == WEIGHT_POST_COMMENT + WEIGHT_SUB_META)


gcs_sub_embeddings = cfg_reshape_embeddings.config_dict['data_embeddings_to_aggregate']['subreddit_desc_folder_embeddings']
print(gcs_sub_embeddings)
gcs_post_comment_embeddings = cfg_reshape_embeddings.config_dict['data_embeddings_to_aggregate']['post_and_comments_folder_embeddings']
print(gcs_post_comment_embeddings)

#TODO(djb): uncomment
# mlflow.log_params(
#     {
#         'embeddings_bucket': BUCKET_NAME,
#         'embeddings_subreddit_path': gcs_sub_embeddings,
#         'embeddings_post_and_comments_path': gcs_post_comment_embeddings,
#         'weight_post_and_comments': WEIGHT_POST_COMMENT,
#         'weight_subreddit_meta': WEIGHT_SUB_META,
#     }
# )

1.0
i18n_topic_model_batch/runs/20220811/subreddits/text/embedding/2022-08-11_082859
i18n_topic_model_batch/runs/20220811/post_and_comment_text_combined/text_all/embedding/2022-08-11_084218


# Load data

In [76]:
%%time
t_start_data_load_ = datetime.utcnow()

subs_v = LoadSubredditsGCS(
    bucket_name=cfg_reshape_embeddings.config_dict['data_embeddings_to_aggregate']['bucket_embeddings'],
    gcs_path=gcs_sub_embeddings,
    local_cache_path="/home/jupyter/subreddit_clustering_i18n/data/local_cache/",
    columns=None,
    col_unique_check='subreddit_id',
    df_format='pandas',
    unique_check=True,
    verbose= True,
    
    n_sample_files=None,
    n_files_slice_start=None,
    n_files_slice_end=None,
)
subs_v.local_cache()

df_v_subs = subs_v.read_as_one_df()
r_subs, c_subs = df_v_subs.shape
mlflow.log_metrics(
    {
        f"df_v_subs-rows": r_subs,
        f"df_v_subs-cols": c_subs,
    }
)
print(f"{r_subs:,.0f} rows, {c_subs:,.0f} cols")

04:32:46 | INFO | "  Local folder to download artifact(s):
  /home/jupyter/subreddit_clustering_i18n/data/local_cache/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220811/subreddits/text/embedding/2022-08-11_082859"
04:32:47 | INFO | "  7 <- Files matching prefix"
04:32:47 | INFO | "  7 <- Files to check"
04:32:47 | INFO | "    000000000000-131971_by_514.parquet <- File already exists, not downloading"
04:32:47 | INFO | "    000000000001-198630_by_514.parquet <- File already exists, not downloading"
04:32:47 | INFO | "    000000000002-441159_by_514.parquet <- File already exists, not downloading"
04:32:47 | INFO | "    2022-08-11_08-28-59_vectorize_text.log <- File already exists, not downloading"
04:32:47 | INFO | "  Files already cached: 4"
04:32:47 | INFO | "  Files already downloaded."
04:32:47 | INFO | "  df format: pandas"
04:32:51 | INFO | "  Checking ID uniqueness..."


771,760 rows, 514 cols
CPU times: user 4.23 s, sys: 4.75 s, total: 8.98 s
Wall time: 6.68 s


In [132]:
mlflow.end_run("FAILED")