# Purpose


2022-06-29:
Use updated pandas function to get embeddings on VM machine with a ton of RAM.

There might be some tweaks needed to batch a few subreddits at a time, but at least we can get more consistent state/progress than with `dask`.

Test new methods to make aggregation faster.
Provenance:
* `v0.4.1 / djb_03.01-2021-12-aggregate_v041_posts_and_comments_pandas.ipynb`

# Notebook setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datetime import datetime
import gc
import os
import logging
from logging import info
from pathlib import Path
from pprint import pprint

import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import seaborn as sns

import dask
from dask import dataframe as dd
from tqdm.auto import tqdm

import mlflow
import hydra

import subclu
from subclu.models.aggregate_embeddings import (
    AggregateEmbeddings, AggregateEmbeddingsConfig,
    load_config_agg_jupyter, get_dask_df_shape,
)
from subclu.models import aggregate_embeddings_pd

from subclu.utils import set_working_directory
from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric
)
from subclu.utils.mlflow_logger import MlflowLogger, save_pd_df_to_parquet_in_chunks
from subclu.eda.aggregates import (
    compare_raw_v_weighted_language
)
from subclu.utils.data_irl_style import (
    get_colormap, theme_dirl
)


print_lib_versions([dask, hydra, mlflow, np, pd, plotly, sns, subclu])

python		v 3.7.10
===
dask		v: 2021.06.0
hydra		v: 1.1.0
mlflow		v: 1.16.0
numpy		v: 1.19.5
pandas		v: 1.2.4
plotly		v: 4.14.3
seaborn		v: 0.11.1
subclu		v: 0.5.0


In [3]:
# plotting
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as mdates
plt.style.use('default')

setup_logging()
notebook_display_config()

# Set sqlite database as MLflow URI

In [4]:
# use new class to initialize mlflow
mlf = MlflowLogger(tracking_uri='sqlite')
mlflow.get_tracking_uri()

'sqlite:////home/jupyter/subreddit_clustering_i18n/mlflow_sync/djb-100-2021-04-28-djb-eda-german-subs/mlruns.db'

## Get list of experiments with new function

In [5]:
df_mlf_exp = mlf.list_experiment_meta(output_format='pandas')
df_mlf_exp.tail(10)

Unnamed: 0,experiment_id,name,artifact_location,lifecycle_stage
24,24,v0.4.1_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/24,active
25,25,v0.4.1_mUSE_clustering_new_metrics,gs://i18n-subreddit-clustering/mlflow/mlruns/25,active
26,26,v0.4.1_nearest_neighbors_test,gs://i18n-subreddit-clustering/mlflow/mlruns/26,active
27,27,v0.4.1_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/27,active
28,28,v0.5.0_mUSE_aggregates_test,gs://i18n-subreddit-clustering/mlflow/mlruns/28,active
29,29,v0.5.0_mUSE_aggregates,gs://i18n-subreddit-clustering/mlflow/mlruns/29,active
30,30,v0.5.0_mUSE_clustering_test,gs://i18n-subreddit-clustering/mlflow/mlruns/30,active
31,31,v0.5.0_mUSE_clustering,gs://i18n-subreddit-clustering/mlflow/mlruns/31,active
32,32,v0.5.0_nearest_neighbors_test,gs://i18n-subreddit-clustering/mlflow/mlruns/32,active
33,33,v0.5.0_nearest_neighbors,gs://i18n-subreddit-clustering/mlflow/mlruns/33,active


In [6]:
# df_mlf_exp.iloc[9:15, :]

## ~Get runs that we can use for embeddings aggregation jobs~

For v0.5.0 embeddings I didn't use mlflow to track the embeddings inference. We'll need to get them from these folders in GCS:

- [Subreddit metadata](https://console.cloud.google.com/storage/browser/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/subreddits/text/embedding/2022-06-29_084555)
    - `i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/subreddits/text/embedding/2022-06-29_084555`
- [Post + Comment Text (already combined)](https://console.cloud.google.com/storage/browser/i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds/embedding/2022-06-29_091925)
    - `i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds/embedding/2022-06-29_091925`



# Load embeddings

## Posts + Comments

Only load 1 file for testingm

In [11]:
%%time

gcs_pc = (
    "gs://i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/post_and_comment_text_combined/text_subreddit_seeds/embedding/2022-06-29_091925/000000000003-274825_by_515.parquet"
)
df_v_pc = pd.read_parquet(
    gcs_pc
)
print(df_v_pc.shape)

(274825, 515)
CPU times: user 5.49 s, sys: 3.82 s, total: 9.32 s
Wall time: 34.7 s


In [13]:
df_v_pc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 274825 entries, 0 to 274824
Columns: 515 entries, subreddit_id to embeddings_511
dtypes: float32(512), object(3)
memory usage: 543.1+ MB


In [14]:
[c for c in df_v_pc.columns if 'embedding' not in c]

['subreddit_id', 'subreddit_name', 'post_id']

## Subreddit meta
all subreddits are in a single file, so no need to sample

In [10]:
%%time

gcs_subs = (
    "gs://i18n-subreddit-clustering/i18n_topic_model_batch/runs/20220629/subreddits/text/embedding/2022-06-29_084555/000000000000-196371_by_514.parquet"
)
df_v_subs = pd.read_parquet(
    gcs_subs
)
print(df_v_subs.shape)

(196371, 514)
CPU times: user 3.14 s, sys: 2.39 s, total: 5.53 s
Wall time: 26.7 s


In [15]:
[c for c in df_v_subs.columns if 'embedding' not in c]

['subreddit_id', 'subreddit_name']

# Compare weighted average calculations

## Weighted avg new method -- 1 subreddit at a time

By calculated the weights 1 subreddit at a time we have to do fewer loops. Also by reducing the work to multiplying & adding, it might be faster than letting numpy do weighted avg.

Algo:

For each subreddit:
- Calculate weighted value for subreddit (mutiply by 0.15)
- Calculate weighted value for each post in subreddit (multiply by 0.85)
- Sum the weigthed post + weighted subreddit
- Append multi-index cols from original post (do I even need to do this anymore?)

In [32]:
%%time
df_v_pc_weighted = df_v_pc.copy()

df_v_subs_weighted = df_v_subs.copy()

CPU times: user 193 ms, sys: 210 ms, total: 403 ms
Wall time: 402 ms


In [22]:
np.allclose(df_v_pc_weighted.iloc[:1000,3:515], df_v_pc.iloc[:1000,3:515])

True

In [34]:
np.allclose(df_v_subs_weighted.iloc[:1000,3:515], df_v_subs.iloc[:1000,3:515])

True

In [38]:
l_embedding_cols = [c for c in df_v_pc_weighted if c.startswith('embeddings_')]
print(len(l_embedding_cols))
df_v_pc_weighted.iloc[-5:,:50]

512


Unnamed: 0,subreddit_id,subreddit_name,post_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9,embeddings_10,embeddings_11,embeddings_12,embeddings_13,embeddings_14,embeddings_15,embeddings_16,embeddings_17,embeddings_18,embeddings_19,embeddings_20,embeddings_21,embeddings_22,embeddings_23,embeddings_24,embeddings_25,embeddings_26,embeddings_27,embeddings_28,embeddings_29,embeddings_30,embeddings_31,embeddings_32,embeddings_33,embeddings_34,embeddings_35,embeddings_36,embeddings_37,embeddings_38,embeddings_39,embeddings_40,embeddings_41,embeddings_42,embeddings_43,embeddings_44,embeddings_45,embeddings_46
274820,t5_2qhsb,legal,t3_vav60h,-0.025484,-0.042297,-0.028066,0.053591,-0.052055,0.03133,0.04766,-0.015244,0.029506,-0.006542,-0.042632,-0.022448,0.056348,-0.02123,-0.020143,-0.032932,0.031171,-0.047692,0.040053,0.054666,0.044613,-0.048837,0.040751,0.029844,-0.051699,0.027687,-0.044435,-0.030889,0.053416,-0.006069,-0.00553,0.010514,0.022499,-0.055732,0.053022,-0.021726,0.046862,0.02254,0.055809,0.048669,-0.046346,-0.016853,0.008252,0.035973,0.01874,0.037203,0.056607
274821,t5_2qhsb,legal,t3_vaxwgj,-0.016911,-0.049202,0.031682,-0.024452,-0.067547,0.010203,-0.003813,0.035652,0.000114,-0.008946,-0.002668,0.045287,0.065987,-0.036945,0.012492,-0.039705,0.034107,-0.043527,-0.033865,0.029753,-0.056707,-0.006285,0.033569,0.045606,0.032561,0.041322,0.032581,0.006962,0.054103,-0.039368,-0.034983,0.017265,0.020664,-0.050606,-0.028953,-0.018941,0.004652,0.010902,0.032505,-0.05143,-0.011445,0.050677,-0.00236,-0.038099,0.004932,0.013224,0.058994
274822,t5_2qhsb,legal,t3_vayh6b,-0.051104,-0.041374,0.001943,-0.049956,-0.00659,0.034922,-0.044614,-0.037481,0.036902,0.0419,-0.045564,-0.036952,0.059462,-0.021652,-0.046495,0.033043,0.060548,-0.037271,0.054952,0.050859,0.052671,-0.053249,0.03545,0.001733,-0.020086,0.042897,-0.056326,-0.010254,-0.01618,-0.004392,0.045888,0.00464,0.050837,-0.040546,0.046204,-0.0604,0.034423,0.059203,0.051741,-0.02968,-0.005628,-0.016251,0.060978,0.045034,0.022429,-0.036358,0.057792
274823,t5_2qhsb,legal,t3_vazqka,-0.01109,-0.028554,-0.009037,-0.02008,-0.068724,-0.04288,0.008994,0.036494,0.00102,-0.042014,0.0599,0.009779,0.065419,-0.003364,-0.009137,0.021694,0.030215,-0.042665,0.019184,-0.009016,-0.024031,0.02888,0.030134,0.048822,-0.055032,0.022658,0.019199,-0.023534,0.037996,-0.032842,-0.053451,-0.014632,-0.00323,-0.034031,-0.027852,0.016376,0.044918,-0.001498,0.056727,0.040879,-0.000106,0.050608,0.023669,-0.012869,0.008975,-0.05933,0.054055
274824,t5_2qhsb,legal,t3_vb1mkk,0.018254,-0.01848,-0.025331,0.052942,-0.062068,-0.025189,0.01164,0.035173,-0.010215,-0.027693,0.04286,0.029097,0.06375,-0.010834,0.003707,0.006596,0.056101,-0.04043,0.021435,0.015901,0.036214,-0.004567,0.06221,0.054289,-0.036047,0.009227,0.03037,-0.017788,0.05112,-0.014712,-0.06169,0.011686,0.035823,-0.045688,-0.005661,-0.011399,-0.02959,-0.01673,0.046029,0.046278,-0.033987,-0.003701,-0.009151,-0.014255,0.036235,-0.044902,0.044714


In [35]:
# df_v_pc_weighted[l_embedding_cols].iloc[:5,:50]

In [30]:
WEIGHT_POST_COMMENT = 0.85
WEIGHT_SUB_META = 0.15
assert(1.0 == WEIGHT_POST_COMMENT + WEIGHT_SUB_META)

In [36]:
%%time
# apply weight to all posts & subreddit meta at once (vectorized)
df_v_pc_weighted[l_embedding_cols] = df_v_pc_weighted[l_embedding_cols] * WEIGHT_POST_COMMENT

CPU times: user 367 ms, sys: 205 ms, total: 573 ms
Wall time: 571 ms


In [37]:
%%time
# apply weight to all posts & subreddit meta at once (vectorized)
df_v_subs_weighted[l_embedding_cols] = df_v_subs_weighted[l_embedding_cols] * WEIGHT_SUB_META

CPU times: user 272 ms, sys: 143 ms, total: 415 ms
Wall time: 414 ms


```
np.repeat()
2.03 s ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

In [76]:
%%timeit
# add subreddit vector to multiple posts
n_test_add_ = 274825  # 274825
np_repeat_test = (
    np.repeat(df_v_subs_weighted[l_embedding_cols].iloc[:1,:].to_numpy(), repeats=n_test_add_, axis=0) +
    df_v_pc_weighted[l_embedding_cols].iloc[:n_test_add_,:]
)

2.02 s ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [77]:
%%timeit
# add subreddit vector to multiple posts

np_add_test = np.add(
    df_v_subs_weighted[l_embedding_cols].iloc[:1,:].to_numpy(),
    df_v_pc_weighted[l_embedding_cols].iloc[:n_test_add_,:].to_numpy()
)

471 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


```
# np.add() is 4x faster!!
470 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

In [75]:
%%timeit
# add subreddit vector to multiple posts

np_add_test_2 = np.add(
    df_v_subs_weighted[l_embedding_cols].iloc[:1,:].to_numpy(),
    df_v_pc_weighted[l_embedding_cols].iloc[:n_test_add_,:]
)

470 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [72]:
%%time
np.allclose(np_repeat_test, np_add_test)

CPU times: user 784 ms, sys: 294 ms, total: 1.08 s
Wall time: 1.08 s


True

In [74]:
%%time
np.allclose(np_repeat_test, np_add_test_2)

CPU times: user 830 ms, sys: 251 ms, total: 1.08 s
Wall time: 1.08 s


True

In [57]:
# np.repeat(df_v_subs_weighted[l_embedding_cols].iloc[:1,:].to_numpy(), repeats=10, axis=0)

In [78]:
for s_id in tqdm(df_v_pc_weighted['subreddit_id'].unique()):
    mask_sub_posts = df_v_pc_weighted['subreddit_id'] == s_id
    
    df_v_pc_weighted.loc[mask_sub_posts, l_embedding_cols] = np.add(
        df_v_subs_weighted[df_v_subs_weighted['subreddit_id'] == s_id][l_embedding_cols].to_numpy(),
        df_v_pc_weighted[mask_sub_posts][l_embedding_cols]
    ) 

  0%|          | 0/327 [00:00<?, ?it/s]

## Calculate weighted average with old method

In [16]:
counts_describe(df_v_pc[['subreddit_id', 'subreddit_name', 'post_id']])

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
subreddit_id,object,274825,327,0.12%,0,0.00%
subreddit_name,object,274825,327,0.12%,0,0.00%
post_id,object,274825,274825,100.00%,0,0.00%


In [17]:
value_counts_and_pcts(df_v_pc['subreddit_name'])

Unnamed: 0,subreddit_name-count,subreddit_name-percent,subreddit_name-pct_cumulative_sum
christianity,4800,1.7%,1.7%
dating,4800,1.7%,3.5%
polls,4800,1.7%,5.2%
conservative,4800,1.7%,7.0%
sexy,4800,1.7%,8.7%
lego,4800,1.7%,10.5%
food,4800,1.7%,12.2%
golf,4800,1.7%,14.0%
weed,4800,1.7%,15.7%
art,4800,1.7%,17.5%


In [80]:
%%time

col_weights = '_col_method_weight_'

l_ix_sub_level = ['subreddit_id', 'subreddit_name']
l_ix_post_level = l_ix_sub_level + ['post_id']

# Create df with:
#  - all posts that already include weight from comments
#    - add new col with input weight
#  - subreddit descriptions
#    - create new df: one row per post, each row has the embeddings for the sub
#    - add new col with input weight
df_posts_for_weights = pd.concat(
    [
        # Because B already has weighted averages, sum their weights in a single column
        df_v_pc.assign(
            **{col_weights: WEIGHT_POST_COMMENT}
        ),
        (
            df_v_pc[l_ix_post_level]
            .merge(
                df_v_subs,
                how='left',
                left_on=l_ix_sub_level,
                right_on=l_ix_sub_level,
            )
        ).assign(
            **{col_weights: WEIGHT_SUB_META}
        ),
    ]
)

CPU times: user 904 ms, sys: 614 ms, total: 1.52 s
Wall time: 1.52 s


In [81]:
%%time
d_weighted_mean_agg = dict()
for id_, df in tqdm(
    df_posts_for_weights.groupby('post_id'),
    ascii=True, ncols=80, position=0, mininterval=8
):
    d_weighted_mean_agg[id_] = np.average(
        df[l_embedding_cols],
        weights=df[col_weights],
        axis=0,
    )

gc.collect()
# Convert dict to df so we can reshape to input multi-index
df_agg_posts_w_sub = pd.DataFrame(d_weighted_mean_agg).T
df_agg_posts_w_sub.columns = l_embedding_cols
df_agg_posts_w_sub.index.name = 'post_id'

  0%|                                                | 0/274825 [00:00<?, ?it/s]

CPU times: user 4min 22s, sys: 2.88 s, total: 4min 25s
Wall time: 4min 25s


In [82]:
df_agg_posts_w_sub.head()

Unnamed: 0_level_0,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9,embeddings_10,embeddings_11,embeddings_12,embeddings_13,embeddings_14,embeddings_15,embeddings_16,embeddings_17,embeddings_18,embeddings_19,embeddings_20,embeddings_21,embeddings_22,embeddings_23,embeddings_24,embeddings_25,embeddings_26,embeddings_27,embeddings_28,embeddings_29,...,embeddings_482,embeddings_483,embeddings_484,embeddings_485,embeddings_486,embeddings_487,embeddings_488,embeddings_489,embeddings_490,embeddings_491,embeddings_492,embeddings_493,embeddings_494,embeddings_495,embeddings_496,embeddings_497,embeddings_498,embeddings_499,embeddings_500,embeddings_501,embeddings_502,embeddings_503,embeddings_504,embeddings_505,embeddings_506,embeddings_507,embeddings_508,embeddings_509,embeddings_510,embeddings_511
post_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1
t3_uzzcv1,-0.009148,-0.039507,0.021938,0.055331,-0.05464,0.024449,-0.002375,0.05562,-0.044555,-3.2e-05,0.032799,-0.053243,-0.01136,0.005441,-0.054313,-0.009622,0.040114,-0.03802,-0.017399,-0.033303,-0.036961,-0.064304,0.012384,-0.005154,0.029324,-0.001708,-0.035202,0.024156,-0.064851,-0.068842,...,0.024094,0.005559,-0.027843,-0.0601,-0.038439,-0.042445,-0.053084,0.014334,0.055739,0.050927,-0.012063,0.009465,-0.032515,-0.003552,-0.061799,0.005882,-0.02888,0.040546,0.027285,-0.056034,-0.013069,-0.013259,0.034959,0.039181,0.046509,0.016985,-0.034171,0.007017,-0.010695,-0.008249
t3_uzzcz9,-0.019084,-0.048237,-0.043216,-0.017629,-0.044857,0.055447,-0.030667,-0.057356,-0.060387,-0.060134,0.054658,-0.050397,0.058118,-0.060813,0.009434,0.0009,0.063898,-0.002631,0.062142,0.0323,-0.063295,-0.027986,0.048525,0.044668,-0.04471,-0.051113,-0.020938,-0.062194,0.014399,-0.060798,...,0.003619,-0.058374,-0.056328,-0.062827,0.013637,-0.059419,-0.058887,-0.033025,0.001513,-0.044439,0.028763,0.016377,0.050336,0.029066,-0.032324,-0.063519,0.019458,0.053063,-0.051965,-0.003078,-0.037659,0.053979,-0.04323,0.002078,-0.030616,-0.044428,0.045737,-0.05711,0.016374,-0.009797
t3_uzzd5a,-0.000194,0.020939,0.004795,-0.067557,0.071071,-0.04359,0.016589,-0.048031,-0.03963,0.046193,0.045031,-0.043043,-0.023307,-0.024153,0.021685,0.081953,0.028689,-0.043491,-0.056047,0.056852,0.036719,-0.005775,-0.00979,0.04125,-0.044283,-0.066452,-0.055494,0.039966,-0.091498,-0.086407,...,0.074405,0.021761,-0.047466,-0.070385,-0.025411,-0.085675,0.041542,0.018769,-0.026418,-0.004121,-0.019741,-0.001223,0.015245,0.006438,-0.029733,-0.049211,-0.028133,-0.011553,0.038644,0.025355,-0.043904,0.061952,-0.038521,0.019298,-0.080547,0.012301,0.013185,-0.005493,-0.0381,-0.025254
t3_uzzd6g,-0.022142,0.063462,0.006929,0.002485,0.017795,-0.026877,0.059541,0.011584,-0.001599,0.00104,0.018869,0.016335,0.008045,-0.031077,-0.039798,0.047882,0.02809,-0.040501,-0.030687,0.065156,0.007338,-0.020712,0.020069,0.045698,0.00593,0.02385,-0.067804,0.002407,-0.101957,-0.093681,...,0.081825,-0.032171,-0.026468,-0.012311,-0.01621,-0.073604,-0.069801,-0.016043,-0.037921,-0.000515,0.019921,0.020983,-0.033227,-0.054925,-0.011173,0.00829,-0.023475,-0.02159,0.008425,0.004249,-0.032743,0.023139,0.000206,-0.004462,-0.097186,0.008091,-0.009928,-0.057247,-0.000195,-0.019686
t3_uzzd83,-0.027423,0.046839,-0.012693,0.028406,0.043272,0.061343,0.005615,-0.050633,-0.009577,-0.015914,-0.014944,-0.011592,0.051924,-0.011203,0.057848,0.007312,-0.046407,-0.03074,-0.004458,0.034829,-0.00344,-0.045691,0.038077,-0.021319,-0.04834,-0.032525,-0.030801,0.059464,0.034828,-0.005344,...,0.04106,0.01646,0.05686,-0.005063,0.060571,-0.050961,-0.056728,0.028061,0.047922,-0.028718,0.056995,-0.053525,0.027443,0.00236,0.018768,0.059138,-0.015045,-0.04133,0.05168,0.012768,-0.00501,0.040599,-0.050547,0.00834,-0.050218,-0.007337,0.056396,-0.049191,0.027127,0.03544


We need to sort by post-id before comparing the outputs.

As expected, the outputs match!

In [85]:
%%time
np.allclose(
    (
        df_v_pc_weighted
        .sort_values(by='post_id')
        [l_embedding_cols]
    ),
    df_agg_posts_w_sub.sort_index()    
)

CPU times: user 4.3 s, sys: 965 ms, total: 5.26 s
Wall time: 5.26 s


True

In [84]:
df_v_pc_weighted.head()

Unnamed: 0,subreddit_id,subreddit_name,post_id,embeddings_0,embeddings_1,embeddings_2,embeddings_3,embeddings_4,embeddings_5,embeddings_6,embeddings_7,embeddings_8,embeddings_9,embeddings_10,embeddings_11,embeddings_12,embeddings_13,embeddings_14,embeddings_15,embeddings_16,embeddings_17,embeddings_18,embeddings_19,embeddings_20,embeddings_21,embeddings_22,embeddings_23,embeddings_24,embeddings_25,embeddings_26,...,embeddings_482,embeddings_483,embeddings_484,embeddings_485,embeddings_486,embeddings_487,embeddings_488,embeddings_489,embeddings_490,embeddings_491,embeddings_492,embeddings_493,embeddings_494,embeddings_495,embeddings_496,embeddings_497,embeddings_498,embeddings_499,embeddings_500,embeddings_501,embeddings_502,embeddings_503,embeddings_504,embeddings_505,embeddings_506,embeddings_507,embeddings_508,embeddings_509,embeddings_510,embeddings_511
0,t5_2qh4w,4chan,t3_v09l0x,-0.025344,0.044852,-0.052632,-0.026053,0.060904,0.055519,-0.042853,-0.011524,0.016424,-0.004246,0.063406,0.064824,0.033947,-0.063879,0.013637,-0.038069,-0.030971,-0.030783,0.040182,0.017561,0.052267,0.042475,0.030988,-0.051888,-0.066758,-0.055583,0.016553,...,0.05313,-0.030324,-0.071508,-0.004869,0.016515,-0.070484,-0.012324,0.00586,0.054137,-0.069215,0.052305,-0.011671,0.038728,0.052904,0.002566,-0.019516,0.035654,0.015187,0.035085,0.023885,-0.022797,-0.021177,-0.063997,-0.004042,-0.032427,-0.029911,-0.053402,-0.003792,0.016906,0.063804
1,t5_2qh4w,4chan,t3_v09uio,-0.000732,0.030236,0.070387,0.020216,-0.034513,0.040741,0.040386,-0.025521,-0.049477,0.002374,0.034898,-0.004015,0.056157,-0.053467,-0.054341,0.029749,0.014569,0.024137,-0.057508,-0.0289,-0.006759,0.002075,-0.033822,-0.052768,0.011789,0.002936,-0.045712,...,-0.022454,0.02826,-0.068587,-0.052722,0.030246,-0.01294,-0.007049,0.004349,-0.03414,-0.004288,0.062963,-0.048408,-0.01088,0.011084,-0.036912,-0.002513,0.023946,-0.052061,0.005379,0.048422,-0.056612,0.01881,-0.043887,0.019949,0.019921,-0.02815,-0.034119,0.028215,0.004101,0.061613
2,t5_2qh4w,4chan,t3_v0ab8w,-0.00563,0.045755,0.046313,-0.065188,0.032805,0.040427,0.039458,-0.004727,-0.006517,0.059666,0.046653,-0.03958,0.049561,-0.062256,0.022804,-0.023691,0.028127,0.037128,-0.072688,-0.009074,-0.040053,-0.052761,0.045511,-0.023994,-0.061593,0.021083,-0.01197,...,0.003046,0.0207,-0.081724,0.014125,-0.035194,-0.072809,-0.047294,-0.030604,0.008737,-0.07995,0.05214,-0.041242,-0.044933,-0.006422,0.005643,-0.013681,-0.000126,0.026373,-0.001319,-0.000275,-0.019186,0.029785,-0.037317,0.037052,-0.040958,-0.029413,0.01403,-0.03384,0.055456,0.037141
3,t5_2qh4w,4chan,t3_v0aury,0.043843,0.050806,-0.059777,-0.079567,0.03857,0.043294,-0.044257,-0.016724,0.016344,-0.028772,0.04924,0.065774,0.047302,-0.053179,0.02704,0.031598,0.026866,0.009455,0.052996,0.023587,0.00114,-0.033306,0.034803,-0.020478,-0.039246,0.007488,0.022621,...,-0.013407,0.016951,-0.019589,-0.051987,0.013327,-0.018753,-0.013847,0.045517,0.038266,0.014553,0.019461,-0.001064,0.05556,0.040787,-0.013519,-0.031739,-0.008671,-0.018284,0.016138,0.011228,-0.011301,-0.064077,-0.048079,-0.055569,0.03418,-0.003101,-0.040758,-0.049324,0.055498,0.062042
4,t5_2qh4w,4chan,t3_v0c9k0,0.007006,0.033415,0.049523,-0.028163,0.006388,0.043576,-0.009795,0.049176,0.013264,0.010367,0.044427,0.059394,0.042845,-0.018269,0.036784,-0.002845,0.024472,-0.045489,0.048226,-0.009878,0.040419,-0.04631,0.003686,0.00792,-0.066298,0.000123,-0.010283,...,0.019983,-0.05448,-0.051939,-0.003669,0.023038,-0.065567,0.00044,-0.014776,0.024939,-0.01775,0.057477,-0.025525,-0.038102,0.049199,-0.050771,0.04724,-0.030346,-0.062441,0.015395,0.062345,0.031599,-0.05082,-0.0377,-0.020974,0.007366,-0.045776,-0.017955,-0.058923,0.0365,0.049851
