# Purpose

### 2022-03-22
- model version: 4.0.1

In this notebook we apply the filters from the QA process to get the final output for Australia-relevant subs. The primary use case is geo-relevant FPRs [[i18n/ML] One Feed Experiment Spec: UK/CA/AU/IN](https://docs.google.com/document/d/10z0ZlZuYnPYzUjlKHLy5iKClCk1AvMsy6LSL1wRK7FE/edit#heading=h.uozl6p2gc9i3).


### Inputs:
- Values from the QA spreadsheet
    - Cluster IDs
        - Recommend subs with the same cluster ID
    - Subreddits that are not country-relevant
        - Remove them
    - Subreddits that are NSFW
        - Remove them
    - Subreddits that don't belong to a cluster
        - Remove them

### Outputs
- Write to a new tab in google sheets with format needed for FPRs
    - Use `gspread` to write table outputs directly to google sheets


### Reference
[Notebook template for v0.4.0 model](https://colab.research.google.com/drive/19p3O5DGxiEXj57OeCFf2gVIagUgCLx9v?usp=sharing)

# Imports & notebook setup

In [None]:
%load_ext google.colab.data_table
%load_ext autoreload
%autoreload 2

The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# colab auth for BigQuery & google drive
from google.colab import auth, files, drive
import sys  # need sys for mounting gdrive path

auth.authenticate_user()
print('Authenticated')

Authenticated


### Append google-drive

In [None]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
g_drive_root = '/content/drive'

try:
    drive.mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
except Exception as e:
    print(e)
    raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install custom libraries

In [None]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

### General Imports


In [None]:
# Regular Imports
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted
from tqdm import tqdm

# auth for google sheets
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())


# Set env variable needed by some libraries to get datay from BigQuery
# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

### `subclu` import (custom module)

In [None]:
# subclu imports

from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.models.clustering_utils import (
    create_dynamic_clusters,
    convert_distance_or_ab_to_list_for_fpr
)
from subclu.models.reshape_clusters_v041 import (
    get_subs_to_filter_as_df
)


setup_logging()
print_lib_versions([gspread, pd, np])

python		v 3.7.13
===
gspread		v: 4.0.1
pandas		v: 1.3.5
numpy		v: 1.21.5


# Checklist to re-run for a country:

- Copy google sheet cell from 1st notebook so it matches:
    - Google sheet KEY
    - country name for google sheet name
    - country initial in google sheet
- Add new sheet to create: `clusters_t2t_fpr_after_qa`
    - Here's where we'll save the clusters after QA


# Google sheet with country QA

For now, copy the same cell from the QA notebook. 

In the future we might need to register QA sheets in a central location to reduce copy/pasta.

In [None]:
# # %%time
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

country_name_sheet_ = 'Germany'
target_abbrev_ = 'DE'
GSHEET_KEY = '1K0GPk-ud_UNPun_5EaODxY4CoywTEUbMyF8NNsk2q-8'
GSHEET_NAME = f'i18n {country_name_sheet_} subreddits and clusters - model v0.4.1'


d_wsh_names = {
    'qa_ready': {
        'name': 'subs_need_to_be_rated',
    },
    'clusters_t2t_list_raw': {
        'name': f'raw_clusters_list_{target_abbrev_}_{target_abbrev_}',
    },
    'sub_raw': {
        'name': 'raw_data_per_subreddit',
    },
    'clusters_t2t_fpr_raw': {
        'name': f'raw_clusters_fpr_{target_abbrev_}_{target_abbrev_}',
    },
    'clusters_t2t_fpr_after_qa': {
        'name': f'fpr_clusters_after_qa_{target_abbrev_}_{target_abbrev_}',
    },
}

if GSHEET_KEY is not None:
    sh = gc.open_by_key(GSHEET_KEY)
    print(f"Opening google worksheet: {GSHEET_NAME} ...")
else:
    print(f"** Creating google worksheet: {GSHEET_NAME} ...")
    sh = gc.create(GSHEET_NAME)

# create worksheets:
for _, d_ in d_wsh_names.items():
    sh_name = d_['name']
    try:
        d_['worksheet'] = sh.worksheet(sh_name)
        print(f"  Opening tab/sheet: {sh_name} ...")
    except Exception as e:
        print(f"  ** Creating tab/sheet: {sh_name} ...")
        d_['worksheet'] = sh.add_worksheet(sh_name, rows=5, cols=5)

if GSHEET_KEY is None:
    print(f"\n*** New sheet ID (assign it to GSHEET_KEY variable): ***\n{sh.id}\n")

Opening google worksheet: i18n Germany subreddits and clusters - model v0.4.1 ...
  Opening tab/sheet: subs_need_to_be_rated ...
  Opening tab/sheet: raw_clusters_list_DE_DE ...
  Opening tab/sheet: raw_data_per_subreddit ...
  Opening tab/sheet: raw_clusters_fpr_DE_DE ...
  Opening tab/sheet: fpr_clusters_after_qa_DE_DE ...


# Get latest ratings & flags (e.g., `allow_discovery`)

## SQL

In [None]:
%%time
%%bigquery df_latest_ratings --project data-science-prod-218515 

-- Get ratings & other flags for all subs in the model
--  we'll filter/match to country in python
DECLARE PARTITION_DATE DATE DEFAULT (CURRENT_DATE() - 1);

SELECT
    t.subreddit_id
    , t.subreddit_name
    , CASE WHEN nt.rating_short = 'E' THEN True
        ELSE False
    END AS rated_e_latest
    , slo.over_18
    , slo.allow_discovery
    , nt.rating_short
    , slo.type
    , nt.primary_topic
    , nt.rating_name

FROM `reddit-employee-datasets.david_bermejo.subclu_v0041_subreddit_clusters_c_a` AS t
    -- Add rating so we can get filter out subs not rated as E
    LEFT JOIN (
        SELECT *
        FROM `data-prod-165221.ds_v2_postgres_tables.subreddit_lookup`
        -- Get latest partition
        WHERE dt = PARTITION_DATE
    ) AS slo
        ON t.subreddit_id = slo.subreddit_id
    LEFT JOIN (
        SELECT * FROM `data-prod-165221.cnc.shredded_crowdsource_topic_and_rating`
        WHERE pt = PARTITION_DATE
    ) AS nt
        ON t.subreddit_id = nt.subreddit_id

WHERE 1=1
    AND t.subreddit_name != 'profile'
    -- For this query, we want to keep these values to know
    --  if they have changed since last time
    -- AND COALESCE(slo.type, '') = 'public'
    -- AND COALESCE(slo.verdict, 'f') <> 'admin_removed'
    -- AND COALESCE(slo.over_18, 'f') = 'f'
    -- AND COALESCE(nt.rating_short, '') NOT IN ('X', 'D')

ORDER BY subreddit_name
;

CPU times: user 2.06 s, sys: 172 ms, total: 2.23 s
Wall time: 10.2 s


## Inspect latest ratings

In [None]:
print(df_latest_ratings.shape)
df_latest_ratings.head()

(49558, 9)


Unnamed: 0,subreddit_id,subreddit_name,rated_e_latest,over_18,allow_discovery,rating_short,type,primary_topic,rating_name
0,t5_46wt4h,0hthaatsjaay,False,,,,,Mature Themes and Adult Content,
1,t5_4byrct,0nlyfantastic0,False,,,,,,
2,t5_36f9u6,0nlyleaks,False,,,,,,
3,t5_2qlzfy,0sanitymemes,False,,,M,,Internet Culture and Memes,Mature
4,t5_2qgijx,0xpolygon,True,,,E,,Crypto,Everyone


# Google sheet with subreddits to exclude
Spiros created [this sheet](https://docs.google.com/spreadsheets/d/1JiDpiLa8RKRTC0ZxjLI0ISgtngAFWTEbsoYEoeeaVO8/edit#gid=733540374) for subs that are missing rating or have ratings that look wrong.

To be safe, we'll be excluding all the subs in these sheets.


In [None]:
GSHEET_KEY_EXCLUDES = '1JiDpiLa8RKRTC0ZxjLI0ISgtngAFWTEbsoYEoeeaVO8'
sh_filter = gc.open_by_key(GSHEET_KEY_EXCLUDES)

df_subs_to_filter = get_subs_to_filter_as_df(sh_filter, cols_to_keep='core')

100%|██████████| 8/8 [00:04<00:00,  1.93it/s]


(1374, 3) <- df_subs to filter shape





In [None]:
counts_describe(df_subs_to_filter)

Unnamed: 0,dtype,count,unique,unique-percent,null-count,null-percent
subreddit_name,object,1373,1373,100.00%,1,0.07%
category,object,1374,88,6.40%,0,0.00%
request_type,object,1374,9,0.66%,0,0.00%


In [None]:
df_subs_to_filter.tail()

Unnamed: 0,subreddit_name,category,request_type
648,liraglutide,Medical and Mental Health,Recheck Rating
650,b12_deficiency,Medical and Mental Health,Recheck Rating
652,epilepsy,Medical and Mental Health,Recheck Rating
654,iih,Medical and Mental Health,Recheck Rating
659,tourettes,Medical and Mental Health,Recheck Rating


# Read QA results from sheet

In [None]:
%%time
df_qa_raw = pd.DataFrame(
    d_wsh_names['qa_ready']['worksheet'].get_all_records()
)
df_qa_raw = df_qa_raw.rename(columns={k: k.lower().strip().replace(' ', '_') for k in df_qa_raw.columns})

df_qa_raw = df_qa_raw.dropna(subset=['subreddit_id', 'subreddit_name'])

print(df_qa_raw.shape)

(730, 35)
CPU times: user 96.4 ms, sys: 249 µs, total: 96.7 ms
Wall time: 908 ms


# Reshape & filter data

## Add latest ratings to df with QA results

We need the latest ratings & flags to make sure filters are up to date

In [None]:
latest_suffix = '_latest'

col_rating_latest = f'rating_short{latest_suffix}'
col_over_18_latest = f'over_18{latest_suffix}'
col_rated_e_latest = 'rated_e_latest'
col_allow_discovery_latest = 'allow_discovery_latest'
col_primary_topic_latest = f"primary_topic{latest_suffix}"
col_country_relevant = f'not_country_relevant'
col_releveant_to_cluster = f'relevant_to_cluster/_other_subreddits_in_cluster'
col_safe_to_show_in_cluster = f'safe_to_show_in_relation_to_cluster'

col_new_cluster_val = 'cluster_label'
col_new_cluster_name = 'cluster_label_k'
col_model_sort_order = 'model_sort_order'

l_cols_to_front = [
    'subreddit_id',
    'subreddit_name',
    col_rated_e_latest,
    col_over_18_latest,
    col_allow_discovery_latest,
    col_country_relevant,
    col_releveant_to_cluster,
    col_safe_to_show_in_cluster,

    col_rating_latest,
    'type',
    col_primary_topic_latest,
    col_new_cluster_val,
    col_model_sort_order,
    col_new_cluster_name,
    'rating_short',
    'over_18',
]


# Set some dtypes to prevent errors downstream
df_qa_raw[col_model_sort_order] = df_qa_raw[col_model_sort_order].astype(int)



In [None]:
df_qa_latest = (
    df_latest_ratings.drop(['subreddit_name'], axis=1)
    .merge(
        df_qa_raw,
        how='right',
        on=['subreddit_id', ],
        suffixes=(latest_suffix, ''),
    )
    .sort_values(
        by=[col_model_sort_order], ascending=True,
    )
)

# make sure all objects in col are str so we can sort by it
df_qa_latest[col_new_cluster_val] = df_qa_latest[col_new_cluster_val].astype(str)

df_qa_latest = df_qa_latest[
    reorder_array(
        l_cols_to_front,
        df_qa_latest.columns
    )
]
print(df_qa_latest.shape)

(730, 42)


In [None]:
# style_df_numeric(
#     df_qa_latest.iloc[:5, :15],
#     rename_cols_for_display=True,
# )

## Drop sensitive subreddits

These have been flagged as too sensitive or risky to show up in recommendations, so we're excluding them even if the crowd-sourced rating is `E` (people troll... it's reddit, after all).

In [None]:
# might need to re-import in order for live updates to show up
from subclu.models.reshape_clusters_v041 import (
    _L_MATURE_CLUSTERS_TO_EXCLUDE_FROM_QA_,
    _L_SENSITIVE_SUBREDDITS_TO_EXCLUDE_FROM_FPRS_,
    remove_sensitive_clusters_and_subs,
    print_subreddit_name_qa_checks,
    apply_qa_filters_for_fpr,
)

print(len(_L_SENSITIVE_SUBREDDITS_TO_EXCLUDE_FROM_FPRS_))
_L_SENSITIVE_SUBREDDITS_TO_EXCLUDE_FROM_FPRS_[30:38]

88


['askthe_donald',
 'benshapiro',
 'tucker_carlson',
 'trueanon',
 'beholdthemasterrace',
 'globallockdown',
 'nurembergtwo',
 'covidiots']

In [None]:
df_qa_clean = remove_sensitive_clusters_and_subs(
    df_qa_latest,
    additional_subs_to_filter=df_subs_to_filter['subreddit_name'],
    col_new_cluster_val='cluster_label',
    print_qa_check=True,
)

(730, 42) <- Initial shape
(720, 42) <- Shape AFTER dropping place-clusters
(720, 42) <- Shape AFTER dropping covid-clusters
(711, 42) <- Shape AFTER dropping sensitive clusters
(672, 42) <- Shape AFTER dropping flagged subs A
(668, 42) <- Shape AFTER dropping flagged subs B
(666, 42) <- Shape AFTER dropping covid-related subs
64 <- Total subreddits removed

QA keyword subreddit checks:
  ['corona']
  ['nichtdietagespresse']
  ['tuebingen']



# Apply filters based on QA + latest ratings

Keep only subreddits that
- Are rated as `E`
    - Double check: `over_18` should be `f` or `NULL` 
- Relevant to country (`TRUE`)
- Relevant to cluster (`TRUE`)
- Safe to show in cluster (`TRUE`)
- Have the `allow_discovery` flag to `t` or `NULL` (i.e., NOT `f`) 


In [None]:
l_cols_qa = [
    col_country_relevant,
    col_releveant_to_cluster,
    col_safe_to_show_in_cluster,
    col_rated_e_latest,
    col_allow_discovery_latest,
    col_over_18_latest,
]

### Check each column individually

In [None]:
for c_ in l_cols_qa:
    display(
        value_counts_and_pcts(
            df_qa_clean,
            c_,
            add_col_prefix=False,
            reset_index=True,
            sort_index=True, cumsum=False,
            count_type='subreddits',
            rename_cols_for_display=True,
        )
    )
    print('')

Unnamed: 0,not country relevant,subreddits count,percent of subreddits
0,False,665,99.8%
1,True,1,0.2%





Unnamed: 0,relevant to cluster/ other subreddits in cluster,subreddits count,percent of subreddits
0,,45,6.8%
1,NO,135,20.3%
2,TRUE,486,73.0%





Unnamed: 0,safe to show in relation to cluster,subreddits count,percent of subreddits
0,,180,27.0%
1,True,486,73.0%





Unnamed: 0,rated e latest,subreddits count,percent of subreddits
0,False,32,4.8%
1,True,634,95.2%





Unnamed: 0,allow discovery latest,subreddits count,percent of subreddits
0,,666,100.0%





Unnamed: 0,over 18 latest,subreddits count,percent of subreddits
0,,666,100.0%





### Matrix of all conditions in one table



In [None]:
value_counts_and_pcts(
    df_qa_clean
    .fillna(
        value={
            col_allow_discovery_latest: 't', 
            col_over_18_latest: 'f',
            col_country_relevant: 'FALSE',
            col_releveant_to_cluster: 'FALSE',
            col_safe_to_show_in_cluster: 'FALSE',
        }
    )
    # Germany QA had other values
    .replace(
        {
            col_releveant_to_cluster: {'': 'NO'},
            col_safe_to_show_in_cluster: {'': 'FALSE'},
        }
    )
    ,
    l_cols_qa[:],
    add_col_prefix=False,
    # reset_index=True,
    sort_index=True, cumsum=False,
    count_type='subreddits',
    rename_cols_for_display=True,
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,subreddits count,percent of subreddits
not country relevant,relevant to cluster/ other subreddits in cluster,safe to show in relation to cluster,rated e latest,allow discovery latest,over 18 latest,Unnamed: 6_level_1,Unnamed: 7_level_1
False,NO,False,False,t,f,8,1.2%
False,NO,False,True,t,f,171,25.7%
False,TRUE,True,False,t,f,24,3.6%
False,TRUE,True,True,t,f,462,69.4%
True,NO,False,True,t,f,1,0.2%


In [None]:
# df_qa_clean[df_qa_clean['rated_e_latest'].isnull()].iloc[:5, :20]

### Create & apply masks

In [None]:
df_qa_clean = apply_qa_filters_for_fpr(
    df_qa_clean,
    print_qa_check=True,
)


QA keyword subreddit checks:
  ['nichtdietagespresse']

666 <- Initial subreddit count
462 <- Clean subreddits to use
(462, 42) <- df subreddits to use for FPR


# Reshape data for Subreddit seed -> Cluster list 

In [None]:
%%time
# col_new_cluster_val = 'cluster_label'
# col_new_cluster_name = 'cluster_label_k'
# col_new_cluster_prim_topic = 'cluster_name'

# make sure all objects in col are str so we can sort by it
# df_qa_clean[col_new_cluster_val] = df_qa_clean[col_new_cluster_val].astype(str)

l_cols_for_seeds = [
    'subreddit_id', 'subreddit_name',
    col_model_sort_order, col_rating_latest,
    col_new_cluster_val, col_new_cluster_name,
]


df_target_to_target_list = convert_distance_or_ab_to_list_for_fpr(
    df_qa_clean,
    convert_to_ab=True,
    col_counterpart_count='subreddits_to_recommend_count',
    col_list_cluster_names='list_cluster_subreddit_names',
    col_list_cluster_ids='list_cluster_subreddit_ids',
    l_cols_for_seeds=l_cols_for_seeds,
    l_cols_for_clusters=None, 
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    # col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    col_sort_by=col_new_cluster_val,
    verbose=True,
)

['subreddit_id', 'subreddit_name', 'model_sort_order', 'rating_short_latest', 'cluster_label', 'cluster_label_k']
['subreddit_id', 'subreddit_name', 'cluster_label']
  (3448, 8) <- df_ab.shape raw
  (2986, 8) <- df_ab.shape after removing matches to self
  Groupby cols:
    ['model_sort_order', 'subreddit_id_seed', 'subreddit_name_seed', 'cluster_label', 'cluster_label_k', 'rating_short_latest']
  (459, 8) <- df_a_to_b.shape
CPU times: user 38 ms, sys: 0 ns, total: 38 ms
Wall time: 38.1 ms


In [None]:
# df_target_to_target_list.sort_values(by=['cluster_label', ], ascending=True).iloc[:9, :]

In [None]:
df_target_to_target_list.head()

Unnamed: 0,subreddit_id_seed,subreddit_name_seed,cluster_label,cluster_label_k,rating_short_latest,subreddits_to_recommend_count,list_cluster_subreddit_names,list_cluster_subreddit_ids
0,t5_2veux,wix,0006-0008-0013-0016-0016-0020-0021-0025,k_0118_label,E,6,"gitea, traefik, portainer, abap, scriptable, l...","t5_wx7rh, t5_fucxb, t5_3jabm, t5_2seit, t5_nti..."
1,t5_wx7rh,gitea,0006-0008-0013-0016-0016-0020-0021-0025,k_0118_label,E,6,"wix, traefik, portainer, abap, scriptable, latex","t5_2veux, t5_fucxb, t5_3jabm, t5_2seit, t5_nti..."
2,t5_fucxb,traefik,0006-0008-0013-0016-0016-0020-0021-0025,k_0118_label,E,6,"wix, gitea, portainer, abap, scriptable, latex","t5_2veux, t5_wx7rh, t5_3jabm, t5_2seit, t5_nti..."
3,t5_3jabm,portainer,0006-0008-0013-0016-0016-0020-0021-0025,k_0118_label,E,6,"wix, gitea, traefik, abap, scriptable, latex","t5_2veux, t5_wx7rh, t5_fucxb, t5_2seit, t5_nti..."
4,t5_2seit,abap,0006-0008-0013-0016-0016-0020-0021-0025,k_0118_label,E,6,"wix, gitea, traefik, portainer, scriptable, latex","t5_2veux, t5_wx7rh, t5_fucxb, t5_3jabm, t5_nti..."


In [None]:
df_target_to_target_list.tail(10)

Unnamed: 0,subreddit_id_seed,subreddit_name_seed,cluster_label,cluster_label_k,rating_short_latest,subreddits_to_recommend_count,list_cluster_subreddit_names,list_cluster_subreddit_ids
451,t5_2qi34,shqip,0013-0023,k_0023_label,E,5,"ueber8000, mangade, gametwo, ytpromo, selfpromote","t5_4sfk6d, t5_4thzyd, t5_yqrtm, t5_33gd5, t5_2..."
452,t5_yqrtm,gametwo,0013-0023,k_0023_label,E,5,"ueber8000, mangade, shqip, ytpromo, selfpromote","t5_4sfk6d, t5_4thzyd, t5_2qi34, t5_33gd5, t5_2..."
455,t5_33xyp,mediathek,0013-0023-0041-0059-0062-0078-0084-0115-0310-0...,k_3927_label,E,3,"streamen, mirellativegal, stoizismus","t5_4cmjcc, t5_30efup, t5_3pjos"
456,t5_4cmjcc,streamen,0013-0023-0041-0059-0062-0078-0084-0115-0310-0...,k_3927_label,E,3,"mediathek, mirellativegal, stoizismus","t5_33xyp, t5_30efup, t5_3pjos"
458,t5_3pjos,stoizismus,0013-0023-0041-0059-0062-0078-0084-0115-0310-0...,k_3927_label,E,3,"mediathek, streamen, mirellativegal","t5_33xyp, t5_4cmjcc, t5_30efup"
457,t5_30efup,mirellativegal,0013-0023-0041-0059-0062-0078-0084-0115-0310-0...,k_3927_label,E,3,"mediathek, streamen, stoizismus","t5_33xyp, t5_4cmjcc, t5_3pjos"
231,t5_31x7o,american_football,10,k_0013_label,E,3,"elf, almancis, valnevase","t5_2skuk, t5_2h02ye, t5_51lqnd"
232,t5_2skuk,elf,10,k_0013_label,E,3,"american_football, almancis, valnevase","t5_31x7o, t5_2h02ye, t5_51lqnd"
264,t5_2h02ye,almancis,10,k_0013_label,E,3,"american_football, elf, valnevase","t5_31x7o, t5_2skuk, t5_51lqnd"
265,t5_51lqnd,valnevase,10,k_0013_label,E,3,"american_football, elf, almancis","t5_31x7o, t5_2skuk, t5_2h02ye"


## Save new output to google sheet

In [None]:
d_wsh_names['clusters_t2t_fpr_after_qa']['worksheet'].update(
    [df_target_to_target_list.rename(columns={k: k.replace('_', ' ') for k in df_target_to_target_list.columns}).columns.values.tolist()] + 
    df_target_to_target_list.fillna('').values.tolist()
)

{'spreadsheetId': '1K0GPk-ud_UNPun_5EaODxY4CoywTEUbMyF8NNsk2q-8',
 'updatedCells': 3680,
 'updatedColumns': 8,
 'updatedRange': 'fpr_clusters_after_qa_DE_DE!A1:H460',
 'updatedRows': 460}