# Purpose

### 2022-04-06
- model version: 4.0.1

In this notebook we apply the filters from the QA process to get the final output for Australia-relevant subs. The primary use case is geo-relevant FPRs [[i18n/ML] One Feed Experiment Spec: UK/CA/AU/IN](https://docs.google.com/document/d/10z0ZlZuYnPYzUjlKHLy5iKClCk1AvMsy6LSL1wRK7FE/edit#heading=h.uozl6p2gc9i3).


### Inputs:
- Values from the QA spreadsheet
    - Cluster IDs
        - Recommend subs with the same cluster ID
    - Subreddits that are not country-relevant
        - Remove them
    - Subreddits that are NSFW
        - Remove them
    - Subreddits that don't belong to a cluster
        - Remove them

### Outputs
- Write to a new tab in google sheets with format needed for FPRs
    - Use `gspread` to write table outputs directly to google sheets


### Reference
[Notebook template for v0.4.0 model](https://colab.research.google.com/drive/19p3O5DGxiEXj57OeCFf2gVIagUgCLx9v?usp=sharing)

# Imports & notebook setup

In [1]:
%load_ext google.colab.data_table
%load_ext autoreload
%autoreload 2

In [2]:
# colab auth for BigQuery, google drive, & google sheets (gspread)
from google.colab import auth, files, drive
from google.auth import default
import sys  # need sys for mounting gdrive path

auth.authenticate_user()
print('Authenticated')

Authenticated


### Append google-drive

In [3]:
# Attach google drive & import my python utility functions
# if drive.mount() fails, you can also:
#   MANUALLY CLICK ON "Mount Drive"
g_drive_root = '/content/drive'

try:
    drive.mount(g_drive_root, force_remount=True)
    print('   Authenticated & mounted Google Drive')
except Exception as e:
    print(e)
    raise Exception('You might need to manually mount google drive to colab')

l_paths_to_append = [
    f'{g_drive_root}/MyDrive/Colab Notebooks',

    # need to append the path to subclu so that colab can import things properly
    f'{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n'
]
for path_ in l_paths_to_append:
    if path_ in sys.path:
        sys.path.remove(path_)
    print(f" Appending path: {path_}")
    sys.path.append(path_)

Mounted at /content/drive
   Authenticated & mounted Google Drive
 Appending path: /content/drive/MyDrive/Colab Notebooks
 Appending path: /content/drive/MyDrive/Colab Notebooks/subreddit_clustering_i18n


### Install custom libraries

In [4]:
# install subclu & libraries needed to read parquet files from GCS & spreadsheets
#  make sure to use the [colab] `extra` because it includes colab-specific libraries
module_path = f"{g_drive_root}/MyDrive/Colab Notebooks/subreddit_clustering_i18n/[colab]"

!pip install -e $"$module_path" --quiet

[K     |████████████████████████████████| 10.1 MB 8.9 MB/s 
[K     |████████████████████████████████| 14.2 MB 36.3 MB/s 
[K     |████████████████████████████████| 965 kB 42.0 MB/s 
[K     |████████████████████████████████| 144 kB 12.0 MB/s 
[K     |████████████████████████████████| 76 kB 5.0 MB/s 
[K     |████████████████████████████████| 285 kB 58.6 MB/s 
[K     |████████████████████████████████| 13.2 MB 43.1 MB/s 
[K     |████████████████████████████████| 79.9 MB 130 kB/s 
[K     |████████████████████████████████| 136 kB 48.9 MB/s 
[K     |████████████████████████████████| 715 kB 60.1 MB/s 
[K     |████████████████████████████████| 112 kB 61.8 MB/s 
[K     |████████████████████████████████| 74 kB 3.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 51.2 MB/s 
[K     |████████████████████████████████| 62 kB 858 kB/s 
[K     |████████████████████████████████| 181 kB 50.2 MB/s 
[K     |████████████████████████████████| 79 kB 8.9 MB/s 
[K     |█████████████████████

### General Imports


In [5]:
# Regular Imports
import os
from datetime import datetime

from google.cloud import bigquery

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib_venn import venn2_unweighted, venn3_unweighted
from tqdm import tqdm

# auth for google sheets
creds_, _ = default()
import gspread
# from oauth2client.client import GoogleCredentials
gc = gspread.authorize(creds_)


# Set env variable needed by some libraries to get datay from BigQuery
# os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-science-prod-218515'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

### `subclu` import (custom module)

In [6]:
# subclu imports

from subclu.utils.eda import (
    setup_logging, counts_describe, value_counts_and_pcts,
    notebook_display_config, print_lib_versions,
    style_df_numeric, reorder_array,
)
from subclu.models.clustering_utils import (
    create_dynamic_clusters,
    convert_distance_or_ab_to_list_for_fpr
)
from subclu.models.reshape_clusters_v041 import (
    get_subs_to_filter_as_df
)


setup_logging()
print_lib_versions([gspread, pd, np])

python		v 3.7.13
===
gspread		v: 4.0.1
pandas		v: 1.3.5
numpy		v: 1.21.5


# Checklist to re-run for a country:

- Copy google sheet cell from 1st notebook so it matches:
    - Google sheet KEY
    - country name for google sheet name
    - country initial in google sheet
- Add new sheet to create: `clusters_t2t_fpr_after_qa`
    - Here's where we'll save the clusters after QA


# Google sheet with country QA

For now, copy the same cell from the QA notebook. 

In the future we might need to register QA sheets in a central location to reduce copy/pasta.

In [7]:
# # %%time
os.environ['GOOGLE_CLOUD_PROJECT'] = 'data-prod-165221'

country_name_sheet_ = 'India'
target_abbrev_ = 'in'
GSHEET_KEY = '1LgcrVG-1vhMoC0JifrZYCCgsalIVigk13Yy7-ZKzw1U'
GSHEET_NAME = f'i18n {country_name_sheet_} subreddits and clusters - model v0.4.1'


d_wsh_names = {
    'qa_ready': {
        'name': 'subs_need_to_be_rated',
    },
    'clusters_t2t_list_raw': {
        'name': f'raw_clusters_list_{target_abbrev_}_{target_abbrev_}',
    },
    'sub_raw': {
        'name': 'raw_data_per_subreddit',
    },
    'clusters_t2t_fpr_raw': {
        'name': f'raw_clusters_fpr_{target_abbrev_}_{target_abbrev_}',
    },
    'clusters_t2t_fpr_after_qa': {
        'name': f'fpr_clusters_after_qa_{target_abbrev_}_{target_abbrev_}',
    },
}

if GSHEET_KEY is not None:
    sh = gc.open_by_key(GSHEET_KEY)
    print(f"Opening google worksheet: {GSHEET_NAME} ...")
else:
    print(f"** Creating google worksheet: {GSHEET_NAME} ...")
    sh = gc.create(GSHEET_NAME)

# create worksheets:
for _, d_ in d_wsh_names.items():
    sh_name = d_['name']
    try:
        d_['worksheet'] = sh.worksheet(sh_name)
        print(f"  Opening tab/sheet: {sh_name} ...")
    except Exception as e:
        print(f"  ** Creating tab/sheet: {sh_name} ...")
        d_['worksheet'] = sh.add_worksheet(sh_name, rows=5, cols=5)

if GSHEET_KEY is None:
    print(f"\n*** New sheet ID (assign it to GSHEET_KEY variable): ***\n{sh.id}\n")

Opening google worksheet: i18n India subreddits and clusters - model v0.4.1 ...
  Opening tab/sheet: subs_need_to_be_rated ...
  Opening tab/sheet: raw_clusters_list_in_in ...
  Opening tab/sheet: raw_data_per_subreddit ...
  Opening tab/sheet: raw_clusters_fpr_in_in ...
  ** Creating tab/sheet: fpr_clusters_after_qa_in_in ...


# Get latest ratings & flags (e.g., `allow_discovery`)

## SQL

In [8]:
%%time
%%bigquery df_latest_ratings --project data-science-prod-218515 

-- Get ratings & other flags for all subs in the model
--  we'll filter/match to country in python
DECLARE PARTITION_DATE DATE DEFAULT (CURRENT_DATE() - 1);

SELECT
    t.subreddit_id
    , t.subreddit_name
    , CASE WHEN nt.rating_short = 'E' THEN True
        ELSE False
    END AS rated_e_latest
    , slo.over_18
    , slo.allow_discovery
    , nt.rating_short
    , slo.type
    , nt.primary_topic
    , nt.rating_name

FROM `reddit-employee-datasets.david_bermejo.subclu_v0041_subreddit_clusters_c_a` AS t
    -- Add rating so we can get filter out subs not rated as E
    LEFT JOIN (
        SELECT *
        FROM `data-prod-165221.ds_v2_postgres_tables.subreddit_lookup`
        -- Get latest partition
        WHERE dt = PARTITION_DATE
    ) AS slo
        ON t.subreddit_id = slo.subreddit_id
    LEFT JOIN (
        SELECT * FROM `data-prod-165221.cnc.shredded_crowdsource_topic_and_rating`
        WHERE pt = PARTITION_DATE
    ) AS nt
        ON t.subreddit_id = nt.subreddit_id

WHERE 1=1
    AND t.subreddit_name != 'profile'
    -- For this query, we want to keep these values to know
    --  if they have changed since last time
    -- AND COALESCE(slo.type, '') = 'public'
    -- AND COALESCE(slo.verdict, 'f') <> 'admin_removed'
    -- AND COALESCE(slo.over_18, 'f') = 'f'
    -- AND COALESCE(nt.rating_short, '') NOT IN ('X', 'D')

ORDER BY subreddit_name
;

CPU times: user 1.8 s, sys: 145 ms, total: 1.95 s
Wall time: 6.41 s


## Inspect latest ratings

In [9]:
print(df_latest_ratings.shape)
df_latest_ratings.head()

(49558, 9)


Unnamed: 0,subreddit_id,subreddit_name,rated_e_latest,over_18,allow_discovery,rating_short,type,primary_topic,rating_name
0,t5_46wt4h,0hthaatsjaay,False,t,,,public,Mature Themes and Adult Content,
1,t5_4byrct,0nlyfantastic0,False,t,,,public,,
2,t5_36f9u6,0nlyleaks,False,t,,,public,,
3,t5_2qlzfy,0sanitymemes,False,,,M,public,Internet Culture and Memes,Mature
4,t5_2qgijx,0xpolygon,True,,,E,public,Crypto,Everyone


In [10]:
df_latest_ratings[df_latest_ratings['subreddit_name'].str.contains('667')]

Unnamed: 0,subreddit_id,subreddit_name,rated_e_latest,over_18,allow_discovery,rating_short,type,primary_topic,rating_name
308,t5_2ytof,667,True,,,E,public,Music,Everyone
25323,t5_2zcwdh,leakoeln50667,False,,,,public,,


# Google sheet with subreddits to exclude
Spiros created [this sheet](https://docs.google.com/spreadsheets/d/1JiDpiLa8RKRTC0ZxjLI0ISgtngAFWTEbsoYEoeeaVO8/edit#gid=733540374) for subs that are missing rating or have ratings that look wrong.

To be safe, we'll be excluding all the subs in these sheets.


In [11]:
# GSHEET_KEY_EXCLUDES = '1JiDpiLa8RKRTC0ZxjLI0ISgtngAFWTEbsoYEoeeaVO8'
# sh_filter = gc.open_by_key(GSHEET_KEY_EXCLUDES)

# df_subs_to_filter = get_subs_to_filter_as_df(sh_filter, cols_to_keep='core')

# counts_describe(df_subs_to_filter)

In [12]:
# df_subs_to_filter.tail()

# Read QA results from sheet

In [13]:
%%time
df_qa_raw = pd.DataFrame(
    d_wsh_names['qa_ready']['worksheet'].get_all_records()
)
df_qa_raw = df_qa_raw.rename(columns={k: k.lower().strip().replace(' ', '_') for k in df_qa_raw.columns})

print(df_qa_raw.shape)

(1160, 35)
CPU times: user 146 ms, sys: 162 µs, total: 146 ms
Wall time: 559 ms


# Reshape & filter data

## Add latest ratings to df with QA results

We need the latest ratings & flags to make sure filters are up to date

In [14]:
latest_suffix = '_latest'

col_rating_latest = f'rating_short{latest_suffix}'
col_over_18_latest = f'over_18{latest_suffix}'
col_rated_e_latest = 'rated_e_latest'
col_allow_discovery_latest = 'allow_discovery_latest'
col_primary_topic_latest = f"primary_topic{latest_suffix}"
col_country_relevant = f'not_country_relevant'
col_releveant_to_cluster = f'relevant_to_cluster/_other_subreddits_in_cluster'
col_safe_to_show_in_cluster = f'safe_to_show_in_relation_to_cluster'

col_new_cluster_val = 'cluster_label'
col_new_cluster_name = 'cluster_label_k'
col_model_sort_order = 'model_sort_order'

l_cols_to_front = [
    'subreddit_id',
    'subreddit_name',
    col_rated_e_latest,
    col_over_18_latest,
    col_allow_discovery_latest,
    col_country_relevant,
    col_releveant_to_cluster,
    col_safe_to_show_in_cluster,

    col_rating_latest,
    'type',
    col_primary_topic_latest,
    col_new_cluster_val,
    col_model_sort_order,
    col_new_cluster_name,
    'rating_short',
    'over_18',

]

In [15]:
df_qa_latest = (
    df_latest_ratings.drop(['subreddit_name'], axis=1)
    .merge(
        df_qa_raw,
        how='right',
        on=['subreddit_id', ],
        suffixes=(latest_suffix, ''),
    )
    .sort_values(
        by=[col_model_sort_order], ascending=True,
    )
)

# make sure all objects in col are str so we can sort by it
df_qa_latest[col_new_cluster_val] = df_qa_latest[col_new_cluster_val].astype(str)

df_qa_latest = df_qa_latest[
    reorder_array(
        l_cols_to_front,
        df_qa_latest.columns
    )
]
print(df_qa_latest.shape)

(1160, 42)


In [16]:
# style_df_numeric(
#     df_qa_latest.iloc[:5, :15],
#     rename_cols_for_display=True,
# )

## Drop sensitive subreddits

These have been flagged as too sensitive or risky to show up in recommendations, so we're excluding them even if the crowd-sourced rating is `E` (people troll... it's reddit, after all).

In [17]:
# might need to re-import in order for live updates to show up
from subclu.models.reshape_clusters_v041 import (
    _L_MATURE_CLUSTERS_TO_EXCLUDE_FROM_QA_,
    _L_SENSITIVE_SUBREDDITS_TO_EXCLUDE_FROM_FPRS_,
    remove_sensitive_clusters_and_subs,
    print_subreddit_name_qa_checks,
    apply_qa_filters_for_fpr,
)

print(len(_L_SENSITIVE_SUBREDDITS_TO_EXCLUDE_FROM_FPRS_))
_L_SENSITIVE_SUBREDDITS_TO_EXCLUDE_FROM_FPRS_[30:38]

96


['askthe_donald',
 'benshapiro',
 'tucker_carlson',
 'trueanon',
 'beholdthemasterrace',
 'globallockdown',
 'nurembergtwo',
 'covidiots']

In [18]:
df_qa_clean = remove_sensitive_clusters_and_subs(
    df_qa_latest,
    additional_subs_to_filter=None,  # df_subs_to_filter['subreddit_name'],
    col_new_cluster_val='cluster_label',
    print_qa_check=True,
)

(1160, 42) <- Initial shape
(1160, 42) <- Shape AFTER dropping place-clusters
(1160, 42) <- Shape AFTER dropping covid-clusters
(1142, 42) <- Shape AFTER dropping sensitive clusters
(1142, 42) <- Shape AFTER dropping flagged subs B
(1141, 42) <- Shape AFTER dropping covid-related subs
19 <- Total subreddits removed

QA keyword subreddit checks:
  ['kritisanon', 'kritisanonlovers', 'indianonpolitical', 'anoncorporatechatind']
  ['indiannsfwmeme']
  ['onlyfans_hacked_x']
  ['hinakhanfap', 'shirleysetiafap', 'raveenatandonfap', 'desifappedtoher', 'payalfappers', 'fapshaggers', 'nofapteens', 'nofap']
  ['nofapteens', 'indianteenagers']
  ['adhdindia']
  ['radhikaseth', 'sadhguru', 'adhdindia']



# Apply filters based on QA + latest ratings

Keep only subreddits that
- Are rated as `E`
    - Double check: `over_18` should be `f` or `NULL` 
- Relevant to country (`TRUE`)
- Relevant to cluster (`TRUE`)
- Safe to show in cluster (`TRUE`)
- Have the `allow_discovery` flag to `t` or `NULL` (i.e., NOT `f`) 


In [19]:
l_cols_qa = [
    col_country_relevant,
    col_releveant_to_cluster,
    col_safe_to_show_in_cluster,
    col_rated_e_latest,
    col_allow_discovery_latest,
    col_over_18_latest,
]

### Check each column individually

In [20]:
for c_ in l_cols_qa:
    display(
        value_counts_and_pcts(
            df_qa_clean,
            c_,
            add_col_prefix=False,
            reset_index=True,
            sort_index=True, cumsum=False,
            count_type='subreddits',
            rename_cols_for_display=True,
        )
    )
    print('')

Unnamed: 0,not country relevant,subreddits count,percent of subreddits
0,False,1129,98.9%
1,True,12,1.1%





Unnamed: 0,relevant to cluster/ other subreddits in cluster,subreddits count,percent of subreddits
0,False,291,25.5%
1,True,850,74.5%





Unnamed: 0,safe to show in relation to cluster,subreddits count,percent of subreddits
0,False,273,23.9%
1,True,868,76.1%





Unnamed: 0,rated e latest,subreddits count,percent of subreddits
0,False,119,10.4%
1,True,1022,89.6%





Unnamed: 0,allow discovery latest,subreddits count,percent of subreddits
0,f,8,0.7%
1,t,91,8.0%
2,,1042,91.3%





Unnamed: 0,over 18 latest,subreddits count,percent of subreddits
0,f,299,26.2%
1,t,4,0.4%
2,,838,73.4%





### Matrix of all conditions in one table



In [21]:
value_counts_and_pcts(
    df_qa_clean.fillna(value={col_allow_discovery_latest: 't', col_over_18_latest: 'f'}),
    l_cols_qa[:],
    add_col_prefix=False,
    # reset_index=True,
    sort_index=True, cumsum=False,
    count_type='subreddits',
    rename_cols_for_display=True,
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,subreddits count,percent of subreddits
not country relevant,relevant to cluster/ other subreddits in cluster,safe to show in relation to cluster,rated e latest,allow discovery latest,over 18 latest,Unnamed: 6_level_1,Unnamed: 7_level_1
False,False,False,False,t,f,58,5.1%
False,False,False,False,t,t,1,0.1%
False,False,False,True,t,f,188,16.5%
False,False,False,True,t,t,3,0.3%
False,False,True,False,t,f,1,0.1%
False,False,True,True,t,f,29,2.5%
False,True,False,True,t,f,11,1.0%
False,True,True,False,t,f,58,5.1%
False,True,True,True,f,f,8,0.7%
False,True,True,True,t,f,772,67.7%


In [22]:
df_qa_clean[df_qa_clean['rated_e_latest'].isnull()].iloc[:5, :20]

Unnamed: 0,subreddit_id,subreddit_name,rated_e_latest,over_18_latest,allow_discovery_latest,not_country_relevant,relevant_to_cluster/_other_subreddits_in_cluster,safe_to_show_in_relation_to_cluster,rating_short_latest,type,primary_topic_latest,cluster_label,model_sort_order,cluster_label_k,rating_short,over_18,rating_name_latest,cluster_label_int,cluster_topic_mix,rated_e


### Create & apply masks


In [23]:
print(datetime.utcnow())
df_qa_clean = apply_qa_filters_for_fpr(
    df_qa_clean,
    print_qa_check=True,
)

2022-04-06 18:37:15.122576

QA keyword subreddit checks:
  ['indianonpolitical', 'anoncorporatechatind']
  ['indianteenagers']
  ['adhdindia']
  ['sadhguru', 'adhdindia']

1,141 <- Initial subreddit count
772 <- Clean subreddits to use
(772, 42) <- df subreddits to use for FPR


# Reshape data for Subreddit seed -> Cluster list 

In [24]:
%%time
# col_new_cluster_val = 'cluster_label'
# col_new_cluster_name = 'cluster_label_k'
# col_new_cluster_prim_topic = 'cluster_name'

# make sure all objects in col are str so we can sort by it
# df_qa_clean[col_new_cluster_val] = df_qa_clean[col_new_cluster_val].astype(str)

l_cols_for_seeds = [
    'subreddit_id', 'subreddit_name',
    col_model_sort_order, col_rating_latest,
    col_new_cluster_val, col_new_cluster_name,
]

print(datetime.utcnow())

df_target_to_target_list = convert_distance_or_ab_to_list_for_fpr(
    df_qa_clean,
    convert_to_ab=True,
    col_counterpart_count='subreddits_to_recommend_count',
    col_list_cluster_names='list_cluster_subreddit_names',
    col_list_cluster_ids='list_cluster_subreddit_ids',
    l_cols_for_seeds=l_cols_for_seeds,
    l_cols_for_clusters=None, 
    col_new_cluster_val=col_new_cluster_val,
    col_new_cluster_name=col_new_cluster_name,
    # col_new_cluster_prim_topic=col_new_cluster_prim_topic,
    col_sort_by=col_new_cluster_val,
    verbose=True,
)

2022-04-06 18:37:16.117681
['subreddit_id', 'subreddit_name', 'model_sort_order', 'rating_short_latest', 'cluster_label', 'cluster_label_k']
['subreddit_id', 'subreddit_name', 'cluster_label']
  (6278, 8) <- df_ab.shape raw
  (5506, 8) <- df_ab.shape after removing matches to self
  Groupby cols:
    ['model_sort_order', 'subreddit_id_seed', 'subreddit_name_seed', 'cluster_label', 'cluster_label_k', 'rating_short_latest']
  (768, 8) <- df_a_to_b.shape
CPU times: user 97.3 ms, sys: 516 µs, total: 97.8 ms
Wall time: 249 ms


In [25]:
# df_target_to_target_list.sort_values(by=['cluster_label', ], ascending=True).iloc[:9, :]

In [34]:
(
    df_target_to_target_list
    .drop([col_new_cluster_val, 'list_cluster_subreddit_ids'], axis=1)
    .head(10)
)

Unnamed: 0,subreddit_id_seed,subreddit_name_seed,cluster_label_k,rating_short_latest,subreddits_to_recommend_count,list_cluster_subreddit_names
0,t5_3dpt3l,bolly_actress_hd,k_3927_label,E,3,"classicdesibeauties, southindianangels, justsaree"
1,t5_4c9b7z,classicdesibeauties,k_3927_label,E,3,"bolly_actress_hd, southindianangels, justsaree"
2,t5_3mj6vk,southindianangels,k_3927_label,E,3,"bolly_actress_hd, classicdesibeauties, justsaree"
3,t5_3orvgc,justsaree,k_3927_label,E,3,"bolly_actress_hd, classicdesibeauties, southin..."
17,t5_12ppxo,bollywoodmemes,k_3927_label,E,13,"saipallavi, kollywood, internetrabbithole, mal..."
16,t5_fknyy,bollyblindsngossip,k_3927_label,E,13,"saipallavi, kollywood, internetrabbithole, mal..."
15,t5_3p026j,ranbirkapooruniverse,k_3927_label,E,13,"saipallavi, kollywood, internetrabbithole, mal..."
14,t5_4j6pz3,classicdesicelebs,k_3927_label,E,13,"saipallavi, kollywood, internetrabbithole, mal..."
12,t5_2v3t9,shahrukhkhan,k_3927_label,E,13,"saipallavi, kollywood, internetrabbithole, mal..."
11,t5_2t24o,indiancinema,k_3927_label,E,13,"saipallavi, kollywood, internetrabbithole, mal..."


In [31]:
(
    df_target_to_target_list
    .drop([col_new_cluster_val, 'list_cluster_subreddit_ids'], axis=1)
    .tail(10)
)

Unnamed: 0,subreddit_id_seed,subreddit_name_seed,cluster_label_k,rating_short_latest,subreddits_to_recommend_count,list_cluster_subreddit_names
489,t5_30jzq,clashofclansrecruit,k_0013_label,E,3,"wattles, contestofchampionslfg, saiyanpeopletw..."
483,t5_o6oqi,wattles,k_0013_label,E,3,"clashofclansrecruit, contestofchampionslfg, sa..."
554,t5_3kc7e,mcumemes,k_0013_label,E,2,"nowayhome, spidermannowayhome"
553,t5_40jv5m,spidermannowayhome,k_0013_label,E,2,"nowayhome, mcumemes"
552,t5_2ck6uk,nowayhome,k_0013_label,E,2,"spidermannowayhome, mcumemes"
595,t5_2uhb6,hardcoreaww,k_0013_label,E,1,dailydoseofaww
596,t5_5f9130,dailydoseofaww,k_0013_label,E,1,hardcoreaww
98,t5_4gwjlz,acerpredatorhelios300,k_0013_label,E,2,"hpvictus, reeltoreel"
99,t5_4gr4gi,hpvictus,k_0013_label,E,2,"acerpredatorhelios300, reeltoreel"
105,t5_2ux78,reeltoreel,k_0013_label,E,2,"acerpredatorhelios300, hpvictus"


## Save new output to google sheet

In [35]:
d_wsh_names['clusters_t2t_fpr_after_qa']['worksheet'].update(
    [df_target_to_target_list.rename(columns={k: k.replace('_', ' ') for k in df_target_to_target_list.columns}).columns.values.tolist()] + 
    df_target_to_target_list.fillna('').values.tolist()
)

{'spreadsheetId': '1LgcrVG-1vhMoC0JifrZYCCgsalIVigk13Yy7-ZKzw1U',
 'updatedCells': 6152,
 'updatedColumns': 8,
 'updatedRange': 'fpr_clusters_after_qa_in_in!A1:H769',
 'updatedRows': 769}