# Clustering and Recommendations

We've made a lot of features, many of which we don't want to give equal weight. This notebook creates similarity measures for the following groupings of features:
    - metadata
    - words and sentences
    - repetition
    - profanity
    - parts of speech
    - point of view
    - sentiment
    - bag of words

## import

In [1]:
import pickle
import numpy as np
import pandas as pd
from datetime import date
import json
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize
import gensim
from gensim.corpora.dictionary import Dictionary

import spacy
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler



In [2]:
sw = stopwords.words("english")

In [3]:
with open(f'../data/metascripts_df_sentiment.pickle', 'rb') as file:
    metascripts = pickle.load(file)
    
with open(f'../data/metascripts_repetition_df.pickle', 'rb') as file:
    metascripts_rep = pickle.load(file)
    
with open('../data/cosims_df.pickle', 'rb') as file:
    bow_cosims = pickle.load(file)
    
with open('../data/pos_props_df.pickle', 'rb') as file:
    pos_props_df = pickle.load(file)
    
with open('../data/pos_props_df.pickle', 'rb') as file:
    pos_props_df = pickle.load(file)
    
with open('../data/pov_props_relative_df.pickle', 'rb') as file:
    pov_props_df = pickle.load(file)

## prepare the data for unsupervised learning

### bow cosims
Use wide bag-of-words cosims dataframe in preparation for later steps

In [4]:
(bow_cosims > 0.99999).sum()

Jim Gaffigan: Comedy Monster (2021) | Transcript                     1
Louis C. K.: Sorry (2021) | Transcript                               1
Drew Michael: Drew Michael (2018) | Transcript                       1
Drew Michael: Red Blue Green (2021) | Transcript                     1
Mo Amer: Mohammed in Texas (2021) | Transcript                       1
                                                                    ..
GEORGE CARLIN: COMPLAINTS AND GRIEVANCES (2001) – FULL TRANSCRIPT    1
GEORGE CARLIN: LIFE IS WORTH LOSING (2006) – Transcript              1
George Carlin: It’s Bad For Ya (2008) Full transcript                1
Dave Chappelle: 8:46 – Transcript                                    1
JIM JEFFERIES ON GUN CONTROL [FULL TRANSCRIPT]                       1
Length: 310, dtype: int64

### metadata
Dummify categorical variables and combine metadata variables into one dataframe.

In [5]:
companies_dummy = pd.get_dummies(metascripts['companies'].str.split(",").explode()).sum(level=0)
companies_dummy.head(3)

Unnamed: 0,Unnamed: 1,3 Arts Entertainment,A24,Another Mulligan Entertainment,Art & Industry,Attic Bedroom,BBC Comedy,Black Gold Films,Blue Wolf Productions,Brillstein Entertainment Partners,...,The Nacelle Company,Three T's Entertainment,Tiger Aspect Productions,Triage Entertainment,Universal Pictures,Universal Pictures UK,Universal Studios,Weirdass Comedy,What's Wrong with People?,e&a Film Berlin
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
ratings_dummy = pd.get_dummies(metascripts['contentRating'])
ratings_dummy.head(3)

Unnamed: 0,Not Rated,PG,R,TV-14,TV-G,TV-MA,TV-PG,Unrated
0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0


In [7]:
meta_num_features = ['year', 'runtimeMins', 'imDbRating', 'imDbRatingVotes']

In [8]:
metadata = pd.concat([metascripts[meta_num_features], companies_dummy, ratings_dummy], axis = 1).fillna(0)

### part-of-speech proportions
These are ready for cosine similarity besides dropping SYM (symbols).

In [9]:
pos_props_df = pos_props_df.drop(columns = 'SYM')

In [10]:
pos_props_df

Unnamed: 0,description,VERB,PRON,INTJ,NOUN,ADV,AUX,ADJ,PART,ADP,DET,SCONJ,CCONJ,PROPN,NUM
0,Jim Gaffigan: Comedy Monster (2021) | Transcript,0.146510,0.184963,0.017147,0.145698,0.062906,0.106230,0.053673,0.037845,0.081981,0.077922,0.024046,0.027902,0.025974,0.007204
1,Louis C. K.: Sorry (2021) | Transcript,0.140020,0.206552,0.016381,0.133737,0.057669,0.106361,0.057556,0.039942,0.076517,0.074274,0.030405,0.029507,0.021317,0.008639
2,Drew Michael: Drew Michael (2018) | Transcript,0.136785,0.216978,0.043468,0.119326,0.058519,0.116918,0.053341,0.040698,0.083925,0.060566,0.027694,0.026851,0.007586,0.007345
3,Drew Michael: Red Blue Green (2021) | Transcript,0.140336,0.199783,0.022306,0.133947,0.055658,0.107959,0.052518,0.045479,0.087168,0.069193,0.026096,0.029020,0.021332,0.009204
4,Mo Amer: Mohammed in Texas (2021) | Transcript,0.134704,0.207911,0.027945,0.133917,0.061694,0.104300,0.051855,0.031388,0.085309,0.065729,0.017810,0.026567,0.041622,0.009151
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
305,GEORGE CARLIN: COMPLAINTS AND GRIEVANCES (2001...,0.139356,0.163515,0.011204,0.166433,0.056723,0.079132,0.063025,0.038982,0.092787,0.083217,0.018441,0.032096,0.044118,0.010854
306,GEORGE CARLIN: LIFE IS WORTH LOSING (2006) – T...,0.143898,0.150713,0.008538,0.181654,0.055616,0.074886,0.075748,0.035172,0.089456,0.092355,0.017233,0.037130,0.026007,0.011593
307,George Carlin: It’s Bad For Ya (2008) Full tra...,0.147432,0.165814,0.012886,0.154539,0.057703,0.090203,0.063578,0.039227,0.084707,0.083002,0.021508,0.032405,0.034395,0.012507
308,Dave Chappelle: 8:46 – Transcript,0.140195,0.181085,0.008345,0.139638,0.040890,0.091516,0.053686,0.035883,0.079555,0.088178,0.025591,0.037274,0.063421,0.014465


### point-of-view proportions
These are also ready for cosine similarity.

In [11]:
pov_props_df.head(3)

Unnamed: 0,description,third_person,first_person,second_person
0,Jim Gaffigan: Comedy Monster (2021) | Transcript,0.69504,0.202955,0.102005
1,Louis C. K.: Sorry (2021) | Transcript,0.718351,0.179794,0.101856
2,Drew Michael: Drew Michael (2018) | Transcript,0.614692,0.235545,0.149763


### word and sentence lengths
Select off just the relevant word and sentence features from the metascripts dataframe.

In [12]:
ws_features = [
       'mean word length', 'std word length', 'max word length',
       'mean sentence length', 'std sentence length', 'Q1.0 sentence length',
       'Q2.0 sentence length', 'Q3.0 sentence length', 'max sentence length'
    ]

word_sentence_lengths = metascripts[ws_features]

###  speed, uniqueness, and repetition

In [13]:
sur_features = [
       'unique words', 'total words', 'proportion unique words',
       'unique words per sentence', 'words per minute', 'sentences per minute'
]

sur = pd.concat([metascripts[sur_features], metascripts_rep['threepeat proportions']], axis = 1)

### profanity
Select off just the relevant profanity features from the metascripts dataframe.

In [14]:
profanity_features = ['profane count', 'profane proportion', 'profanity per sentence', 'profanity per minute']
profanity = metascripts[profanity_features]

### sentiment: polarity and subjectivity
Select off just the polarity and subjectivity measures from the metascripts dataframe.

In [15]:
sentiment_features = [
        'polarity', 'subjectivity',
       'mean sentence polarity', 'std sentence polarity',
        'Q1.0 sentence polarity', 'Q3.0 sentence polarity', 
        'mean sentence subjectivity', 'std sentence subjectivity',
        'Q2.0 sentence subjectivity', 'Q3.0 sentence subjectivity'
    ]

sentiment = metascripts[sentiment_features]

### collect dataframes into a dictionary

In [16]:
features_dict = {
    'metadata':metadata,
    'pos_props':pos_props_df.drop(columns = 'description'),
    'pov_props':pov_props_df.drop(columns = 'description'),
    'word_sentence_lengths':word_sentence_lengths,
    'profanity':profanity,
    'sentiment':sentiment,
    'sur':sur
}

In [17]:
with open('../data/features_dict.pickle', 'wb') as file:
    pickle.dump(features_dict, file)

## cosine similarity

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
def get_scaled_similarities(df):
    scaled = StandardScaler().fit_transform(df)
    cosims = cosine_similarity(scaled)
    df = (pd.DataFrame(index = metascripts['description'],
                                  columns = metascripts['description'],
                                  data = cosims
                    )
         )
    return df

### run get_scaled_similarities on each dataframe

In [20]:
feature_cosims_dict = {key: get_scaled_similarities(df) for key, df in tqdm(features_dict.items())}
feature_cosims_dict['bow'] = bow_cosims

  0%|          | 0/7 [00:00<?, ?it/s]

In [21]:
for key, value in feature_cosims_dict.items():
    print(key, value.shape, type(value))

metadata (310, 310) <class 'pandas.core.frame.DataFrame'>
pos_props (310, 310) <class 'pandas.core.frame.DataFrame'>
pov_props (310, 310) <class 'pandas.core.frame.DataFrame'>
word_sentence_lengths (310, 310) <class 'pandas.core.frame.DataFrame'>
profanity (310, 310) <class 'pandas.core.frame.DataFrame'>
sentiment (310, 310) <class 'pandas.core.frame.DataFrame'>
sur (310, 310) <class 'pandas.core.frame.DataFrame'>
bow (310, 310) <class 'pandas.core.frame.DataFrame'>


In [22]:
with open('../data/feature_cosims_dict.pickle', 'wb') as file:
    pickle.dump(feature_cosims_dict, file)

In [23]:
# checking the calculations below on (i,j) = (0,1)
np.mean([df.to_numpy()[0,1] for df in feature_cosims_dict.values()])

-0.1549427416275974

In [24]:
cosims_means = np.mean([df.to_numpy() for df in feature_cosims_dict.values()], axis = 0)
cosims_means_df = pd.DataFrame(index = metascripts['description'],
                         columns = metascripts['description'],
                         data = cosims_means)

### make cosims_means_df long

In [25]:
cosims_means_df_long = (cosims_means_df.melt(var_name = 'other description', 
                                            value_name = 'cosine similarity', 
                                            ignore_index = False)
                                      .reset_index()
                       )
#cosims_means_df_long[cosims_means_df_long['cosine similarity'] != 1]

In [26]:
widget_df = (cosims_means_df_long
                .merge(metascripts[['description', 'fullTitle', 'artist']])
                .merge(metascripts, 
                       left_on = 'other description', 
                       right_on = 'description')
                    .rename(columns = {'description_x': 'description',
                                       'fullTitle_x': 'fullTitle',
                                       'fullTitle_y': 'other fullTitle',
                                       'artist_x': 'artist',
                                       'artist_y': 'other artist'})
                    .drop(columns = 'description_y')
        )

In [27]:
widget_df['image html'] = widget_df['image'].apply(lambda x: f"<img src='{x}' width='100px'>")

In [28]:
from ipywidgets import interact
import ipywidgets
from IPython.display import Image, HTML
from random import randrange

def show_df(NewArtistsOnly, Show):
    if NewArtistsOnly:
        filtered = widget_df[(widget_df['fullTitle'] == Show) 
                             & 
                             (
                                 (widget_df['artist'] != widget_df['other artist']) 
                                 | 
                                 (widget_df['cosine similarity'] >= 0.9999999)
                             )
                            ]
        tops = filtered.nlargest(6, 'cosine similarity')
        bottom = filtered.nsmallest(1, 'cosine similarity')
        topbottom = pd.concat([tops, bottom]).reset_index()
        output_df = topbottom[['title', 'year', 'image html', 'cosine similarity']].T
        output_df.columns = ['Selection', 'Recommendation 1', 'Recommendation 2', 'Recommendation 3', 'Recommendation 4', 'Recommendation 5', 'Something Different']
        output_df.index = output_df.index.str.title()
        output = HTML(output_df.to_html(escape=False))
    else:
        filtered = widget_df[widget_df['fullTitle'] == Show]
        tops = filtered.nlargest(6, 'cosine similarity')
        bottom = filtered.nsmallest(1, 'cosine similarity')
        topbottom = pd.concat([tops, bottom]).reset_index()
        output_df = topbottom[['title', 'year', 'image html', 'cosine similarity']].T
        output_df.columns = ['Selection', 'Recommendation 1', 'Recommendation 2', 'Recommendation 3', 'Recommendation 4', 'Recommendation 5', 'Something Different']
        output_df.index = output_df.index.str.title()
        output = HTML(output_df.to_html(escape=False))
    return output

def what_should_i_watch():
    unique_show_titles = widget_df['fullTitle'].sort_values().unique()
    random_index = randrange(len(unique_show_titles))
    starting_show = unique_show_titles[random_index]
    show_selection_combobox = ipywidgets.Combobox(
                                value=starting_show,
                                placeholder='Start typing a show or artist',
                                options=list(unique_show_titles),
                                description='Select a Show',
                                ensure_option=True,
                                disabled=False
                            )
#     show_selection_widget = ipywidgets.Dropdown(
#                                     options = unique_show_titles,
#                                     value = starting_show,
#                                     description = 'Select a Show',
#                                     disabled = False,
#                                 )
    new_artists_filter_widget = ipywidgets.Checkbox(
                                    value = False,
                                    description ='Show Only Other Artists?',
                                    disabled =False,
                                )
    return interact(show_df, NewArtistsOnly = new_artists_filter_widget, Show = show_selection_combobox)

In [29]:
what_should_i_watch()

interactive(children=(Checkbox(value=False, description='Show Only Other Artists?'), Combobox(value='Kathleen …

<function __main__.show_df(NewArtistsOnly, Show)>

In [30]:
# check widget output against dataframe
widget_df[widget_df['fullTitle'] == 'Nate Bargatze: The Tennessee Kid (2019)'].sort_values('cosine similarity', ascending = False).iloc[[0,1,2,-1], 0:3]

Unnamed: 0,description,other description,cosine similarity
32033,Nate Bargatze: The Tennessee Kid (2019) – Full...,Nate Bargatze: The Tennessee Kid (2019) – Full...,1.0
46913,Nate Bargatze: The Tennessee Kid (2019) – Full...,NIKKI GLASER: PERFECT (2016) – Full Transcript,0.804317
4753,Nate Bargatze: The Tennessee Kid (2019) – Full...,Nate Bargatze: The Greatest Average American (...,0.790999
80703,Nate Bargatze: The Tennessee Kid (2019) – Full...,BILL HICKS: REVELATIONS (1993) – FULL TRANSCRIPT,-0.561622


## make SQLite DB

In [31]:
metacols = ['description', 'link', 'script characters', 'id',
           'artist', 'title', 'fullTitle', 'year', 'image', 'releaseDate',
           'runtimeMins', 'runtimeStr', 'awards', 'genres',
           'companies', 'contentRating', 'imDbRating',
           'imDbRatingVotes', 'similars', 'languages']
cosims_means_df_long.columns

Index(['description', 'other description', 'cosine similarity'], dtype='object')

In [32]:
widget_df_export = (cosims_means_df_long.rename(columns = {'other description': 'description comparison'})
                .merge(metascripts[metacols].drop(columns = 'similars'))
                .merge(metascripts[metacols].drop(columns = 'similars'), 
                       left_on = 'description comparison', 
                       right_on = 'description',
                       suffixes = (" selection", " comparison"))
        )

widget_df_export.shape

(96100, 40)

In [33]:
keepind = [ind for ind, col in enumerate(widget_df_export.columns) if ind != 1]
widget_df_export = widget_df_export.iloc[:, keepind]
widget_df_export.shape

(96100, 39)

In [34]:
widget_df_export.columns

Index(['description selection', 'cosine similarity', 'link selection',
       'script characters selection', 'id selection', 'artist selection',
       'title selection', 'fullTitle selection', 'year selection',
       'image selection', 'releaseDate selection', 'runtimeMins selection',
       'runtimeStr selection', 'awards selection', 'genres selection',
       'companies selection', 'contentRating selection',
       'imDbRating selection', 'imDbRatingVotes selection',
       'languages selection', 'description comparison', 'link comparison',
       'script characters comparison', 'id comparison', 'artist comparison',
       'title comparison', 'fullTitle comparison', 'year comparison',
       'image comparison', 'releaseDate comparison', 'runtimeMins comparison',
       'runtimeStr comparison', 'awards comparison', 'genres comparison',
       'companies comparison', 'contentRating comparison',
       'imDbRating comparison', 'imDbRatingVotes comparison',
       'languages comparison

In [35]:
# import sqlite3

# with sqlite3.connect('../StandupRecommenderShiny/data/overall_recommender_df_2.sqlite') as db:
#     widget_df_export.to_sql('overallfull', db, if_exists = 'append', index = False)

# appendix
not for project use

### original cosine similarity code

In [36]:
def token_filter(token):
    return not (token.is_punct | token.is_space | token.is_stop)

filtered_tokens = []
for scripts_subset in tqdm(chunker(scripts, 2), total = np.ceil(len(scripts)/2)):
    for doc in nlp.pipe(scripts_subset):
        tokens = [token.lemma_.lower() for token in doc if token_filter(token)]
        filtered_tokens.append(tokens)
        
tokens_no_sw = [[token for token in tokenized_script if token not in sw] for tokenized_script in filtered_tokens]

scripts_counters = {description: Counter(tokenized_script) for description, tokenized_script in zip(descriptions, tokens_no_sw)}
scripts_df = pd.DataFrame.from_dict(scripts_counters, orient = 'index').fillna(0)

cosims = cosine_similarity(scripts_df)
cosims_df = (pd.DataFrame(index=scripts_df.index, 
                          columns=scripts_df.index, 
                          data = cosims)
                .melt(var_name='other_show', 
                      value_name='cosine_similarity', 
                      ignore_index=False)
                .reset_index()
                .rename(columns = {'index':'show'})
        )

NameError: name 'chunker' is not defined

### render image or thumbnail

In [None]:
from IPython.display import Image, HTML

#HTML(metascripts.loc[:5,['image']].to_html(escape = False))
Image(url= f"{metascripts.loc[0,'image']}")

### kmeans clustering

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from jupyterthemes import jtplot
jtplot.style()

In [None]:
inertias = []
sil_scores = []
max_clusters = 60
for n_clusters in tqdm(range(2, max_clusters+1)):
    
    pipeline = Pipeline(
        steps = [
            ('scaler', StandardScaler()),
            ('cluster', KMeans(n_clusters = n_clusters))
        ]
    )

    pipeline.fit(metascripts[features])
    inertias.append(pipeline['cluster'].inertia_)

    
    features_scaled = pipeline['scaler'].transform(metascripts[features])
    features_clusters = pipeline.predict(metascripts[features])
    sil_score = silhouette_score(features_scaled, features_clusters)
    sil_scores.append(sil_score)
        
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20,6))

ax1.plot(range(2, max_clusters + 1), inertias)
ax1.scatter(range(2, max_clusters + 1), inertias, s = 100)
ax1.set_xlabel("number of clusters")
ax1.set_ylabel("inertia")
ax1.set_title("inertia by kmeans cluster size")

ax2.plot(range(2, max_clusters + 1), sil_scores)
ax2.scatter(range(2, max_clusters + 1), sil_scores, s = 100)
ax2.set_xlabel("number of clusters")
ax2.set_ylabel("silhoutte score")
ax2.set_title("silhouette score by kmeans cluster size");

In [None]:
n_clusters = 39
    
pipeline = Pipeline(
    steps = [
        ('scaler', StandardScaler()),
        ('cluster', KMeans(n_clusters = n_clusters))
    ]
)

pipeline.fit(metascripts[features])
inertias.append(pipeline['cluster'].inertia_)


features_scaled = pipeline['scaler'].transform(metascripts[features])
features_clusters = pipeline.predict(metascripts[features])
sil_score = silhouette_score(features_scaled, features_clusters)
sil_scores.append(sil_score)

i = 13
j = 19

plt.figure(figsize = (10,6))
sns.scatterplot(data = metascripts[features],
               x = features[i],
               y = features[j],
               hue = pipeline[1].labels_)

sns.scatterplot(x = pipeline['scaler'].inverse_transform(pipeline['cluster'].cluster_centers_)[:,i],
                y = pipeline['scaler'].inverse_transform(pipeline['cluster'].cluster_centers_)[:,j],
                s = 500, 
                hue = list(range(n_clusters)), 
                marker = 'D',
                legend = False);

## run Louvain algorithm on cosine similarities

In [None]:
import community as community_louvain
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import networkx as nx

In [None]:
graph_df = (widget_df_export.loc[widget_df_export['cosine similarity'] < 0.99999, ['description selection', 'description comparison', 'cosine similarity']]
                            .sort_values(['cosine similarity', 'description comparison'], ascending = False)
                            .drop_duplicates(subset = 'cosine similarity', keep = 'first')
        )

#G = nx.Graph()
G = nx.from_pandas_edgelist(graph_df, 
                        'description selection', 
                        'description comparison', 
                        edge_attr='cosine similarity')

len(G.nodes())

In [None]:
graph_df.to_csv('../data/graph_df.csv', index = False) # for Neo4J

note that these weights failed in Python but ran in Neo4J's Louvain algorithm. I'll filter out negative weights below and then check python's Louvain.

## run Louvain without negative weights

In [None]:
graph_df = (widget_df_export.loc[widget_df_export['cosine similarity'].between(0, 0.99999), ['description selection', 'description comparison', 'cosine similarity']]
                            .sort_values(['cosine similarity', 'description comparison'], ascending = False)
                            .drop_duplicates(subset = 'cosine similarity', keep = 'first')
        )

#G = nx.Graph()
G = nx.from_pandas_edgelist(graph_df, 
                        'description selection', 
                        'description comparison', 
                        edge_attr='cosine similarity')

len(G.nodes())

In [None]:
communities = community_louvain.best_partition(G, weight = 'cosine similarity')

In [None]:
communities_values = [communities.get(node) for node in tqdm(graph_df['description selection'])]
graph_df['community selection'] = communities_values

In [None]:
graph_df['community selection'].value_counts(normalize = True)

In [None]:
# draw the graph
pos = nx.spring_layout(G)
# color the nodes according to their partition
cmap = cm.get_cmap('viridis', max(communities.values()) + 1)
nx.draw_networkx_nodes(G, pos, communities.keys(), node_size=40,
                       cmap=cmap, node_color=list(communities.values()))
nx.draw_networkx_edges(G, pos, alpha=0.5)
plt.show()