# Unit 4 Capstone

#### John A. Fonte


### Instructions

1. Find 100 different entries from at least 10 different authors (articles?)
2. Reserve 25% for test set
3. cluster vectorized data (go through a few clustering methods)
4. Perform unsupervised feature generation and selection
5. Perform supervised modeling by classifying by author
6. Comment on your 25% holdout group. Did the clusters for the holdout group change dramatically, or were they consistent with the training groups? Is the performance of the model consistent? If not, why?
7. Conclude with which models (clustering or not) work best for classifying texts.


In [1]:
# Imports - combination of NLP and basic imports

from nltk.corpus import stopwords

import spacy
nlp = spacy.load('en')

import re
import pandas as pd
import numpy as np

In [40]:
# importing data
data = pd.read_csv('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/seinfeld-chronicles/scripts.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,Character,Dialogue,EpisodeNo,SEID,Season
0,0,JERRY,Do you know what this is all about? Do you kno...,1.0,S01E01,1.0
1,1,JERRY,"(pointing at Georges shirt) See, to me, that b...",1.0,S01E01,1.0
2,2,GEORGE,Are you through?,1.0,S01E01,1.0
3,3,JERRY,"You do of course try on, when you buy?",1.0,S01E01,1.0
4,4,GEORGE,"Yes, it was purple, I liked it, I dont actuall...",1.0,S01E01,1.0


In [6]:
# cleaning up the DataFrame a little bit
data.drop(columns=['Unnamed: 0', 'SEID'], inplace=True)
data.rename(columns={'EpisodeNo': 'Episode'}, inplace=True)

In [7]:
data.Character.value_counts()

JERRY                                                                                                              14786
GEORGE                                                                                                              9708
ELAINE                                                                                                              7983
KRAMER                                                                                                              6664
NEWMAN                                                                                                               640
MORTY                                                                                                                505
HELEN                                                                                                                471
FRANK                                                                                                                436
SUSAN                           

In [14]:
# By virtue of the above's results, we will need to do a bit of cleaning:
df = data.copy() # keeping original dataset

import re

df.Character.replace(r'.*JERRY.*$', 'JERRY', regex=True, inplace=True)
df.Character.replace(r'.*GEORGE.*$', 'GEORGE', regex=True, inplace=True)
df.Character.replace(r'.*ELAINE.*$', 'ELAINE', regex=True, inplace=True)
df.Character.replace(r'.*KRAMER.*$', 'KRAMER', regex=True, inplace=True)

df.Character.value_counts()

JERRY                                                                                                                                                                              15008
GEORGE                                                                                                                                                                              9804
ELAINE                                                                                                                                                                              8092
KRAMER                                                                                                                                                                              6739
NEWMAN                                                                                                                                                                               640
MORTY                                                                      

In [17]:
# did not ignore case, so will do replacements one more time

df['Character'] = df.Character.str.replace(r'.*JERRY.*$', 'JERRY', regex=True, flags=re.IGNORECASE)
df['Character'] = df.Character.str.replace(r'.*GEORGE.*$', 'GEORGE', regex=True, flags=re.IGNORECASE)
df['Character'] = df.Character.str.replace(r'.*ELAINE.*$', 'ELAINE', regex=True, flags=re.IGNORECASE)
df['Character'] = df.Character.str.replace(r'.*KRAMER.*$', 'KRAMER', regex=True, flags=re.IGNORECASE)

df.Character.value_counts()

JERRY                                                       15040
GEORGE                                                       9813
ELAINE                                                       8116
KRAMER                                                       6751
NEWMAN                                                        640
MORTY                                                         505
HELEN                                                         471
FRANK                                                         436
SUSAN                                                         379
[Setting                                                      293
ESTELLE                                                       286
PETERMAN                                                      191
PUDDY                                                         162
WOMAN                                                         157
MAN                                                           143
JACK      

In [41]:
df = df[(df.Character == 'JERRY') | (df.Character == 'GEORGE') | 
       (df.Character == 'ELAINE') | (df.Character == 'KRAMER') | (df.Character == 'NEWMAN') | (df.Character == 'MORTY') | 
       (df.Character == 'HELEN') | (df.Character == 'FRANK') | (df.Character == 'SUSAN') | (df.Character == 'ESTELLE')].copy()

In [None]:
# onehotencoder for character classification

In [61]:
df_test = df.iloc[df.Character > 200]
df_test.head()

TypeError: '>' not supported between instances of 'str' and 'int'

In [42]:
df.head()

Unnamed: 0,Character,Dialogue,Episode,Season
0,JERRY,Do you know what this is all about? Do you kno...,1.0,1.0
1,JERRY,"(pointing at Georges shirt) See, to me, that b...",1.0,1.0
2,GEORGE,Are you through?,1.0,1.0
3,JERRY,"You do of course try on, when you buy?",1.0,1.0
4,GEORGE,"Yes, it was purple, I liked it, I dont actuall...",1.0,1.0


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42437 entries, 0 to 54615
Data columns (total 4 columns):
Character    42437 non-null object
Dialogue     42434 non-null object
Episode      42437 non-null float64
Season       42437 non-null float64
dtypes: float64(2), object(2)
memory usage: 1.6+ MB


In [45]:
# resizing column widths
pd.set_option('display.max_colwidth', 150)
df.head()

Unnamed: 0,Character,Dialogue,Episode,Season
0,JERRY,"Do you know what this is all about? Do you know, why were here? To be out, this is out...and out is one of the single most enjoyable experiences o...",1.0,1.0
1,JERRY,"(pointing at Georges shirt) See, to me, that button is in the worst possible spot. The second button literally makes or breaks the shirt, look at ...",1.0,1.0
2,GEORGE,Are you through?,1.0,1.0
3,JERRY,"You do of course try on, when you buy?",1.0,1.0
4,GEORGE,"Yes, it was purple, I liked it, I dont actually recall considering the buttons.",1.0,1.0


In [76]:
# creating text cleaner function because spaCy is finnecky

def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    text = str(text)
    text = re.sub(r'\-\-',' ', str(text))
    text = re.sub(r'[\[].*?[\]]', ' ', str(text))
    text = ' '.join(text.split())
    
    return text

In [80]:
df.Character.value_counts()

JERRY      15040
GEORGE      9813
ELAINE      8116
KRAMER      6751
NEWMAN       640
MORTY        505
HELEN        471
FRANK        436
SUSAN        379
ESTELLE      286
Name: Character, dtype: int64

In [69]:
type(df['Dialogue'][1])

str

In [71]:
samplequote = df['Dialogue'][1]
text_cleaner(samplequote)

'(pointing at Georges shirt) See, to me, that button is in the worst possible spot. The second button literally makes or breaks the shirt, look at it. Its too high! Its in no-mans-land. You look like you live with your mother.'

In [77]:
df['Dialogue'] = df['Dialogue'].apply(lambda x: text_cleaner(x))

In [78]:
df.describe()

Unnamed: 0,Episode,Season
count,42437.0,42437.0
mean,11.159531,5.604944
std,6.732658,2.260228
min,1.0,1.0
25%,5.0,4.0
50%,11.0,6.0
75%,17.0,8.0
max,24.0,9.0


In [None]:
# define X and Y
X = df.Dialogue
y = df.Character

In [None]:
# vectorize data using tfidf
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.65,         # drop words that are too common (i.e., found in more than 65% of tokens)
                             min_df=2,            # only use words that appear twice
                             stop_words='english',# takes out stopwords - very helpful!
                             lowercase=True,      # done intentionally to ignore capitalization differences
                             use_idf=True,        # yes please
                             norm=u'l2',          # L2 regularization - handy with large number of features generated from vectorization
                             smooth_idf=True      # prevents divide by zero errors by adding 1 to vectorized values
                            )               
#Applying the vectorizer

#parents_sents_tfidf = []

#for sent in parents_sents:
 #   print(sent)
  #  vectorized_sent = vectorizer.fit_transform(sent)
   # parents_sents_tfidf.append(vectorized_sent)

df['Vectorized Dialogue'] = df.Dialogue.apply(lambda x: vectorizer.fit_transform(x.words))

In [None]:
# split the data with 25% holdout

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) # default holdout parameter is test_size = 0.25

Clustering

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans

X_pca = PCA(2).fit_transform(X_norm) # cutting the large vectorized data down to two dimensions
y_pred = KMeans(n_clusters=2, random_state=42).fit_predict(X_pca)

# Plot the solution.
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred)
plt.show()

# Check the solution against the data.
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred, y)) # pandas version of confusion_matrix

In [None]:
minibatchkmeans = MiniBatchKMeans(
    init='random',
    n_clusters=2,
    batch_size=200)
minibatchkmeans.fit(X_pca)

# Add the new predicted cluster memberships to the data frame.
predict_mini = minibatchkmeans.predict(X_pca)

# Check the MiniBatch model against our earlier one.
print('Comparing k-means and mini batch k-means solutions:')
print(pd.crosstab(predict_mini, y_pred))

In [None]:
from sklearn.cluster import MeanShift, estimate_bandwidth

# Here we set the bandwidth. This function automatically derives a bandwidth
# number based on an inspection of the distances among points in the data.
bandwidth = estimate_bandwidth(X_train, quantile=0.2, n_samples=500)

# Declare and fit the model.
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X_train)

# Extract cluster assignments for each data point.
labels = ms.labels_

# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated clusters: {}".format(n_clusters_))

plt.scatter(X_train[:, 0], X_train[:, 1], c=labels)
plt.show()

print('Comparing the assigned categories to the ones in the data:')
print(pd.crosstab(y_train,labels))

In [None]:
from sklearn.cluster import AffinityPropagation
from sklearn import metrics

# Declare the model and fit it in one statement.
# Note that you can provide arguments to the model, but we didn't.
af = AffinityPropagation().fit(X_train)
print('Done')

# Pull the number of clusters and cluster assignments for each data point.
cluster_centers_indices = af.cluster_centers_indices_
n_clusters_ = len(cluster_centers_indices)
labels = af.labels_

print('Estimated number of clusters: {}'.format(n_clusters_))

In [None]:
from itertools import cycle

plt.figure(1)
plt.clf()

# Cycle through each cluster and graph them with a center point for the
# exemplar and lines from the exemplar to each data point in the cluster.
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X_train[cluster_centers_indices[k]]
    plt.plot(X_train[class_members, 0], X_train[class_members, 1], col + '.')
    plt.plot(cluster_center[0],
             cluster_center[1],
             'o',
             markerfacecolor=col,
             markeredgecolor='k')
    for x in X_train[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: {}'.format(n_clusters_))
plt.show()

In [None]:
# In each of the above graphs, also include code for graphing X_test

In [4]:
paradise_doc = nlp(paradise)
paradise_sents = [[sent, "Milton"] for sent in paradise_doc.sents]

In [5]:
paradise_sents[:5]

[[[Paradise Lost by John Milton 1667] 
   
   , 'Milton'], [Book I 
   
   
  Of Man's first disobedience, and the fruit 
  Of that forbidden tree whose mortal taste 
  Brought death into the World, and all our woe, 
  With loss of Eden, till one greater Man 
  Restore us, and regain the blissful seat, 
  Sing,, 'Milton'], [Heavenly Muse, that, on the secret top 
  Of Oreb, or of Sinai, didst inspire 
  That shepherd who first taught the chosen seed 
  In the beginning how the heavens and earth 
  Rose out of Chaos: or, if Sion hill ,
  'Milton'], [Delight thee more, and Siloa's brook that flowed 
  Fast by the oracle of God, 'Milton'], [, I thence 
  Invoke thy aid to my adventurous song, 
  That with no middle flight intends to soar 
  Above th' Aonian mount, while it pursues 
  Things unattempted yet in prose or rhyme. , 'Milton']]

In [12]:
parents = gutenberg.raw('edgeworth-parents.txt')

parents = text_cleaner(parents)

parents_doc = nlp(parents)
parents_sents = [[sent, "Edgeworth"] for sent in parents_doc.sents]

In [17]:
parents_sents = [sent for sent in parents_doc.sents]

In [18]:
parents_sents[:5]

[THE ORPHANS.,
 Near the ruins of the castle of Rossmore, in Ireland, is a small cabin, in which there once lived a widow and her four children.,
 As long as she was able to work, she was very industrious, and was accounted the best spinner in the parish; but she overworked herself at last, and fell ill, so that she could not sit to her wheel as she used to do, and was obliged to give it up to her eldest daughter, Mary.,
 Mary was at this time about twelve years old.,
 One evening she was sitting at the foot of her mother's bed spinning, and her little brothers and sisters were gathered round the fire eating their potatoes and milk for supper.]

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.65,
                             min_df=5,   
                             stop_words='english', 
                             lowercase=True,
                             use_idf=True,  
                             norm=u'l2',    
                             smooth_idf=True
                            )               
#Applying the vectorizer

#parents_sents_tfidf = []

#for sent in parents_sents:
 #   print(sent)
  #  vectorized_sent = vectorizer.fit_transform(sent)
   # parents_sents_tfidf.append(vectorized_sent)

parents_sents_tfidf = vectorizer.fit_transform(parents_doc.sents)
print(parents_sents_tfidf.shape)


AttributeError: 'spacy.tokens.span.Span' object has no attribute 'lower'

In [None]:
# RECURSION NOTES 4/17

https://www.educative.io/d/data_structures
    
    https://www.educative.io/collection/5642554087309312/5634727314718720
        
        https://medium.com/educative/3-month-coding-interview-bootcamp-904422926ce8