# Unit 4 Capstone - News Article Analysis & Classification

## John A. Fonte

---
---

### Instructions

1. Find 100 different entries from at least 10 different authors (articles?)
2. Reserve 25% for test set
3. cluster vectorized data (go through a few clustering methods)
4. Perform unsupervised feature generation and selection
5. Perform supervised modeling by classifying by author
6. Comment on your 25% holdout group. Did the clusters for the holdout group change dramatically, or were they consistent with the training groups? Is the performance of the model consistent? If not, why?
7. Conclude with which models (clustering or not) work best for classifying texts.

---
---

### About the Dataset

__Source:__ https://archive.ics.uci.edu/ml/datasets/Reuter_50_50#

__Description:__ This is a subset of the [Reuters Corpus Volume 1 (RCV1)](https://scikit-learn.org/0.17/datasets/rcv1.html). Specifically, this subset consists of the top 50 authors by article proliferation, with a total of 100 articles per each author within the combined training and testing sets.

---
---
# 1. Data Load and Cleaning

In [1]:
# basic imports
# will be doing other imports ad hoc
##### i.e., models and related functions

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

In [29]:
'''
Loading Data from Local Computer
Each author is a subfolder, and within each folder is a series of .txt files
The goal of this cell is to load all the contents of every subfolder into the 
DataFrame, while retaining the author designation for those works.
'''

from os import listdir

def multiple_file_load(file_directory):
    
    # identifying all author subfolders - appending them into list 
    
    authorlist = []
    textlist = []
    
    for author in listdir(file_directory):
        authorname = str(author)
        author_sub_directory = (file_directory + '/' + author) #author file path
    
    # identifying all files within each subfolder - 
    
        for filename in listdir(author_sub_directory):
            text_file_path = (author_sub_directory + '/' + filename) # text file path
            
            if (filename.lower().endswith('txt')):
                authorlist.append(authorname)
                textfile = open(text_file_path,'r') # this is how you open files
                substantive_text = textfile.read()  # this is how to read a file
                textlist.append(substantive_text)   # this is how to do something with that file
                textfile.close()                    # this is how to close the file 
                                                             # (you must close one before opening another!)
  # pushing the two lists into a dataframe 

    df = pd.DataFrame({'Author':authorlist, 'Text':textlist})
    
    return df
                

In [30]:
# loading training data (note the file path)
df_train = multiple_file_load('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/C50/C50train')

In [31]:
df_train.head(3)

Unnamed: 0,Author,Text
0,AaronPressman,The Internet may be overflowing with new techn...
1,AaronPressman,The U.S. Postal Service announced Wednesday a ...
2,AaronPressman,Elementary school students with access to the ...


In [32]:
# Authors don't have space between the names
# adding the space in the authors...because I want it
import re
author_split = [re.findall('[A-Z][a-z]*', i) for i in df_train.Author]

In [33]:
#joining them back together
author_join = []

for couple in author_split:
    joined_string = couple[0] + ' ' + couple[1]
    author_join.append(joined_string)    

In [34]:
df_train['Author'] = pd.Series(author_join)
df_train.tail(3)

Unnamed: 0,Author,Text
2497,William Kazer,China issued tough new rules on the handling o...
2498,William Kazer,China will avoid bold moves in tackling its ai...
2499,William Kazer,Communist Party chief Jiang Zemin has put his ...


In [35]:
# Loading df_test, which is a separate csv file

df_test = multiple_file_load('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/C50/C50test')
df_test.head(3)

Unnamed: 0,Author,Text
0,AaronPressman,U.S. Senators on Tuesday sharply criticized a ...
1,AaronPressman,Two members of Congress criticised the Federal...
2,AaronPressman,Commuters stuck in traffic on the Leesburg Pik...


In [36]:
#another fix to Author column

author_split = [re.findall('[A-Z][a-z]*', i) for i in df_test.Author]

author_join = []

for couple in author_split:
    joined_string = couple[0] + ' ' + couple[1]
    author_join.append(joined_string)    
    
df_test['Author'] = pd.Series(author_join)

In [37]:
#Before I begin adding features, assignment asks for 25% data split, NOT 50/50
# See "GOAL" below for explanation as to how I am doing that.

'''GOAL:
Trying to get half of the datapoints OF EACH AUTHOR
in the testing set into a new DataFrame, which
will be concatenated onto the training set.
I will delete that from the testing set later.

Doing this instead of combining both and splitting 75/25 later 
ensures balanced data between the authors.
'''

def appendingdataframe(dataframe):
    appendabledataframe = pd.DataFrame(columns=['Author', 'Text'])
    
    for item in dataframe.Author.unique():
        df_testauthor = df_test[df_test['Author'] == item].copy() 
        appendabledataframe = appendabledataframe.append(df_testauthor[25:], 
                                                         ignore_index=True) # want half of df_testauthor!
    
    return appendabledataframe
    

In [38]:
# using appendabledataframe to avoid screwing up original data
# This is explicit inefficiency at the cost of being cautious

df_train2 = df_train.append(appendingdataframe(df_train), ignore_index=True)

# checking if the appending worked
len(df_train2)

Looping through  Aaron Pressman
Looping through  Alan Crosby
Looping through  Alexander Smith
Looping through  Benjamin Kang
Looping through  Bernard Hickey
Looping through  Brad Dorfman
Looping through  Darren Schuettler
Looping through  David Lawder
Looping through  Edna Fernandes
Looping through  Eric Auchard
Looping through  Fumiko Fujisaki
Looping through  Graham Earnshaw
Looping through  Heather Scoffield
Looping through  Jane Macartney
Looping through  Jan Lopatka
Looping through  Jim Gilchrist
Looping through  Joe Ortiz
Looping through  John Mastrini
Looping through  Jonathan Birt
Looping through  Jo Winterbottom
Looping through  Karl Penhaul
Looping through  Keith Weir
Looping through  Kevin Drawbaugh
Looping through  Kevin Morrison
Looping through  Kirstin Ridley
Looping through  Kourosh Karimkhany
Looping through  Lydia Zajc
Looping through  Lynne O
Looping through  Lynnley Browning
Looping through  Marcel Michelson
Looping through  Mark Bendeich
Looping through  Martin Wolk

3750

In [39]:
# It worked!
df_train = df_train2.copy()

In [40]:
# doing same for df_test

df_test2 = df_test.append(appendingdataframe(df_train), ignore_index=True)

# checking if the appending worked
len(df_test2)

Looping through  Aaron Pressman
Looping through  Alan Crosby
Looping through  Alexander Smith
Looping through  Benjamin Kang
Looping through  Bernard Hickey
Looping through  Brad Dorfman
Looping through  Darren Schuettler
Looping through  David Lawder
Looping through  Edna Fernandes
Looping through  Eric Auchard
Looping through  Fumiko Fujisaki
Looping through  Graham Earnshaw
Looping through  Heather Scoffield
Looping through  Jane Macartney
Looping through  Jan Lopatka
Looping through  Jim Gilchrist
Looping through  Joe Ortiz
Looping through  John Mastrini
Looping through  Jonathan Birt
Looping through  Jo Winterbottom
Looping through  Karl Penhaul
Looping through  Keith Weir
Looping through  Kevin Drawbaugh
Looping through  Kevin Morrison
Looping through  Kirstin Ridley
Looping through  Kourosh Karimkhany
Looping through  Lydia Zajc
Looping through  Lynne O
Looping through  Lynnley Browning
Looping through  Marcel Michelson
Looping through  Mark Bendeich
Looping through  Martin Wolk

3750

In [41]:
# and now to drop the rows added to df_train from df_test

df_test2.drop_duplicates(keep=False, inplace=True)
len(df_test2)

1250

In [42]:
df_test = df_test2.copy()

# Text Cleaning

Arguably the most important part about working with text data is how to refine it for processing. As simple as they are, string substitutions such as `pd.replace` and regex's `re.sub` are common. I am also partial to `pd.Series.apply(lambda x: x.replace('...',''))`.  Additional text processing such as the exclusion of stop_words and lemmatization will be done after the raw text is pre-processed.

In [178]:
# cleaning text before feature analysis/engineering 

#--------------------------------------------------------------------

# CLEANING FUNCTION 1 - WORD AND PUNCTUATION/CHARACTER CLEANING

# EDIT: During first run-through, this function was very basic
# I have since implemented new cleaning features

# regex already imported as re

def text_cleaner(text):
    text = text.lower() # avoiding capitalization problems.
    
    text = re.sub(r'.\s*\\n[a-z]', r'\. [a-z]', text)
    text = re.sub(r'\.\s?([a-z])', r'\. \1', text)
    text = re.sub(r' u\. s\.(\s?)', r' u\.s\.\1', text) # next three lines are my attempt to join 'u. s.' to 'u.s.'
    text = text.replace('u. s.', 'u.s.')
    text = text.replace(r'u. s.', r'u.s.')
    text = re.sub(r'-', '', text)
    text = re.sub(r'  ', ' ', text)
    text = re.sub('[\[].*?[\]]', '', text)
    text = re.sub('.=.', '. .', text)
    text = text.replace('\\', '')
    text = re.sub(',', '', text) # I don't want punct screwing up lemmatization
    text = re.sub('\\n', '', text)
    text = re.sub(r'\\n', '', text) # I don't know which one works
    text = text.replace('\"', '')
    
    # rest of punctuation will be handled via lemmatization
    
    return text

In [179]:
# application of text cleaning functions

df_train['Text1'] = df_train['Text'].apply(lambda x: text_cleaner(x))
df_test['Text1'] = df_test['Text'].apply(lambda x: text_cleaner(x))

In [180]:
df_train['Text1'][1]

"the u. s. postal service announced wednesday a plan to boost online commerce by enhancing the security and reliability of electronic mail traveling on the internet. under the plan businesses and consumers can verify that email has not been tampered with and use services now available for ordinary mail like sending a certified letter.the leap from trading messages to buying and selling goods has been blocked by the fear of security threats robert reisner vice president of stategic planning said. to expand from local area networks and bilateral secure communications to wide use of electronic commerce will require a new generation of security services reisner said. cylink corp is developing a system for the post office to use to verify the identity of email senders. the system will enable people to register a digital signature with the post office that can be compared against electronic mail they send. if any tampering is discovered the postal service would investigate just like it inves

In [215]:
dd = ['rr']
ddd = []

dddd = dd + ddd
dddd

['rr']

In [196]:
dddd = 'here is abelly'
print(dddd.remove('el'))

AttributeError: 'str' object has no attribute 'remove'

In [220]:
# CLEANING FUNCTION 2 - NUMBER CLEANING
'''
Whether it be phone numbers, page numbers, or just 
digits for no particular reason (which, yes, does happen),
numbers will become there own vectors and inevitably clog
up the vectorized feature space.

I am dropping all number columns because 
we are looking at words, not numbers!!!!!
'''

def phone_and_weird_num_deletion(text):
    final_text = text
    
    leading_zero_numbers = re.findall(r' 0\d+/g', text)
    phone_numbers = re.findall(r' ?\+?\d{1}?(\d{3}?|\d{4}?) \d{3} \d{4}/g', text)
    phone_num_no_space = re.findall(r' ?\+?(\d{10}?|\d{11}?)/g', text)
    total_deletions = leading_zero_numbers + phone_numbers + phone_num_no_space
    
    if len(total_deletions) != 0:
        final_text = total_deletions.apply(lambda x: final_text.replace(x, ''))
        return final_text
    
    else:
        return text
                               
#----------------------------------------------
        
# DON'T USE THE BELOW FUNCTION!
# It appears there are a lot of good numbers there

#-----------------------------------------------

#def string_num_deletion(text):
#    final_text = text
#    
#    dates = [str(x) for x in list(range(1900, 2026))]
#    plural_dates = re.findall(r' (\d{2}?|\d{4}?)s/g', text)
#    money_values = re.findall(r' \$\d+/g', text)
#    common_num = list(range(1000))
#    total_exceptions = dates + plural_dates + money_values + common_num
#                              
#    
#    num_token = re.findall(r'\d+[^snrt]?[^tdh]?\.*\d*/g', text) # finding all numbers that are not ordinals
#                                                                # also using "." for decimal findings
#    if len(num_token) != 0:
#        for x in num_token not in total_exceptions:
#            final_text = num_token.apply(lambda x: final_text.replace(x, ''))
#        
#        return final_text
#    
#    else:
#        return text

In [221]:
df_train['Text2'] = df_train['Text1'].apply(lambda x: phone_and_weird_num_deletion(x))
df_test['Text2'] = df_test['Text1'].apply(lambda x: phone_and_weird_num_deletion(x))

# The list comprehension inside a loop makes this run for a total of 10seconds

---

# Adding Features

Just some fun numerical features!

In [26]:
# adding some numerical features for text analysis

df_train['Raw Character Count'] = df_train['Text'].apply(lambda x: len(x))
df_train['Raw Word Count'] = df_train['Text'].apply(lambda x: len(x.split()))

TypeError: object of type 'NoneType' has no len()

In [27]:
# doing same for df_test

df_test['Raw Character Count'] = df_test['Text'].apply(lambda x: len(x))
df_test['Raw Word Count'] = df_test['Text'].apply(lambda x: len(x.split()))

TypeError: object of type 'NoneType' has no len()

In [None]:
# creating numerical classes for authors:
# I feel like one hot encoding would've screwed things up, so I did "factorize"

df_train['AuthorNum'] = pd.factorize(df_train.Author)[0]
df_train['AuthorNum'] = df_train['AuthorNum'].astype("category")

In [None]:
df_train.tail()

In [None]:
# and same for df_test...

df_test['AuthorNum'] = pd.factorize(df_test.Author)[0]
df_test['AuthorNum'] = df_test['AuthorNum'].astype("category")

In [None]:
df_test.tail()

# Vectorizing! Changing Text to Numbers

Once everything is numerical, then we can feed that data into the clusters for analysis.

In [None]:
# doing tokenization (unigram), lemmatization, and stop_word exclusion
# ***AS PART**** of the model, not a separate spacy thing

import spacy
nlp = spacy.load('en')

In [None]:
df_train['Spacy-ed Text'] = df_train['Text'].apply(lambda text: nlp(text))
df_test['Spacy-ed Text'] = df_test['Text'].apply(lambda text: nlp(text))

In [None]:
# setting up function to (1) lemmatize AND
# (2) exclude stop words from the count

from collections import Counter

def lemma_frequencies(text, include_stop=True):
    
    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_) # this is why we needed to spacy/nlp-ify the texts first
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

df_train['Meaningful Word Count'] = df_train['Spacy-ed Text'].apply(lambda text: lemma_frequencies(text, 
                                                                                                   include_stop=False))
df_test['Meaningful Word Count'] = df_test['Spacy-ed Text'].apply(lambda text: lemma_frequencies(text,
                                                                                                include_stop=False))

In [None]:
# didn't want a dictionary. I wanted a total:

df_train['Meaningful Word Count Total'] = df_train['Meaningful Word Count'].apply(lambda x: sum(list(x.values())))
df_test['Meaningful Word Count Total'] = df_test['Meaningful Word Count'].apply(lambda x: sum(list(x.values())))

In [None]:
def lemmatize(text, include_stop=True):
    
    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_) # this is why we needed to spacy/nlp-ify the texts first
            
    # Build and return a Counter object containing word counts.
    return lemmas

In [None]:
df_train['Lemmatized Text'] = df_train['Spacy-ed Text'].apply(lambda text: lemmatize(text, include_stop=False))
df_test['Lemmatized Text'] = df_test['Spacy-ed Text'].apply(lambda text: lemmatize(text, include_stop=False))

In [None]:
df_test.head()

In [None]:
# all processing is complete, saving it as a csv so I don't have to do it again


# commented out because this cell should only be run once

#df_train.to_csv('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/COMPLETE_NLP-train.csv',
#                index=False)
#
#df_test.to_csv('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/COMPLETE_NLP-test.csv',
#               index=False)

In [None]:
# creating output for vectors, i.e., 
# a NEW dataframe from a vectorized "word matrix"
def word_matrix_2_df(word_matrix, feat_names):
    
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(word_matrix)]
    
    df = pd.DataFrame(data=word_matrix.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

In [None]:
# now that tokens are created via lemmatization, we can vectorize those tokens

# FIRST VECTORIZER - BAG-OF-WORDS!!!!!!!!!!!

from sklearn.feature_extraction.text import CountVectorizer

def dummy(doc):
    return doc

cv = CountVectorizer(
    tokenizer=dummy, # putting lemmatizer function in here would've thrown an error
    preprocessor=dummy,
)  

In [None]:
cwm = cv.fit_transform(list(df_train['Lemmatized Text']))
tokens = cv.get_feature_names()

df_train_vectorized = word_matrix_2_df(cwm, tokens)

In [None]:
df_train_vectorized

In [None]:
df_train_vectorized.shape

In [None]:
#-------------------------------------------------------------------------------------------
# NEW VECTORIZER - Tfidf!!!!!!!!
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer=lambda x: x, ngram_range=(1,1))
tfidf_vectors = tfidf.fit_transform(df_train['Lemmatized Text'])

# save this commented-out thing for later?
# df_word_matrix_TRAIN = pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)

# feature_names the same from previous vectorizer, so feature_names=tokens again
# (this is my way of saying that tfidf_vectors.get_feature_names throws an AttributeError)

df_train_tfidf_vectorized = word_matrix_2_df(tfidf_vectors, tokens)

In [None]:
df_train_tfidf_vectorized.iloc[:,100-250].value_counts() # highly sparse data again

# CLUSTERING ANALYSIS

Looking at which vectorized features are the biggest influencers.

Here, we will be using *x\_train\_tfidf\_vectorized* as the input data. Tfidf is a measure of frequency, which means the proportions of the raw counts (found in CountVectorized()) are the same among datapoints, but the frequency character of the dataset naturally allows scaling that accounts for any particular outliers, which would otherwise negatively affect clustering.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans

In [None]:
y = df_train['AuthorNum']
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
scaler = StandardScaler()

X_pca = PCA(2).fit_transform(df_train_tfidf_vectorized) # gigantic feature space

# PCA spreads data out (as a result of lower dimensionality, don't ask)
# Normalizing/standardscaling is just needed for this - otherwise, it will throw a ValueError otherwise.

kmeans = KMeans(n_clusters=5, random_state=42)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans).fit(X_pca)

# Calculate the cluster labels: THIS IS THE OUTPUT YOU ARE LOOKING AT!!!!!!!!!!!!********
labels = pipeline.predict(X_pca)

# Create a DataFrame with cluster labels and Classes as columns: df
dfkmeans = pd.DataFrame({'Cluster Label':labels, 'Author':y})

# Create crosstab: ct
ct = pd.crosstab(dfkmeans['Cluster Label'], dfkmeans['Author'])

# Plotting kMeans
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x')

## Calculate predicted values.
#y_pred = KMeans(n_clusters=5, random_state=42).fit_predict(X_pca)

# Plot the solution.
#plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred)
plt.show()

# Check the solution against the crosstab data.
print('Comparing k-means clusters against the data:')
ct

In [None]:
# problem with above is that while cluster points scaled and transformed
# cluster centers refused to do so

# don't need scaling or normalization because all tfidf values are between 0 and 1 anyway

#------------------------------------------------------------------------------------

# K-Means, Round 2

X_pca = PCA(2).fit_transform(df_train_tfidf_vectorized) 

kmeans = KMeans(n_clusters=13, n_init=50, max_iter=500000, tol=0.000001)

# Create pipeline: pipeline
#from sklearn.preprocessing import Normalizer
#normalizer=Normalizer()

#pipeline = make_pipeline(normalizer, kmeans).fit_transform(X_pca)
kmeans.fit(X_pca)
labels = kmeans.predict(X_pca) # same as kmeans.labels_ --- I just did it this way


dfkmeans = pd.DataFrame({'Cluster Label':labels, 'Author':y})
ct = pd.crosstab(dfkmeans['Cluster Label'], dfkmeans['Author'])

cluster_centers = kmeans.cluster_centers_
xcenters = cluster_centers[:,0]
ycenters = cluster_centers[:,1]
xcenlist = list(xcenters)
ycenlist = list(ycenters)

legendlabels = range(0,13)

centerdf = pd.DataFrame({'Xcenter':xcenters, 'Ycenter':ycenters,
                        'Labels':[str(label) for label in legendlabels]})

# Plotting kMeans
plt.scatter(x=X_pca[:, 0], y=X_pca[:, 1], c=labels)
plt.legend(centerdf.Labels)

# Plotting kMeans centers
plt.scatter(x=xcenters, y=ycenters,
            c='red', marker='x')
plt.show()

for i in range(0, len(xcenlist)):
    print('Cluster {} has center coordinates of: ({}, {})'.format(str(legendlabels[i]), 
                                                                  round(xcenlist[i], 2), 
                                                                  round(ycenlist[i], 2)))

ct

__K-Means Analysis:__

The space in the tfidf is very compact, making separation highly difficult. This is expected when you try to reduce a high-dimensional space via PCA; flattening down thousands of dimensions down to `PCA(2)` overcrowds the data, exemplifying the __curse of dimensionality.__  Originally, a large number of clusters (here, K=13) was hypothesized to provide accurate classification information without overfitting and overcome this curse of dimensionality. However, as the crosstab results show, some of the authors' classifications are spread more or less evenly across multiple clusters. This means the model was unable to use the PCA-ed features to classify those particular authors effectively.

So how can we maximize the accuracy of the model? What _is_ the accuracy of a clustering model? This is where an _Elbow Visualizer_ and the _Silhouette score_ come in, respectively.

A silhouette score for the above model is shown below:

In [None]:
# importing silhouette score

from sklearn.metrics import silhouette_score 
silhouette_score(X_pca, labels) # silhouette score for k=13 clusters

### What does this Silhouette Score Mean?

A Silhouette value is a distance measure of a sample's distance to its cluster's center (a) and its distance to the center of the nearest cluster that the sameple is not a part of (b). A __Silhouette Coefficient__ is the average of all of these values for a given cluster. [As used in sklearn.metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html), a __Silhouette Score__ is the average of all Silhouette Coefficients across all clusters.

Silhouette Coefficients (and by nature of the metric, Silhouette Scores too) range from (-1,1), with "1" being the best score. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

In short, a silhouette score is a __measure of how accurate the clustering is.__

---

__Analysis:__ As you can see, our Silhouette Score for k=13 clusters of 36.5% is not particularly comforting. How do we determine the best number of clusters to obtain the highest accuracy for our dataset?  This is where the __K-Elbow Visualizer__ comes in.

The "Elbow Method" is a simple iterative method that measures the accuracies for a range of k-sized clustering.  The point of inflection (the "elbow" of the graph) is usually the indicator that the model fits best with that point (in this case, that the model fits best with that particular k-number of clusters).

In [None]:
# Not a good percentage! 

# Import the KElbowVisualizer method 
from yellowbrick.cluster import KElbowVisualizer

# Instantiate a scikit-learn K-Means model 
# really cranking on these iterations, 
# because I have yet to see any overfitting or significant computational overload...
model = KMeans(n_init=500, max_iter=500000, tol=0.000001, random_state=42)

# Instantiate the KElbowVisualizer with the number of clusters and the metric 
visualizer = KElbowVisualizer(model, k=(2,21), metric='silhouette', timings=True)

# Fit the data and visualize 

plt.figure(figsize=(20,5))
visualizer.fit(X_pca) 
visualizer.poof() 

In [None]:
# judging by the graph, silhouette score for K=2 is ~58%.
# welp, that settles it! going back and doing a final K-Means with K=2

#------------------------------------------------------------------------------------

# K-Means, Round 3

X_pca = PCA(2).fit_transform(df_train_tfidf_vectorized) 

kmeans = KMeans(n_clusters=2, n_init=500, max_iter=500000, tol=0.000001, random_state=42)
kmeans.fit(X_pca)
labels = kmeans.predict(X_pca)

dfkmeans = pd.DataFrame({'Cluster Label':labels, 'Author':y})
ct = pd.crosstab(dfkmeans['Cluster Label'], dfkmeans['Author'])

cluster_centers = kmeans.cluster_centers_
xcenters = cluster_centers[:,0]
ycenters = cluster_centers[:,1]
xcenlist = list(xcenters)
ycenlist = list(ycenters)

legendlabels = range(0,2)

centerdf = pd.DataFrame({'Xcenter':xcenters, 'Ycenter':ycenters,
                        'Labels':[str(label) for label in legendlabels]})

#------------------------------------------------------------------------

#matplotlib thought "black and white" were good plotting colors for no reason, so that's neat...

color_list = []
for value in labels:
    if value == 0:
        color_list.append('orange')
    else:
        color_list.append('blue')

# Plotting kMeans
plt.scatter(x=X_pca[:, 0], y=X_pca[:, 1], c=color_list) 

# Plotting kMeans centers
plt.scatter(x=xcenters, y=ycenters,
            c='red', marker='x')
plt.show()

for i in range(0, len(xcenlist)):
    print('Cluster {} has center coordinates of: ({}, {})'.format(str(legendlabels[i]), 
                                                                  round(xcenlist[i], 2), 
                                                                  round(ycenlist[i], 2)))

ct

In [None]:
df_crosstab_analysis = pd.DataFrame({'Author':df_train['Author'].unique()})

cluster_fraction_list = []
cluster_final_list = []

for i in range(0,50):
    cluster0_count = ct[i].iloc[0]
    cluster1_count = ct[i].iloc[1]
    cluster_total = 75
    
    if cluster0_count > cluster1_count:
        cluster0avg = cluster0_count / cluster_total
        cluster_fraction_list.append(cluster0avg)
        cluster_final_list.append(0)
        
    else:
        cluster1avg = cluster1_count / cluster_total
        cluster_fraction_list.append(cluster1avg)
        cluster_final_list.append(1)
        

df_crosstab_analysis['Cluster No. Classification'] = pd.Series(cluster_final_list)
df_crosstab_analysis['Percent of Author in the Cluster Classification'] = pd.Series(cluster_fraction_list)
df_crosstab_analysis = df_crosstab_analysis.sort_values(by=['Cluster No. Classification',
                                                           'Percent of Author in the Cluster Classification'],
                                                        ascending=False)

# just did this manually
print('The percentage of Authors found to be in Cluster 1 more than in Cluster 0 is 20% (10/50 total Authors).')

df_crosstab_analysis

__Analysis:__ 

Again, only 20% of the `df_train` dataset can be characterized as falling into Cluster 1. More so, out of that 20%, only two authors (37 - Peter Humphrey and 45 - Tan Ee) were classified in Cluster 1 by more than 75%. These results come from K=2, which has the highest silhouette score of ~58%...

---
---

# So what now?

Obtaining the highest silhouette score of ~58% leaves us at a point where there is an expected upper-limit to model accuracy moving forward. Did we do something wrong?

I propose yes. Yes we did.

Thinkful's Capstone Project instructions appear to place the cart before the horse. According to the instructions (which are listed in the beginning of this project): "The _first technique_ is to create a series of clusters. ... _Next_, perform some unsupervised feature generation and selection using techniques covered in this unit and elsewhere in the course."

It is a fact that dimensionality reduction and proper scaling are _required_ for clustering. While not synonymous, dimensionality reduction goes hand in hand with feature selection and generation; the dimensionality reduction algorithm is created "new" composite features as projections. From there, we can cluster and perform unsupervised learning.

---
---

# Table of Contents for Remainder of Project

1. Re-vectorize data using tfidf again, but with different (better) hyperparameters.


2. Use gensim's `word2vec` to perform unsupervised learning on the dataset (this model has dimensionality reduction built-in as a hyperparameter).


3. Reduce dimensionality using the more appropriate dimensionality reduction algorithm, _Latent Semantic Analysis (LSA)_, for the remainder of unsupervised learning analysis.
    - retain original `df_train` for aggregate number features. This will be separately clustered and analyzed.
    
    
4. Apply clustering models:
    - K-Means (again)
    - Spectral Clustering
    - t-SNE


5. Perform any feature selection, testing whether to include aggregate number features in the vectorized DataFrame.


6. Build and fit supervised classification models.


7. Test fitted models on vectorized and dimensionality-reduced testing data.


8. Apply test data to previously used clustering models.
    - Compare and contrast the results with those of the training data.

In [None]:
# STEP 0: Load the data
# I have since restarted the kernel
# luckily, I saved my text analysis df's

df_train = pd.read_csv('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/COMPLETE_NLP-train.csv')
df_test = pd.read_csv('D:/Github/Data-Science-Bootcamp/CAPSTONE - Unsupervised Learning/COMPLETE_NLP-test.csv')

df_train.head(3)

In [None]:
# not so lucky, it appears that python doesn't know what to do with lemmatized text as that is a spacy datatype
# gotta re-clean?...

# STEP 1: DATA CLEANING
# wanted to do something more comprehensive than the above text cleaner function

df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace('\'', ''))
df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace('[', ''))
df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace(']', ''))
df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace(',', ''))
df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace(r'\\n', ''))
df_train['Lemmatized Text'][1]

In [None]:
import re

df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace('.', '')) # tried to do re
df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace('-', ' '))# but didn't work...
df_train['Lemmatized Text'] = df_train['Lemmatized Text'].apply(lambda x: x.replace('\\n', ''))
df_train['Lemmatized Text'][1]
                                                                

In [None]:
# it's...fine.
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace('\'', ''))
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace('[', ''))
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace(']', ''))
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace(',', ''))
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace('.', ''))
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace('-', ' '))
df_test['Lemmatized Text'] = df_test['Lemmatized Text'].apply(lambda x: x.replace('\\n', ''))

In [None]:
# STEP 2 - RE-VECTORIZING TRAINING DATA

# Instantiate vectorizer model
# input here for demonstration purposes - already imported
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer='word',
                        ngram_range=(1,1),      # ensuring unigram nature
                        min_df=2,               # only use words that appear at least twice******
                        stop_words='english',   # stop words and lowercase were already done 
                        lowercase=True,         # but it doesn't hurt to do a second time
                        use_idf=True,           # um yeah, that's what we're doing...
                        norm=u'l2',             # Applies a correction factor for imbalanced sized tokens
                        smooth_idf=True         # smooth_idf adds 1 to all document frequencies, 
                       )                        # as if an extra document existed that used every word once.  
                                                # Prevents divide-by-zero errors
    
tfidf_vectors = tfidf.fit_transform(df_train['Lemmatized Text'])
#dense = tfidf_vectors.todense() ----- for a feature space of this size
#denselist = dense.tolist()     ------ fixing sparsity will crash/run out of memory

token_names = tfidf.get_feature_names()

# creating dataframe from vectors (a separate process)
vectorized_df_train = pd.DataFrame(tfidf_vectors.toarray(), columns=token_names)

In [None]:
vectorized_df_train.head(3)

In [None]:
# DATAFRAME CLEANING
'''
This vectorized dataframe is extremely sparse.
Trying to do tfidf_vectors.todense() threw a memory error. Not risking that.
Dropping all number columns because we are looking at words, not numbers.
'''
import re

numbertokens = []
dates = [str(x) for x in list(range(1900, 2026))]

for name in token_names:
    num_token = re.findall(r'\d+[^t][^h]', name)
    num_not_date = [x for x in num_token if x not in dates]
    if len(num_not_date) != 0:
        numbertokens.append(num_not_date)
        
print(len(numbertokens))
numbertokens

In [None]:
# numbertokens is list of lists. Need to undo that
from itertools import chain

numbertokens2 = list(chain.from_iterable(numbertokens))
numbertokens2[:5]

In [None]:
numbertokens2[25:100]

In [None]:
print(len(numbertokens2))
print(len(list(set(numbertokens2))))

In [None]:
# checking for duplicates, there are none
print(len(token_names))
print(len(set(token_names)))

In [None]:
# TRYING to drop all numerical tokens

for item in numbertokens2 and token_names:
    vectorized_df_train.drop(columns=[item], axis=1, inplace=True)


In [None]:
# I give up, just manually doing it:
vectorized_df_train = vectorized_df_train.iloc[:,1808:].copy()

In [None]:
vectorized_df_train.shape

# Feature Analysis and Dimensionality Reduction via LSA

Latent Semantic Analysis (LSA) is a dimensionality reduction method that measures the angles of the vectors via a cosine algorithm. This cosine algorithm essentially creates a normal range between 0 and 1.

For efficiency purposes, dimensionality reduction is crucial for unsupervised learning methods such as clustering.

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from nearly 20000 to 150
svd= TruncatedSVD(150)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(vectorized_df_train) ###### LSA IS THE NAME OF THE MODEL USED HERE!

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of authors our LSA model considers similar, for the first (not top) ***FIVE*** components
components=pd.DataFrame(X_train_lsa,index=df_train['Author'])
for i in range(5):
    print('\nComponent {}:\n'.format(i))
    print(components.loc[:,i].sort_values(ascending=False)[:5])

In [None]:
components.head(3)

### What does this tell us?

Tells us that no one writes like David Lawder!

It is disheartening, however, to see that the truncation resulted in the retention of only 37% variance. While this is somewhat expected given that we are flattening nearly 20000 features into 150, this is still a significant loss and could play a role in making our future models inaccurate.

In [None]:
# let's see if LSA can tell us 

# running LSA again with 50 components, like 50 clusters:
svd= TruncatedSVD(50)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(vectorized_df_train) ###### LSA IS THE NAME OF THE MODEL USED HERE!

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

components=pd.DataFrame(X_train_lsa,index=df_train['Author'])


In [None]:
columnsort = {}
for column in components.columns:
    columndict = {str(column):components[column].max()}
    columnsort.update(columndict)
    
columnsort

In [None]:
# components 0,1,3,7 had highest numbers:

for i in [0,1,3,7]:
    print('\nComponent {}:\n'.format(i))
    print(components.loc[:,i].sort_values(ascending=False)[:3])

We can use the above to figure out which components best describe a given author.

To do this, we could use `pd.idmax(axis=0)`

# Clustering

Using the above, we have an idea of what kinds of clusterings we expect from authors' writings. Unforntuately, we have to use reduce dimensionality down to 2 if we want to visualize the clusters. SVD is still preferable over PCA because PCA requires a central tendency, and there is no such thing in a sparse vector matrix. On the other hand, PCA retains variance, and we have seen the huge drops in variance explainability with SVD/LSA.

Let's try SVD first.

In [None]:
# re-doing SVD to get 2 dimensions
svd= TruncatedSVD(2)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(vectorized_df_train)

# K-Means clustering
y = df_train['Author']
kmeans = KMeans(n_clusters=5, n_init=500, max_iter=500000, tol=0.000001, random_state=42)
kmeans.fit(X_train_lsa)
labels = kmeans.predict(X_train_lsa)

dfkmeans = pd.DataFrame({'Cluster Label':labels, 'Author':y})
ct = pd.crosstab(dfkmeans['Cluster Label'], dfkmeans['Author'])

cluster_centers = kmeans.cluster_centers_
xcenters = cluster_centers[:,0]
ycenters = cluster_centers[:,1]
xcenlist = list(xcenters)
ycenlist = list(ycenters)

centerdf = pd.DataFrame({'Xcenter':xcenters, 'Ycenter':ycenters})

#------------------------------------------------------------------------


# Plotting kMeans
plt.scatter(x=X_train_lsa[:, 0], y=X_train_lsa[:, 1], c=labels) 

# Plotting kMeans centers
plt.scatter(x=xcenters, y=ycenters,
            c='red', marker='x')
plt.show()

for i in range(0, len(xcenlist)):
    print('Cluster {} has center coordinates of: ({}, {})'.format(i, 
                                                                  round(xcenlist[i], 2), 
                                                                  round(ycenlist[i], 2)))

ct

So as you can see, creating a central tendency for graphing purposes might be better.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
scaler = StandardScaler()

X_pca = PCA(2).fit_transform(vectorized_df_train) 

kmeans = KMeans(n_clusters=5, n_init=500, max_iter=500000, tol=0.000001, random_state=42)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans).fit(X_pca)

# Calculate the cluster labels
labels = pipeline.predict(X_pca)

dfkmeans = pd.DataFrame({'Cluster Label':labels, 'Author':y})
ct = pd.crosstab(dfkmeans['Cluster Label'], dfkmeans['Author'])

cluster_centers = kmeans.cluster_centers_
xcenters = cluster_centers[:,0]
ycenters = cluster_centers[:,1]
xcenlist = list(xcenters)
ycenlist = list(ycenters)

centerdf = pd.DataFrame({'Xcenter':xcenters, 'Ycenter':ycenters})

#------------------------------------------------------------------------


# Plotting kMeans
plt.scatter(x=X_pca[:, 0], y=X_pca[:, 1], c=labels) 

plt.show()

for i in range(5):
    print('Cluster {} has center coordinates of: ({}, {})'.format(i, 
                                                                  round(xcenlist[i], 2), 
                                                                  round(ycenlist[i], 2)))

In [None]:
# importing silhouette score

from sklearn.metrics import silhouette_score 
silhouette_score(X_pca, labels) # silhouette score for k=5 clusters, above

In [None]:
# doing one more KElbow analysis
from yellowbrick.cluster import KElbowVisualizer

# Instantiate a scikit-learn K-Means model 
# really cranking on these iterations, 
# because I have yet to see any overfitting or significant computational overload...
model = KMeans(n_init=500, max_iter=500000, tol=0.000001, random_state=42)

# Instantiate the KElbowVisualizer with the number of clusters and the metric 
visualizer = KElbowVisualizer(model, k=(2,21), metric='silhouette', timings=True)

# Fit the data and visualize 

plt.figure(figsize=(20,5))
visualizer.fit(X_pca) 
visualizer.poof() 

# To Do

run mean shift with PCA
run spectral OR affinity with PCA
run t-SNE
---
create feature spaces, incorporating top components from LSA and aggregate numbers from original
DO THE SAME FOR XTEST
run supervised classification models
- lasso/ridge
- knn
- tree model/boosting model
unsupervised modelling on testing dataset
explain classification analyis and compare/contrast models

In [None]:
# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt


# SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so we need to use the 
# normalize() function from sklearn.preprocessing instead of Normalizer.

# Import normalize
from sklearn.preprocessing import normalize
# Normalize the movements: normalized_movements
normalized_movements = normalize(sample_data)
# Calculate the linkage: mergings
mergings = linkage(normalized_movements, method='complete')


# Plot the dendrogram, using varieties as labels
dendrogram(mergings,labels=varieties, leaf_rotation=90,leaf_font_size=6)
plt.show()

In [None]:
# playing around with the K ---- checking elbow for best number of clusters