# Parsing and assembling corpus data for the MSDS 453 final project

This code will take care of bringing in text files, parsing them into tokens, and store the tokenized version of each text together in a dataframe. We'll also include the vectorized version in the same dataframe, just to make things easier

It should also check if a text has already been added, and if it has it will just ignore it.


While in the future we could probably build this to run through a webscraper using Selinium or Scrapy, we're just going to have folks select text documents for now.

### Start by importing everything

In [9]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer


import pandas as pd # we'll put things into dataframes, then pickle them
from os import path

from gensim.models import Word2Vec,LdaMulticore, TfidfModel
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.cluster import KMeans

# allow GUI-based document imports
from tkinter import Tk, filedialog
import tkinter as tk

# os tools
from os import path, environ
environ['OMP_NUM_THREADS'] = 1

import utils

In [None]:
# # download stopwords if necessary
# import nltk
# nltk.download('stopwords')

In [2]:
import ipympl

### Choose files, import

In [6]:
corpus_df = utils.parse_corpus_dir('.')

corpus_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, Announcement2014_CTA to Zotti2011_Reader
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Cleaned_List  9 non-null      object
 1   Full_Text     9 non-null      object
 2   Publication   9 non-null      object
dtypes: object(3)
memory usage: 288.0+ bytes


### TFIDF processing

tf-idf using sklearn. 

This is a little more pythonic than the version provided by Paul, but should do all of the same things. Most of the dataframes are named the same, except where I thought a different name might be a little more explanatory

In [10]:
# call the tfidf vectorizer. ngram range allows you to use multi-words within matrix
Tfidf = TfidfVectorizer(ngram_range=(1,3))

# fit with final processed documents
TFIDF_matrix = Tfidf.fit_transform(corpus_df['Cleaned_List'])
TFIDF_df = pd.DataFrame(TFIDF_matrix.toarray(), columns=Tfidf.get_feature_names_out(), index=corpus_df.index)

# get some summaries
tfidf_mean = np.array(TFIDF_df.mean(axis=0)) # mean tfidf value per term, across documents
tfidf_mean_quarts = np.percentile(tfidf_mean, [25, 75]) # quartiles
cross_doc_outlier = tfidf_mean_quarts[1] + 1.5*(tfidf_mean_quarts[1]-tfidf_mean_quarts[0])

# list of terms that show up frequently
outlier_list = TFIDF_df.columns[tfidf_mean>cross_doc_outlier].to_list()

## K-means clustering


In [14]:
titles = list(corpus_df.index)

k= int(np.floor(np.sqrt(len(titles))))
km = KMeans(n_clusters=k, random_state =89)
km.fit(TFIDF_df)
clusters_d2v = km.labels_.tolist()

Dictionary={'Doc Name':titles, 'Cluster':clusters_d2v,  'Text': corpus_df['Cleaned_List']}
frame=pd.DataFrame(Dictionary, columns=['Cluster', 'Doc Name','Text'])

cluster_title = {cluster:titles for cluster,title in enumerate(frame.groupby('Cluster').agg(','.join)['Doc Name'])}




### Now plot the clusters


In [None]:
###############################################################################
### Plotting Doc2vec
###############################################################################
# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.


mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)

dist = 1 - cosine_similarity(doc2vec_df)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

xs, ys = pos[:, 0], pos[:, 1]


#set up colors per clusters using a dict.  number of colors must correspond to K
cluster_colors = {0: 'black', 1: 'grey', 2: 'blue', 3: 'rosybrown', 4: 'firebrick', 
                  5:'red', 6:'darksalmon', 7:'sienna'}


#set up cluster names using a dict.  
cluster_dict=cluster_title         

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters_d2v, title=range(0,len(clusters_d2v)))) 

#group by cluster
groups = df.groupby('label')

fig, ax = plt.subplots() # set size
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12,
            label=cluster_dict[name], color=cluster_colors[name], 
            mec='none')
    # ax.set_aspect('auto')
    # ax.tick_params(\
    #     axis= 'x',          # changes apply to the x-axis
    #     which='both',      # both major and minor ticks are affected
    #     bottom='off',      # ticks along the bottom edge are off
    #     top='off',         # ticks along the top edge are off
    #     labelbottom='on')
    # ax.tick_params(\
    #     axis= 'y',         # changes apply to the y-axis
    #     which='both',      # both major and minor ticks are affected
    #     left='off',      # ticks along the bottom edge are off
    #     right='off',         # ticks along the top edge are off
    #     labelleft='on')

for spine in ax.spines:
    ax.spines[spine].set_visible(False)


ax.legend(bbox_to_anchor=(1,0.5))      #show legend with only 1 point

fig.set_size_inches([9,9])

fig.savefig('Doc2Vec.png')


In [None]:
path.pathsep

In [None]:
path.sep
