# Parsing and assembling corpus data for the MSDS 453 final project

This code will take care of bringing in text files, parsing them into tokens, and store the tokenized version of each text together in a dataframe. We'll also include the vectorized version in the same dataframe, just to make things easier

It should also check if a text has already been added, and if it has it will just ignore it.


While in the future we could probably build this to run through a webscraper using Selinium or Scrapy, we're just going to have folks select text documents for now.

### Start by importing everything

In [1]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer


import pandas as pd # we'll put things into dataframes, then pickle them
from os import path

from gensim.models import Word2Vec,LdaMulticore, TfidfModel
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.cluster import KMeans

# allow GUI-based document imports
from tkinter import Tk, filedialog
import tkinter as tk

# os tools
from os import path, environ
environ['OMP_NUM_THREADS'] = "1"

import utils

In [None]:
# # download stopwords if necessary
# import nltk
# nltk.download('stopwords')

In [2]:
import ipympl

### Choose files, import

In [3]:
corpus_df = utils.parse_corpus_dir('.')

corpus_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, Announcement2014_CTA to Zotti2011_Reader
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Cleaned_List  9 non-null      object
 1   Full_Text     9 non-null      object
 2   Publication   9 non-null      object
dtypes: object(3)
memory usage: 288.0+ bytes


### TFIDF processing

tf-idf using sklearn. 

This is a little more pythonic than the version provided by Paul, but should do all of the same things. Most of the dataframes are named the same, except where I thought a different name might be a little more explanatory

In [None]:
# call the tfidf vectorizer. ngram range allows you to use multi-words within matrix
Tfidf = TfidfVectorizer(ngram_range=(1,3))

# fit with final processed documents
TFIDF_matrix = Tfidf.fit_transform(corpus_df['Cleaned_List'])
TFIDF_df = pd.DataFrame(TFIDF_matrix.toarray(), columns=Tfidf.get_feature_names_out(), index=corpus_df.index)

# get some summaries
tfidf_mean = np.array(TFIDF_df.mean(axis=0)) # mean tfidf value per term, across documents
tfidf_mean_quarts = np.percentile(tfidf_mean, [25, 75]) # quartiles
cross_doc_outlier = tfidf_mean_quarts[1] + 1.5*(tfidf_mean_quarts[1]-tfidf_mean_quarts[0])

# list of terms that show up frequently
outlier_list = TFIDF_df.columns[tfidf_mean>cross_doc_outlier].to_list()

## K-means clustering


In [None]:
titles = list(corpus_df.index)

k= int(np.floor(np.sqrt(len(titles))))
km = KMeans(n_clusters=k, random_state =89)
km.fit(TFIDF_df)
clusters_d2v = km.labels_.tolist()

Dictionary={'Doc Name':titles, 'Cluster':clusters_d2v,  'Text': corpus_df['Cleaned_List']}
frame=pd.DataFrame(Dictionary, columns=['Cluster', 'Doc Name','Text'])

cluster_title = {cluster:titles for cluster,title in enumerate(frame.groupby('Cluster').agg(','.join)['Doc Name'])}


k_means_plotting()


### Now plot the clusters
