## Global To Dos:
* read paper
* [steps document](https://docs.google.com/document/d/1du2fcmVzTqnW0FWGmUxvpR2t6QEfwxDIrHSZs9k3HZM/edit?usp=sharing)

Section 1: Load and Preprocess Data

In [1]:
import utils_preprocessing as up
import utils as ut
import importlib
importlib.reload(ut)
importlib.reload(up)
from itertools import groupby
import yaml
import os
import textdistance
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

## Pre-processing 

Read in YAML and make PATH

In [2]:
with open("config.yml", 'r') as ymlfile:
    cfg = yaml.load(ymlfile)

PATH = os.path.join(cfg['data']['Directory'] + ":" + os.sep, cfg['data']['Folder1'], cfg['data']['Folder2'], cfg['data']['Folder3']) # Alix Path

#PATH = 'SOTU/'
#PATH = '/Users/aleistermontfort/Desktop/speeches' # Aleister Path
filetype = '*txt'

  


Read in Data

In [3]:
speeches, numpar = up.reading_data(PATH,'*.txt')

Create Noun Phrases

In [4]:
new_speeches = up.chunks(speeches, "regex")

KeyboardInterrupt: 

Clean Words & Lemmatize Noun Phrases: Spacy - Skip if used Regex Above

In [None]:
#clean_speeches = up.clean_words(new_speeches)

In [None]:
#phrases_lemmed = up.lemmed_phrases(words_changed, clean_speeches)

Lemmatize Noun Phrases: Regex - Skip if Used Regex Above

In [None]:
words_changed = up.word_changes(new_speeches, 0.5, 100)

In [None]:
phrases_lemmed = up.lemmed_phrases(words_changed, new_speeches)

Counting Occurrence of Terms

In [None]:
counted_words = up.count_words(phrases_lemmed)

Limit List to Top 1000

In [None]:
top_words = up.top_x(counted_words, 1000)
top_words

limit paragraph phrases only to those in top 1000

In [None]:
limited_paragraphs = up.limit(phrases_lemmed, top_words)

## Periodization

In [None]:
tfidfs = up.corpus_tfidf(limited_paragraphs, counted_words, top_words)

In [None]:
periods, dissimilarity = up.periodization(tfidfs)

In [None]:
import operator
sorted_x = sorted(periods.items(), key=operator.itemgetter(1))
years = dissimilarity.index

In [None]:
import matplotlib.pylab as plt
import matplotlib.ticker as ticker

lists = sorted(periods.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples

ax = plt.axes()
ax.xaxis.set_major_locator(ticker.MultipleLocator(20))
plt.axvline(pd.to_numeric(sorted_x[0][0])-pd.to_numeric(years[0]), 0,1)
plt.plot(x, y)
plt.show()

In [None]:
import seaborn as sns
cmap = sns.cubehelix_palette(as_cmap=True, reverse=True)
ax = sns.heatmap(dissimilarity, cmap=cmap)
plt.axvline((pd.to_numeric(sorted_x[0][0])-pd.to_numeric(years[0])), 0,1)
plt.show()

### Before

Before Years

In [None]:

before_1914 = years[:pd.to_numeric(sorted_x[0][0])-pd.to_numeric(years[0])]
before_dict = {k: v for k, v in limited_paragraphs.items() if k[1] in before_1914}

Before TFIDF

In [None]:
before_tfidfs = up.corpus_tfidf(before_dict, counted_words, top_words)

In [None]:
before_periods, before_dissimilarity = up.periodization(before_tfidfs)

### After

After Years

In [None]:
after_1914 = years[pd.to_numeric(sorted_x[0][0])-pd.to_numeric(years[0]):]
after_dict = {k: v for k, v in limited_paragraphs.items() if k[1] in after_1914}

After TFIDF

In [None]:
after_tfidfs = up.corpus_tfidf(after_dict, counted_words, top_words)

In [None]:
after_periods, after_dissimilarity = up.periodization(after_tfidfs)

In [None]:
new_periods = {**before_periods, **after_periods}

In [None]:
import matplotlib.pylab as plt
import matplotlib.ticker as ticker

lists = sorted(new_periods.items()) # sorted by key, return a list of tuples
x, y = zip(*lists) # unpack a list of pairs into two tuples

ax = plt.axes()
ax.xaxis.set_major_locator(ticker.MultipleLocator(20))

plt.plot(x, y)
plt.show()

## Co-Occurrence Matrix and Dissimilarity Topic Modeling

Build the co-occurrence matrix

In [None]:
pre_occur = ut.convert_dict_to_list(limited_paragraphs)

In [None]:
co_occur = ut.co_oc_matrix(pre_occur, True, True)

Do a cosine similarity

In [None]:
co_matrix = ut.pairwise_similarity(co_occur, 'cosine')

Do a paper dissimilarity

In [None]:
co_matrix_paper = ut.pairwise_similarity(co_occur, 'paper')

### Global Corpus

Create network graph using community detection algorithms and paper's approach

In [None]:
global_cda_paper = ut.network_graph(co_matrix_paper, 'community')

Create network graph using community detection algorithms and cosine similarity

In [None]:
global_cda_cosine = ut.network_graph(co_matrix, 'community')

Create network graph using kmeans unsupervised clustering and paper's approach 

In [None]:
global_kmeans_paper = ut.network_graph(co_matrix_paper, 'kmeans')

Create network graph using kmeans unsupervised clustering and cosine similarity 

In [None]:
global_kmeans_cosine= ut.network_graph(co_matrix, 'kmeans')

### Before 1914 Corpus

First turn the before dictionary into a co-occurrence matrix and create cosine and dissimilarity versions of it

In [None]:
before_occur = ut.convert_dict_to_list(before_dict)
co_before = ut.co_oc_matrix(before_occur, True, False)
before_cosine = ut.pairwise_similarity(co_before, 'cosine')
before_paper = ut.pairwise_similarity(co_before, 'paper')

Create network graph using community detection algorithm and paper dissimilarity

In [None]:
before_cda_paper = ut.network_graph(before_paper, 'community')

Create network graph using community detection algorithm and cosine dissimilarity

In [None]:
before_cda_cosine = ut.network_graph(before_cosine, 'community')

Create network graph using kmeans and paper dissimilarity

In [None]:
before_kmeans_paper = ut.network_graph(before_paper, 'kmeans')

Create network graph using kmeans and cosine similarity

In [None]:
before_kmeans_cosine = ut.network_graph(before_cosine, 'kmeans')

### After 1914 Corpus

First turn the after dictionary into a co-occurrence matrix and create cosine and dissimilarity versions of it

In [None]:
after_occur = ut.convert_dict_to_list(after_dict)
co_after = ut.co_oc_matrix(after_occur, True, False)
after_cosine = ut.pairwise_similarity(co_after, 'cosine')
after_paper = ut.pairwise_similarity(co_after, 'paper')

Create network graph using community detection algorithm and paper dissimilarity

In [None]:
after_cda_paper = ut.network_graph(after_paper, 'community')

Create network graph using community detection algorithm and cosine dissimilarity

In [None]:
after_cda_cosine = ut.network_graph(after_cosine, 'community')

Create network graph using kmeans and paper dissimilarity

In [None]:
after_kmeans_paper = ut.network_graph(after_paper, 'kmeans')

Create network graph using kmeans and cosine similarity

In [None]:
after_kmeans_cosine = ut.network_graph(after_cosine, 'kmeans')

## LDA Topic Modeling 