In [None]:
from nltk import PorterStemmer
stemmer = PorterStemmer()
import random 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

## Problem

The goal of this project was originally to categorize documents from the colliegiate debate wiki into categories based on the country each document case or plan "affirmative" had. However due to the massive amount of documents I needed to hand label, the 105 documents I had were not enough for this. But, I did have enough to divide into two categories, planless and plan affrimatives. In college debate there is a clear split between teams that read an affirmation of topic, this year it was about reducing US alliance commitments, and teams that do not read a plan. These "planless" teams usually read philosophical or political criticisms of the topic, or debate and so draw from a very different literature base than the "plan" affirmatives. All of this is very contentious, but if an accurate classification algrotihm could be developed, quantifying the trends and differences would be possible, without the need for manual classification. 

An easy approach I thought would be to simply search texts for a few key words to indicate if they have a plan, and so I did that by using naive bayes, although it could be even simpler than that.

### Creating the Corpus

In [None]:
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpusdir6 = 'Plan'
corpusdir7 = 'No_Plan'



Plan_corpus = PlaintextCorpusReader(corpusdir6, '.*')
No_Plan_corpus = PlaintextCorpusReader(corpusdir7, '.*')

In [None]:
documents = [(list(Plan_corpus.words(fileid)),'Plan') for fileid in Plan_corpus.fileids()] + [(list(No_Plan_corpus.words(fileid)),'No_Plan') for fileid in No_Plan_corpus.fileids()] 

### Cleaning the word features 

In [None]:
from nltk.corpus import stopwords
import re

all_words = Plan_corpus.words() + No_Plan_corpus.words() 

filtered_words = [word for word in all_words if word not in stopwords.words('english')]

fd = nltk.FreqDist(list(filtered_words))

In [None]:
print(len(all_words))
word_features = [word for (word, count) in fd.most_common(4000)]


the two corpus's have over a million words and 100 documents

In [None]:
def features2(document):
    document_words = set(stemmer.stem(word) for word in document)
    features = {}
    for word in good_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features
         
doc_featuresets2 = [(features(d), c) for (d,c) in documents]
random.shuffle(doc_featuresets2)
    

In [None]:
good_features2 = ['nuclear', 'plan', '5', 'S', 'U', 'aircraft', 'revisionist', 'spiral',
                 'partnership', 'grand', 'alternative',
                 'one', 'honor', 'analyst', 'civillian', 'key']
good_features = ['USFG', 'usfg',]

In [None]:
train_set, test_set = doc_featuresets2[40:], doc_featuresets2[:40]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

In [None]:
classifier.show_most_informative_features(10)

Not bad results, I found the highest accuracy came from using only two words as feautures "USFG" that stands for US Federal Government, that many plan teams use when writing a plan. Also interesting was I found the word "nuclear" highly predicted a plan aff as many plan affirmatives discuss speculative war scenarios.

## LDA 

I used tf-idf and the kmeans algorithm to separate the find the difference in language used by the planless and policy focused case affrimatives, in many ways this is more interesting because it shows how the two categories cluster around different vocabularies

In [None]:
docu = [Plan_corpus.raw(fileid) for fileid in Plan_corpus.fileids()] + [No_Plan_corpus.raw(fileid) for fileid in No_Plan_corpus.fileids()]



In [None]:
vectorizer = TfidfVectorizer(docu, stop_words={'english'}, max_df=0.5, min_df=0.1, lowercase=True)

X = vectorizer.fit_transform(docu)

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :200]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

The results show that the planless affirmative's top terms of the cluster has words heaviliy emphasized in the "critical theory" type humanities. Words like "body" and "colonial" show the connection to these humanities fields. On the other hand plan affirmatives top terms in the other cluster emphasizes geopolitical terminology and things like "cyber" and vocabulary taken from international relations literature. 

In [None]:
len(doc_featuresets)

In [None]:
len(docu)

## Visualization

In [None]:
from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as plt

In [None]:
cloud = WordCloud(stopwords = STOPWORDS,
                  background_color = "white",
                  max_words = 200,
                  max_font_size = 40, 
                  scale=3,
                  random_state=1
                 ).generate(str([terms[ind] for ind in order_centroids[i, :200]]))


plt.figure(figsize = (15,10))
plt.clf()
plt.imshow(cloud)
plt.axis("off")
plt.show()

In [None]:
cloud = WordCloud(stopwords = STOPWORDS,
                  background_color = "white",
                  max_words = 200,
                  max_font_size = 40, 
                  scale=3,
                  random_state=1
                 ).generate(str([terms[ind] for ind in order_centroids[0, :200]]))


plt.figure(figsize = (15,10))
plt.clf()
plt.imshow(cloud)
plt.axis("off")
plt.show()