# Reuters-21578 Text Categorization
The goal of this section is to implement a machine learning model that correctly classifies a series of documents in the Reuters-21578 corpus. Each document's "BODY" and "TITLE" is used to predict the overall category, or "TOPIC", of the document.

First, the frequency of top words in each document's "BODY" and "TITLE" is calculated and used to create a sparse matrix of features. The most popular "TOPIC" for each document comprises the list of labels

The K-Means Clustering adds more value to analyzing the top words and groups them into 135 clusters, reflecting the 135 potential "TOPICS".

The Naive Bayes algorithm uses two approaches to classify: SkLearn randomized the test_train_split and "LEWISSPLIT" predetermined split, which was used in the TOIS located in the Resources directory.
 - Using the SkLearn split, "BODY" accurately classfies 
 - Using the "LEWISSPLIT", "BODY" classifies about

## Frequency of Top Words
This section determines the frequencies of top words in the entire corpus. The words are taken from either the text "BODY" or "TITLE".

Words that do not have an impact on the overall categorization of the article, or stopwords, such as "and" and "the" are removed from the list of words. A sparse matrix is created for the features and a separate list is created with a single topic for each document ID.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words
from yellowbrick.text.freqdist import FreqDistVisualizer

In [None]:
stopwords = frozenset({'reuter', 'said'})
english_stopwords = stop_words.ENGLISH_STOP_WORDS
stopwords = stopwords.union(english_stopwords)

vectorizer = CountVectorizer(stop_words = stopwords)

def graph_frequencies(matrix, features):
    fig, ax = plt.subplots(figsize=(20,10))
    visualizer = FreqDistVisualizer(features=features, ax=ax)
    visualizer.fit(matrix)
    visualizer.poof()
    plt.close(fig)

In [None]:
#creates a sparse matrix and populates features for BODY and TITLE
with open('body_no_null.csv') as f:
    bodyContent = f.readlines()
bodyData = [x.split(',')[2] for x in bodyContent]
bodyMatrix = vectorizer.fit_transform(bodyData)
bodyFeatures = vectorizer.get_feature_names()
graph_frequencies(bodyMatrix, bodyFeatures)

with open('title_no_null.csv') as f:
    titleContent = f.readlines()
titleData = [x.split(',')[2] for x in titleContent]
titleMatrix = vectorizer.fit_transform(titleData)
titleFeatures = vectorizer.get_feature_names()
graph_frequencies(titleMatrix, titleFeatures)

In [None]:
def get_labels(file):
    with open(file) as f:
        labelsContent = f.readlines()
    labels = [x.split(',')[2] for x in labelsContent]
    return labels

labels = get_labels('topics_popular.csv')

The following code creates a sparse matrix and populates the "BODY" and "TITLE" features according to the "LEWISSPLIT" training or test tags

In [None]:
with open('training_files/body_train.csv') as f:
    bodyTrainContent = f.readlines()
bodyTrainData = [x.split(',')[2] for x in bodyTrainContent]
bodyTrainMatrix = vectorizer.transform(bodyTrainData)
bodyTrainFeatures = vectorizer.get_feature_names()
#graph_frequencies(bodyTrainMatrix, bodyTrainFeatures)

with open('testing_files/body_test.csv') as f:
    bodyTestContent = f.readlines()
bodyTestData = [x.split(',')[2] for x in bodyTestContent]
bodyTestMatrix = vectorizer.transform(bodyTestData)
bodyTestFeatures = vectorizer.get_feature_names()
#graph_frequencies(bodyTestMatrix, bodyTestFeatures)

with open('training_files/title_train.csv') as f:
    titleTrainContent = f.readlines()
titleTrainData = [x.split(',')[2] for x in titleTrainContent]
titleTrainMatrix = vectorizer.transform(titleTrainData)
titleTrainFeatures = vectorizer.get_feature_names()
#graph_frequencies(titleTrainMatrix, titleTrainFeatures)

with open('testing_files/title_test.csv') as f:
    titleTestContent = f.readlines()
titleTestData = [x.split(',')[2] for x in titleTestContent]
titleTestMatrix = vectorizer.transform(titleTestData)
titleTestFeatures = vectorizer.get_feature_names()
#graph_frequencies(titleTestMatrix, titleTestFeatures)

## K-Means Clustering
Since there are 135 potential "TOPICS", there are 135 clusters with the top word frequencies. The first 5 clusters are displayed below with each run producing a different subset of words for each cluster. This analysis is a useful sanity check to identify that words that are grouped together align with predetermined topics located in topics_popular.csv

Modified from: https://github.disney.com/JORDC054/twitter-friend-clusters/blob/master/twitter%20techvive%202018.ipynb.

In [3]:
from sklearn.cluster import MiniBatchKMeans

mbk = MiniBatchKMeans(init='k-means++', n_clusters=135, batch_size=2500, n_init=10, max_no_improvements=10, verbose=0)

mbk.fit(bodyMatrix) #titleMatrix

TypeError: __init__() got an unexpected keyword argument 'max_no_improvements'

In [6]:
import pandas as pd

clusters = mbk.labels_.tolist()

frame = pd.DataFrame({'cluster' : clusters}, index = [clusters], columns = ['cluster'])
#frame['cluster'].value_counts()

NameError: name 'mbk' is not defined

In [8]:
print("Top words per cluster:")
print()

#sort cluster centroids by proximity to centroid
order_centroids = mbk.cluster_centers_.argsort()[:, ::-1]

for i in range(5):
    print("Cluster %d had top words:" % i, end= ' ')
    for ind in order_centroids[i, :30]:
        print(bodyFeaturs[ind], end = ', ') #change to titleFeatures to view "TITLE" top words
    print()
    print()

Top words per cluster:



NameError: name 'mbk' is not defined

## Naive Bayes
The following code splits the data into train and test sections with the features determine dby the sparse matrix and labels created above.

Using SKlearn train_test_split: "BODY" produces an accuracy of about 86% and "TITLE" produces an ac