<table align="left">
<tr>

<th, style="background-color:white">
<img src="https://github.com/mlgill/ODSC_East_2017_PythonNLP/blob/master/assets/logo.png?raw=true", width=140, height=100>
</th>

<th, style="background-color:white">
<div align="left">
<h1>Learning from Text: <br> Introduction to Natural Language Processing with Python</h1>  
<h2>Michelle L. Gill, Ph.D.</h2>     
Senior Data Scientist, Metis  
ODSC East  
May 3, 2017 
</div>
</th>

</tr>
</table>  

## Hierarchical Text Clustering Exercises

We will be using a subset of the Reuters data set, which is a collection of 9,603 newswire articles. The training set contains training articles from April 7, 1987 and a test set from the following day (April 8, 1987).

This dataset is included with the NLTK corpora, so the initial code will handle loading the data.

In [None]:
import numpy as np
import pandas as pd
import re

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('talk')
sns.set_style('white')
sns.set_palette('dark')

import nltk
from accessory_functions import nltk_path
# Setup nltk corpora path
nltk.data.path.insert(0, nltk_path)

%matplotlib inline

Load the data.

In [None]:
from nltk.corpus import reuters

reuters.ensure_loaded()

A function to load the data and create data and category dataframes.

In [None]:
def load_data(data_obj):
    
    # Sort the filenames into train and test
    category_docs = data_obj.fileids()

    # Get the text for the train and test files
    text = [data_obj.raw(x) for x in category_docs]

    # Create dataframe
    data_df = pd.DataFrame({'fileid':category_docs, 
                             'text':text}).set_index('fileid')

    # Load the categories and create a dataframe
    category_list = data_obj.categories()

    category_dict = [(pd.DataFrame({x: data_obj.fileids(x)})
                      .stack()
                      .to_frame()
                      .reset_index(level=-1))
                     for x in category_list]
    category_df = pd.concat(category_dict, axis=0).reset_index(drop=True)
    category_df.columns = ['category', 'fileid']
    category_df = category_df.set_index('category')
    
    return data_df, category_df

In [None]:
data, category = load_data(reuters)

print(data.shape, category.shape)

data.head(5)

Select just the cocoa and coffee articles.

In [None]:
cocoa = category.loc['cocoa']
coffee = category.loc['coffee']

cocoa_data = data.loc[cocoa.fileid]
coffee_data = data.loc[coffee.fileid]

data_sm = pd.concat([cocoa_data, coffee_data])

print(data_sm.shape, data_sm.drop_duplicates().shape)

data_sm = data_sm.drop_duplicates().sample(frac=1, replace=False,
                                           random_state=42)

data = data_sm

If you prefer to work with lists rather than Pandas dataframes, they can be created by copying and pasting the following code into a cell. This should be executed after preprocessing the data in Question 1 below.

```python
data = data.text.tolist()
```

## Question 1

* Preprocess the data using the convenience function in `accessory_functions`
* Use count vectorizer to create a document-term matrix with counts

## Question 2

* Perform clustering on a few of the documents using either cosine distance (1 - cosine_similarity) or Euclidean distance.
* Plot the results in a dendrogram.

## Question 3

* Get the flattened clusters for one of the linkage matrices using `fcluster`
* Using the cluster number, print some of each of the articles in the respective clusters