## Get the Data

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn import datasets

from nltk import ngrams

import pickle

We'll use **load_files**, to read a directory of data (the 20 Newsgroups dataset http://qwone.com/~jason/20Newsgroups/)

The data here is in a directory with subdirectories inside it, each subdirectory contains text documents of the same category, the subdirectory name is the category.

This needs to be transformed into a table .. let's see how it can be done!


In [None]:
# Load text files with categories as subfolder names.
# Please check the documentation of this function
# Do shift + tab in Jupyter
# Returns
#-------
#data : Bunch
#    Dictionary-like object, the interesting attributes are: either
#    data, the raw text data to learn, or 'filenames', the files
#    holding it, 'target', the classification labels (integer index),
#    'target_names', the meaning of the labels, and 'DESCR', the full
#    description of the dataset.
data = datasets.load_files('20news-bydate-test')

In [None]:
data.keys()

### Transform into a table
Here we read the data from the Bunch object and turn in into a pandas DF object

In [None]:
data_tuples = []
# loop through data and target entries
for text,category in zip(data['data'], data['target']):
    # decode text to deal with special characters/symbols
    decoded = text.decode("cp1252")
    # turn text into one line
    one_line = str.join(" ", decoded.splitlines())
    # save each text and its category as a tuple in a list
    data_tuples.append((one_line, category))

In [None]:
len(data_tuples)

In [None]:
#data_tuples[0]

In [None]:
# create a DF from the list of tuples
df = pd.DataFrame(data_tuples, columns=['Text','Category'])

## Exploratory Data Analysis

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df['Category'].value_counts()

In [None]:
df.describe()

Let's use **groupby** to use describe by label, this way we can begin to think about the features that separate different categories

In [None]:
df.groupby('Category').describe()

As we continue our analysis we want to start thinking about the features we are going to be using. This goes along with the general idea of [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering). The better your domain knowledge on the data, the better your ability to engineer more features from it. Feature engineering is a very large part of text classification in general. I encourage you to read up on the topic!

Let's make a new column to detect how long each text entry is!

In [None]:
# length here is the number of chars
df['length'] = df['Text'].apply(len)
df.head()

In [None]:
df['Text'][200]

### Some Data Visualization

In [None]:
# Are the classes balanced?
count_target = df['Category'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(count_target.index, count_target.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Category', fontsize=12);

In [None]:
df['length'].plot(bins=5, kind='hist');

Play around with the bin size! Looks like text length may be a good feature to think about! Let's try to explain why the x-axis goes all the way to more than 160000, this must mean that there is some really long texts!

In [None]:
df.length.describe()

In [None]:
df.hist(column='length', by='Category', bins=20,figsize=(12,12));

## Text Pre-processing

The main issue with our data is that it is all in text format (strings). To be able to use classification algorithms we will need some sort of numerical feature vector in order to perform the classification tasks. There are actually many methods to convert a corpus to a vector format. The simplest is the the [bag-of-words](http://en.wikipedia.org/wiki/Bag-of-words_model) approach, where each unique word in a text will be represented by one number.


** In this section we'll convert the raw text (sequence of characters) into vectors (sequences of numbers) **

As a first step, let us write a function that splits a text line (i.e. a file or tweet) into its individual words and returns a list. We will also remove very common words, ('the', 'a', etc..). To do this we will take advantage of a text file that contains a list of very common words (i.e. stopwords).

In addition we will perform two steps: text stemming and ngram tokenisation which are common techniques in text preprocessing.

Let's create a function that will process the string in the **Text** column, then we can just use **apply()** in pandas to process all the text in the DataFrame.

### To Remove Stopwords 
* Here we prepare a list of stopwords
* We import a list of english stopwords from a text file
* We later remove these words from the input text