## Get the Data

In [None]:
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import re

from nltk import ngrams
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

In [None]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)



In [None]:
def remove_url(text):
    return re.sub(r'http\S+', '', text)

### Read Data

We'll use Pandas's **read_csv**, to load an already existing dataset


In [None]:
df = pd.read_csv(r'text-data.csv', encoding="cp1252")

In [None]:
df.head()

## Exploratory Data Analysis

In [None]:
df['Category'].value_counts()

In [None]:
df.describe()

Let's use **groupby** to use describe by label, this way we can begin to think about the features that separate different categories

In [None]:
df.groupby('Category').describe()

As we continue our analysis we want to start thinking about the features we are going to be using. This goes along with the general idea of [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering). The better your domain knowledge on the data, the better your ability to engineer more features from it. Feature engineering is a very large part of text classification in general. I encourage you to read up on the topic!

Let's make a new column to detect how long each text entry is!

In [None]:
# length here is the number of chars
df['length'] = df['Text'].apply(len)
df.head()

In [None]:
df['Text'][321]

### Some Data Visualization

In [None]:
# Are the classes balanced?
count_target = df['Category'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(x = count_target.index, y = count_target.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Category', fontsize=12);

In [None]:
df['length'].plot(bins=5, kind='hist');

Play around with the bin size! Looks like text length may be a good feature to think about!

In [None]:
df.length.describe()

In [None]:
df.hist(column='length', by='Category', bins=20,figsize=(12,5));

## Text Pre-processing

The main issue with our data is that it is all in text format (strings). To be able to use classification algorithms we will need some sort of numerical feature vector in order to perform the classification tasks. There are actually many methods to convert a corpus to a vector format. The simplest is the the [bag-of-words](http://en.wikipedia.org/wiki/Bag-of-words_model) approach, where each unique word in a text will be represented by one number.


** In this section we'll convert the raw text (sequence of characters) into vectors (sequences of numbers) **

As a first step, let us write a function that splits a text line (i.e. a file or tweet) into its individual words and returns a list. We will also remove very common words, ('the', 'a', etc..). To do this we will take advantage of a text file that contains a list of very common words (i.e. stopwords).

In addition we will perform two steps: text stemming and ngram tokenisation which are common techniques in text preprocessing.

Let's create a function that will process the string in the **Text** column, then we can just use **apply()** in pandas to process all the text in the DataFrame.

### To Remove Stopwords 
* Here we prepare a list of stopwords
* We import a list of english stopwords from a text file
* We later remove these words from the input text

In [None]:
my_stopwords = []
with open(r'stopwords_en.txt') as f:
    my_stopwords = f.read().splitlines()

In [None]:
my_stopwords

### NGram Tokenisation
Now let's write a function to "tokenise" the text lines (i.e. files). Tokenisation is the term used to describe the process of converting the normal text strings in to a list of tokens (words or sentences that we actually want).

* Here we apply ngram tokenisation
* This function receives a text list where each element is a string in a text
* e.g. ['take', 'a', 'string', 'text']
* Also the number of tokens n .. default is 3
* It returns the tokenised string as a list of tokens .. e.g. ['take a string', 'a string text']

In [None]:
def ngram_vectoriser(text_list, n=1):
    ngram_feature_vector = []
    for item in ngrams(text_list, n):
        ngram_feature_vector.append(' '.join(item))
    return ngram_feature_vector

### You can also use NLTK

* from nltk.tokenize import word_tokenize

* from nltk.corpus import stopwords 

### Negation Handling

In [None]:
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def handle_negation(text):
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], text)
    return neg_handled

## Now let's write a Function to Prepare Text
We will apply it to our DataFrame later on
### This function receives a text string and performs the following:
* Convert text to lower case
* Remove emojis
* Remove URLs
* Handle negation
* Remove punctuation marks
* Remove stop words using the list we prepared previously
* Apply stemming using the popular Snowball or Porter Stemmer
* Apply NGram Tokenisation
* Return the tokenised text as a list

In [None]:
#stemmer = SnowballStemmer("english", ignore_stopwords=True)
stemmer = PorterStemmer()


def process_text(text):
    """
    Takes in a string of text, then performs the following:
    1. Converts text to lower case
    2. Removes emojis and then removes URLs
    3. Handles Negation
    4. Removes punctuation marks
    5. Removes all stopwords
    6. Applies Stemming
    7. Applies Ngram Tokenisation
    8. Returns the tokenised text as a list
    """
    text = text.lower()
    
    text = remove_emoji(text)

    text = remove_url(text)
    
    text = handle_negation(text)

    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    no_stop_words = [word for word in nopunc.split() if word.lower() not in my_stopwords]
    
    # apply stemming
    stemmed = [stemmer.stem(word) for word in no_stop_words]
    
    #apply ngram tokenisation
    tokenised = ngram_vectoriser(stemmed,1)
    
    return tokenised

In [None]:
process_text('Now well convert each text, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.')

In [None]:
remove_url('@switchfoot http://twitpic.com/2y1zl - Awww')

Here is the original DataFrame again:

In [None]:
df.head()

In [None]:
# Check to make sure its working
df['Text'].head(5).apply(process_text)

## Vectorization

Currently, we have the text file contents as lists of tokens (also known as [lemmas](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)) and now we need to convert each of those texts into a vector the SciKit Learn's algorithm models can work with.

Now we'll convert each text, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.

We'll do that in three steps using the bag-of-words model:

1. Count how many times does a word occur in each text (Known as term frequency)

2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)

3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)

Let's begin the first step:

Each vector will have as many dimensions as there are unique words in the text file corpus.  We will first use SciKit Learn's **CountVectorizer**. This model will convert a collection of text documents to a matrix of token counts.

We can imagine this as a 2-Dimensional matrix. Where the 2nd-dimension is the entire vocabulary (1 column per word) and the other dimension will have the actual documents, in this case a row per text. 

For example:

<table border = “1“>
<tr>
<th></th> <th>Word 1 Count</th> <th>Word 2 Count</th> <th>...</th> <th>Word M Count</th> 
</tr>
<tr>
<td><b>File 1</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>File 2</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>File N</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>


Since there are so many text files and words, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit Learn will output a [Sparse Matrix](https://en.wikipedia.org/wiki/Sparse_matrix).

There are a lot of arguments and parameters that can be passed to the CountVectorizer. In this case we will just specify the **analyzer** to be our own previously defined function:

In [None]:
# Might take a while...
# You can save this for future use .. e.g. to apply on new text
bow_transformer = CountVectorizer(analyzer=process_text).fit(df['Text'])

# Print total number of vocab words
print(len(bow_transformer.vocabulary_))

Let's take one text and get its bag-of-words counts as a vector, putting to use our new `bow_transformer`:

In [None]:
data4 = df['Text'][3]
print(data4)

Now let's see its vector representation:

In [None]:
bow4 = bow_transformer.transform([data4])
print(bow4)
print(bow4.shape)

This means that there are several unique words in text file number 4 (after removing common stop words). We can see how many times each of them appears!

In [None]:
#Test
print(bow_transformer.get_feature_names_out()[65])
print(bow_transformer.get_feature_names_out()[355])

In [None]:
# get the ID of a term
bow_transformer.vocabulary_['child']

Now we can use **.transform** on our Bag-of-Words (bow) transformed object and transform the entire DataFrame of text file contents. Let's go ahead and check out how the bag-of-words counts for the entire corpus in a large, sparse matrix:

In [None]:
text_bow = bow_transformer.transform(df['Text'])

In [None]:
print('Shape of Sparse Matrix: ', text_bow.shape)
print('Amount of Non-Zero occurences: ', text_bow.nnz)

In [None]:
sparsity = (100.0 * text_bow.nnz / (text_bow.shape[0] * text_bow.shape[1]))
print('sparsity: {}'.format(sparsity))

After the counting, the term weighting and normalization can be done with [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf), using scikit-learn's `TfidfTransformer`.

____
### So what is TF-IDF?
TF-IDF stands for *term frequency-inverse document frequency*, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

*TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).*

**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

*IDF(t) = log_e(Total number of documents / Number of documents that contain term t).*

See below for a simple example.

**Example:**

Consider a document containing 100 words wherein the word **cat** appears 3 times. 

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word **cat** appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the TF-IDF weight is the product of these quantities: 0.03 * 4 = 0.12.
____

<table border = “1“>
<tr>
<th></th> <th>Word 1 Weight</th> <th>Word 2 Weight</th> <th>...</th> <th>Word M Weight</th> 
</tr>
<tr>
<td><b>File 1</b></td><td>0.033</td><td>1.092</td><td>...</td><td>1.301</td>
</tr>
<tr>
<td><b>File 2</b></td><td>2.98</td><td>1.106</td><td>...</td><td>0.093</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>2.102</td>
</tr>
<tr>
<td><b>File N</b></td> <td>0.173</td><td>0618</td><td>...</td><td>0.602</td>
</tr>
</table>

_____________


Let's go ahead and see how we can do this in SciKit Learn:

In [None]:
# You can save this for future use .. e.g. to apply on new text
tfidf_transformer = TfidfTransformer().fit(text_bow)

tfidf4 = tfidf_transformer.transform(bow4)
print(tfidf4)

We'll go ahead and check what is the IDF (inverse document frequency) of the word `"see"` and of word `"crack"`?

In [None]:
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['child']])
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['love']])

To transform the entire bag-of-words corpus into TF-IDF corpus at once:

In [None]:
text_tfidf = tfidf_transformer.transform(text_bow)
print(text_tfidf.shape)

### If you have new unknown text

You must prepare your new input by applying the same transformers .. otherwise you'll get errors

1- You need to load the two transformers if you have saved them OR retrain them from scratch (not recommended)

2- Apply them to prepare your input

3- Feed the input into your trained classifier!

In [None]:
#test_bow = bow_transformer.transform([input_text])
#test_data = tfidf_transformer.transform(test_bow)
#model.predict(test_data)

# Now the Data is Ready for Classifier Usage

Another way is to use ** Cross Validation **. 

In n Fold cross validation, the data is divided into n non-overlapping subsets. We repeat the following n times:
* one of the n subsets is used as the test set/ validation set
* the other n-1 subsets are put together to form a training set. 
* The error estimation is averaged over all n trials to get total accuracy of the model. 

## Training a RandomForest model

#### Cross Validation

In [None]:
clf = RandomForestClassifier(verbose = 3)
scores = cross_val_score(clf, text_tfidf, df['Category'],  cv=8, verbose = 3)
#scores
print("Overall Average Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))