**Before you start**
Make sure you have completed "Tutorial 1 - Exploring the data" before starting this one as it creates a data file that will be used below.

## **Preparing the Discovery data for Machine Learning**
Having performed our initial data analysis and produced a filtered dataset to work with, we will now move on to preparing the data for machine learning.

The first step is to create a numerical representation of the data. We will start with the simple way, and then a slightly more sophisticated way which is very common in data science.

Then we can look at an approach called stemming which deals with things like pluralisation of words (you may want to treat 'machines' and 'machine' as the same thing in your model).

At the end of this tutorial we will have 3 datasets which we will experiment with in the next one.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
data_folder = "/content/gdrive/My Drive/MLC/Session 3/Data/"

This piece of code imports the libraries that we will be using in the first part of this notebook.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sklearn      # The most common Python Machine Learning library - scikit learn
from sklearn.model_selection import train_test_split  # Used to create training and test data
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, balanced_accuracy_score
import seaborn as sns
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
from operator import itemgetter
from nltk.stem.porter import *
import pickle
from nltk import toktok         


Run the following code to load the reduced dataset you created in the previous tutorial.

In [0]:
descriptions = pd.read_csv(data_folder + 'topN_taxonomy.csv',
                           delimiter="|", header=None, lineterminator='\n')
descriptions = descriptions.drop(descriptions.columns[[0]], axis=1) 
descriptions.columns = ["IAID","TAXID","Description"]
descriptions.count()

This time there are no rows with blanks.

We will also load up the table of taxonomy category names again which will be useful for understanding the various categories.

In [0]:
taxonomy = pd.read_csv(data_folder + 'taxonomyids_and_names.txt',
                           delimiter="|", header=None, lineterminator='\n')
taxonomy.columns = ["TAXID","TaxonomyCategory"]

### **Bag-of-Words**

Do you remember the Bag-of-Words technique from the first session? This is a way of identifying the top words or phrases in a corpus.

We will now go into more detail of what was happening behind the scenes. There are a number of ways of turning text into numbers. One is simply to create a table with a column for every word that appears in the corpus and then for each record, count how many times each word occurs in the text for that row.

For example, imagine we have two sentences:

*   the dog laughed
*   the dog walked the dog

Our table would have have a column for each unique word: 'the', 'dog', 'laughed', 'walked', a row for each sentence. **Run the code below to see what it looks like**



In [0]:
pd.DataFrame({'text' : ['the dog laughed','the dog walked the dog'], 'the' : [1,2], 'dog' : [1,2], 'laughed' : [1,0], 'walked' : [0,1]})

The following code does the same thing creating a function which will convert text into a 'vector'. (There's no need to run the code, it is just here for discussion purposes). A vector is just a list of numbers, and in this situation is analogous to a row in a spreadsheet. The column headings are set by the vocabulary in the text we used to set up the function (which was our table of descriptions). The MAX_FEATURES variable defines how many columns there will be - it is the size of the vocabulary (in the example above our vocabulary was 4 words).
This is just what we used in the previous tutorial to get the top 10 words.

In [0]:
MAX_FEATURES = 500
count_vectorizer = CountVectorizer(max_features = MAX_FEATURES) # A library function to turn text into a vector of length 'MAX_FEATURES'
word_counts = count_vectorizer.fit_transform(descriptions.Description) # Use data to generate the column headings
count_vocab = count_vectorizer.get_feature_names() # This is just the names of the columns

Why not set MAX_FEATURES large enough to use all of the words in the corpus? In fact, why don't we use a giant dictionary to create all the column headings rather than restricting it to the words in our corpus?

In answer to the first question, we can do that but the primary reason for restricting the vocabulary is the amount of memory and computing power you have available. If you remember back to session 1, we restricted this variable for the Amazon data because the Colab could potentially crash. Another reason is that a lot of words may only appear once in the text and so they might not be particularly useful for machine learning. Often a machine learning algorithm will peform better with fewer features to work with.

In answer to the second question, there is no point in having columns outside of your corpus. When you fit an algorithm it will attach some weight to each word depending on the category of the row. Any word that doesn't appear in the corpus will therefore not pass through the algorithm and so that column will always get zero weighting. In other words it's just a waste of a column.

To illustrate the points above we will experiment with building a 'vectorizer' of different sizes and passing a new sentence into it to see what happens.

Run the code below. **Are all of the words in the sentence printed out?**

**What happens if you change MAX_FEATURES to 1000? Or 2000?**

**Try changing the sentence too**

In [0]:
MAX_FEATURES = 500
test_sentence = 'my records are about design and food and the war and design during the war'

count_vectorizer = CountVectorizer(max_features = MAX_FEATURES) # A library function to turn text into a vector of length 'MAX_FEATURES'
word_counts = count_vectorizer.fit_transform(descriptions.Description) # Use data to generate the column headings
count_vocab = count_vectorizer.get_feature_names() # This is just the names of the columns

sentence_counts = count_vectorizer.transform([test_sentence])
nz = sentence_counts.nonzero()
ft_names = count_vectorizer.get_feature_names()
pd.DataFrame([(ft_names[i], sentence_counts[0, i]) for i in nz[1]])


While this approach is simple to implement it has some issues. Firstly, long documents could end up with disproportionally high scores. For example, an essay about dogs could have a 20 in the 'dog' column. Is it more about dogs than the sentences we had at the beginning?

Secondly, words that appear regularly in the corpus will have high values but may not be especially meaningful. Think of how often 'record' might appear in a set of documents about archiving. It is effectively a stop word!

The most common approach for turning text into numbers is to use a score called the TF-IDF score. The TF stands for Term Frequency, which is just the same as counting words. The IDF stands for Inverse Document Frequency, and this is what makes it a better system. IDF is a count of the number of documents that a word appears in. (If you're interested in how the calculation is made, see http://www.tfidf.com/)

As a quick intuition though, if a word appears in lots of documents it will end up with a low score. If it appears several times in one document but is rare across the corpus it will have a high score.

Let's run the previous example again but this time using the TF-IDF scorer.

**Compare the scores for 'the', 'and', and 'are' to the CountVectorizer results above**

**Which is the top scoring word?**

**Why do you think 'design' has scored so low, despite appearing twice?**

**Where is food in the scoring?**

Note: You may notice a new command in the code below - pickle.dump

This is a method that Python uses for saving complex objects that can't be easily output into a CSV file. The TfidfVectorizer counts as a complex object. The reason we are using this here is to save this TFIDF representation of the text for the next tutorial. It is also a format for the more technically minded to consider when thinking about how to archive an ML model in the final session.

In [0]:
MAX_FEATURES = 2000
test_sentence = 'my records are about design and food and the war and design during the war'

tfidf_vectorizer = TfidfVectorizer(max_features = MAX_FEATURES)
word_tfidf = tfidf_vectorizer.fit_transform(descriptions.Description)
pickle.dump(tfidf_vectorizer, open(data_folder + 'word_tfidf.pck','wb'))
tfidf_vocab = tfidf_vectorizer.get_feature_names()

sentence_counts = tfidf_vectorizer.transform([test_sentence])
nz = sentence_counts.nonzero()
ft_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame([(ft_names[i], sentence_counts[0, i]) for i in nz[1]])


If you think back to session 1, you may remember we used this approach to find the 2 and 3 word phrases (n-grams) in the Amazon reviews. We will do that again for this dataset. We're not going to analyze the phrases in our data now (although if you want to, you can go back to Tutorial 1 and add the "ngram_range=[1,3]" into the code where we printed out the top 10 words), but looking at our test sentence again is informative.

First we rebuild the TfidfVectorizer with phrases between 1 and 3 words long. Then pass in the test sentence again. As you can see only one two word phrase ("and the") is found in the feature set. If you **change the MAX_FEATURES to 25000**, you will get some more 2 word ones but still no triplets. This is a feature of language. There may be lots of common words but they mostly used in unique combinations, unless you're working with a large corpus. This suggests that when we work with phrases rather than words we need far more features, so again we have to decide on what is feasible given our computational resources.

**Note: using 20000 features made the machine crash in the 3rd tutorial, which demonstrates this point**


In [0]:
MAX_FEATURES = 5000
test_sentence = 'my records are about design and food and the war and design during the war'

tfidf_vectorizer = TfidfVectorizer(max_features = MAX_FEATURES, ngram_range=[1,3])
tfidf_vectorizer.fit_transform(descriptions.Description)
pickle.dump(tfidf_vectorizer, open(data_folder + 'ngram_tfidf.pck','wb'))
tfidf_vocab = tfidf_vectorizer.get_feature_names()

sentence_counts = tfidf_vectorizer.transform([test_sentence])
nz = sentence_counts.nonzero()
ft_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame([(ft_names[i], sentence_counts[0, i]) for i in nz[1]])

So we'll start with 5000 features for the n-gram dataset that we're about to create, and if that works ok feel free to return here and bump up the value.

**Re-run the code above with MAX_FEATURES set to 5000 before continuing**

**Stemming**

Stemming is another technique we can use to prepare our data. It is used to standardise words by removing suffixes, such as the 's' or 'es' at the end of a plural form. It is used, for example, in Discovery's search system. There are several stemming algorithms available but we're going to choose the popular Porter stemmer from the NLTK (Natural Language Tool Kit) library.

Best way to see what it does is to try it out. Change the words in the list to try different endings, and see if you can get a feel for what it is and isn't good at.

In [0]:
word_list = ['grows','grow','growing','leave','leaves','leaf','fairly']
ps = PorterStemmer()

for w in word_list:
    print(w,"becomes",ps.stem(w))

We won't delve into the results of applying this to our corpus but we are going to create a third dataset so that we can see what (if any) difference stemming makes to the results. One observation is that stemming reduces the vocabulary size (because variations of words become one) so we should need fewer features. We will start with 4000 (this is the n-gram version) and work from there.

This bit of code is a little different to the earlier ones as we have to create a function which does the stemming and then pass it to the TFIDFVectorizer. This is because the library we're using to do the vectorising, sklearn, doesn't do stemming because it is a machine learning library not a Natural Language Processing (NLP) one. To do the stemming we are using the NLTK library. (**Note: this one might take a little while to run as the stemmer isn't super fast.)

**Compare the numbers output by this function to the ones in the other two versions**

**Why do you think they are different?**

In [0]:
MAX_FEATURES = 4000
test_sentence = 'my records are about design and food and the war and design during the war'

class PorterTokenizer:
    def __init__(self):
        self.porter = PorterStemmer()
    def __call__(self, doc):
        ttt = toktok.ToktokTokenizer()
        return [self.porter.stem(t) for t in ttt.tokenize(doc)]


tfidf_vectorizer = TfidfVectorizer(max_features = MAX_FEATURES, ngram_range=[1,3], tokenizer=PorterTokenizer())
tfidf_vectorizer.fit_transform(descriptions.Description)
pickle.dump(tfidf_vectorizer, open(data_folder + 'stemmed_ngram_tfidf.pck','wb'))
tfidf_vocab = tfidf_vectorizer.get_feature_names()

sentence_counts = tfidf_vectorizer.transform([test_sentence])
nz = sentence_counts.nonzero()
ft_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame([(ft_names[i], sentence_counts[0, i]) for i in nz[1]])

That's the end of this tutorial in which we have:



1.   Learned about TF-IDF
2.   And why it is better than just counting words

1.   Learned about using stemming to standardise words
2.   Created 3 datasets for use in the next tutorial

We're now ready to start Machine Learning with Discovery data!



