## **Discovery Taxonomy prediction from descriptions**
For this notebook we will use data from Discovery to further explore some of the ideas from the previous session. In the advanced search of Discovery, it is possible to filter your search according to a taxonomy which is maintained by the cataloguing team. There are 136 categories in the dataset with names such as "Trade and commerce", "Population", "International", "Hunting", and "UFOs". Each record in the catalogue is categorised according to a set of rules which are applied to the description of the record in Discovery. For example, if a description includes the following words or phrases it will be classified as "Taxation":  "taxes", "taxation", "first fruits and tenths", "Domesday", "customs revenue", "scutages".

Some of these rules can be long and complex, and it is easy to find examples where the ambiguities of language mean they categorise documents incorrectly. We're not going to solve these problems in this tutorial but it is an opportunity to explore a machine learning approach to the problem and discuss the advantages and disadvantages of Rules vs. ML.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
data_folder = "/content/gdrive/My Drive/MLC/Session 3/Data/"

['taxonomyids_and_names.txt', 'taxonomy_descriptions.txt']

This piece of code imports the libraries that we will be using in the first part of this notebook.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sklearn      # The most common Python Machine Learning library - scikit learn
from sklearn.model_selection import train_test_split  # Used to create training and test data
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, balanced_accuracy_score
import seaborn as sns
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
from operator import itemgetter

Here we will be creating the dataframe (what this library - Pandas - calls a table) from the text file, and load it into a variable called **descriptions**. After loading we will output a count of rows. It should be obvious from the counts that this is a sample of Discovery, not the whole thing (with its 30+ million records). You will notice there are some rows with blanks in the Description column (because the count is different to the other columns). **What should we do with those?** Think back to session 1 when we discussed strategies for missing data.

In [0]:
descriptions = pd.read_csv(data_folder + 'taxonomy_descriptions.txt',
                           delimiter="|", header=None, lineterminator='\n')
descriptions.columns = ["IAID","TAXID","Description"]
descriptions.count()

Let's drop those rows. Without a description there are no features for the ML to work with. In reality we would want to write a description, although that is probably a lifetime's work for the whole catalogue!

In [0]:
descriptions.dropna(inplace=True)
print("Rows:", len(descriptions))

### **Understanding our data**

We will start by looking at the first few rows. If you want to see more rows you can change the number inside of the brackets after the 'head' function. You can also adjust the 'max_colwidth' setting to see more of the Descriptions if you like.

In [0]:
pd.set_option('display.max_colwidth', 130) # increment 5 or 10 at a time if you want to see wider descriptions.
print("Number of rows:", len(descriptions))
descriptions.head(5)

Next we will load a table of taxonomy category names which will be useful for understanding the various categories.

In [0]:
taxonomy = pd.read_csv(data_folder + 'taxonomyids_and_names.txt',
                           delimiter="|", header=None, lineterminator='\n')
taxonomy.columns = ["TAXID","TaxonomyCategory"]
print("Number of rows:",len(taxonomy))
taxonomy.head(10)

With the list above to hand, you can view sample records for each category. Just change the value of the TAXID variable below with one from the list.

In [0]:
TAXID = 'C10004'  # <- Change this value
descriptions[descriptions.TAXID == TAXID].head(5)

The previous list was just from a lookup table. We can join this lookup to our Discovery data to get counts by category, with descriptions. The following code produces a summary table showing counts per category. **Are you surprised by the top category?**

Change the value of N if you wish to see more rows.



In [0]:
N = 10
topN = pd.merge(descriptions, taxonomy, how = 'inner').groupby(['TAXID', 'TaxonomyCategory']).size().reset_index(name='Count').sort_values('Count', ascending=False).head(N)
topN

It is probably worth looking at some sample rows for 'Food and drink' to check if they look reasonable...

In [0]:
TAXID = 'C10039'  # <- Change this value
descriptions[descriptions.TAXID == TAXID].head(10)

Something odd has happened there. We could ask for the top 20, or 50, or 100 rows but there are over 23,000 of them. A better way to summarise the records might be to count words. So let's do that for the food and drink category (you can also change the TAXID value to look at other categories if you want to, as well as varying the MAX_FEATURES variable to see more top words). 

In [0]:
TAXID = 'C10039'
MAX_FEATURES = 10

count_vectorizer = CountVectorizer(max_features = MAX_FEATURES)
word_counts = count_vectorizer.fit_transform(descriptions[descriptions.TAXID == TAXID].Description)
tfidf_vocab = count_vectorizer.get_feature_names()

nz = word_counts.nonzero()
ft_names = count_vectorizer.get_feature_names()
counts = [x for x in zip(ft_names,np.asarray(word_counts.sum(axis=0))[0])]
counts.sort(key=itemgetter(1), reverse=True)
counts

This suggests something is wrong with our data. It was sourced from a test system and so there may have been an issue when it was loaded, or somewhere else in the pipeline. For this exercise we will leave it out of our ML process, but hopefully it provides a good example of why digging into the data is so important.

**Check the word counts of some of the other categories in the top 10 to reassure yourself that they're ok**

**If there are any you don't like the look of, add them to the EXCLUDE list below.** (to add to the EXCLUDE list you will need to separate values with a comma and remember to put them in quotes e.g. ['ABC','DEF','GHI']). We are going to generate the dataset for the remaining tutorials from the list produced below.

In [0]:
EXCLUDE = ['C10039']
N = 10
topN = pd.merge(descriptions[~descriptions['TAXID'].isin(EXCLUDE)], taxonomy, how = 'inner').groupby(['TAXID', 'TaxonomyCategory']).size().reset_index(name='Count').sort_values('Count', ascending=False).head(N)
topN

To end this tutorial, we will write a new dataset using only the top 10 categories. This will be used in the next tutorial. You're welcome to use more categories if you wish, but do think of what a confusion matrix with 135 columns and rows will look like before you go for too many!

In [0]:
descriptions[descriptions.TAXID.isin(topN.TAXID)].to_csv(data_folder + "topN_taxonomy.csv",sep = "|")

That's the end of this tutorial. We have:



*   Loaded the Taxonomy data
*   Produced category summaries

*   Produced word counts by category
*   Identified a large data quality issue

*   Created a new data extract for the next tutorial







Possible additions:

Print rules out for a category.
Identify key words per class which are unique to that class.