## **Discovery Taxonomy prediction from descriptions**
For this notebook we will use data from Discovery to further explore some of the ideas from the previous session. In the advanced search of Discovery, it is possible to filter your search according to a taxonomy which is maintained by the cataloguing team. There are 136 categories in the dataset with names such as "Trade and commerce", "Population", "International", "Hunting", and "UFOs". Each record in the catalogue is categorised according to a set of rules which are applied to the description of the record in Discovery. For example, if a description includes the following words or phrases it will be classified as "Taxation":  "taxes", "taxation", "first fruits and tenths", "Domesday", "customs revenue", "scutages".

Some of these rules can be long and complex, and it is easy to find examples where the ambiguities of language mean they categorise documents incorrectly. We're not going to solve these problems in this tutorial but it is an opportunity to explore a machine learning approach to the problem and discuss the advantages and disadvantages of Rules vs. ML.

In [9]:
import sys
import os

In [13]:
os.listdir('MachineLearningClub')

['.git',
 'Datasets.md',
 'Session 2',
 'Session 1',
 'LICENSE',
 'Session 3',
 'README.md',
 'Session 0']

In [2]:
# Where do you want to get data from - Google Drive or Github
data_source = "Github"  # Change to either Github or Drive - if Drive, copy the data into a folder called "Data" within your "My Drive folder"

In [15]:
if data_source == "Github":
    !git clone https://github.com/nationalarchives/MachineLearningClub.git
    sys.path.insert(0, 'MachineLearningClub')
    data_folder = "MachineLearningClub/Session 3/Data/"
    os.listdir(data_folder)
else:
    # Connect to gdrive
    from google.colab import drive
    drive.mount('/content/gdrive')
    data_folder = "/content/gdrive/My Drive/Data/"
    os.listdir(data_folder)

Cloning into 'MachineLearningClub'...
remote: Enumerating objects: 60, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 60 (delta 13), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (60/60), done.


This piece of code imports the libraries that we will be using in the first part of this notebook.

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn      # The most common Python Machine Learning library - scikit learn
from sklearn.model_selection import train_test_split  # Used to create training and test data
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, balanced_accuracy_score
import seaborn as sns
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
from operator import itemgetter

Here we will be creating the dataframe (what this library - Pandas - calls a table) from the text file, and load it into a variable called **descriptions**. After loading we will output a count of rows. It should be obvious from the counts that this is a sample of Discovery, not the whole thing (with its 30+ million records). You will notice there are some rows with blanks in the Description column (because the count is different to the other columns). **What should we do with those?** Think back to session 1 when we discussed strategies for missing data.

In [17]:
descriptions = pd.read_csv(data_folder + 'taxonomy_descriptions.txt',
                           delimiter="|", header=None, lineterminator='\n')
descriptions.columns = ["IAID","TAXID","Description"]
descriptions.count()

IAID           145037
TAXID          145037
Description    145033
dtype: int64

Let's drop those rows. Without a description there are no features for the ML to work with. In reality we would want to write a description, although that is probably a lifetime's work for the whole catalogue!

In [18]:
descriptions.dropna(inplace=True)
print("Rows:", len(descriptions))

Rows: 145033


### **Understanding our data**

We will start by looking at the first few rows. If you want to see more rows you can change the number inside of the brackets after the 'head' function. You can also adjust the 'max_colwidth' setting to see more of the Descriptions if you like.

In [19]:
pd.set_option('display.max_colwidth', 130) # increment 5 or 10 at a time if you want to see wider descriptions.
print("Number of rows:", len(descriptions))
descriptions.head(5)

Number of rows: 145033


Unnamed: 0,IAID,TAXID,Description
0,C13122444,C10005,"Registered design number: 381468. Proprietor: The Strines Printing Company. Address: 19 George Street, Manchester, Lancashi..."
1,C13089234,C10005,"Registered design number: 196086. Proprietor: James Black and Company. Address: 23 Royal Exchange Square, Glasgow, Scotland..."
2,C13195333,C10092,"Statements of Service, Royal Artillery 10 Battalion Numbers 1 to 477. This entry appears on opening 381; this number is impr..."
3,C13175356,C10014,"Registered design number: 275937. Proprietor: Benham and Froud. Address: Chandos Metal Works, 40, 41 and 42 Chandos Street,..."
4,C12894455,C10005,"Registered design number: 47236. Proprietor: R Dalglish, Falconer and Company. Address: Lennox Mill, Lennox Town, North Bri..."


Next we will load a table of taxonomy category names which will be useful for understanding the various categories.

In [20]:
taxonomy = pd.read_csv(data_folder + 'taxonomyids_and_names.txt',
                           delimiter="|", header=None, lineterminator='\n')
taxonomy.columns = ["TAXID","TaxonomyCategory"]
print("Number of rows:",len(taxonomy))
taxonomy.head(10)

Number of rows: 136


Unnamed: 0,TAXID,TaxonomyCategory
0,C10004,Archives and libraries
1,C10131,Europe and Russia
2,C10134,Local Government
3,C10114,Weapons
4,C10021,Construction industries
5,C10109,Travel and tourism
6,C10105,Sports
7,C10110,Transportation
8,C10103,Shipping
9,C10092,Army


With the list above to hand, you can view sample records for each category. Just change the value of the TAXID variable below with one from the list.

In [21]:
TAXID = 'C10004'  # <- Change this value
descriptions[descriptions.TAXID == TAXID].head(5)

Unnamed: 0,IAID,TAXID,Description
27242,C7052,C10004,"Scope and ContentAgendas, minutes and supporting papers of the British Library Organising Committee.The papers include feasibi..."
39556,C6811365,C10004,Historical Manuscripts Commission: Senior Editor
40680,C10718872,C10004,"Copy letter W E Hayward to E Atkinson, Archivist British Transport Historical Records, 10 December 1963, introduces Michael Wi..."
44575,C12765640,C10004,"'Photograph of public library and Free South Church, Aberdeen, 10,828'. Copyright owner of work: G W Wilson & Company Limite..."
52247,C16759363,C10004,Public Record Office: general correspondence; PRO 1


The previous list was just from a lookup table. We can join this lookup to our Discovery data to get counts by category, with descriptions. The following code produces a summary table showing counts per category. **Are you surprised by the top category?**

Change the value of N if you wish to see more rows.



In [22]:
N = 10
topN = pd.merge(descriptions, taxonomy, how = 'inner').groupby(['TAXID', 'TaxonomyCategory']).size().reset_index(name='Count').sort_values('Count', ascending=False).head(N)
topN

Unnamed: 0,TAXID,TaxonomyCategory,Count
37,C10039,Food and drink,23243
89,C10092,Army,17113
4,C10005,"Art, architecture and design",14114
57,C10060,Medals,12049
69,C10072,Navy,10691
1,C10002,Air Force,6682
55,C10058,Maps and plans,5805
128,C10131,Europe and Russia,4790
124,C10127,Indian Subcontinent,3275
103,C10106,Taxation,3144


It is probably worth looking at some sample rows for 'Food and drink' to check if they look reasonable...

In [23]:
TAXID = 'C10039'  # <- Change this value
descriptions[descriptions.TAXID == TAXID].head(10)

Unnamed: 0,IAID,TAXID,Description
26,C16155353,C10039,Certificate Number: R1/120368. Date of Certificate: 30 July 1969.
35,C14187904,C10039,"Robert Cutler, Purser of the Royal Oak. He delivered in July the remains of provisions on the Royal Oak but was refused a rec..."
97,C15342276,C10039,Certificate Number: R1/95873. Date of Certificate: 18 January 1968.
98,C15308967,C10039,Certificate Number: R1/91873. Date of Certificate: 31 October 1967.
104,C16155858,C10039,Certificate Number: R1/120873. Date of Certificate: 22 August 1969.
107,C15732652,C10039,Certificate Number: R1/116373. Date of Certificate: 23 April 1969.
117,C16120481,C10039,Certificate Number: R1/86373. Date of Certificate: 27 July 1967.
118,C16082377,C10039,Certificate Number: R1/75873. Date of Certificate: 17 October 1966.
119,C15970908,C10039,Certificate Number: R1/36373. Date of Certificate: 29 May 1961.
134,C15096807,C10039,Certificate Number: R1/32373. Date of Certificate: 30 August 1960.


Something odd has happened there. We could ask for the top 20, or 50, or 100 rows but there are over 23,000 of them. A better way to summarise the records might be to count words. So let's do that for the food and drink category (you can also change the TAXID value to look at other categories if you want to, as well as varying the MAX_FEATURES variable to see more top words). 

In [24]:
TAXID = 'C10039'
MAX_FEATURES = 10

count_vectorizer = CountVectorizer(max_features = MAX_FEATURES)
word_counts = count_vectorizer.fit_transform(descriptions[descriptions.TAXID == TAXID].Description)
tfidf_vocab = count_vectorizer.get_feature_names()

nz = word_counts.nonzero()
ft_names = count_vectorizer.get_feature_names()
counts = [x for x in zip(ft_names,np.asarray(word_counts.sum(axis=0))[0])]
counts.sort(key=itemgetter(1), reverse=True)
counts

[('certificate', 44219),
 ('of', 23406),
 ('date', 22432),
 ('number', 22307),
 ('r1', 5921),
 ('july', 2125),
 ('october', 2026),
 ('august', 2008),
 ('r3', 1970),
 ('november', 1952)]

This suggests something is wrong with our data. It was sourced from a test system and so there may have been an issue when it was loaded, or somewhere else in the pipeline. For this exercise we will leave it out of our ML process, but hopefully it provides a good example of why digging into the data is so important.

**Check the word counts of some of the other categories in the top 10 to reassure yourself that they're ok**

**If there are any you don't like the look of, add them to the EXCLUDE list below.** (to add to the EXCLUDE list you will need to separate values with a comma and remember to put them in quotes e.g. ['ABC','DEF','GHI']). We are going to generate the dataset for the remaining tutorials from the list produced below.

In [25]:
EXCLUDE = ['C10039']
N = 10
topN = pd.merge(descriptions[~descriptions['TAXID'].isin(EXCLUDE)], taxonomy, how = 'inner').groupby(['TAXID', 'TaxonomyCategory']).size().reset_index(name='Count').sort_values('Count', ascending=False).head(N)
topN

Unnamed: 0,TAXID,TaxonomyCategory,Count
88,C10092,Army,17113
4,C10005,"Art, architecture and design",14114
56,C10060,Medals,12049
68,C10072,Navy,10691
1,C10002,Air Force,6682
54,C10058,Maps and plans,5805
127,C10131,Europe and Russia,4790
123,C10127,Indian Subcontinent,3275
102,C10106,Taxation,3144
104,C10108,Trade and commerce,2891


To end this tutorial, we will write a new dataset using only the top 10 categories. This will be used in the next tutorial. You're welcome to use more categories if you wish, but do think of what a confusion matrix with 135 columns and rows will look like before you go for too many!

In [26]:
descriptions[descriptions.TAXID.isin(topN.TAXID)].to_csv(data_folder + "topN_taxonomy.csv",sep = "|")

That's the end of this tutorial. We have:



*   Loaded the Taxonomy data
*   Produced category summaries

*   Produced word counts by category
*   Identified a large data quality issue

*   Created a new data extract for the next tutorial







Possible additions:

Print rules out for a category.
Identify key words per class which are unique to that class.