Long story short, in this tutorial we're going to use ML to try and predict whether or not a given constitution was written by a former UK colony. The idea is loosely inspired by the paper ["Constitutional Islamization and Human Rights: The Surprising Origin and Spread of Islamic Supremacy in Constitutions"](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2438983) (Ahmed and Ginsberg 2014). For example, that paper codes countries as 1=former UK colony and 0=otherwise, as we'll do here.

The data on colonization is from http://www.cepii.fr/PDF_PUB/wp/2011/wp2011-25.pdf, and the corpus of constitutions is from https://www.poltext.org/en/constitutional-texts. Let's get to it!

# Step 1: Data Cleaning

I'm basically starting from scratch here, just so that it's completely transparent (I'm not sweeping anything under the rug)

In [166]:
# Using os to get list of files in a directory
import os
# Using pandas for its DataFrame structure, which lets us maintain a "spreadsheet" of the data
import pandas as pd

Get a list of all files in the "constitutions" directory

In [167]:
const_files = sorted(os.listdir('constitutions'))

Load the colonization dataset as a Pandas DataFrame. Note that it lets you load Stata .dta files via the `pd.read_stata()` function, and figures out all the variable names, converstion, etc.

In [168]:
full_colonial_df = pd.read_stata('geo_cepii.dta')

The `.head(n)` function lets you look at the first `n` rows of a DataFrame, with 5 rows as the default, so we'll use it throughout to examine what our DataFrames look like

In [169]:
full_colonial_df.head()

Unnamed: 0,iso2,iso3,cnum,country,pays,area,dis_int,landlocked,continent,city_en,...,lang9_2,lang9_3,lang9_4,colonizer1,colonizer2,colonizer3,colonizer4,short_colonizer1,short_colonizer2,short_colonizer3
0,AD,AND,20,Andorra,Andorre,453,8.005398,0.0,Europe,Andorra la Vella,...,,,,,,,,,,
1,AE,ARE,784,United Arab Emirates,Emirats arabes unis,83657,108.788994,0.0,Asia,Abu Dhabi,...,,,,GBR,,,,,,
2,AF,AFG,4,Afghanistan,Afghanistan,652225,303.761353,1.0,Asia,Kabul,...,Uzbek,,,,,,,GBR,,
3,AG,ATG,28,Antigua and Barbuda,Antigua-et-Barbuda,442,7.907605,0.0,America,Saint John's,...,,,,GBR,,,,,,
4,AI,AIA,660,Anguilla,Anguilla,102,3.79869,0.0,America,The Valley,...,,,,GBR,,,,,,


Now, instead of the 34 variables in the full DataFrame, we'll work with a reduced DataFrame with just the three variables we need: `country` (English-language name of country), `pays` (French-language name of country, for merging), and `colonizer1` (the "primary" colonizer of the country, if any)

In [170]:
colonial_df = full_colonial_df[["country","pays","colonizer1"]]

In [171]:
colonial_df.head()

Unnamed: 0,country,pays,colonizer1
0,Andorra,Andorre,
1,United Arab Emirates,Emirats arabes unis,GBR
2,Afghanistan,Afghanistan,
3,Antigua and Barbuda,Antigua-et-Barbuda,GBR
4,Anguilla,Anguilla,GBR


# Step 2: Merging

Because we have two separate pieces of information here, the first being the colonization dataset and the second being the list of constitution .txt files, we need to figure out which row in the colonization data corresponds to which constitution .txt file, via merging.

Right now we just have a (Python) list of filenames, so we convert that to a Pandas DataFrame here as a first step towards merging with the Pandas DataFrame containing the colonization info

In [172]:
file_df = pd.DataFrame(const_files, columns=["filename"])

In [173]:
file_df.head()

Unnamed: 0,filename
0,afghanistan2004.txt
1,albanie1998-2008.txt
2,algerie1989-2008.txt
3,allemagne1949-2010.txt
4,andorre1993.txt


Now we process the raw filenames to extract just a string with the (lowercased) country names

In [174]:
# Remove the ".txt"
file_df["file_country"] = file_df["filename"].str.replace(".txt","")
# Replace "-" with " "
file_df["file_country"] = file_df["file_country"].str.replace("-"," ")
# Replace "_" with " "
file_df["file_country"] = file_df["file_country"].str.replace("_"," ")
# Remove digits (the years)
file_df["file_country"] = file_df["file_country"].str.replace("\d","")
# Remove trailing whitespace
file_df["file_country"] = file_df["file_country"].str.strip()

In [175]:
file_df.head()

Unnamed: 0,filename,file_country
0,afghanistan2004.txt,afghanistan
1,albanie1998-2008.txt,albanie
2,algerie1989-2008.txt,algerie
3,allemagne1949-2010.txt,allemagne
4,andorre1993.txt,andorre


So we've successfully extracted the (lowercased) country names from the filenames. Now we lowercase the country names in the *colonization* dataset (`colonial_df`), so that Pandas will be able to match them successfully

In [176]:
colonial_df["pays"] = colonial_df["pays"].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


The syntax here is like: `merged_df = left_dataset.merge(right_dataset, left_on=<left key variable>, right_on=<right key variable>)`

In [177]:
merged_df = colonial_df.merge(file_df, left_on="pays", right_on="file_country")

We get some duplicates because some countries have more than one constitution in the dataset. For the sake of this tutorial we just pick the first constitution and move on

In [178]:
# Remove duplicates
merged_df = merged_df.drop_duplicates(subset="country")

In [179]:
merged_df.head(10)

Unnamed: 0,country,pays,colonizer1,filename,file_country
0,Andorra,andorre,,andorre1993.txt,andorre
1,United Arab Emirates,emirats arabes unis,GBR,emirats-arabes-unis1971-1972.txt,emirats arabes unis
2,Afghanistan,afghanistan,,afghanistan2004.txt,afghanistan
3,Albania,albanie,TUR,albanie1998-2008.txt,albanie
4,Angola,angola,PRT,angola2010.txt,angola
5,Argentina,argentine,ESP,argentine1853-1994.txt,argentine
6,Austria,autriche,,autriche1920.txt,autriche
7,Australia,australie,GBR,australie1900-1977.txt,australie
9,Barbados,barbade,GBR,barbade1966-2007.txt,barbade
10,Bangladesh,bangladesh,GBR,bangladesh1972-2011.txt,bangladesh


Right now `colonizer1` is a country code. Since the machine learning algorithm requires numeric values, we make a 0/1 binary variable `uk_col` which is 1 if `colonizer1 == "GBR"` and 0 otherwise

In [180]:
merged_df["uk_col"] = (merged_df["colonizer1"] == "GBR").astype(int)

In [181]:
merged_df.head()

Unnamed: 0,country,pays,colonizer1,filename,file_country,uk_col
0,Andorra,andorre,,andorre1993.txt,andorre,0
1,United Arab Emirates,emirats arabes unis,GBR,emirats-arabes-unis1971-1972.txt,emirats arabes unis,1
2,Afghanistan,afghanistan,,afghanistan2004.txt,afghanistan,0
3,Albania,albanie,TUR,albanie1998-2008.txt,albanie,0
4,Angola,angola,PRT,angola2010.txt,angola,0


# Step 3: Machine Learn!

The first step towards the actual ML is to load the text of each constitution. To this end, we import Python's `codecs` library which just lets us specify the encoding of a text file (which is important when your text files might have "non-standard", for example non-Latin, characters)

In [182]:
import codecs

Here we just "pull out" the `filename` column of our dataset and put it into a standard Python list, for easy looping

In [183]:
# Construct the list of texts
file_list = merged_df["filename"].tolist()

The main text-loading loop. See comments therein.

In [184]:
# text_list will be filled with the contents of each constitution
text_list = []
# This loops over all filenames in file_list,
# storing them into cur_filename for use inside the loop
for cur_filename in file_list:
    print("Loading " + cur_filename)
    # os.path.join() is just a handy function which uses the correct type of slash in a
    # file path, since Windows uses backslashes like C:\cool.txt, whereas OSX and Linux
    # use forward slashes like /home/cool.txt
    cur_filepath = os.path.join("constitutions",cur_filename)
    # Here's where we use codecs.open(). The arguments are: filename, mode ("r" means
    # "read the file"), encoding ("utf-8" is a standard Unicode text format), and
    # errors, which tells it to just skip any reading errors (for example, characters
    # it doesn't recognize) rather than crashing the code
    with codecs.open(cur_filepath, "r", "utf-8", errors="ignore") as f:
        # We replace the \n and \r which are just line breaks, so that we just get
        # the file contents as a single, unbroken string
        cur_text = f.read().replace("\n"," ").replace("\r"," ")
        text_list.append(cur_text)

Loading andorre1993.txt
Loading emirats-arabes-unis1971-1972.txt
Loading afghanistan2004.txt
Loading albanie1998-2008.txt
Loading angola2010.txt
Loading argentine1853-1994.txt
Loading autriche1920.txt
Loading australie1900-1977.txt
Loading barbade1966-2007.txt
Loading bangladesh1972-2011.txt
Loading bulgarie1991-2007.txt
Loading burundi2005.txt
Loading bolivie2009.txt
Loading bahamas1973.txt
Loading bhoutan2008.txt
Loading botswana_1996.txt
Loading belize1981-2010.txt
Loading canada1867-1982.txt
Loading suisse1999-2011.txt
Loading chili1980-2012.txt
Loading cameroun1972-1996.txt
Loading chine1982-2004.txt
Loading colombie1991-2011.txt
Loading cuba1976-2003.txt
Loading chypre1960-1996.txt
Loading allemagne1949-2010.txt
Loading djibouti1992-2010.txt
Loading danemark1953.txt
Loading dominique1978-1984.txt
Loading equateur2008.txt
Loading estonie1992-2007.txt
Loading egypte2011.txt
Loading espagne1978-2011.txt
Loading ethiopie1995.txt
Loading finlande1999-2011.txt
Loading fidji1990-1997.tx

To make sure it worked, we look at the first 500 characters of the 4th constitution in the list (remember that computers count starting with zero, so `text_list[3]` gives the 4th element in `text_list`), which happens to be Albania's

In [185]:
text_list[3][:500]

'CONSTITUTION OF ALBANIA     We, the people of Albania, proud and aware of our history, with responsibility for   the future, and with faith in God and/or other universal values, with determination   to build a social and democratic state based on the rule of law, and to guarantee   the fundamental human rights and freedoms, with a spirit of religious coexistence   and tolerance, with a pledge to protect human dignity and personhood, as well as   for the prosperity of the whole nation, for peace,'

Now that we have this Python `text_list` variable, we can insert it into our Pandas DataFrame as a new column in `merged_df`

In [186]:
# Now make each text a cell within the DataFrame
merged_df["const_text"] = text_list

In [187]:
merged_df.head()

Unnamed: 0,country,pays,colonizer1,filename,file_country,uk_col,const_text
0,Andorra,andorre,,andorre1993.txt,andorre,0,Constitution of the Principality of Andorra ...
1,United Arab Emirates,emirats arabes unis,GBR,emirats-arabes-unis1971-1972.txt,emirats arabes unis,1,United Arab Emirates THE PROVISIONAL CONSTITU...
2,Afghanistan,afghanistan,,afghanistan2004.txt,afghanistan,0,"The Constitution of Afghanistan January 3,..."
3,Albania,albanie,TUR,albanie1998-2008.txt,albanie,0,"CONSTITUTION OF ALBANIA We, the people of ..."
4,Angola,angola,PRT,angola2010.txt,angola,0,REPUBLIC OF ANGOLA NATIONAL ASSEMBLY CON...


Now, I could do a whole tutorial on Gensim, since imo it's the best text-processing library in Python. But really we don't need to do much fancy text processing here, so I'll just use it to auto-preprocess each constitution via the `preprocess_string()` function, which does things like lowercasing, removing overly-common words like "the", stemming, and other important cleaning operations. The full list of stuff it does is [here](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.preprocess_string), and you should def check out Gensim more generally. My favorite list of tutorials for it is [here](https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md).

Syntax-wise, we make a new variable in our DataFrame called `const_preproc` by applying (via the `.apply()` function) `preprocess_string` to each cell in the `const_text` column.

In [188]:
from gensim.parsing.preprocessing import preprocess_string
merged_df["const_preproc"] = merged_df["const_text"].apply(preprocess_string)

The only issue with `preprocess_string()` is that it spits out the processed string as a list of words. So to re-format it as a string, we just put a space in between each spat-out word using Python's `join()` function. Here instead of calling `apply()` with an entire separate function, we just make a "mini-function" called a *Lambda function*, which here just says take `x` (the list of words) and join each element together with a space. (In general, `lambda x: <code involving x>` just tells python "take x and perform this code on it", which is convenient in the sense that we don't have to explicitly define a whole new function for simple operations like this)

In [189]:
merged_df["const_preproc"] = merged_df["const_preproc"].apply(lambda x: " ".join(x))

In [190]:
merged_df.head()

Unnamed: 0,country,pays,colonizer1,filename,file_country,uk_col,const_text,const_preproc
0,Andorra,andorre,,andorre1993.txt,andorre,0,Constitution of the Principality of Andorra ...,constitut princip andorra consel gener princip...
1,United Arab Emirates,emirats arabes unis,GBR,emirats-arabes-unis1971-1972.txt,emirats arabes unis,1,United Arab Emirates THE PROVISIONAL CONSTITU...,unit arab emir provision constitut unit arab e...
2,Afghanistan,afghanistan,,afghanistan2004.txt,afghanistan,0,"The Constitution of Afghanistan January 3,...",constitut afghanistan januari god graciou merc...
3,Albania,albanie,TUR,albanie1998-2008.txt,albanie,0,"CONSTITUTION OF ALBANIA We, the people of ...",constitut albania peopl albania proud awar his...
4,Angola,angola,PRT,angola2010.txt,angola,0,REPUBLIC OF ANGOLA NATIONAL ASSEMBLY CON...,republ angola nation assembl constitu assembl ...


Now as a first foray into machine learning, we're just going to construct four extremely simple "features" which we think can help the algorithm predict whether or not the constitution is that of a former UK colony or not. Our first feature is simply the length of the constitution (in terms of number of words). To construct this feature, we use `.apply()` again, calling Python's `len()` function on a version of the `const_preproc` cell split into separate words via `split()`, and store the length into a new `const_len` column in our DataFrame

In [191]:
merged_df["const_len"] = merged_df["const_preproc"].apply(lambda x: len(x.split()))

In [192]:
merged_df.head(10)

Unnamed: 0,country,pays,colonizer1,filename,file_country,uk_col,const_text,const_preproc,const_len
0,Andorra,andorre,,andorre1993.txt,andorre,0,Constitution of the Principality of Andorra ...,constitut princip andorra consel gener princip...,4492
1,United Arab Emirates,emirats arabes unis,GBR,emirats-arabes-unis1971-1972.txt,emirats arabes unis,1,United Arab Emirates THE PROVISIONAL CONSTITU...,unit arab emir provision constitut unit arab e...,5329
2,Afghanistan,afghanistan,,afghanistan2004.txt,afghanistan,0,"The Constitution of Afghanistan January 3,...",constitut afghanistan januari god graciou merc...,5390
3,Albania,albanie,TUR,albanie1998-2008.txt,albanie,0,"CONSTITUTION OF ALBANIA We, the people of ...",constitut albania peopl albania proud awar his...,6197
4,Angola,angola,PRT,angola2010.txt,angola,0,REPUBLIC OF ANGOLA NATIONAL ASSEMBLY CON...,republ angola nation assembl constitu assembl ...,13634
5,Argentina,argentine,ESP,argentine1853-1994.txt,argentine,0,﻿ CONSTITUTION OF THE...,constitut argentin nation preambl repres peopl...,6072
6,Austria,autriche,,autriche1920.txt,autriche,0,Erste...,erst hauptstiick chapter allgemein bestimmunge...,54203
7,Australia,australie,GBR,australie1900-1977.txt,australie,1,﻿ AUSTRALIA Commonwealth of ...,australia commonwealth australia constitut act...,7247
9,Barbados,barbade,GBR,barbade1966-2007.txt,barbade,1,The Constitution of Barbados ARRANGEIV1E...,constitut barbado arrangeiv section section ch...,13690
10,Bangladesh,bangladesh,GBR,bangladesh1972-2011.txt,bangladesh,1,"(In the name of Allah, the Beneficient, th...",allah benefici merci creator merci preambl peo...,9822


The last three features are the proportion of the words in the constitution which are "freedom", "justice", and "liberty". Since the `preprocess_string()` function we called from Gensim performs stemming on all the words, we shorten these to their stems ("free", "just", "liber") to make sure we're getting the correct counts. Then we just take the counts and divide by `const_len` to get the proportions for each country.

In [193]:
merged_df["num_free"] = merged_df["const_preproc"].str.count("free")
merged_df["num_just"] = merged_df["const_preproc"].str.count("just")
merged_df["num_lib"] = merged_df["const_preproc"].str.count("liber")
merged_df["prop_free"] = merged_df["num_free"] / merged_df["const_len"]
merged_df["prop_just"] = merged_df["num_just"] / merged_df["const_len"]
merged_df["prop_lib"] = merged_df["num_lib"] / merged_df["const_len"]

In [194]:
merged_df.head()

Unnamed: 0,country,pays,colonizer1,filename,file_country,uk_col,const_text,const_preproc,const_len,num_free,num_just,num_lib,prop_free,prop_just,prop_lib
0,Andorra,andorre,,andorre1993.txt,andorre,0,Constitution of the Principality of Andorra ...,constitut princip andorra consel gener princip...,4492,31,31,5,0.006901,0.006901,0.001113
1,United Arab Emirates,emirats arabes unis,GBR,emirats-arabes-unis1971-1972.txt,emirats arabes unis,1,United Arab Emirates THE PROVISIONAL CONSTITU...,unit arab emir provision constitut unit arab e...,5329,15,8,7,0.002815,0.001501,0.001314
2,Afghanistan,afghanistan,,afghanistan2004.txt,afghanistan,0,"The Constitution of Afghanistan January 3,...",constitut afghanistan januari god graciou merc...,5390,16,16,5,0.002968,0.002968,0.000928
3,Albania,albanie,TUR,albanie1998-2008.txt,albanie,0,"CONSTITUTION OF ALBANIA We, the people of ...",constitut albania peopl albania proud awar his...,6197,49,20,9,0.007907,0.003227,0.001452
4,Angola,angola,PRT,angola2010.txt,angola,0,REPUBLIC OF ANGOLA NATIONAL ASSEMBLY CON...,republ angola nation assembl constitu assembl ...,13634,100,32,8,0.007335,0.002347,0.000587


Finally we can import our machine learning library, [Scikit-Learn](http://scikit-learn.org/stable/). First we import its `train_test_split()` function, which allows us to divide the data up into training and test sets in a consistent manner (by seeding it using the `random_state` argument). The training set is what the machine learning algorithm will use to try and *learn* a relationship between the features and the outcome variable (`uk_col`), and then its success in this learning will be measured by how well it can predict the outcome variable for the texts in the *test* set. Here we randomly select 80% of the observations to be training data and 20% to be test data, using the `test_size` argument.

In [195]:
# Split the data into training and test data [by splitting the indices]
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(merged_df, test_size=0.2, random_state=42)

In [196]:
print("Training size: " + str(len(train_df)))
print("Test size: " + str(len(test_df)))

Training size: 96
Test size: 24


Now that we've split our data, we can "pull out" just the relevant columns of our DataFrame, to produce the training features, training labels ("label" is just an ML term for the outcome variable we're trying to predict), test features, and test labels.

In [197]:
# The variables we want to "pull out" of the DataFrame to use as features
feature_vars = ["const_len","prop_free","prop_just","prop_lib"]
# The variable we want to "pull out" of the DataFrame to use as the outcome variable
# (which the ML algorithm will try to predict)
label_var = "uk_col"

train_features = train_df[feature_vars]
train_labels = train_df[label_var]
test_features = test_df[feature_vars]
# Since the ML algorithm will never look at the test labels anyways,
# I convert them to a Python list for easier use later on
test_labels = test_df[label_var].tolist()

Here is the moment we've all been waiting for, the actual machine learning. I'm using the most simple of all possible algorithms here, [Multinomial Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) with two classes (so really Binomial Naive Bayes), which works ridiculously fast because it makes the (incorrect) assumption that all features are independent. In our case, for example, it's probably not true that the number of times "freedom" appears is independent of the number of times "liberty" appears. But it turns out that Naive Bayes does astonishingly well despite this simplifying assumption, and for large datasets it's sometimes the case that it's the only algorithm that will run in a reasonable amount of time.

In [198]:
from sklearn.naive_bayes import MultinomialNB
# I'm writing these as functions so that we can re-run the procedure with different
# sets of features, which we'll do below.
def evaluateAccuracy(predicted_labels, actual_labels):
    num_test_obs = len(actual_labels)
    num_mislabeled = (predicted_labels != actual_labels).sum()
    accuracy = 1 - (num_mislabeled/num_test_obs)
    print(str(num_mislabeled) + " mislabeled obs out of " + 
          str(num_test_obs) + " total test observations")
    print("Accuracy = " + str(accuracy))
    
def naiveBayesClassify(train_features, train_labels, test_features, test_labels):
    gnb = MultinomialNB()
    # This is where all the magic happens. Calling .fit() makes the ML algorithm
    # look at the training data to try and learn a relationship, and then calling
    # .predict() asks it to make predictions using that learned relationship, which
    # get stored into y_pred
    y_pred = gnb.fit(train_features, train_labels).predict(test_features)
    print("Predicted labels: " + str(y_pred))
    print("Actual test labels: " + str(test_labels))
    evaluateAccuracy(y_pred, test_labels)
    return gnb
    
hand_engineered = naiveBayesClassify(train_features, train_labels, test_features, test_labels)

Predicted labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Actual test labels: [0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
6 mislabeled obs out of 24 total test observations
Accuracy = 0.75


So we get a score of 75%. Is this good? We'll find out below. For now, just keep it in mind and we'll see if we can do better. To this end, we're going to try massively expanding the feature set by using $n$-gram features. An [$n$-gram](https://en.wikipedia.org/wiki/N-gram) is just an ordered sequence of $n$ words. For example, within the phrase "machine learning is fun" we have the $1$-grams "machine", "learning", "is", "fun", the $2$-grams "machine learning", "learning is", and "is fun", and the $3$-grams "machine learning is" and "learning is fun". So, instead of trying to "guess" words that will be predictive of UK colony status, let's just throw in counts of ALL words across ALL the constitutions, and let the ML algorithm *figure out* which ones are important.

Scikit-Learn provides a super easy-to-use class called `CountVectorizer`, which just takes in a list of strings and spits out $n$-gram counts which are immediately ready to be used as input features for an ML algorithm.

In [199]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

Important note here: the `.fit_transform()` function first scans over all the strings and figures out how big the feature vector of word counts needs to be, and then scans a second time to "fill in" the vector for each string. We *don't*, however, want to call `.fit_transform()` a second time for the test data, if you think about it, because we need the training and test vectors to be the same size, but there are probably different sets of words in the training and test data. So when generating the vectors for the *test* data, we just call `.transform()`, which skips the step of figuring out the "indices" of the vector and just produces word counts for the set of words we learned from the *training* data via `fit_transform()`.

In [200]:
train_counts = count_vect.fit_transform(train_df["const_preproc"]).toarray()
test_counts = count_vect.transform(test_df["const_preproc"]).toarray()

In [201]:
print(train_counts.shape)
print(test_counts.shape)

(96, 17177)
(24, 17177)


The above numbers mean that the training data consists of 96 rows with word counts for 17,177 separate words, while the test data consists of 24 rows with word counts for the same words.

In [202]:
ngram_classifier = naiveBayesClassify(train_counts, train_labels, test_counts, test_labels)

Predicted labels: [0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Actual test labels: [0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0]
3 mislabeled obs out of 24 total test observations
Accuracy = 0.875


So now with this additional information, the algorithm is able to get 87.5% of its predictions correct! Again we should ask: is this actually good? Now we'll find out.

To find out, we basically want to compare the accuracy from the ML output with the accuracy we'd get by doing "dumb" things, like randomly guessing or always guessing yes/no. If the ML algorithm gets higher accuracy than the accuracy of these approaches, we can say that in a sense the algorithm is actually learning something, that's it's actually doing something smart.

Here I introduce the Python library `numpy` (which actually is what Pandas and Scikit-Learn have been using "under the hood" anyways), which contains a `np.array()` data type that "works nicely" with Pandas/Scikit-Learn data, and also has random number generation functions like `np.random.choice()` which we use below.

In [203]:
import numpy as np
# Just "seeding" the random number generator so my results are comparable with yours.
# (Otherwise it uses the time, so you get different results based on what time you run it)
np.random.seed(42)
# Baseline 1: random guessing
random_test_labels = np.random.choice([0,1], size=(len(test_df)))

In [204]:
evaluateAccuracy(random_test_labels, test_labels)

8 mislabeled obs out of 24 total test observations
Accuracy = 0.666666666667


So now we know: the four "hand-engineered" features we made only do sligtly better than random guessing, whereas the $n$-gram features do significantly better.

Next we examine the accuracy of always-guess-yes and always-guess-no

In [205]:
# Make a list of 1s the same size as the test data
always_yes = np.array([1]*len(test_df))
# And same with a list of 0s
always_no = np.array([0]*len(test_df))

In [206]:
evaluateAccuracy(always_yes, test_labels)

18 mislabeled obs out of 24 total test observations
Accuracy = 0.25


In [207]:
evaluateAccuracy(always_no, test_labels)

6 mislabeled obs out of 24 total test observations
Accuracy = 0.75


Finally, we see that in fact the hand-engineered features performed WORSE than a "dumb" approach -- namely, always guessing "no, not a UK colony". The run using the $n$-gram features, therefore, is the only one for which we can say with confidence that the ML algorithm is doing something smart.

In conclusion, it's important to know what your *baseline* performance is -- i.e., how well you can do with dumb approaches like random guessing or always guessing yes/no -- before drawing conclusions about how well your method is doing.

The real "correct" thing to report when evaluating an ML algorithm is actually not accuracy at all, but rather [F1 score](https://en.wikipedia.org/wiki/F1_score). It takes into account what we learned by looking at the always-no and always-yes classifiers, and weights the accuracy based on false negatives and false positives.

We could probably get even better performance by using not only 1-grams (aka words) but also 2-grams and 3-grams here. I'll leave it up to you to figure out how to expand the feature set to include these higher-order $n$-grams, but as a hint it should only require adding a single argument to the `CountVectorizer()` call...

As a last thing (which really would be a central thing for social science, but I'm trying to make this quick and just show the basic things you can do), let's look at which features were most "informative" for the (better than random/"dumb" guessing) $n$-gram classifier:

In [208]:
# Shamelessly stolen from
# https://stackoverflow.com/questions/26976362/how-to-get-most-informative-features-for-scikit-learn-classifier-for-different-c
def mostInformative(vectorizer, classifier, n=10):
    labelid = list(classifier.classes_).index(0)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
    topn.reverse()
    for coef, feat in topn:
        print(feat, coef)

In [209]:
mostInformative(count_vect, ngram_classifier)

shall -3.21323717541
person -3.92419641785
offic -4.04265700621
law -4.25126659155
constitut -4.30042631298
court -4.30795358523
member -4.36271328139
presid -4.36621984225
parliament -4.43770482669
act -4.6574208148


Interpretations abound! Try it on your own text corpora! The end.