<br><br><font color="gray">INTEG 440 / 640<br>MODULE 11 of *Doing Computational Social Science*</font>


# <font color="green" size=40>CLASSIFYING TEXT <br>WITH SUPERVISED LEARNING</font><br>

Dr. [John McLevey](http://www.johnmclevey.com)    
Department of Knowledge Integration   
Department of Sociology & Legal Studies     
University of Waterloo         

<hr>

* INTEG 440 (Undergraduate): This module is worth <font color='#437AB2'>**8%**</font> of your final grade. The questions in this module add up to 10 points. 
* INTEG 640 (Graduate): This module is worth <font color='#437AB2'>**5%**</font> of your final grade. The questions in this module add up to 10 points.  

<hr>

# Table of Contents 

* [Overview](#overview)
* [Learning Outcomes](#lo) 
* [Prerequisite Knowledge](#pk) 
* [Assigned Reading](#ar) 
* [Question Links](#ql)
* [Packages Used in this Module](#packs)
* [Data Used in this Module](#data)
* **[Key Concepts and Overall Process](#concepts)**
* **[A Simple (and Unrealistic) Example](#unrealistic)**
* **[A More Realistic Application](#realistic)**
* [References](#refs)

<hr>  


# Overview <a id='overview'></a>

The unsupervised methods of discovery introduced in the previous module offer the researcher very little control. For example, you can modify the parameters of a topic model to identify topics at various levels of abstraction (i.e. more or less specific), but you can't ask a topic model to estimate the prevalence of some specific topic -- say anti-immigration discourses -- in a corpus. 

By contrast, we can use supervised learning to scale up more traditional models of content analysis. This involves conceptualization, operationalization, and measurement (all covered at the start of the course) on the front end of a workflow. We then train a model to learn the complex relationships between the codes we have developed and the text. We can assess how well this model learned, and then -- if it learned well enough -- we can use it to classify text that we have not yet seen. 

As you now know, unsupervised learning is about discovery. Supervised learning for text analysis is largely about classification and confirmation. 

<hr>

# Learning Outcomes  <a id='lo'></a>

Upon successful completion of this module, you will be able to: 

1. Compare the goals and logic of supervised learning and unsupervised learning in text analysis 
2. Explain the role of conceptualization, operationalization, and measurement in supervised learning 
3. Conduct and interpret a supervised learning analysis 

<hr>

# Prerequisite Knowledge  <a id='pk'></a>

This module assumes comfort with the fundamentals of Python, and with the vectorization processes introduced in the module "Exploratory Text Analysis."  

<hr>

# Assigned Readings  <a id='ar'></a>

This module assumes you have completed the assigned readings, which are listed immediately below. The readings provide a detailed explanation of the core concepts covered in this module. 

* <font color="green">Chapter 19 "Supervised Learning and Scaling Up what Humans Do Best" from *Doing Computational Social Science*.</font> 

As always, I recommend that you (1) complete the assigned readings, (2) attempt to complete this module without consulting the readings, making notes to indicate where you are uncertain, (3) go back to the readings to fill in the gaps in your knowledge, and finally (4) attempt to complete the parts of this module that you were unable to complete the first time around.

This module notebook includes highly condensed overviews of *some* of the key material from the assigned reading. This is intended as a *supplement* to the assigned reading, *not as a replacement for it*. These high-level summaries do not contain enough information for you to successfully complete the exercises that are part of this module, and they do not cover every relevant topic.

<hr>

# Question Links <a id='ql'></a>

Make sure you have answered all of the following questions before submitting this notebook on LEARN. 

1. [Question 1](#yt1) 
2. [Question 2](#yt2) 
3. [Question 3](#yt3) 
4. [Question 4](#yt4) 
5. [Question 5](#yt5) 
6. [Question 6](#yt6) 
7. [Question 7](#yt7) 

<hr>

<a id='packs'></a>
# Packages Used in this Module 

The cell below imports the packages that are necessary to complete this module. If there are any additional packages you wish to import, you may add them to this import cell. 

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import metaknowledge as mk
import pandas as pd
import numpy as np
import random
import pickle

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# Data Used in this Module <a id='data'></a>

In the second part of this module, we will use a dataset on 13 years of article publications in five peer reviewed journals: *Social Studies of Science*; *Science, Technology, & Human Values*; *Scientometrics*; *Journal of Informetrics*; and *Research Policy*. Our goal will be to classify articles based on whether they use a qualitative or a quantitative methodology. The publication metadata is stored in a `csv` file called `supervised_science.csv` in the `data` directory. We will read it into memory later in this module. 

# Key Concepts and Overall Process <a id='concepts'></a>

To properly understand supervised learning, it is *essential* that you have a firm grasp of (1) foundational concepts and (2) how those concepts relate to one another. To help ensure that is the case, take a moment to review the following concepts. Then we will walk through a small toy example before turning to a more realistic application of supervised learning.  

* **supervised learning**
  * In the context of text analysis, supervised learning is when you have some unstructured text and a set of annotations / labels. You want to learn the relationship between the language used (and not used) in the text of any given document and the label it has been assigned (often by a human). If the model you fit has high accuracy, you want to use it to predict the labels for other unseen / out of sample data that does not have a label. This effectively extends the abilities of human coders to classify text to collections that are far too large to read.
* **labels**  
  * The thing you want to predict with your model.
* **training data** and **test data**
  * To know how good your model is at classifying text, you need to split your data into training data and a test data. You know the labels for both. You train the model on the training data (i.e. get it to learn the relationship between the features and the label), then ask it to predict the labels for the test data. Since you know the true labels for the test data (maybe you assigned them yourself!) you can compare the predictions of the model with the true labels. This lets you validate your model, for example by computing an accuracy rate.
* **vocabulary**
  * The full set of unique words used across all documents in our corpus.
* **document-term matrix**, or **feature matrix**
  * Machine learning algorithms need quantitative features in the form of a matrix. You can't just give them unstructured text. Using functions such as `CountVectorizer`, scikit-learn will learn the vocabulary (i.e. every unique word) used in your corpus. These words become "features" in a document-term matrix created using the `transform` function. In this matrix, the rows are documents and the columns ("features") are words. If there are 2.5 million unique words used across all of the documents in your corpus, then scikit-learn will learn them when it learns the vocabulary, and then when you transform it to a document-term matrix, it will have 2.5 million features (i.e. columns). Again, the number of features depends on the number of unique words in the whole vocabulary across all documents in your corpus. The values (cells) in this matrix are usually counts of the number of times a given feature (word) appears in a given document. However, the values may also be weights, such as TF-IDF.
* **`CountVectorizer`**
  * The scikit-learn function that processes your raw text data and learns the vocabulary in your corpus. It is a "feature extraction" method because it is extracting features (unique words) from your raw data and computing values (counts of occurrences) for documents in the corpus.
* **`TfidfVectorizer`**
  * Like `CountVectorizer` except it computes TF-IDF weights for each word. If you do not remember what TF-IDF is, please review your notes from previous modules or consult the readings.
* **fit**
  * When we fit a vectorizer, it learns the vocabulary used in our corpus. When we fit a classification model (e.g. Naïve Bayes) it learns the relationship between the features and the labels in order to make predictions about the labels given new data that is not yet labelled / annotated.
* **transform** <!-- a sparse representation -->
  * When you convert unstructured text into a document-term matrix of features.
* **fit and transform** vs. just **transform**
  * When you fit and transform text at the same time using `CountVectorizer` or `TfidfVectorizer`, scikit-learn learns the vocabulary used in the corpus and then transforms the text into a term-document matrix where the features are derived from the vocabulary. Once the vocabulary is learned and the model is trained, any new data will have to be transformed into a document-term matrix that includes *exactly* the same feature set that the original data was trained on. The model does not know about the relationship between any new features and the labels. So, when you are going to do prediction (see below), you need to convert the new data into a document-term matrix with the same feature set as the original data. You do **not** learn new features. So, you transform without fitting / learning.
* **classification models**
  * Once you have the document-term feature matrix, you can use classification models for supervised learning to learn the relationship between features and labels. There are *many* models to choose from, and scikit-learn makes it easy to apply any of them to the document-term matrices. Some choices include Multinomial Naïve Bayes, $k$-nearest neighbors, random forests, support vector machines, etc. The course readings describe these algorithms in detail. Make sure to consult them when you use these classifiers.
* **out of sample data**
  * The whole purpose of supervised learning is to be able to predict label values on new unlabelled data. That new data without labels is called "out of sample data."
* **predict**
  * When you use the model trained on your training data to predict the value of labels that are unknown for the out of sample data.

You may have noticed that it's hard to isolate some of these specific concepts from other concepts. That's because these concepts fit together in a coherent general process. If you understand the process, it becomes much easier to remember each specific concept.

I have mapped out the general process for doing supervised machine learning with text in the figure below. I did it with scikit-learn in mind, and so it references specific scikit-learn functions like `CountVectorizer`. If you are using some other package (e.g. gensim), the general process is the same.

![](img/supervised_process.png)

As shown in the figure above, the process starts -- as always -- with acquiring and cleaning text data that has been classified using some set of labels (e.g. each text is labelled as being produced by a politician from one of several political parties). We then split our text into a training set and a testing set. We then "learn" the vocabulary in the dataset, create a document-term matrix with numerical features, fit a supervised learning model, and validate it against the testing data.

You will learn how accurate your model is by seeing how it performs on the test data, which also contains the true labels. If your model is reasonably accurate, then you can use it to predict the labels using new out of sample data, which does *not* include information about the true labels.

Let's see this whole thing in action. We will start simple and unrealistic, and then scale up to a more realistic analysis.

# A Simple (and Unrealistic) Example <a id='unrealistic'></a>

## Create Training and Test Data

In this initial example, we are going to skip the step where we create training and test data. The only reason why we are doing that is because we want to use a really simple and small example that will help make the rest of the process as clear as possible. Once this example is done, we will go through a more realistic example, which will include splitting our data into training and testing sets, and then validating our classifier by computing accuracy rates and looking at a confusion matrix.

## Vectorization: Transform Text into DTM

We will start with the following data, which come from a workshop that Kevin Markham (2016) ran at PyData DC 2016: [Kevin Markham | Machine Learning with Text in scikit learn (1:24:19, Youtube)](https://www.youtube.com/watch?v=8QmkFAthuPU). **Our goal is to predict whether or not a text message is "desperate."** We will label "desperate" as 1 and non-desperate as 0.

In [3]:
training_data = ['call you tonight', 'call me a cab', 'please call me... PLEASE!']
is_desperate = [0, 0, 1]

As covered in the last class on processing and bag of words models, the first step in machine learning with text is *always* to go from raw text to a matrix of numerical features. Previously, we broke that down and did each task (e.g. tokenize, drop non-alpha, lower case, stem, etc.) separately and then created a bag of words. These bag of words can easily be turned into document-term matrices.

To process and vectorize our text, we will use the `CountVectorizer` function we imported above. First, we will initialize it with the default settings (you can tune them later if you like), and then fit it to our training data.

In [4]:
vect = CountVectorizer()
vect.fit(training_data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Remember that `CountVectorizer` learns the vocabulary across all documents in our corpus. Let's see the words it learned from this toy example.

In [5]:
vect.get_feature_names()

['cab', 'call', 'me', 'please', 'tonight', 'you']

Note that, in addition to tokenization and learning the vocabulary, `CountVectorizer` also pre-processed our text a bit. Punctuation is gone, words are lowercased, words less than 2 characters long are removed, every word is unique, and the words are rearranged to be in alphabetical order. We can change these settings if we want, and we can also add our own custom pre-processing to make this workflow *exactly* what we want.

Now that we have learned the vocabulary and vectorized our text, we can *transform* it into a document-term matrix where the number of rows = the number of documents and the number of features (i.e. columns) = the number of words in the vocabulary. The values returned by `CountVectorizer` are, unsurprisingly, counts of the number of times the feature (i.e. word) appears in the document.

In [6]:
training_data_dtm = vect.transform(training_data)
training_data_dtm.shape

(3, 6)

Our document-term matrix is 3 rows by 6 columns. Perfect. Let's take a look at the matrix via a Pandas dataframe. Note that I am not going to assign the Pandas dataframe to a variable, we are just going to take a peek at the matrix.

In [7]:
pd.DataFrame(training_data_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


If this were a real example, there would be a lot more 0s because most words in your corpus will not be in most documents in your corpus. To save storage space and make computation more efficient, skikit-learn will store our data as a sparse matrix, meaning it only records information about non-0 values.

In this example, I have separated the tasks of learning the vocabulary and transforming our data into a document-term matrix of numerical features. However, you can actually do all of these computations at the same time as follows:

In [8]:
training_data_dtm = vect.fit_transform(training_data)
training_data_dtm.shape

(3, 6)

A 3 by 6 sparse matrix. We got the same results.

## Train a Multinomial Naïve Bayes Classifier on the Document-Term Matrix

Multinomial Naïve Bayes is a very popular classifier for text analysis. It's simple, effective, and has been around since the 1960s. The multinomial variant has been shown to be especially good for count vectors in text analysis.

The Multinomial Naïve Bayes classifier is based on [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem). Maybe you remember it from your statistics classes?

Basically, this Naïve Bayes model will classify the probability of the text being desperate or not desperate conditional on whether and how frequently these features (i.e. words) show up in our documents.

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [10]:
nb_classifier = MultinomialNB()
nb_classifier.fit(training_data_dtm, is_desperate)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Done. That's it!

## Validate Model

Normally at this point we would want to get a sense of how accurate our model is. We would do that by getting it to predict the values for our test data. We know the true values of the test data, so we can compare the estimates from our model with the truth.

Again, we are not going to do real model evaluation on this toy example, so we will move onto the next step. However, we will do model evaluation in our more realistic example below.

## Make Predictions on Out of Sample Data

We now have a model that has learned, from our training dataset, the relationship between the features of our documents and the labels. If we were to show it some new out of sample data that is missing a label, we can predict the label. In order to do that, we have to transform the new data into a DTM (document-term matrix) with *identical* features to the one we used to train our model, and then make the prediction.

Our new data will contain words that our trained model doesn't know. Those words will not be taken into consideration in our prediction because the model can't make predictions using information it doesn't know about. The DTM we use here must have *exactly* the same features as the DTM that we used to train our model. Therefore, we will transform our new data -- `new_text` -- into a DTM based on the vocabulary learned from our training data.

In [11]:
new_text = ["please don't call me again"]

We will use the `transform` method of the `CountVectorizer`, but not the `fit` method (which learns new words).

In [12]:
new_text_dtm = vect.transform(new_text)
new_text_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]])

If you look at the dataframe you will see that the features are in fact the same. New words ('again') are not included.

In [13]:
pd.DataFrame(new_text_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


Now we can use the Multinomial Naïve Bayes model we trained to predict the class (desperate or not desperate) for this new data. We will do so using the `.predict` method. Earlier in the notebook, we called our classifier `nb_classifier,` so we put them together as `nb_classifier.predict()`.

In [14]:
nb_classifier.predict(new_text_dtm)

array([1])

Our classifier predicts that this is a desperate message, likely because of the word "please." We humans know this is not the best choice. But in a more realistic application, we would have a lot more training data and if the model is tuned well it will make mistakes like this infrequently. Of course, we need to have good model validation to know how accurately it is classifying the out of sample data.

## Using TF-IDF Weights Instead of Counts

Before getting into a more realistic example, let's consider an alternative path we could have taken. The `CountVectorizer` transforms our data into a DTM where the values are counts of the number of times any given feature / word appears in any given document. Alternatively, we could have used `TfidfVectorizer`, which -- you guessed it -- transforms our data into a DTM where the values are TF-IDF weights instead of raw counts.

Recall from previous modules that TF-IDF measures the importance of a given word in a given document in a larger corpus. Formally, it looks like this:

$$w_i,_j = tf_i,_j * log\Big(\frac{N}{df_i}\Big)$$

Where the TF-IDF weight $w$ for word $i$ in document $j$ is equal to the term frequency multiplied by the inverse document frequency.

The term frequency $tf_i,_j$ is just the number of times the word $i$ appears in document $j$. The inverse document frequency is the log of the total number of documents in the corpus divided by the number of documents in which the word $i$  appears. TF-IDF is just the product of those two numbers.

So, broadly speaking, the weight of a word in a document increases the more frequently it appears in that document, but it decreases if it also appears across many other documents. Rare but not too rare words are weighted more than words that show up across many documents.

Our toy example stays mostly the same as what we have already done, except that we use `TfidfVectorizer` instead of `CountVectorizer`. This time we will use the `fit_transform` method rather than separating the computations for `fit` and `transform`.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
t_vect = TfidfVectorizer()
tfidf_dtm = t_vect.fit_transform(training_data)
tfidf_dtm

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

## Train a Logistic Regression Classifier

Multinomial Naïve Bayes works well when the values in a DTM are integers, but its performance is not quite as good when the values are floats. So, even though it will probably still work OK, we might want to try a different model. As mentioned previously, this is really easy to do in scikit-learn. Here we will use a logistic regression model.

In [16]:
logreg = LogisticRegression()
logreg.fit(tfidf_dtm, is_desperate)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Again, nothing to test here... Let's just make a prediction of that new data introduced above: `new_text_dtm`.

In [17]:
logreg.predict(new_text_dtm)

array([0])

Unlike our Multinomial Naïve Bayes model, our logistic regression model predicted that "please don't call me again" is not a desperate message.

# Other Classifiers

Before we move on to a more realistic example, take a few minutes to try out another classification method of your choice. You read about many in the readings.

### <font color="green">YOUR TURN! (Question 1)</font> <a id='yt1'></a>

Question is Worth: <font color="green">2 points</font>
    
The assigned reading introduced a number of additional classifiers that we could also use here, such as support vector machines (SVMs) or random forests. 

In the cell below, select an additional classifier discussed in the readings and use it in the example we just walked through. Then compare the results to the Naïve Bayes and logistic regression classifiers. 

In [18]:
# Your Answer Here 
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)
clf.fit(training_data_dtm, is_desperate)
clf.predict(new_text_dtm)

# Output is similar to the Logistic Regression Model in its opposite rather than the Naïve Bayes classifier.

array([0])

# A MORE REALISTIC APPLICATION <a id='realistic'></a>



Let's turn to a more realistic application.

In this example, we have metadata on 12000 journal publications that are, broadly speaking, social scientific research on the structure, evolution, and content of science, and scientific careers and policy. The articles come from 5 journals: 

* *Scientometrics*
* *Journal of Informetrics* 
* *Research Policy*
* *Social Studies of Science*
* *Science, Technology, & Human Values*

The first two journals are exclusively quantitative, the latter two are mixed but are *almost* entirely qualitative (very few exceptions). The middle one? Well... let's find out... 

We will train a classifier to learn about differences in the language used in quantitative and qualitative research on science, and then predict the types of articles that get published in the journal *Research Policy*.

Our **training data** is stored in a file called `supervised_science.csv` in the `data` directory. We can load it with `Pandas`. This dataset includes 300 articles each from all of our journals except *Research Policy*, which we will use for our out of sample data. 

In [19]:
training_data = pd.read_csv('data/supervised_science.csv')
training_data.sample(20)

Unnamed: 0,text,label
1030,In this paper we deal with the problem of aggr...,quant
1100,Dyads of journals related by citations can agg...,quant
156,"Ontology, and in particular, the so-called ont...",qual
174,This paper builds on the growing literature in...,qual
1077,A new method of assessment of scientific paper...,quant
75,A number of science and technology studies (ST...,qual
514,Adaptation to the impacts of climate change is...,qual
409,While in the beginning of the environmental de...,qual
28,Collins and Evans have proposed a normative th...,qual
127,This paper develops a framework for understand...,qual


Our 1,200 observations include 600 quantitative articles and 600 qualitative articles. 

In [20]:
training_data['label'].value_counts()

quant    600
qual     600
Name: label, dtype: int64

## Split into Training Data and Test Data

We have to split train-test before vectorization because we don't want to learn vocabulary in the test data. Basically, we are going to "simulate the future" as if this was new data coming in. That data will include words that are not in the vocabulary. We want to split the data and then vectorize so that we have a better simulation on when we have unknown words in our test set.

In [21]:
from sklearn.model_selection import train_test_split

# ignore the warning for now...
X = training_data['text']
y = training_data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Vectorization

Let's learn the vocabulary for our analysis and then create a DTM.

In [22]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(900, 11135)

There are 900 documents (rows) and 11045 features (columns) in our matrix.

In [23]:
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape

(300, 11135)

Same shape, but 300 are held out for testing. So, to make sure it is clear, there are 1200 observations total in this data, 900 are for training and 300 for testing. If the tests come back with good results, then we will use this model to predict labels (quant/qual) for the out of sample data from *Research Policy*. 

## Train a Multinomial Naïve Bayes Classifier

In [24]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Make Predictions on Test Data

In [25]:
y_pred_class = nb.predict(X_test_dtm)

How accurate is our classifier?

In [26]:
metrics.accuracy_score(y_test, y_pred_class)

0.9833333333333333

Pretty accurate! 😎

In [27]:
pd.DataFrame(metrics.confusion_matrix(y_test, y_pred_class))

Unnamed: 0,0,1
0,147,4
1,1,148


Recall from the assigned reading how to read a confusion matrix like the one presented above. 

In [51]:
from sklearn.model_selection import cross_val_score

np.mean(cross_val_score(nb, X_train_dtm, y_train, cv=5, scoring ="accuracy"))

0.9822222222222223

### <font color="green">YOUR TURN! (Question 2)</font> <a id='yt2'></a>

Question is Worth: <font color="green">1 point</font>

What does the confusion matrix above tell us about the accuracy of our classifications? Provide your answer in the cell below. 

A confusion matrix helps to tell us how accurately and correctly has the classifier classified each testing observation in accordance to the test labels that serve as a ground truth reference. Each row represents an actual class while each column represents a predicted class. Across the diagonals from left to bottom right, this represents true correctly classified values where the top left is true negative and bottom right is true positive. The other remaining cells are falsely classified values. If the truely classified diagonal is the only non-zero value in the matrix, then we have a perfect classifier. With the confusion matrix, we can calculate precision and recall to get a more informed insight on the classifier and the relative tradeoff on precision and recall which is variantly desired based on the nature of classification. 

In [28]:
X_test[y_test < y_pred_class]

567    The New York Times (NYT) receives more citatio...
27     This paper examines the social capital that ev...
386    This essay presents the first analysis of gend...
401    In 1942, Katherine Frost Bruner published an a...
Name: text, dtype: object

### <font color="green">YOUR TURN! (Question 3)</font> <a id='yt3'></a>

Question is Worth: <font color="green">1 point</font>

What does the code above (`X_test[y_test < y_pred_class]`) tell us about our classification? Provide your answer in the cell below. 

The code above helps to find those test observations where the test label is less than that of the predicted label. This includes all qualitative articles that were wrongly predicted as being those that are quantitative. These are all the false negative values, ie. those observation articles that are truefully qualitative pieces but were considered to contain quantitative components of composition. This helps us to identify errors and help with determining performance of the NB classifier.  

In [1]:
[X_test, y_test, y_test < y_pred_class]

NameError: name 'X_test' is not defined

In [30]:
"qual" < "quant"

True

## Compare Naïve Bayes with Logistic Regression

Let's now train and fit a logistic regression classifier. 

In [43]:
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

y_pred_class = logreg.predict(X_test_dtm)
metrics.accuracy_score(y_test, y_pred_class)

0.9666666666666667

In [44]:
pd.DataFrame(metrics.confusion_matrix(y_test, y_pred_class))

Unnamed: 0,0,1
0,148,3
1,7,142


In [52]:

np.mean(cross_val_score(logreg, X_train_dtm, y_train, cv=5, scoring ="accuracy"))

0.9800000000000001

### <font color="green">YOUR TURN! (Question 4)</font> <a id='yt4'></a>

Question is Worth: <font color="green">1 point</font>

In the cell below, translate the code from above (training and fitting the logistic regression classifier) into plain English. In terms of accuracy, how does it compare to the Naïve Bayes classifier? 

First, the logistic regression classifier is initialized to a variable called `logreg`. With `logreg`, the classifier fits the data for the DTM matrix consisting of all the unique words across training articles along with a pertaining training label. This allows for predicting out of training data for probability scores or discrete classification based on the argument to be passed. With the classifier fine-tuned and fitted to the training data, the next step is to predict labels for the test DTM matrix with the `logreg` classifier and store into a variable called `y_pred_class`. To assess the accuraacy of classification, the sklearn package has a metrics class which measures the accuracy, as in the number of correctly indentified labels over the total labels within the predicted set. The logistic regression classifier performed at an accuracy of 97% in comparison to a 98% performance from the Naïve Bayes classifier. This means that the Naïve Bayes classifier has performed slightly better then LogReg. 

## Predict the Types of STS Published in Research Policy

Let's load our 'out of sample' data from *Research Policy* into a list called `RP`. 

In [32]:
pickle_in = open("data/research_policy.pickle","rb")
RP = pickle.load(pickle_in)
print("There are {} articles in our out of sample data from Research Policy.".format(len(RP)))

There are 1269 articles in our out of sample data from Research Policy.


Let's use both of our classifiers to predict the labels for our out of sample data. 

In [33]:
RP_dtm = vect.transform(RP)
lp = pd.DataFrame(logreg.predict(RP_dtm))
nbp = pd.DataFrame(nb.predict(RP_dtm))

In [34]:
lp.columns = ['prediction']
nbp.columns = ['prediction']

lp['prediction'].value_counts()

quant    964
qual     305
Name: prediction, dtype: int64

In [35]:
nbp['prediction'].value_counts()

quant    1023
qual      246
Name: prediction, dtype: int64

In [36]:
lp['prediction'].value_counts()

quant    964
qual     305
Name: prediction, dtype: int64

### <font color="green">YOUR TURN! (Question 5)</font> <a id='yt5'></a>

Question is Worth: <font color="green">1.5 points</font>

What conclusions can we draw from these two classifiers about the content of *Research Policy*? How confident are you in those conclusions? Write your answer in the cell below. 

Based on the `value_count` for each classifier, it seems that the general trend it that there have been predicted more quantitative then qualitative articles for the premise of classifying content of Research Policy. We have noticed from before that the Naïve Bayes classifier has performed at a better accuracy then the Logistic Regression classifier on the pertaining test data split from out training set. This is potentially indicative here where NB has been able to classify more "quant" than "qual" articles from Research Policy. From the confusion matrix between each classifier, it can be seen that each classifier has a similar precision but the NBP has a higher recall, thus it has detected more of the range of quantitative content and been able to precisely classify. Based on an average of a 5 cross fold validation, the NBP slightly beat the LogReg classifier. Based on these observations, it can be said with a relatively strong amount of confidence that the NB classifier performed better in the general observation; that there are more quantitative articles within Research Policy journal based on the words within the corpus across the documents. 

LogReg:
Recall - 95.3%
Precision - 97.9%

NBP: 
Recall - 99.3%
Precision - 97.4%

### <font color="green">YOUR TURN! (Question 6)</font> <a id='yt6'></a>

Question is Worth: <font color="green">2.5 points</font>
    
In the cell below, train and test a Support Vector Machine (SVM) classifier and use it to make predictions on the out of sample data. 

In [58]:
# Your Answer Here

#Import svm model
from sklearn import svm

#Create a svm Classifier
clf_svm = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf_svm.fit(X_train_dtm, y_train)

#Predict the response for test dataset
y_pred = clf_svm.predict(X_test_dtm)

np.mean(cross_val_score(clf_svm, X_train_dtm, y_train, cv=5, scoring ="accuracy"))

0.9788888888888889

In [57]:
sv = pd.DataFrame(clf_svm.predict(RP_dtm))
sv.columns = ['prediction']
sv['prediction'].value_counts()

quant    946
qual     323
Name: prediction, dtype: int64

### <font color="green">YOUR TURN! (Question 7)</font> <a id='yt7'></a>

Question is Worth: <font color="green">1 point</font>

Compare the accuracy of your SVM classifier with the Naïve Bayes and Logistic Regression classifiers. Which do you prefer and why? Provide your response in the cell below. 

The SVM classifier performs at a slightly lesser accurate level; on the assumption of the conclusion drawn of there being more quant than qual as shown by NB and logReg classifiers. Based on cross_fold_validation averages, the SVM has performed slightly less than the others which is why I would take the Naïve Bayes classifier apart from the others. The NB classifier works on the assumption that all features are independent of one another and its relatively intepretable in how it functions with conditional probability being the driver for its decision making. Each word is distinct and independent, and based on the NB formula, the algorithm can encompass a large number of features and is relatively efficient in its processing as a model. 

<hr>

# <font color="green">Do You See Something That Could be Better?</font>

I am committed to collecting student feedback to continuously improve this course for future students. I would like to invite you to help me make those improvements. 

As you worked on this module, did you notice anything that could be improved? For example, did you find a typo in the module notebook **or in the assigned reading**? Did you find the explanation of a particular concept or block of code confusing? Is there something that just isn’t clicking for you? 

If you have any feedback for the content in this module, please enter it into the text block below. I will review feedback each week and make a list of things that should be changed before the next offering. 

Please know that *nothing you say here, however critical, will impact how I evaluate your work in this course*. There is no risk that I will assign a lower grade to you if you provide critical feedback. In fact, if the feedback you provide is thoughtful and constructive, I will assign up to 3% bonus marks on your final course grade. 

Thanks for your help improving the course! 

# Your Feedback Here :-)

<hr>

# REFERENCES <a id='refs'></a>

* Markham, Kevin. 2016. "Kevin Markham | Machine Learning with Text in scikit learn (1:24:19, Youtube)." PyData DC 2016. [Available online](https://www.youtube.com/watch?v=8QmkFAthuPU). 
* McLevey, John. 2020. *Doing Computational Social Science*. Sage. London, UK. 