<h3>UofT Social Science Methods Week</h3>
<h1>Introduction to Machine Learning for Textual Analysis</h1>

<p>
This notebook illustrates the key steps to fit a machine learning classifier using textual data.  We will use the sklearn library for Python and focus mostly on support vector machines (SVMs).
</p>

<p>
We start by importing libraries for the current session.
</p>

In [None]:
# These classes provide "vectorizers" to encode texts into a term-document matrix: 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This class provides a support vector classifier with a linear kernel:
from sklearn.svm import LinearSVC

# This class will be useful later on to assess the model using cross-validation:
from sklearn.model_selection import StratifiedKFold, train_test_split

# This class provides accuracy metrics.
from sklearn import metrics

# These libraries are useful to deal with maths and data frames.
import numpy as np
import pandas as pd

<h2>Toy Example</h2>

Let us start by looking at a simple example to make sure that we understand each step involved in the estimation and interpretation of SVMs.

In [None]:
# Four text documents stored in a Python list.
documents = ["Chinese Beijing Chinese", 
            "Chinese Chinese Shanghai",
            "Chinese Macao",
            "Tokyo Japan Chinese"]

# The corresponding classes, annotated.
y = ["China", "China", "China", "Not China"]

# A new document for which we want to predict the class.
new_document = ["Chinese Chinese Japan"]

<h3>1. Transforming the texts into a term-document matrix</h3>

We create an instance of the CountVectorizer class, and call it, for instance, "vectorizer".

In [None]:
vectorizer = CountVectorizer()

We can now transform the text documents into a term-document matrix, with the fit_transform function of the CountVectorizer class.

In [None]:
X = vectorizer.fit_transform(documents)

If we inspect the result object X, we can see that it is stored as a sparse matrix to save space.

In [None]:
X

We can convert it to a dense matrix to visualize the content. 

In [None]:
X.todense()

The vocabulary_ attribute is a dictionary that maps each word in the vocabulary to the columns in the matrix X.  Notice that Python is zero-based.

In [None]:
vectorizer.vocabulary_

The get_feature_names() function returns an ordered list of words in the vocabulary.

In [None]:
vectorizer.get_feature_names()

<h3>2. Fitting the model</h3>

We can now create an instance of the LinearSVC class, here simply named "model".  This will be our classifier.  Without arguments, the model will rely on default options.

In [None]:
model = LinearSVC()

We fit the model with the fit function, and enter X (the design matrix/TDM) and y as arguments. 

In [None]:
model.fit(X, y)

We can retrieve the coefficients of the hyperplane with the coef_ attribute.

In [None]:
model.coef_

The model has an intercept, which we can also retrieve.

In [None]:
model.intercept_

To be clearer, let's look at the hyperplane estimates for each word.

In [None]:
for i, beta in enumerate(model.coef_[0]):
    # Iterate through the coefficients.
    print("%s: %0.3f" %(vectorizer.get_feature_names()[i], beta))
    # Print the name of the column and the estimate.

It is important to understand how the classes have been encoded (as -1 and 1).  They were stored in alpha order by default, such that the positive class is "Not China". 

In [None]:
model.classes_

We can verify on which side of the hyperplane each of the four documents falls, by computing $\hat{\alpha} + \mathbf{x}\hat{\boldsymbol\beta}$. Documents with negative values, below the hyperplane, are in the class "China", whereas the document above is in the class "Not China".  This is what we expected. 

In [None]:
for i, row in enumerate(X.todense()):
    # Iterate the rows of the X matrix.
    yhat = model.intercept_[0] + np.dot(row, np.transpose(model.coef_[0]))[0,0]
    # For each row, compute alpha + the dot product of x and the coefficients.
    print("%s: %0.3f" %(documents[i], yhat))
    # Print the document with the resulting "yhat" value.

<h3>3. Predicting the class of new documents</h3>

Now that the model is trained, we can use it to predict the class of a new document.  We have set aside one such example.

In [None]:
new_document

We need to transform the document into the same vector space.  We invoke the vectorizer, this time with the transform method.  This implies that words that have never been observed when fitting the model cannot be used to predict new documents.

In [None]:
Xprime = vectorizer.transform(new_document)
Xprime.todense()

Predicting the class is simple.  We invoke our model with the predict function and pass as an argument the transformed data.

In [None]:
model.predict(Xprime)

The model predicted the category "China", which sounds good.  The example was ambiguous as it contained a word from the other category.  We can confirm where it falls relative to the hyperplane, as we did before.

In [None]:
model.intercept_[0] + np.dot(Xprime.todense()[0], np.transpose(model.coef_[0]))[0,0]

<h2>Example 2. Sentiment Analysis with Social Media Data</h2>

Here's a full example using a real-world dataset of social media data.  This is a sample from the Stanford Twitter dataset, with tweets annotated for sentiment.  The example relies on pandas for Python to handle the dataset.

In [None]:
df = pd.read_table("socialmedia.csv", sep=',', encoding="utf-8", header=0)
df.head()

The dataset contains 10,000 annotated examples.

In [None]:
df.shape

Awareness of the class distribution is primordial.  Here, we have a near-balanced dataset.

In [None]:
df.sentiment.value_counts()

<h3>1. Transforming the texts into a term-document matrix</h3>

Let us create Python lists containing the documents and the sentiment class.

In [None]:
text = df.text.tolist()
y = df.sentiment.tolist()

As before, we create our vectorizer.  This example relies on a TF-IDF weighted term-document matrix restricted to the 5000 most frequent terms, and we remove English stop words.  

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

We transform the text into a vector space, as we did before.  We can confirm that the matrix has shape 10000 x 5000.

In [None]:
X = vectorizer.fit_transform(text)
X.shape

<h3>2. Fitting and assessing the model using cross-validation</h3>

The simplest way to perform cross-validation is to split the sample at random between training and testing set.  This can be done easily with the train_test_split() function from sklearn.

In [None]:
model = LinearSVC()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)

We can now compute the accuracy and F1-Score.

In [None]:
metrics.accuracy_score(y_test, y_predicted)

In [None]:
metrics.f1_score(y_test, y_predicted, pos_label='Positive')

In [None]:
metrics.confusion_matrix(y_test, y_predicted)

For convenience, let's write a function that retrieves the largest coefficients and indicates which words are best predictors of each class.

In [None]:
def showTopFeatures(classifier, vectorizer, n):
    feature_names = np.asarray(vectorizer.get_feature_names())
    temp_pos = np.argsort(classifier.coef_)[:,-n:].tolist()[0]
    temp_neg = np.argsort(classifier.coef_)[:,0:n].tolist()[0]
    print("Top positive features:\n")
    for t in temp_pos: 
        print("%s" % (feature_names[t]))
    print("\n")
    print("Top negative features:\n")    
    for t in temp_neg: 
        print("%s" % (feature_names[t]))

In [None]:
showTopFeatures(model, vectorizer, 10)

<h3>3. Improving the model with hyperparameter optimization and feature selection</h3>

We have left the hyperparameter C at the default value, 1.  We may adjust the parameter to find a better model.

In [None]:
# Simple example of hyperparameter optimization.
for c in [0.01,0.05,0.1,1,5,10]:
    model = LinearSVC(C=c)
    model.fit(X_train, y_train)
    y_predicted = model.predict(X_test)
    print("F1-Score for C = %s: %0.3f" %(str(c), metrics.f1_score(y_test, y_predicted, pos_label='Positive')))

Setting C = 0.05 increases the accuracy of prediction.

Next, we can compare the performance of various classifiers.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

classifiers = [('Support Vector Machine', LinearSVC(C=.05)),
               ('Decision Tree', DecisionTreeClassifier()), 
               ('Logistic Regression', LogisticRegression(penalty='l1'))]

for name, clf in classifiers:
    model = clf
    model.fit(X_train, y_train)
    y_predicted = model.predict(X_test)
    print("Classifier: %s" %name)
    print("Accuracy: %0.3f" %metrics.accuracy_score(y_test, y_predicted))
    print("F1-Score: %0.3f" %metrics.f1_score(y_test, y_predicted, pos_label='Positive'))
    print("\n")

Feature selection helps to achieve a better bias-variance trade-off, by removing poor predictors from the model.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_features=10000)
kbest = SelectKBest(chi2, k=2000)

X = vectorizer.fit_transform(text)
X = kbest.fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
model = LinearSVC(C=5)
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print("Accuracy: %0.3f" %metrics.accuracy_score(y_test, y_predicted))
print("F1-Score: %0.3f" %metrics.f1_score(y_test, y_predicted, pos_label='Positive'))

<h3>4. Making predictions</h3>

If we are satisfied with the model, we can fit the chosen specification on the full sample and save it for later use.  In this case, we will use it to predict the sentiment of new documents.

In [None]:
model = LinearSVC(C=5)
model.fit(X, y)

Let us load a new dataset with tweets, with the aim of predicting sentiment. 

In [None]:
newdf = pd.read_table("unseen_documents.csv", sep=',', encoding="utf-8", header=0)
newdf.head()

We need to convert the new documents into the same vector space.  This now requires two steps.

In [None]:
newtext = newdf.text.tolist()
newX = vectorizer.transform(newtext)
newX = kbest.transform(newX)

We can create a new variable in our dataset with the predictions from the trained model.

In [None]:
newdf['sentiment'] = model.predict(newX)

In [None]:
newdf.head()