# Tutorial - build LinearSVC model with sklearn

This tutorial demonstrates how to use the Sci-kit Learn (sklearn) package to build linearSVC model, rank features, and use the model for prediction. We will be using the Kaggle sentiment data again.

Note that sklearn actually provides two SVM algorithms: SVC and LinearSVC. 

The SVC module allows for choosing nonlinear kernels, and it uses one-vs-one strategy for multi-class classification.

The LinearSVC module uses the linear kernel, and it uses one-vs-all strategy for multi-class classification, so linearSVC is generally faster than SVC. Since linear kernel works better for text classification in general, this tutorial demonstrates how to use LinearSVC for text classification.

# Step 1: Read in data

In [1]:
# this step is the same as the NB script

# read in the training data
# the data set includes four columns: PhraseId, SentenceId, Phrase, Sentiment
# In this data set a sentence is further split into phrases 
# in order to build a sentiment classification model
# that can not only predict sentiment of sentences but also shorter phrases

# A data example:
# PhraseId SentenceId Phrase Sentiment
# 1 1 A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .1

# the Phrase column includes the training examples
# the Sentiment column includes the training labels
# "0" for very negative
# "1" for negative
# "2" for neutral
# "3" for positive
# "4" for very positive


import pandas as p
train=p.read_csv("/Users/kenmckee/Desktop/GS/S18/tm/ks/train.tsv", delimiter='\t')
y=train['Sentiment'].values
X=train['Phrase'].values

# Step 2: Split train/test data for hold-out test

In [2]:
# this step is the same as the NB script

# check the sklearn documentation for train_test_split
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
# "test_size" : float, int, None, optional
# If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. 
# If int, represents the absolute number of test samples. 
# If None, the value is set to the complement of the train size. 
# By default, the value is set to 0.25. The default will change in version 0.21. It will remain 0.25 only if train_size is unspecified, otherwise it will complement the specified train_size.    

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
print(X_train[0])
print(y_train[0])
print(X_test[0])
print(y_test[0])

(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2


Sample output from the code above:

(93636,) (93636,) (62424,) (62424,)
almost in a class with that of Wilde
3
escape movie
2

# Step 2.1 Data Checking

In [3]:
# this step is the same as the NB script

# Check how many training examples in each category
# this is important to see whether the data set is balanced or skewed

training_labels = set(y_train)
print(training_labels)
from scipy.stats import itemfreq
training_category_dist = itemfreq(y_train)
print(training_category_dist)

{0, 1, 2, 3, 4}
[[    0  4141]
 [    1 16449]
 [    2 47718]
 [    3 19859]
 [    4  5469]]


`itemfreq` is deprecated and will be removed in a future version. Use instead `np.unique(..., return_counts=True)`
  if __name__ == '__main__':


The sample output shows that the data set is skewed with 47718/93636=51% "neutral" examples. All other categories are smaller.

{0, 1, 2, 3, 4}
[[    0  4141]
 [    1 16449]
 [    2 47718]
 [    3 19859]
 [    4  5469]]

# Step 3: Vectorization

In [4]:
# this step is the same as the NB script

# sklearn contains two vectorizers

# CountVectorizer can give you Boolean or TF vectors
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# TfidfVectorizer can give you TF or TFIDF vectors
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

# Read the sklearn documentation to understand all vectorization options

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# several commonly used vectorizer setting

#  unigram boolean vectorizer, set minimum document frequency to 5
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True, min_df=5, stop_words='english')

#  unigram term frequency vectorizer, set minimum document frequency to 5
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False, min_df=5, stop_words='english')

#  unigram and bigram term frequency vectorizer, set minimum document frequency to 5
gram12_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), min_df=5, stop_words='english')

#  unigram tfidf vectorizer, set minimum document frequency to 5
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, min_df=5, stop_words='english')


## Step 3.1: Vectorize the training data

In [5]:
# this step is the same as the NB script

# The vectorizer can do "fit" and "transform"
# fit is a process to collect unique tokens into the vocabulary
# transform is a process to convert each document to vector based on the vocabulary
# These two processes can be done together using fit_transform(), or used individually: fit() or transform()

# fit vocabulary in training documents and transform the training documents into vectors
X_train_gram = gram12_count_vectorizer.fit_transform(X_train)
X_test_gram = gram12_count_vectorizer.transform(X_test)


X_train_tfidf = unigram_tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = unigram_tfidf_vectorizer.transform(X_test)
# check the content of a document vector
print(X_train_gram.shape)
print(X_train_gram[0].toarray())

# check the size of the constructed vocabulary
print(len(gram12_count_vectorizer.vocabulary_))

# print out the first 10 items in the vocabulary
print(list(gram12_count_vectorizer.vocabulary_.items())[:10])

# check word index in vocabulary
print(gram12_count_vectorizer.vocabulary_.get('imaginative'))

(93636, 34579)
[[0 0 0 ... 0 0 0]]
34579
[('class', 5020), ('wilde', 33787), ('derring', 7552), ('chilling', 4755), ('affecting', 825), ('meanspirited', 19199), ('personal', 22506), ('low', 18281), ('involved', 15905), ('worth', 34261)]
14942


Sample output:

(93636, 11967)
[[0 0 0 ..., 0 0 0]]
11967
[('imaginative', 5224), ('tom', 10809), ('smiling', 9708), ('easy', 3310), ('diversity', 3060), ('impossibly', 5279), ('buy', 1458), ('sentiments', 9305), ('households', 5095), ('deteriorates', 2843)]
5224

## Step 3.2: Vectorize the test data

Sample output:

(62424, 14324)

# Step 4: Train a LinearSVC classifier

In [6]:
# import the LinearSVC module
from sklearn.svm import LinearSVC

# initialize the LinearSVC model
svm_clf = LinearSVC(C=1)


# use the training data to train the model
gramsvm = svm_clf.fit(X_train_gram,y_train)
tfidfsvm = svm_clf.fit(X_train_tfidf,y_train)

In [7]:

# Step 4: Train a LinearSVC classifier

# import the MNB module
from sklearn.naive_bayes import MultinomialNB

# initialize the MNB model
nb_clf= MultinomialNB()

# use the training data to train the MNB model
grammnb = nb_clf.fit(X_train_gram,y_train)
tfidfmnb = nb_clf.fit(X_train_tfidf,y_train)


# Step 4.1 Interpret a trained LinearSVC model

In [8]:
## interpreting LinearSVC models
## http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

## LinearSVC uses a one-vs-all strategy to extend the binary SVM classifier to multi-class problems
## for the Kaggle sentiment classification problem, there are five categories 0,1,2,3,4 with 0 as very negative and 4 very positive
## LinearSVC builds five binary classifier, "very negative vs. others", "negative vs. others", "neutral vs. others", "positive vs. others", "very positive vs. others", 
## and then pick the most confident prediction as the final prediction.

## Linear SVC also ranks all features based on their contribution to distinguish the two concepts in each binary classifier
## For category "0" (very negative), get all features and their weights and sort them in increasing order
feature_ranks = sorted(zip(svm_clf.coef_[0], gram12_count_vectorizer.get_feature_names()))

## get the 10 features that are best indicators of very negative sentiment (they are at the bottom of the ranked list)
very_negative_10 = feature_ranks[-10:]
print("positive")
for i in range(0, len(very_negative_10)):
    print(very_negative_10[i])
print()

## get 10 features that are least relevant to "very negative" sentiment (they are at the top of the ranked list)
not_very_negative_10 = feature_ranks[0:10]
print("negative")
for i in range(0, len(not_very_negative_10)):
    print(not_very_negative_10[i])
print()


positive
(2.098748016306035, 'excuse pair')
(2.172740688849538, 'formula level')
(2.179837148708545, 'formula comedy')
(2.180049840739211, 'blind date')
(2.2009494768298765, 'filming teeming')
(2.21238243968794, 'filled raw')
(2.238138902154803, 'credit')
(2.462147786384405, 'excuse demonstrate')
(2.5442704628170834, 'black white')
(2.5634850970395875, 'blank yearning')

negative
(-1.5438531996978093, 'characters angles')
(-1.321165559374924, 'auteur tsai')
(-1.1453612579630867, 'firebrand')
(-1.1388849769574414, '7th century')
(-1.0972851444204532, 'combustible')
(-1.0915542640535807, 'corniest hackneyed')
(-1.0786297052259861, 'dramatically substantial')
(-1.0635195925357532, 'brisk 85')
(-1.0473708386683755, 'cumbersome 3d')
(-1.0303677748214903, 'cute')



In [9]:
## interpreting LinearSVC models
## http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

## LinearSVC uses a one-vs-all strategy to extend the binary SVM classifier to multi-class problems
## for the Kaggle sentiment classification problem, there are five categories 0,1,2,3,4 with 0 as very negative and 4 very positive
## LinearSVC builds five binary classifier, "very negative vs. others", "negative vs. others", "neutral vs. others", "positive vs. others", "very positive vs. others", 
## and then pick the most confident prediction as the final prediction.

## Linear SVC also ranks all features based on their contribution to distinguish the two concepts in each binary classifier
## For category "0" (very negative), get all features and their weights and sort them in increasing order
feature_ranks = sorted(zip(nb_clf.coef_[0], gram12_count_vectorizer.get_feature_names()))

## get the 10 features that are best indicators of very negative sentiment (they are at the bottom of the ranked list)
very_negative_10 = feature_ranks[-10:]
print("positive")
for i in range(0, len(very_negative_10)):
    print(very_negative_10[i])
print()

## get 10 features that are least relevant to "very negative" sentiment (they are at the top of the ranked list)
not_very_negative_10 = feature_ranks[0:10]
print("negative")
for i in range(0, len(not_very_negative_10)):
    print(not_very_negative_10[i])
print()


positive
(-6.645979011120781, 'fashioned emotionally')
(-6.62020923362483, 'blowout')
(-6.60369986228579, 'crystallize')
(-6.517311233308192, 'bored')
(-6.355845020937053, 'conceal')
(-6.131355008437069, 'formula comedy')
(-6.029810232070883, 'controversy')
(-5.792621365210278, 'capture visual')
(-5.411480448479903, 'affirming message')
(-5.186697744581525, 'daily')

negative
(-9.958278961750986, '10 minutes')
(-9.958278961750986, '100')
(-9.958278961750986, '101 premise')
(-9.958278961750986, '102 minute')
(-9.958278961750986, '103')
(-9.958278961750986, '105')
(-9.958278961750986, '105 minutes')
(-9.958278961750986, '10th')
(-9.958278961750986, '10th grade')
(-9.958278961750986, '11')



# Step 5: Test the LinearSVC classifier

In [14]:
list(X_test_gram.shape)
print(y_test.shape)

(62424,)


In [16]:

# print confusion matrix and classification report

from sklearn.metrics import confusion_matrix
y_pred = svm_clf.predict(X_test_tfidf)
cm=confusion_matrix(y_test, y_pred, labels=[0,1,2,3,4])
print(cm)
print()

from sklearn.metrics import classification_report
target_names = ['0','1','2','3','4']
print(classification_report(y_test, y_pred, target_names=target_names))

[[  795  1387   624   117     8]
 [  589  4336  5245   629    25]
 [  163  2299 26557  2684   161]
 [   24   408  5604  6220   812]
 [    2    40   551  2010  1134]]

             precision    recall  f1-score   support

          0       0.51      0.27      0.35      2931
          1       0.51      0.40      0.45     10824
          2       0.69      0.83      0.75     31864
          3       0.53      0.48      0.50     13068
          4       0.53      0.30      0.39      3737

avg / total       0.61      0.63      0.61     62424



In [18]:

# print confusion matrix and classification report

from sklearn.metrics import confusion_matrix
y_pred = nb_clf.predict(X_test_tfidf)
cm=confusion_matrix(y_test, y_pred, labels=[0,1,2,3,4])
print(cm)
print()

from sklearn.metrics import classification_report
target_names = ['0','1','2','3','4']
print(classification_report(y_test, y_pred, target_names=target_names))

[[  107  1144  1613    67     0]
 [   61  2580  7821   361     1]
 [   19  1168 28673  1987    17]
 [    0   147  7942  4883    96]
 [    0    11  1374  2164   188]]

             precision    recall  f1-score   support

          0       0.57      0.04      0.07      2931
          1       0.51      0.24      0.33     10824
          2       0.60      0.90      0.72     31864
          3       0.52      0.37      0.43     13068
          4       0.62      0.05      0.09      3737

avg / total       0.57      0.58      0.53     62424



# Step 5.1 Interpret the prediction result

In [20]:
## get the confidence scores for all test examples from each of the five binary classifiers
svm_confidence_scores = svm_clf.decision_function(X_test_tfidf)
## get the confidence score for the first test example
print(svm_confidence_scores[0])

## sample output: array([-1.05306321, -0.62746206,  0.31074854, -0.89709483, -1.08343089]
## because the confidence score is the highest for category 2, 
## the prediction should be 2. 

## Confirm by printing out the actual prediction
print(y_test[0])

[-1.014884   -0.38032099  0.16541523 -0.97047589 -1.23293783]
2


# Step 5.2 Error Analysis

In [21]:
# print out specific type of error for further analysis

# print out the very positive examples that are mistakenly predicted as negative
# according to the confusion matrix, there should be 53 such examples
# note if you use a different vectorizer option, your result might be different

err_cnt = 0
for i in range(0, len(y_test)):
    if(y_test[i]==3 and y_pred[i]==0):
        print(X_test[i])
        err_cnt = err_cnt+1
print("errors:", err_cnt)

errors: 0


TRAIN MNB

# Step 6: write the prediction output to file

In [24]:
y_pred=svm_clf.predict(X_test_tfidf)
output = open("/Users/kenmckee/Desktop/GS/S18/tm/ks/linearSVC_prediction_output.csv", 'w')
for x, value in enumerate(y_pred):
  output.write(str(value) + '\n') 
output.close()

# Step 6.1 Prepare submission to Kaggle sentiment classification competition

In [26]:
########## submit to Kaggle submission

# we are still using the model trained on 60% of the training data
# you can re-train the model on the entire data set 
#   and use the new model to predict the Kaggle test data
# below is sample code for using a trained model to predict Kaggle test data 
#    and format the prediction output for Kaggle submission

# read in the test data
kaggle_test=p.read_csv("/Users/kenmckee/Desktop/GS/S18/tm/ks/test.tsv", delimiter='\t') 

# preserve the id column of the test examples
kaggle_ids=kaggle_test['PhraseId'].values

# read in the text content of the examples
kaggle_X_test=kaggle_test['Phrase'].values

# vectorize the test examples using the vocabulary fitted from the 60% training data
kaggle_X_test_vec=unigram_count_vectorizer.transform(kaggle_X_test)

# predict using the NB classifier that we built
kaggle_pred=svm_clf.fit(X_train_gram, y_train).predict(kaggle_X_test_vec)

# combine the test example ids with their predictions
kaggle_submission=zip(kaggle_ids, kaggle_pred)

# prepare output file
outf=open('/Users/byu/Desktop/data/kaggle/kaggle_submission_linearSVC.csv', 'w')

# write header
outf.write('PhraseId,Sentiment\n')

# write predictions with ids to the output file
for x, value in enumerate(kaggle_submission): outf.write(str(value[0]) + ',' + str(value[1]) + '\n')

# close the output file
outf.close()

NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

# Exercise C

In [None]:
# generate your Kaggle submissions with boolean representation and TF representation
# submit to Kaggle
# report your scores here
# which model gave better performance in the hold-out test
# which model gave better performance in the Kaggle test

# Your code starts here

# Your code ends here

# Exercise D

In [None]:
# Use the entire Kaggle training data set to build a LinearSVC model
# then do 3-fold CV for the LinearSVC model that you built

# Your code starts here

# Your code ends here