## Support vector machines

**Data** [Gender-annoted dataset of European parliament talks](https://www.kaggle.com/ellarabi/europarl-annotated-for-speaker-gender-and-age)

**Overreaching question** Can we develop a model which correctly predicts speakers' based on what they are saying?

## Data management

We connect the variable of interest into the textual data each speaker has said.
That data is stored as XML, so we need to do a bit of work before we can easily use it.
Also, transform the textual data to a feature matrix.

In [4]:
from bs4 import BeautifulSoup
metadata = open('./data/europarlament/europarl.de-en.dat', encoding='utf8').readlines()
all_texts = open('./data/europarlament/europarl.de-en.en.aligned.tok', encoding='utf8').readlines()

# 292724 rows

## check that both files have same number of rows
assert len(metadata) == len(all_texts)

## this time processign these takes already some time, so let's choose a random set of 1000 messages already now

import random

random.seed(1)

selected_lines = random.sample( range( len( metadata ) ) , k = 10000 )

print( metadata[0] )

from bs4 import BeautifulSoup

genders = []
selected_texts = []

for line in selected_lines:
    
    md = BeautifulSoup( metadata[ line ] )
    genders.append( md.line['gender'] )
    
    selected_texts.append( all_texts[ line ] )
    

print( len( genders ) )
print( len( selected_texts ) )

<LINE COUNT="1" EUROID="4550" NAME="Evans, Robert J" LANGUAGE="EN" GENDER="FEMALE" DATE_OF_BIRTH="8 May 1959" SESSION_DATE="00-01-17" AGE="40"/>

10000
10000


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer()
document_term_matrix = tf_vectorizer.fit_transform( selected_texts )

## Separate the train-test split

This is used later in the analysis to ensure we do not [overfit](https://en.wikipedia.org/wiki/Overfitting) the data when we train the machine learning classifier.
We choose to use 20% of data for testing.

In [6]:
from sklearn.model_selection import train_test_split

label_train, label_test, data_train, data_test = train_test_split( genders, document_term_matrix, test_size = .2 )

# Run and evaluate machine learning tasks

We now train the model using the **training** data and measure how well accuracy we achieved by examining **test data**.

In [7]:
from sklearn import svm

model = svm.SVC(kernel='linear') # Linear Kernel, default settings
model.fit( data_train, label_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [8]:
from sklearn import metrics
## check how well we did for testing data
label_test_pred = model.predict( data_test )
print( metrics.accuracy_score( label_test, label_test_pred ) )

0.5715


In [30]:
# understand predictions

predictors = {}

for i, name in enumerate( tf_vectorizer.get_feature_names() ):
    predictors[name] = i
    
for name, value in predictors.items():
    predictors[name] = model.coef_[0, value ]
    
print( predictors )



## Tasks

* Run the code as is and interprent the accuracy. What does that mean?

The accuracy in the latest run was about 0.57 which means that the model classifies 57% of the cases right. 

* Examine different metrics for [classification accuracy](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).

In [12]:
from sklearn.metrics import classification_report

print(classification_report(label_test, label_test_pred))

print( metrics.confusion_matrix( label_test, label_test_pred ) )

              precision    recall  f1-score   support

      FEMALE       0.38      0.34      0.35       701
        MALE       0.66      0.70      0.68      1299

    accuracy                           0.57      2000
   macro avg       0.52      0.52      0.52      2000
weighted avg       0.56      0.57      0.57      2000

[[235 466]
 [391 908]]


* Fix issues in the text pre-processing: account for stop words, frequent terms and stem content in the document-term-matrix: does it have any implications on accuracy?

The code in the following sections removes stop words, frequent terms and stems the content: 

In [5]:
metadata = open('./data/europarlament/europarl.de-en.dat', encoding='utf8').readlines()
all_texts = open('./data/europarlament/europarl.de-en.en.aligned.tok', encoding='utf8').readlines()

## check that both files have same number of rows
assert len(metadata) == len(all_texts)

In [30]:
from collections import OrderedDict
from itertools import islice
import random

import nltk
from nltk.stem.snowball import EnglishStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

stemming_tool = EnglishStemmer()
regexp_tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
stopwords = set(stopwords.words('english'))

random.seed(1)

selected_lines = random.sample( range( len( metadata ) ) , k = 10000 )

In [53]:
from bs4 import BeautifulSoup

genders = []
selected_texts = []
selected_texts_raw = []
word_frequencies = {}

for line in selected_lines:
    lowered_line = ''
    stemmed_line = ''
    
    selected_texts_raw.append( all_texts[ line ] )
    
    md = BeautifulSoup( metadata[ line ] )
    genders.append( md.line['gender'] )
    
    # Remove non-albhabetic words from the line: 
    tokenized_line = regexp_tokenizer.tokenize(all_texts[line])
    
    for word in tokenized_line:
        lowered_line += word.lower() + ' '
    
    # Remove the stop words from the line:
    line_without_stopwords = [word for word in word_tokenize(lowered_line) if word not in stopwords]
    
    #Stem the line:
    for word in line_without_stopwords:
    
        stemmed_word = stemming_tool.stem(word)
    
        if stemmed_word in word_frequencies:
            word_frequencies[stemmed_word] += 1
        else: 
            word_frequencies[stemmed_word] = 1
            
        stemmed_line += stemmed_word + ' '
    
    selected_texts.append( stemmed_line )

# Find out what words are the most frequently used words:
ordered_word_freq = OrderedDict(sorted(word_frequencies.items(), key = lambda x: x[1], reverse = True))
freq_words_n = int(len(ordered_word_freq)*(0.02)) # define which % of the words are considered to be frequent words
most_freq_words = list(islice(ordered_word_freq, freq_words_n))
non_freq_words = []

# Find out what words are least frequently used: 
for key in ordered_word_freq:
    if ordered_word_freq[key] <= 1: # add all words that frequence is under to the ignore words -list
        non_freq_words.append(key)

ignore_words = most_freq_words

ignore_words.extend(non_freq_words)

final_selected_texts = []

# Remove the most/less frequently used words from the texts:
for line in selected_texts:
    new_line = ''
    tokenized_line = word_tokenize(line)
    
    for word in tokenized_line:
        if word not in ignore_words:
            new_line += word + ' '

    final_selected_texts.append(new_line)
    
print(len(selected_texts_raw))
print(len(final_selected_texts))

10000
10000


Comparing the accuracy with "raw data" to "processed data":  

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

# Comparing the same data without and with preprocessing it

tf_vectorizer = CountVectorizer()
raw_document_term_matrix = tf_vectorizer.fit_transform( selected_texts_raw )

final_tf_vectorizer = CountVectorizer()
final_document_term_matrix = final_tf_vectorizer.fit_transform( final_selected_texts )

raw_label_train, raw_label_test, raw_data_train, raw_data_test = train_test_split( genders, raw_document_term_matrix, 
                                                                                test_size = .2 )
final_label_train, final_label_test, final_data_train, final_data_test = train_test_split( genders, final_document_term_matrix, 
                                                                                          test_size = .2 )

raw_model = svm.SVC(kernel='linear') # Linear Kernel, default settings
raw_model.fit( raw_data_train, raw_label_train )

final_model = svm.SVC(kernel='linear') # Linear Kernel, default settings
final_model.fit( final_data_train, final_label_train )

raw_label_test_pred = raw_model.predict( raw_data_test )
final_label_test_pred = final_model.predict( final_data_test )

print(classification_report(raw_label_test, raw_label_test_pred))
print( metrics.accuracy_score( raw_label_test, raw_label_test_pred ) )
print( metrics.confusion_matrix( raw_label_test, raw_label_test_pred ) )
print()
print(classification_report(final_label_test, final_label_test_pred))
print( metrics.accuracy_score( final_label_test, final_label_test_pred ) )
print( metrics.confusion_matrix( final_label_test, final_label_test_pred ) )

              precision    recall  f1-score   support

      FEMALE       0.36      0.31      0.33       717
        MALE       0.64      0.70      0.67      1283

    accuracy                           0.56      2000
   macro avg       0.50      0.50      0.50      2000
weighted avg       0.54      0.56      0.55      2000

0.5585
[[221 496]
 [387 896]]

              precision    recall  f1-score   support

      FEMALE       0.41      0.27      0.33       724
        MALE       0.65      0.77      0.71      1276

    accuracy                           0.59      2000
   macro avg       0.53      0.52      0.52      2000
weighted avg       0.56      0.59      0.57      2000

0.592
[[197 527]
 [289 987]]


After a few runs with different parameters, it seemed that preprocessing increased accuracy a little. However, after some point the accuracy became better because the classifier started to classify more and more cases as males. It seems that preprocessing makes the accuracy better, however, one of the reason for this is that at some point it just classifies more cases as males and since in the data seemed to be more males (for example, in one case 1276 males and 724 females) and this lead to that if you could classify all the cases as males, you would have the accuracy of 0.638. It is relatively good accuracy, but model like that would be quite useless. 

* Predictors includes each feature (as a key) and how good the variable was for said problem (as a value). Extract from this the best predictors.

In [15]:
predictors = {}

for i, name in enumerate( final_tf_vectorizer.get_feature_names() ):
    predictors[name] = i
    
for name, value in predictors.items():
    predictors[name] = model.coef_[0, value ]

ordered_predictors = {}
    
ordered_predictors = OrderedDict(sorted(predictors.items(), key = lambda x: abs(x[1]), reverse = True))

# best predictors (negative femal and positive male): 

for key, value in ordered_predictors.items():
    if abs(value) > 1.2:
        print(key, ': ', value)

mental :  -1.6591783920124497
undu :  -1.6266990214710628
serb :  -1.5952651487715381
announc :  -1.5672778548813888
algeria :  1.5648609759615573
cool :  -1.5454151427287277
tragic :  1.5264965047491117
depress :  1.4552451721244997
vienna :  -1.4301988174672993
late :  -1.379099341591621
lloyd :  1.3654943410585418
mismanag :  -1.3512433346116888
entir :  -1.3272740134232974
cold :  -1.2943583062786261
privat :  -1.2683886229636336
mayer :  -1.2633257412193393
elder :  1.250828306144851
strong :  -1.2438928953545858
remuner :  -1.2423825074436812
harmon :  1.2352436629115402
cage :  1.233900518095694
igc :  -1.231529785848558
percentag :  1.2274154289070096
licenc :  -1.2148511505776516


* Count the number of different labels in the dataset of 10,000 comments. What can you observe?

In [21]:
from collections import Counter

print(Counter(genders))
    

Counter({'MALE': 6508, 'FEMALE': 3492})


We observe that there are more males in this dataset. 

* Modify the code to use [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) model and SVM model. Which one seems to work better?

In [66]:
## Apparently we are working with data that calls for multinomial Bayesian model 

from sklearn.naive_bayes import MultinomialNB
multinomial_classifier = MultinomialNB()
multinomial_classifier.fit(final_data_train, final_label_train)

final_label_test_multinomial_pred = multinomial_classifier.predict( final_data_test )

print(classification_report(final_label_test, final_label_test_multinomial_pred))
print( metrics.accuracy_score( final_label_test, final_label_test_multinomial_pred ) )
print( metrics.confusion_matrix( final_label_test, final_label_test_multinomial_pred ) )

## I understood the SVM model part to mean that I should change the kernel - in the following section
## I will run the analysis with RBF kernel: 

rbf_model = svm.SVC(gamma='scale', kernel='rbf')
rbf_model.fit( final_data_train, final_label_train )

final_label_test_rbf_model = rbf_model.predict( final_data_test )

print(classification_report(final_label_test, final_label_test_rbf_model))
print( metrics.accuracy_score( final_label_test, final_label_test_rbf_model ) )
print( metrics.confusion_matrix( final_label_test, final_label_test_rbf_model ) )

              precision    recall  f1-score   support

      FEMALE       0.44      0.28      0.34       724
        MALE       0.66      0.80      0.72      1276

    accuracy                           0.61      2000
   macro avg       0.55      0.54      0.53      2000
weighted avg       0.58      0.61      0.58      2000

0.6085
[[ 200  524]
 [ 259 1017]]
              precision    recall  f1-score   support

      FEMALE       0.72      0.03      0.06       724
        MALE       0.64      0.99      0.78      1276

    accuracy                           0.64      2000
   macro avg       0.68      0.51      0.42      2000
weighted avg       0.67      0.64      0.52      2000

0.6445
[[  21  703]
 [   8 1268]]


With rbf the model accuracy improves, but at the same time we are once again in the situation where the model's accuracy increases since it tends to classify more and more cases as males. 

# Advanced magics

* There are many different ways to build a models using various supervised machine learning methods.
One can use different parameters of methods. This is known as *tuning* the model and can improve models' performance in terms 
of accuracy.
* [Grid search](https://scikit-learn.org/stable/modules/grid_search.html) is an approach to examine different parameters and examine what paremeters lead to best models.
* You can also work on data preprocessing to [scale them](https://scikit-learn.org/stable/modules/preprocessing.html) or try to more acressively to clean or remove data.

In [6]:
## defining parameters for different models
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

In [7]:
from sklearn.model_selection import GridSearchCV

many_models = GridSearchCV( svm.SVC(), param_grid )
many_models.fit( data_train, label_train )

print( many_models )



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
                         {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                          'kernel': ['rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)


* We have used a binary variable (male/female), however support vector machines can be used to [multi-category classification](https://scikit-learn.org/stable/modules/svm.html#multi-class-classification) or [linear variables through regression models](https://scikit-learn.org/stable/modules/svm.html#regression).

* If doing category classification, the algorithm is senstive to inbalances between classification, i.e. if there are more cases belonging to Category 1 than in Category 2. 

In [68]:
model = svm.SVC(kernel='linear', class_weight='balanced') # Linear Kernel, default settings
model.fit( final_data_train, final_label_train)

SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [69]:
balanced_label_test_pred = model.predict( final_data_test )
print(classification_report(final_label_test, balanced_label_test_pred))
print( metrics.accuracy_score( final_label_test, balanced_label_test_pred ) )
print( metrics.confusion_matrix( final_label_test, balanced_label_test_pred ) )

              precision    recall  f1-score   support

      FEMALE       0.39      0.44      0.42       724
        MALE       0.66      0.61      0.64      1276

    accuracy                           0.55      2000
   macro avg       0.53      0.53      0.53      2000
weighted avg       0.56      0.55      0.56      2000

0.551
[[319 405]
 [493 783]]


### Tasks

* Try different grid search parameters, see if your accuracy metric improve.


In [72]:
from sklearn.model_selection import GridSearchCV

param_grid = [
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['linear'], 'class_weight': ['balanced']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'class_weight': ['balanced']},

 ]

many_models = GridSearchCV( svm.SVC(), param_grid )
many_models.fit( final_data_train, final_label_train )

print( many_models )



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid=[{'C': [1, 10, 100, 1000], 'class_weight': ['balanced'],
                          'gamma': [0.001, 0.0001], 'kernel': ['linear']},
                         {'C': [1, 10, 100, 1000], 'class_weight': ['balanced'],
                          'gamma': [0.001, 0.0001], 'kernel': ['rbf']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)


In [48]:
import pandas as pd
results_df = pd.DataFrame(many_models.cv_results_)

Ranking the models: 

In [56]:
print(results_df.columns)
print(results_df[['rank_test_score', 'params']].sort_values(by='rank_test_score').iloc[0:10])

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_C', 'param_kernel', 'param_gamma', 'param_class_weight',
       'params', 'split0_test_score', 'split1_test_score', 'split2_test_score',
       'mean_test_score', 'std_test_score', 'rank_test_score'],
      dtype='object')
    rank_test_score                                             params
9                 1       {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
6                 2         {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
4                 3          {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
5                 3         {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
7                 3        {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
11                6      {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
8                 7        {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
18                8  {'C': 10, 'class_weight': 'balanced', 'gamma':...
21                9  {'C': 100, 'class_weight':

It seems that the model with parameters C: 100, gamma 0.0001, kernel: rbf is ranked the best. 

In [75]:
best_model = svm.SVC(C = 100, gamma=0.0001, kernel='rbf') 

In [76]:
best_model.fit( final_data_train, final_label_train)

best_model_pred = best_model.predict( final_data_test )

print(classification_report(final_label_test, best_model_pred))
print( metrics.accuracy_score( final_label_test, best_model_pred ) )
print( metrics.confusion_matrix( final_label_test, best_model_pred ) )

              precision    recall  f1-score   support

      FEMALE       1.00      0.00      0.01       724
        MALE       0.64      1.00      0.78      1276

    accuracy                           0.64      2000
   macro avg       0.82      0.50      0.39      2000
weighted avg       0.77      0.64      0.50      2000

0.639
[[   2  722]
 [   0 1276]]


The "best ranked" model seems to be quite good "in theory", however, the reason for good accuracy is that it defines most of the cases as males. However, we could attempt to balance it. 

* Does balancing improve accuracy with our data

In [80]:
balanced_model = svm.SVC(C = 100, gamma=0.0001, kernel='rbf', class_weight='balanced') 

balanced_model.fit( final_data_train, final_label_train)
balanced_test_pred = balanced_model.predict( final_data_test )


print(classification_report(final_label_test, balanced_test_pred))
print( metrics.accuracy_score( final_label_test, balanced_test_pred ) )
print( metrics.confusion_matrix( final_label_test, balanced_test_pred ) )


              precision    recall  f1-score   support

      FEMALE       0.41      0.35      0.38       724
        MALE       0.66      0.72      0.69      1276

    accuracy                           0.59      2000
   macro avg       0.54      0.53      0.53      2000
weighted avg       0.57      0.59      0.58      2000

0.5855
[[254 470]
 [359 917]]


 Balancing did not improve accuracy. However it removed the problem of classifying all the cases as males. 

* Use age variable to develop a regression model.

In [58]:
selected_lines = random.sample( range( len( metadata ) ) , k = 10000 )

ages = []
    
for line in selected_lines:
    
    md = BeautifulSoup( metadata[ line ] )
    ages.append( md.line['age'] )


In [79]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn import metrics


final_tf_vectorizer = CountVectorizer()
final_document_term_matrix = final_tf_vectorizer.fit_transform( final_selected_texts )

age_final_label_train, age_final_label_test, age_final_data_train, age_final_data_test = train_test_split( ages, 
                                                                                             final_document_term_matrix, 
                                                                                                          test_size = .2 )
regressor_machine = svm.SVR(kernel='linear')

regressor_machine.fit(age_final_data_train, age_final_label_train)
pred_age_regressor = regressor_machine.predict(age_final_data_test)

[53.18456568 51.0544255  52.74219454 ... 49.63574318 48.96975777
 57.3227156 ]


Apparently one way to evaluate the model is to use the score-function: 

In [97]:
confidence = regressor_machine.score(age_final_data_test, list(map(int, age_final_label_test)))

print(confidence)

-0.15377822437654443


Negative R^2 of self.predict(X) means that our model does worse than if it would just take the mean value. Perhaps our data should have been larger or it should have been  pre-processed better. 

## Some reflections

This exercise highlighted two aspects of machine learning methods. The biggest question was how much does the size of the data that is used to build the model affects the model? The model that was created in this exercise was based only on 10000 (of 292724 possible) examples. How much, for example, using 100000 cases would have affected the performance of the model (of course one way to know would be to "just try it" -- which I tried -- but after some time it felt like it would never finish and I gave up on this experiment)? Another question related to data was how much the imbalance of the data (meaning that there were more males than females) affected the accuracy of the model. These aspects highlighted the importance of the pre-processing of data and the "handcraft" involved when using machine learning methods. This became apparent when I searched information on the best practices of preprocessing and the general impression which I got was that "it depends on various factors..". It could perhaps be said that although machine learning methods give outputs that are somehow mathematically always right, the input (and the parameters used) is always the product of the various descisions made by the researcher. 

As a someone who do not have that much knowledge about machine learning (and quantitative methods in general) this exercise also made visible an interesting aspect of using machine learning methods. That is the computational resources that some of these methods seem to require (at least the Big O of SVM is described as being somehwere betseen n^2 and n^3 which, according to bigocheatsheet.com, can be described as "horrible"). This adds a new element to the practice of doing social scientific research. That is, social scientists having to buy or find computational resources from somewhere (or at least my understanding is that when doing the more traditional quantitative reserach it is really rare that one has to find additional computational resources since most of the statistical analysis are still possible in reasonable time with home computers). For me, this raises questions such us, how much does this additional computational power costs? How it is possible to evaluate how much time building a SVM on a large set of data takes? This kind of things are (I think) aspects of social scientific research that most research have not had to think about before (or at least after computers that used punch cards became obsolete). 

Support vector machines also raised the question on interpretation, explanation and prediction -- that is, what do we actually find out when we have a model that is good at classifiying cases. Similarily to descision tress, I think that SVM are a good tool for prediction. In addition, at least with textual data, the values of predictors could perhaps give us on descriptions on the differences of, for example, cultures. I'd imagine that SVM could be used to test the observation that metaphors are different in different cultures with a large data set of texts. I would hypothesize that there should be "metaphorical" words with different predictor values if metaphors vary between cultures. Although the model I managed to create in the exercise was quite bad, SVM is quite interesting method.