## Support vector machines

**Data** [Gender-annoted dataset of European parliament talks](https://www.kaggle.com/ellarabi/europarl-annotated-for-speaker-gender-and-age)

**Overreaching question** Can we develop a model which correctly predicts speakers' gender, based on what they are saying?

## Data management

Let's create a dataset with the variable of interest and the textual data.
The data about gender is stored as XML, so we need to do a bit of work before we can easily use it.
The below code also transforms the text data into a feature matrix.

In [None]:
metadata = open('./data/europarl-annotated-for-speaker-gender-and-age/europarl.de-en/europarl.de-en.dat').readlines()
all_texts = open('./data/europarl-annotated-for-speaker-gender-and-age/europarl.de-en/europarl.de-en.en.aligned.tok').readlines()

## Check that both files have the same number of rows
assert len(metadata) == len(all_texts)

## Processign the data takes some time, so let's choose a random set of 1000 messages to try initial modeling

import random
random.seed(1) # Set seed for reproducible results

selected_lines = random.sample( range( len( metadata ) ) , k = 1000 )

print( metadata[0] )


from bs4 import BeautifulSoup

genders = []
selected_texts = []

# Parse metadata
for line in selected_lines:
    
    md = BeautifulSoup( metadata[ line ] )
    genders.append( md.line['gender'] )
    
    selected_texts.append( all_texts[ line ] )
    

print( len( genders ) )
print( len( selected_texts ) )

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer()
document_term_matrix = tf_vectorizer.fit_transform( selected_texts )

## Create the train-test split

Used later in the analysis to ensure we do not [overfit](https://en.wikipedia.org/wiki/Overfitting) to the data when training the classifier. Let's use 20% of data for testing.

In [None]:
from sklearn.model_selection import train_test_split

label_train, label_test, data_train, data_test = train_test_split( genders, document_term_matrix, test_size = .2 )

# Run and evaluate SVM classifier

We now train the model using the **training** data and measure its performance using the **test** dataset.

In [None]:
from sklearn import svm

model = svm.SVC(kernel='linear') # Linear Kernel, default settings
model.fit( data_train, label_train )

In [None]:
from sklearn import metrics

## Check how well the model predicts test data
label_test_pred = model.predict( data_test )
print( metrics.accuracy_score( label_test, label_test_pred ) )

In [None]:
# Check the importance of different words for the predictions

predictors = {}

for i, name in enumerate( tf_vectorizer.get_feature_names_out() ):
    predictors[name] = i
    
    
for name, value in predictors.items():
    predictors[name] = model.coef_[0, value ]
    

print( predictors )

### Things to try

* Run the above code as is and interprent the accuracy. What does the score mean?
* Examine different metrics for [classification accuracy](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics).
* Fix issues in the text pre-processing. Account for stop words, frequent terms and stem content in the document-term-matrix. Does this have any infuence on the model's accuracy?
* Predictors include each feature in data (i.e., term), and how important they were in predicting the data. Extract and inspect the best predictor features.
* Modify the code to use [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) model instead of SVM. Which model seems to work better?

# Advanced magics

* Let's now try to improve the model's performance through *tuning* its parameters.
* [Grid search](https://scikit-learn.org/stable/modules/grid_search.html) is an approach to systematically assess the performance different modeling parameter values.
* You can also work on preprocessing to [scale](https://scikit-learn.org/stable/modules/preprocessing.html) the data, or try more acressive cleaning or removal of data.

In [None]:
## Define parameter range for different models
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

In [None]:
from sklearn.model_selection import GridSearchCV

many_models = GridSearchCV( svm.SVC(), param_grid )
many_models.fit( data_train, label_train )

print( many_models )

In [None]:
# Print best parameter after tuning 
print(many_models.best_params_) 
  
# Print how our model looks after hyper-parameter tuning 
print(many_models.best_estimator_) 

In [None]:
## Check how well the best model predicts
label_test_pred = many_models.predict( data_test )
print( metrics.accuracy_score( label_test, label_test_pred ) )

* We have so far used a binary variable (male/female) as target. However, support vector machines can be used to perform [multi-category classification](https://scikit-learn.org/stable/modules/svm.html#multi-class-classification) or to use [linear variables through regression models](https://scikit-learn.org/stable/modules/svm.html#regression).

* If doing multi-category classification, the algorithm is senstive to inbalances between classes, i.e. if there are more cases belonging to Category 1 than in Category 2.

* This can be fixed through weighting to balance the classes.

In [None]:
model = svm.SVC(kernel='linear', class_weight='balanced') # Linear Kernel, default settings
model.fit( data_train, label_train)

In [None]:
## Check how well we did for testing data
label_test_pred = model.predict( data_test )
print( metrics.accuracy_score( label_test, label_test_pred ) )

### Things to try

* Try different grid search parameters, see if your accuracy metric improve.
* Does balancing improve accuracy with our data?
* Use age variable to develop a regression model.