## Support vector machines

**Data** [Gender-annoted dataset of European parliament talks](https://www.kaggle.com/ellarabi/europarl-annotated-for-speaker-gender-and-age)

**Overreaching question** Can we develop a model which correctly predicts speakers' based on what they are saying?

## Data management

We connect the variable of interest into the textual data each speaker has said.
That data is stored as XML, so we need to do a bit of work before we can easily use it.
Also, transform the textual data to a feature matrix.

In [None]:
metadata_all <- readLines('./data/europarl-annotated-for-speaker-gender-and-age/europarl.de-en/europarl.de-en.dat')
texts_all <- readLines('./data/europarl-annotated-for-speaker-gender-and-age/europarl.de-en/europarl.de-en.en.aligned.tok')

## this time processign these takes already some time, so let's choose a random set of 1000 texts
set.seed(1)

all_ids <- 1:length( metadata_all )
selected_ids <- sample( all_ids, 1000 )


metadata <- metadata_all[ selected_ids ]
texts <- texts_all[ selected_ids ]

In [None]:
require(XML)

clean <- function( entry ){
    xml <- xmlTreeParse( entry )
    return( xmlGetAttr( xml$doc$children$LINE  , "GENDER" ) )
}

gender <- sapply( metadata, FUN = clean )
names( gender ) <- NA

In [None]:
library(quanteda)

corp <- corpus( texts )

token <- tokens( corp )

document_terms <- dfm( token )

In [None]:
data <- convert( document_terms, "data.frame" )
data$label_for_ml <- as.vector( gender ) ## adding the label we seek to learn to data
data$label_for_ml <- as.factor( data$label_for_ml )
data <- data[, -c(1)] ## this column is added when converting to data frame, but it is useless => remove from analysis

dim( data )

## Separate the train-test split

This is used later in the analysis to ensure we do not [overfit](https://en.wikipedia.org/wiki/Overfitting) the data when we train the machine learning classifier.
We choose to use 20% of data for testing.

In [None]:
library( caret )

trainIndex <- createDataPartition(data$label_for_ml, p = .8, list = FALSE)

dataTrain <- data[ trainIndex,]
dataTest  <- data[-trainIndex,]

# Run and evaluate machine learning tasks

We now train the model using the **training** data and measure how well accuracy we achieved by examining **test data**.

In [None]:
model <- train( label_for_ml ~., data = dataTrain, method = "svmLinear")
## this prints a lot of warnings
print( model )

In [None]:
test_pred <- predict( model, newdata = dataTest )
postResample( test_pred, as.factor(dataTest$label_for_ml) )

In [None]:
## TODO: FIX THIS
importance <- varImp( model, data = dataTest )

summary( importance )
plot( importance )

### Tasks

* Run the code as is and interprent the accuracy. What does that mean?
* Examine different metrics for [classification accuracy](https://topepo.github.io/caret/measuring-performance.html).
* Fix issues in the text pre-processing: account for stop words, frequent terms ans stem content in the document-term-matrix: does it have any implications on accuracy?
* Predictors includes each feature (as a key) and how good the variable was for said problem (as a value). Extract from this the best predictors.
* Count the number of different labels in the dataset of 10,000 comments. What can you observe?
* Modify the code to use [Naive Bayes](https://topepo.github.io/caret/train-models-by-tag.html#bayesian-model) model and SVM model. Which one seems to work better?

# Advanced magics

* There are many different ways to build a models using various supervised machine learning methods.
One can use different parameters of methods. This is known as *tuning* the model and can improve models' performance in terms of accuracy.
* [Grid search](https://topepo.github.io/caret/model-training-and-tuning.html#basic-parameter-tuning) is an approach to examine different parameters and examine what paremeters lead to best models.
* You can also work on data preprocessing to [scale them](https://topepo.github.io/caret/pre-processing.html#centering-and-scaling) or try to more acressively to clean or remove data.

In [None]:
## defining parameters for different models
grid <- expand.grid( "C" = c(1, 10, 100, 1000) )

In [None]:
model <- train( label_for_ml ~., data = dataTrain, method = "svmLinear", tuneGrid = grid )

* We have used a binary variable (male/female), however support vector machines can be used to multi-category classification or linear variables through regression models - see [different models](https://topepo.github.io/caret/available-models.html).

* If doing category classification, the algorithm is senstive to inbalances between classification, i.e. if there are more cases belonging to Category 1 than in Category 2. We might need to do [magic](https://topepo.github.io/caret/subsampling-for-class-imbalances.html) to control for these.

In [None]:
# TODO FIX THIS

In [None]:
library(ROSE)

fixedDataTrain <- ROSE( label_for_ml ~ ., data = dataTrain) ## note: clean out , before running this code

model <- train( label_for_ml ~., data = fixedDataTrain, method = "svmLinear" )
print( model )

### Tasks

* Try different grid seaarcg parameters, see if your accuracy metric improve.
* Does balancing improve accuracy with our data?
* Use age variable to develop a regression model.