<h1><b><center><b>Predicting Tags </b></center></h1>

#### Our objective is to predict tags for posts from StackOverflow using Linear Model after carefully preprocessing our text features.

#### Our dataset consists of post titles from StackOverflow. We have the dataset in 3 parts(set of 3): train, validation and test. All corpora (except for test) contain titles of the posts and corresponding tags (100 tags are available).

#### Note: We will also demonstrate how to create small test/dummy samples/functions to check that our main and helper functions are working properly

In [9]:
# Important libraries needed

import pandas as pd
import numpy as np
import nltk
import re

import collections 
from collections import Counter
from itertools import chain

from scipy import sparse as sp_sparse
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

#### Here we will need to use a list of stop words. We can be downloaded from nltk and import it.

In [10]:
#nltk.download('stopwords')
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize
#nltk.download('punkt')

 Upload the corpora using *pandas* and look at the data:

In [11]:
def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    return data

In [33]:
train = read_data('data/train.tsv')
validation = read_data('data/validation.tsv')
test = pd.read_csv('data/test.tsv', sep='\t')

#### Saving our dataset into train,validation and test sets.
 - *train* data for training the model;
 - *validation* data for evaluation and hyperparameters tuning;
 - *test* data for final evaluation of the model.

In [34]:
train.head()

Unnamed: 0,title,tags
0,How to draw a stacked dotplot in R?,['r']
1,mysql select all records where a datetime fiel...,"['php', 'mysql']"
2,How to terminate windows phone 8.1 app,['c#']
3,get current time in a specific country via jquery,"['javascript', 'jquery']"
4,Configuring Tomcat to Use SSL,['java']


In [35]:
train.shape

(100000, 2)

#### As you can see, 'title' column contains titles of the posts and 'tags' column contains the tags. It could be noticed that a number of tags for a post is not fixed and could be as many as necessary.

#### We will further split our train, validation and test set into title and tags separately and and initialize as X_train, X_val, X_test, y_train, y_val,y_test.

In [36]:
X_train, y_train = train['title'].values, train['tags'].values
X_val, y_val = validation['title'].values, validation['tags'].values
X_test = test['title'].values

In [37]:
print(X_train[:3])
print(y_train[:3])

['How to draw a stacked dotplot in R?'
 'mysql select all records where a datetime field is less than a specified value'
 'How to terminate windows phone 8.1 app']
["['r']" "['php', 'mysql']" "['c#']"]


## Text preprocessing

#### One of the most known difficulties when working with natural data is that it's unstructured. For example, if you use it "as is" and extract tokens just by splitting the titles by whitespaces, you will see that there are many weird tokens like **%^^3!@, *"Flip*, @@{SQL}, ComPAct etc. To prevent the problems, it's usually useful to prepare the data somehow.

#### We will implement a function 'text_prepare' which will help in cleaning and giving our natural data some structure by converting all text to lowercase, removing stop words and bad symbols.

In [38]:
# specific symbols that we would want to replace with spaces
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')

# delete unknown symbols by taking all words that contain characters other than 0-9,a-z,#,+,_ 
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

STOPWORDS = list((stopwords.words('english')))

def text_prepare(text,join_symbol):
    """
        text: a string
        
        return: modified initial string
    """
    # lowercase text
    text = text.lower() 

    # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(REPLACE_BY_SPACE_RE," ",text,)

    # delete symbols which are in BAD_SYMBOLS_RE from text
    text = re.sub(BAD_SYMBOLS_RE,"",text)
    
    # ('\s+') will match one or more whitespace characters, r - indicates raw string
    text = re.sub(r'\s+'," ",text)

    # delete stopwords from text
    text = f'{join_symbol}'.join([i for i in text.split() if i not in STOPWORDS])
    
    return text

#### Defining a test function to check helper function is working properly

In [39]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex, ' ') != ans:
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [40]:
# Testing helper function
print(test_text_prepare())

Basic tests are passed.


#### We tested that our helper function for preparing text is working correctly

#### Now we can preprocess the titles using function 'text_prepare' and making sure that our data is more structured and the headers don't have bad symbols, stop words etc.

In [41]:
X_train = [text_prepare(x, " ") for x in X_train]
X_val = [text_prepare(x, " ") for x in X_val]
X_test = [text_prepare(x, " ") for x in X_test]

In [42]:
y_train = [text_prepare(x, ",") for x in y_train]
y_val = [text_prepare(x, ",") for x in y_val]
#y_test = [text_prepare(x) for x in y_test]

In [43]:
print(X_train[:5])
print(X_val[:5])

print(y_train[:5])
print(y_val[:5])

['draw stacked dotplot r', 'mysql select records datetime field less specified value', 'terminate windows phone 81 app', 'get current time specific country via jquery', 'configuring tomcat use ssl']
['odbc_exec always fail', 'access base classes variable within child class', 'contenttype application json required rails', 'sessions sinatra used pass variable', 'getting error type json exist postgresql rake db migrate']
['r', 'php,mysql', 'c#', 'javascript,jquery', 'java']
['php,sql', 'javascript', 'rubyonrails,ruby', 'ruby,session', 'rubyonrails,ruby,json']


## Exploratory Data Analysis

#### Now we will try to find the 3 most popular tags and 3 most popular words in train data  

In [44]:
# Dictionary of all words from train corpus with their counts.
words_counts = Counter(chain.from_iterable([i.split(" ") for i in X_train]))

# Dictionary of all tags from train corpus with their counts
tags_counts = Counter(chain.from_iterable([i.split(",") for i in y_train]))

In [45]:
top_3_most_common_words = sorted(words_counts.items(), key = lambda x: x[1], reverse = True)[:3]
top_3_most_common_tags = sorted(tags_counts.items(), key = lambda x: x[1], reverse = True)[:3]

In [46]:
print(top_3_most_common_words)
print(top_3_most_common_tags)

[('using', 8278), ('php', 5614), ('java', 5501)]
[('javascript', 19078), ('c#', 19077), ('java', 18661)]


#### After applying the sorting procedure, we see results of top 3 most common words/tags and their frequencies

In [47]:
print(f"Top three most popular words are: {', '.join(word for word, _ in top_3_most_common_words)}")
print(f"Top three most popular tags are: {', '.join(tag for tag, _ in top_3_most_common_tags)}")

Top three most popular words are: using, php, java
Top three most popular tags are: javascript, c#, java


## Transforming text to a vector

#### Machine Learning algorithms work with numeric data and we cannot use the provided text data "as is". There are many ways to transform text data to numeric vectors. Here we will try to use the two most common and simple approaches i.e. Bag of Words and TF-IDF .

## Bag of words

#### One of the well-known approaches is a *bag-of-words* representation. To create this transformation, we perform the following steps:
1. Find *N* most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
2. For each title in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

#### Let's see a toy example. Imagine that we have *N* = 4 and the list of the most popular words is 
####     ['hi', 'you','me', 'are']

#### Then we need to numerate them, for example, like this: 

####     {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
 
#### And we have the text, which we want to transform to the vector:
 
####     'hi how are you'
 
#### For this text we create a corresponding zero vector 
 
####     [0, 0, 0, 0]
     
#### And interate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:

####     'hi':  [1, 0, 0, 0]
####     'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
####     'are': [1, 0, 0, 1]
####     'you': [1, 1, 0, 1]

#### The resulting vector will be 
 
####     [1, 1, 0, 1]

#### We will implement the encoding described above in the function 'my_bag_of_words' with the size of the dictionary - 5000. To find the most common words, we will use train data.

In [48]:
DICT_SIZE = 5000

# most_common_words contain 5000 words in sorted order of frequency
most_common_words = sorted(words_counts.items(), key = lambda x: x[1], reverse = True)[:DICT_SIZE] 

WORDS_TO_INDEX = {}
INDEX_TO_WORDS = {}

for i in range(0, DICT_SIZE):
    
    # most_common_words[i][0] means extracting ith word from the dictionary, words to index contain 
    # the index value of the word 
    WORDS_TO_INDEX[most_common_words[i][0]] = i   
    
    # index to word conatain the word correspond to the index 
    INDEX_TO_WORDS[i] = most_common_words[i][0]
    
ALL_WORDS = WORDS_TO_INDEX.keys()

def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size)
    y = text.split(" ")
    for i in range(0,len(y)):
        for key, value in words_to_index.items():
            if y[i] == key:
                #  result_vector[words_to_index[key]] contain the count of the presence of
                #  word in the text
                result_vector[words_to_index[key]] = result_vector[words_to_index[key]] + 1  
    
    # result vector is the vector of the size of the no of words taken as features having count
    # of then in the text            
    return result_vector 

#### Defining a test function to check 'my_bag_of_words' function is working properly

In [49]:
def test_my_bag_of_words():
    words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
    examples = ['hi how are you']
    answers = [[1, 1, 0, 1]]
    for ex, ans in zip(examples, answers):
        if (my_bag_of_words(ex, words_to_index, 4) != ans).any():
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [50]:
print(test_my_bag_of_words())

Basic tests are passed.


#### We tested that our function for bag of words representation is working correctly. So, now we can apply the implemented function to all samples.

#### However, we should note that our data might have lot of zero values and can cause problems with regards to space and time complexity in our current matrix (dense) representation. We can use an alternate data structure to represent our sparse data where the zero values can be ignored, thus storing the useful infomation efficiently. There are many types of such representations, however for simplicity we will use sklearn's csr matrix(sparse).

#### Note: Performance on CSR/CSC is severly limited in performance terms by overhead of generating indices. Blocked CSR/CSC is a much better approach especially for SIMD machines and allows loop unrolling and vectorisation to vastly improve performance compared to vanilla CSC/CSR.

In [51]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])
X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])

In [53]:
print('X_train shape ', X_train_mybag.shape)
print('X_val shape ', X_val_mybag.shape)
print('X_test shape ', X_test_mybag.shape)

X_train shape  (100000, 5000)
X_val shape  (30000, 5000)
X_test shape  (20000, 5000)


## TF-IDF

#### The second approach extends the bag-of-words framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 
#### We will implement function 'tfidf_features' using 'TfidfVectorizer' from 'scikit-learn'and use train data to train a vectorizer. We will filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, we will use bigrams along with unigrams in our vocabulary. 

In [54]:
def tfidf_features(X_train, X_val, X_test):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    tfidf_vectorizer =  TfidfVectorizer(min_df = 5, max_df = 0.9, 
                                        ngram_range =(1,2), token_pattern = '(\S+)')
    #  '(\S+)' will match one or more whitespace characters
    
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_val = tfidf_vectorizer.transform(X_val)
    X_test = tfidf_vectorizer.transform(X_test)

    # tfidf_vectorizer.vocabulary_ returns dictionary of word, index
    return X_train, X_val, X_test, tfidf_vectorizer.vocabulary_

#### Here we will test our 'tfidf_features' function by checking whether we have 'c++' or 'c#' in our vocabulary, as they are obviously important tokens in our tags prediction

In [57]:
X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)
tfidf_reversed_vocab = {i:word for word, i in tfidf_vocab.items()}

print("c#" in set(tfidf_reversed_vocab.values()))
print("c++" in set(tfidf_reversed_vocab.values()))

True
True


In [58]:
print(X_train_tfidf[:2])
print('X_test_tfidf ', X_test_tfidf.shape) 
print('X_val_tfidf ',X_val_tfidf.shape)

  (0, 12748)	0.4309937630129157
  (0, 14941)	0.7126565202061851
  (0, 4792)	0.5535025387941576
  (1, 4093)	0.39639224964237335
  (1, 14054)	0.4089312040982416
  (1, 10426)	0.3621376616529093
  (1, 17129)	0.18110148646398525
  (1, 14801)	0.29994308533196384
  (1, 9077)	0.3287166709387216
  (1, 5815)	0.2382368446529078
  (1, 4089)	0.2692803496632626
  (1, 13008)	0.2975359437533551
  (1, 14019)	0.22859508855051242
  (1, 10394)	0.20888863770024907
X_test_tfidf  (20000, 18300)
X_val_tfidf  (30000, 18300)


# MultiLabel classifier

#### Here each example can have multiple tags, so to deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose it is convenient to use MultiLabelBinarizer from sklearn.

#### First we will create a list of unique tags from our training and validation set

In [59]:
y_train = [set(i.split(',')) for i in y_train]
y_val = [set(i.split(',')) for i in y_val]

In [60]:
mlb = MultiLabelBinarizer()

# changes the y_train, y_val in feature form like all clases with 0,1 value
y_train = mlb.fit_transform(y_train) 
y_val = mlb.fit_transform(y_val)

In [61]:
y_val[0]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [62]:
mlb.classes_

array(['ajax', 'algorithm', 'android', 'angularjs', 'apache', 'arrays',
       'aspnet', 'aspnetmvc', 'c', 'c#', 'c++', 'class', 'cocoatouch',
       'codeigniter', 'css', 'csv', 'database', 'date', 'datetime',
       'django', 'dom', 'eclipse', 'entityframework', 'excel', 'facebook',
       'file', 'forms', 'function', 'generics', 'googlemaps', 'hibernate',
       'html', 'html5', 'image', 'ios', 'iphone', 'java', 'javascript',
       'jquery', 'json', 'jsp', 'laravel', 'linq', 'linux', 'list',
       'loops', 'maven', 'mongodb', 'multithreading', 'mysql', 'net',
       'nodejs', 'numpy', 'objectivec', 'oop', 'opencv', 'osx', 'pandas',
       'parsing', 'performance', 'php', 'pointers', 'python', 'python27',
       'python3x', 'qt', 'r', 'regex', 'rest', 'ruby', 'rubyonrails',
       'rubyonrails3', 'selenium', 'servlets', 'session', 'sockets',
       'sorting', 'spring', 'springmvc', 'sql', 'sqlserver', 'string',
       'swift', 'swing', 'twitterbootstrap', 'uitableview', 'unittestin

#### Checking the names of unique tags in our list

#### Now we will implement the function 'train_classifier' for training a classifier. Here we will use One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. 
#### Note: It might take some time, because a number of classifiers to train is large.


In [65]:
def train_classifier(X_train, y_train):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.
    model = OneVsRestClassifier(LogisticRegression(max_iter = 200)).fit(X_train,y_train)
    
    return model

#### We will train the classifiers for different data transformations i.e. 'bag-of-words' and 'tf-idf'.
#### Note: If you receive a convergence warning, please set parameter 'max_iter' in LogisticRegression to a larger value (the default is 100).

In [66]:
classifier_mybag = train_classifier(X_train_mybag, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)

#### Now we can create predictions for the data. We will do two types of predictions: labels and scores.

In [67]:
y_val_predicted_labels_mybag = classifier_mybag.predict(X_val_mybag) 
y_val_predicted_scores_mybag = classifier_mybag.decision_function(X_val_mybag)

y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

#### We can now check how our classifier works using labels prediction, i.e. for bag of words and TF-IDF by taking few examples.

In [69]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_mybag)
y_val_inversed = mlb.inverse_transform(y_val)

for i in range(10):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_val[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

Title:	odbc_exec always fail
True labels:	php,sql
Predicted labels:	


Title:	access base classes variable within child class
True labels:	javascript
Predicted labels:	


Title:	contenttype application json required rails
True labels:	ruby,rubyonrails
Predicted labels:	rubyonrails


Title:	sessions sinatra used pass variable
True labels:	ruby,session
Predicted labels:	ruby


Title:	getting error type json exist postgresql rake db migrate
True labels:	json,ruby,rubyonrails
Predicted labels:	json,rubyonrails


Title:	library found
True labels:	c++,ios,iphone,xcode
Predicted labels:	


Title:	csproj file programmatic adding deleting files
True labels:	c#
Predicted labels:	


Title:	typeerror makedirs got unexpected keyword argument exists_ok
True labels:	django,python
Predicted labels:	python


Title:	pan div using jquery
True labels:	html,javascript,jquery
Predicted labels:	javascript,jquery


Title:	hibernate intermediate advanced tutorials
True labels:	hibernate,java
Predicted labels:	

#### We checked how our bag of words classifier works. Now we can do the same for TF-IDF classifier.

In [68]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)

for i in range(10):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_val[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

Title:	odbc_exec always fail
True labels:	php,sql
Predicted labels:	


Title:	access base classes variable within child class
True labels:	javascript
Predicted labels:	


Title:	contenttype application json required rails
True labels:	ruby,rubyonrails
Predicted labels:	json,rubyonrails


Title:	sessions sinatra used pass variable
True labels:	ruby,session
Predicted labels:	


Title:	getting error type json exist postgresql rake db migrate
True labels:	json,ruby,rubyonrails
Predicted labels:	rubyonrails


Title:	library found
True labels:	c++,ios,iphone,xcode
Predicted labels:	


Title:	csproj file programmatic adding deleting files
True labels:	c#
Predicted labels:	


Title:	typeerror makedirs got unexpected keyword argument exists_ok
True labels:	django,python
Predicted labels:	python


Title:	pan div using jquery
True labels:	html,javascript,jquery
Predicted labels:	javascript,jquery


Title:	hibernate intermediate advanced tutorials
True labels:	hibernate,java
Predicted labels:	hibe

#### Now, we would need to compare the results of different predictions, e.g. to see which classifier works better (bag of words or TF-IDF) or to try different regularization techniques in logistic regression or a different model like SVM, Naive Bayes etc. For all these experiments, we need to setup evaluation procedure. 

## Evaluation
 
#### To evaluate the results we will use several classification metrics:
- Accuracy
- F1-score
- Area under ROC-curve
- Area under precision-recall curve

#### We will implement the function 'print_evaluation_scores' which calculates
* accuracy
* F1-score macro/micro/weighted
* Precision macro/micro/weighted

In [92]:
def print_evaluation_scores(y_val, predicted):
    
    accuracy = accuracy_score(y_val, predicted)
    f1_score_macro = f1_score(y_val, predicted, average = 'macro')
    f1_score_micro = f1_score(y_val, predicted, average = 'micro')
    f1_score_weighted = f1_score(y_val, predicted, average = 'weighted')
    precision_macro = average_precision_score(y_val, predicted, average = 'macro')
    precision_micro = average_precision_score(y_val, predicted, average = 'micro')
    precision_weighted = average_precision_score(y_val, predicted, average = 'weighted')

    scores_names = ['accuracy', 'f1_score_macro', 'f1_score_micro', 'f1_score_weighted', 'precision_macro',
          'precision_micro', 'precision_weighted']
    scores_values = [accuracy, f1_score_macro, f1_score_micro, f1_score_weighted, precision_macro,
          precision_micro, precision_weighted]
    print(pd.DataFrame({'Names':scores_names, 'Values': scores_values}).set_index('Names'),'\n')
    
    

print('Bag-of-words')
print_evaluation_scores(y_val, y_val_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

Bag-of-words
                      Values
Names                       
accuracy            0.357800
f1_score_macro      0.504803
f1_score_micro      0.671006
f1_score_weighted   0.648667
precision_macro     0.344404
precision_micro     0.481164
precision_weighted  0.510749 

Tfidf
                      Values
Names                       
accuracy            0.333900
f1_score_macro      0.445477
f1_score_micro      0.641718
f1_score_weighted   0.614248
precision_macro     0.301817
precision_micro     0.456897
precision_weighted  0.485003 



# Conclusion
#### Once we have the evaluation set up, we can experiment a bit with training your classifiers. We will use F1-score weighted as an evaluation metric and try the following:
- compare the quality of the bag-of-words and TF-IDF approaches and choose one of them.
- for the chosen one, try *L1* and *L2*-regularization techniques in Logistic Regression with different coefficients (e.g. C equal to 0.1, 1, 10, 100).
- We could try to do further improvements of the preprocessing
- We could try some other model like SVM, Naive Bayes, CNN

### Analysis of the most important features

#### Finally, we can also look at the features (words or n-grams) that are used with the largest weigths in our model.