# PROBLEM

In order to train and use an ASR (Automatic Speech Recognition) system, many components are required, among them - a pronunciation model. A simplest form of a pronunciation model is a dictionary of pronunciations, consisting of words in a given language and their pronunciations.

Such dictionaries can be prepared by hand, using rule-based grammars or a trained model, generating pronunciations. The latter two cases are of course much more time- and cost-effective, they may however introduce some errors. A common source of such error may be a word with "non-native" pronunciation used in a language - often a name of a person or a product, sometimes inflected using morphological rules of the studied language.

Words such as *googlowałam*, *facebook* or *Williamów* can serve as tricky examples in Polish. A simple pronunciation model may fail to provide correct pronunciation for such tokens, introducing error to the dataset. To avoid that, a "non-native" word detection model can be used, to find words requiring attention, for inspection of engineers or data annotators.

**The type of the problem is binary classification (native vs non-native pronunciation, or in this case - Polish vs non-Polish), as further division into source languages is not necessary.**

As is presented below, general-use state-of-the-art language detection models do not provide the information necessary in this case. When given examples listed above, they (correctly) identify *googlowałam* and *Williamów* as Polish words, not accounting for the "non-native" pronunciation.

For this reason, after basic tests, the author decided to attempt to create own solution.

### Imports

In [1]:
import numpy as np
import pandas as pd

from datetime import datetime

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, MaxAbsScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, fbeta_score, make_scorer
from sklearn import set_config

from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, LSTM, Bidirectional, Conv1D, Flatten, MaxPooling1D, Dropout, GlobalMaxPool1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.metrics import Precision, Recall, PrecisionAtRecall, RecallAtPrecision
from tensorflow.keras.backend import clear_session
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

import tensorflow as tf

In [2]:
set_config(display='diagram')

np.random.seed(1)
tf.random.set_seed(1)

# DATA

Presented data is a very small random sample of OSCAR Polish corpus (https://oscar-corpus.com). It was cleaned and preprocessed using tools which are not openly available, thus the author is unable to present the full process here. Resulting sample was hand-annotated by the author.

The dataset consists of 12.000 examples of words used in Polish texts with annotations: 0 for pronunciation consistent with rules of Polish pronunciation and 1 for foreign pronunciation.

The dataset was divided into `train` and `test` using scikit-learn's `train_test_split` tool with options `test_size=0.2, random_state=42, stratify=y`.

In [3]:
data_train = pd.read_csv('trainset.csv')
data_train.head(5)

Unnamed: 0,word,not_pl
0,niekorzystnemu,0
1,konsensualna,0
2,czernych,0
3,rossija,1
4,mondi,0


In [4]:
data_test = pd.read_csv('testset.csv')
data_test.head(5)

Unnamed: 0,word,not_pl
0,szambie,0
1,kaiserslautern,1
2,krystalizująca,0
3,przestudiuje,0
4,zetknęliśmy,0


In [5]:
data_train.shape

(9600, 2)

In [6]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9600 entries, 0 to 9599
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    9600 non-null   object
 1   not_pl  9600 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 150.1+ KB


In [16]:
data_train.describe()

Unnamed: 0,not_pl
count,9600.0
mean,0.124271
std,0.329907
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


As is visible from the mean of `data_train.not_pl` column, the dataset is imbalanced, with **negatives about 7 times more frequent than positives**. This, however, reflects the reality of the problem - only a fraction of the words in a language are loanwords from other languages.

Because of this imbalance, it is necessary to focus on metrics connected to recall and precision in model evaluation, rather than use accuracy. The main goal, business-wise, would be to minimize the number of false negatives (words with non-Polish pronunciation classified as Polish) and focus on achieving high recall, even with average precision (as all data classified as positives will be analysed in later stages of the business process).

**Therefore main metrics chosen for evaluation are F-beta scores with beta=1.5 and beta=2, as well as recall.**

In [7]:
# Function for saving scores for a chosen model.

def save_model_results(y_true, y_pred, model_name, save_file=True):
    
    '''
    Returns pd.DataFrame with scores for given y_true, y_pred:
    - accuracy
    - Fbeta-score(beta=1.5) = F1.5-score
    - F1-score
    - Fbeta-score(beta=2) = F2-score
    - recall
    - precision
    - TN, FP, FN, TP
    
    Saves a csv file with results in 'results' directory.
    '''
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    
    summary = [model_name,
               accuracy_score(y_true, y_pred),
               fbeta_score(y_true, y_pred, beta=1.5),
               f1_score(y_true, y_pred),
               fbeta_score(y_true, y_pred, beta=2),
               recall_score(y_true, y_pred),
               precision_score(y_true, y_pred),
               tn, fp, fn, tp]
        
    result_df = pd.DataFrame([summary])
    result_df.columns = ['model',
                       'accuracy',
                       'F1.5-score',
                       'F1-score',
                       'F2-score',
                       'recall',
                       'precision',
                       'tn','fp','fn','tp']
    
    result_df = result_df.sort_values(by=['F1.5-score'], ascending=False)
    
    now = datetime.now()
    current_time = now.strftime("%y%m%d_%H%M")
    
    out_file = f"results/{model_name}_{current_time}.csv"

    pd.DataFrame.to_csv(result_df, out_file, index=False)   
    
    return result_df

## Basic data preparation

In [8]:
X_train = data_train.drop(columns='not_pl')
y_train = data_train.not_pl

X_test = data_test.drop(columns='not_pl')
y_test = data_test.not_pl

## Data preparation using CountVectorizer()

For neural networks consisting only of Dense() Layers.

Words are represented as vectors of occurences of character 3-grams.

In [9]:
words_train = data_train.word
words_test = data_test.word

In [10]:
vectorizer = CountVectorizer(ngram_range=(3,3), analyzer='char_wb')

cv = vectorizer.fit(words_train)

Xcv_train = cv.transform(words_train)
Xcv_test = cv.transform(words_test)

In [11]:
Xcv_train

<9600x7198 sparse matrix of type '<class 'numpy.int64'>'
	with 89000 stored elements in Compressed Sparse Row format>

In [12]:
Xcv_train = Xcv_train.todense()
Xcv_test = Xcv_test.todense()

In [13]:
scaler = StandardScaler()
maxabs = MaxAbsScaler()

scaler.fit(Xcv_train)
Xscaled_train = scaler.transform(Xcv_train)
Xscaled_test = scaler.transform(Xcv_test)

maxabs.fit(Xcv_train)
Xmaxabs_train = maxabs.transform(Xcv_train)
Xmaxabs_test = maxabs.transform(Xcv_test)

## Data preparation using Tokenizer()

For neural networks with Embedding() layer.

Words are represented as same-length vectors of numbers representing characters.

In [14]:
tokenizer = Tokenizer(char_level=True)

In [15]:
tokenizer.fit_on_texts(words_train)

In [16]:
Xtok_train = tokenizer.texts_to_sequences(words_train)
Xtok_test = tokenizer.texts_to_sequences(words_test)

In [17]:
Xtok_train[:5]

[[5, 2, 4, 12, 3, 6, 7, 11, 10, 13, 5, 4, 14, 17],
 [12, 3, 5, 10, 4, 5, 10, 17, 1, 16, 5, 1],
 [9, 7, 4, 6, 5, 11, 9, 23],
 [6, 3, 10, 10, 2, 19, 1],
 [14, 3, 5, 18, 2]]

In [18]:
tokenizer.word_index

{'a': 1,
 'i': 2,
 'o': 3,
 'e': 4,
 'n': 5,
 'r': 6,
 'z': 7,
 'w': 8,
 'c': 9,
 's': 10,
 'y': 11,
 'k': 12,
 't': 13,
 'm': 14,
 'p': 15,
 'l': 16,
 'u': 17,
 'd': 18,
 'j': 19,
 'g': 20,
 'ł': 21,
 'b': 22,
 'h': 23,
 'ą': 24,
 'ę': 25,
 'ś': 26,
 'f': 27,
 'ó': 28,
 'ż': 29,
 'ń': 30,
 'ć': 31,
 'v': 32,
 'ź': 33,
 'x': 34,
 'q': 35,
 "'": 36,
 'á': 37}

In [19]:
vocab_size = len(tokenizer.word_index) + 1
vocab_size

38

In [20]:
maxlen = words_train.str.len().max()
maxlen

28

In [21]:
Xtok_train = pad_sequences(Xtok_train, padding='pre', truncating='post', maxlen=maxlen)
Xtok_test = pad_sequences(Xtok_test, padding='pre', truncating='post',  maxlen=maxlen)

In [22]:
Xtok_train[:5]

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  5,  2,
         4, 12,  3,  6,  7, 11, 10, 13,  5,  4, 14, 17],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        12,  3,  5, 10,  4,  5, 10, 17,  1, 16,  5,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  9,  7,  4,  6,  5, 11,  9, 23],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  6,  3, 10, 10,  2, 19,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0, 14,  3,  5, 18,  2]], dtype=int32)

In [23]:
ytok_train = np.array(y_train).reshape(-1,1)
ytok_test = np.array(y_test).reshape(-1,1)

# TESTING EXISTING SOLUTIONS

In this section author used information and code snippets available under the links:

* https://amitness.com/2019/07/identify-text-language-python/
* https://www.nltk.org/api/nltk.classify.html?highlight=classify%20textcat#module-nltk.classify.textcat

## Solutions

### Fasttext

In [54]:
import fasttext

In [55]:
PRETRAINED_MODEL_PATH = './fasttext/lid.176.ftz'
fasttext_model = fasttext.load_model(PRETRAINED_MODEL_PATH)



In [68]:
sentences = ['Conoce Jim el Google Cloud Platform en Python o JavaScript?',
             'Wygooglowałam newsy o historii obu Williamów.']
predictions = fasttext_model.predict(sentences)
pd.DataFrame(predictions)

Unnamed: 0,0,1
0,[__label__es],[__label__pl]
1,[0.7234646],[0.9548101]


### Google Compact Language Detector v3 (CLD3)

In [58]:
import gcld3

In [59]:
detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0, 
                                        max_num_bytes=1000)

In [61]:
text1 = 'Conoce Jim el Google Cloud Platform en Python o JavaScript?'
cld_result1 = detector.FindLanguage(text=text1)

text2 = 'Wygooglowałam newsy o historii obu Williamów.'
cld_result2 = detector.FindLanguage(text=text2)

In [69]:
print('language:', cld_result1.language)
print('is the result reliable?', cld_result1.is_reliable)
print('probability:', cld_result1.probability)
print('================')
print('language:', cld_result2.language)
print('is the result reliable?', cld_result2.is_reliable)
print('probability:', cld_result2.probability)

language: es
is the result reliable? False
probability: 0.5869095325469971
language: pl
is the result reliable? True
probability: 0.9951347708702087


### nltk.classify.textcat

In [71]:
from nltk.classify import textcat

In [76]:
text1 = 'Conoce Jim el Google Cloud Platform en Python o JavaScript?'
cls = textcat.TextCat()

distances = cls.lang_dists(text1)
print(cls.guess_language(text1))

# show distances from languages in the corpus
sorted_distances = sorted(distances.items(), key = lambda kv: kv[1])
sorted_distances[:5]

eng


[('eng', 18446744073709640590),
 ('eng ', 27670116110564408190),
 ('deu', 36893488147419267083),
 ('dan', 46116860184274013704),
 ('sun', 55340232221128795629)]

In [77]:
text2 = 'Wygooglowałam newsy o historii obu Williamów.'
cls = textcat.TextCat()

distances = cls.lang_dists(text2)
print(cls.guess_language(text2))

sorted_distances = sorted(distances.items(), key = lambda kv: kv[1])
sorted_distances[:5]

pol


[('pol', 92598),
 ('eng', 73786976294838279945),
 ('eng ', 83010348331693041539),
 ('afr', 83010348331693055136),
 ('fri', 92233720368547823683)]

## Testing on toy example

In [70]:
toy = ['wygooglowałam', 'facebook', 'Williamów']

In [85]:
# FastText

predictions = fasttext_model.predict(toy)
pd.DataFrame(predictions, columns=toy, index=['language', 'probability'])

Unnamed: 0,wygooglowałam,facebook,Williamów
language,[__label__pl],[__label__es],[__label__pl]
probability,[0.99788636],[0.8521678],[0.8911561]


In [84]:
# CLD3

languages = []
probabilities = []

for item in toy:
    
    cld_result = detector.FindLanguage(text=item)
    languages.append(cld_result.language)
    probabilities.append(cld_result.probability)
    
cld3_toy = pd.DataFrame([languages, probabilities], columns=toy, index=['language', 'probability'])
cld3_toy

Unnamed: 0,wygooglowałam,facebook,Williamów
language,pl,la,pl
probability,0.999977,0.451175,0.494716


In [88]:
# NLTK TextCat

languages = []
dists = []

cls = textcat.TextCat()

for item in toy:

    distances = cls.lang_dists(item)
    sorted_distances = sorted(distances.items(), key = lambda kv: kv[1])
    
    languages.append(sorted_distances[0][0])
    dists.append(sorted_distances[0][1])
    
nltk_toy = pd.DataFrame([languages, dists], columns=toy, index=['language', 'distance'])
nltk_toy

Unnamed: 0,wygooglowałam,facebook,Williamów
language,pol,por,pol
distance,29382,11252,26358


## Testing on testset

In [90]:
X_test.head()

Unnamed: 0,word
0,szambie
1,kaiserslautern
2,krystalizująca
3,przestudiuje
4,zetknęliśmy


### Fasttext

In [137]:
def fasttext_testing(X_test, y_test):
    '''
    Tests FastText model on a given testset and returns scores:
    - accuracy
    - Fbeta-score(beta=1.5) = F1.5-score
    - F1-score
    - Fbeta-score(beta=2) = F2-score
    - recall
    - precision
    - TN, FP, FN, TP
    
    Saves a csv file with results and txt file with predictions created by Fasttext model.
    '''
    
    PRETRAINED_MODEL_PATH = './fasttext/lid.176.ftz'
    fasttext_model = fasttext.load_model(PRETRAINED_MODEL_PATH)
    
    y_pred_fasttext = []

    for item in X_test:

        predictions = fasttext_model.predict(item)
        
        if '__label__pl' in predictions[0][0]:
            y_pred_fasttext.append(0)
        else:
            y_pred_fasttext.append(1)

    with open('fasttext_results.txt', 'w') as f_out:
        for item in y_pred_fasttext:
            f_out.write(f"{item}\n")
    
    fasttext_df = save_model_results(y_test, y_pred_fasttext, 'fasttext')

    return fasttext_df

In [138]:
fasttext_testing(words_test, y_test)



Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,fasttext,0.767917,0.591516,0.488522,0.670701,0.892617,0.336283,1577,525,32,266


### Google CLD3

In [139]:
def cld3_testing(X_test, y_test):
    '''
    Tests Google CLD3 model on a given testset and returns scores:
    - accuracy
    - Fbeta-score(beta=1.5) = F1.5-score
    - F1-score
    - Fbeta-score(beta=2) = F2-score
    - recall
    - precision
    - TN, FP, FN, TP
    
    Saves a csv file with results and txt file with predictions created by Google CLD3 model.
    '''
    
    detector = gcld3.NNetLanguageIdentifier(min_num_bytes=0,
                                            max_num_bytes=1000)
    
    y_pred_cld3 = []

    for item in X_test:

        cld_result = detector.FindLanguage(text=item)
        
        if cld_result.language == 'pl':
            y_pred_cld3.append(0)
        else:
            y_pred_cld3.append(1)

    with open('results/cld3_results.txt', 'w') as f_out:
        for item in y_pred_cld3:
            f_out.write(f"{item}\n")
    
    cld3_df = save_model_results(y_test, y_pred_cld3, 'cld3')

    return cld3_df

In [140]:
cld3_testing(words_test, y_test)

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,cld3,0.580833,0.473001,0.360051,0.573804,0.949664,0.222135,1111,991,15,283


### nltk.classify.textcat

In [141]:
def textcat_testing(X_test, y_test):
    '''
    Tests nltk.classify.textcat model on a given testset and returns scores:
    - accuracy
    - Fbeta-score(beta=1.5) = F1.5-score
    - F1-score
    - Fbeta-score(beta=2) = F2-score
    - recall
    - precision
    - TN, FP, FN, TP
    
    Saves a csv file with results and txt file with predictions created by nltk.classify.textcat model.
    '''
    
    cls = textcat.TextCat()
    
    y_pred_textcat = []

    for item in X_test:
        
        if cls.guess_language(item) == 'pol':
            y_pred_textcat.append(0)
        else:
            y_pred_textcat.append(1)

    with open('results/textcat_results.txt', 'w') as f_out:
        for item in y_pred_textcat:
            f_out.write(f"{item}\n")
    
    textcat_df = save_model_results(y_test, y_pred_textcat, 'textcat')

    return textcat_df

In [142]:
textcat_testing(words_test, y_test)

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,textcat,0.776667,0.61968,0.510949,0.703518,0.939597,0.350877,1584,518,18,280


As testing shows, existing models do not perform well in this task. They not only generate false negatives on corner cases such as presented in toy example, but also achieve low precision.

# TRAINING A MODEL

### Baseline

LogisticRegression() with StandardScaler() on default settings.

In [9]:
baseline_prepr = ColumnTransformer([("ngrams",
                                     make_pipeline(CountVectorizer(ngram_range=(3,3), analyzer="char_wb")),
                                     "word"),], sparse_threshold=1)

baseline = make_pipeline(baseline_prepr, LogisticRegression())

baseline.fit(X_train, y_train)

In [25]:
save_model_results(y_test, baseline.predict(X_test), 'baseline')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,baseline,0.936667,0.641373,0.692308,0.615994,0.573826,0.872449,2077,25,127,171


# ML algorithms

Algorithms chosen for testing:
* LogisticRegression()
* RidgeRegression()
* DecisionTreeClassifier()
* SVC()
* RandomForestClassifier()
* KNeighborsClassifier()

### Parameters for GridSearchCV:

In [10]:
models_params_part1 = {'logreg': {'penalty': ['l1', 'l2'],
                                  'solver': ['saga'],
                                  'max_iter': [100, 1000],
                                  'C': [0.1, 1, 10],
                                  'class_weight': ['dict', 'balanced'],
                                  'fit_intercept': [True, False]},
                       'ridge': {'alpha': [0.01, 0.1, 1, 10],
                                 'fit_intercept': [True, False],
                                 'normalize': [True, False],
                                 'class_weight': ['dict', 'balanced']},
                       'dtc': {'max_depth': [10, 50, 100, None],
                               'min_samples_split': [2, 5, 10, 20],
                               'min_samples_leaf': [1, 2, 5, 10, 20],
                               'criterion':['gini', 'entropy'],
                               'random_state': [42]}
                      }
                       

models_params_part2 = {'knn': {'n_neighbors': [2, 5, 10, 20, 50],
                               'metric': ['minkowski', 'canberra'],
                               'weights': ['uniform', 'distance']},
                       'rfc': {'random_state': [42],
                               'n_estimators': [100],
                               'bootstrap': [True, False],
                               'max_depth': [None],
                               'min_samples_split': [2, 5, 10],
                               'min_samples_leaf': [1, 2, 5, 10],
                               'criterion':['gini', 'entropy']},
                       'svm': {'kernel':['linear', 'poly', 'sigmoid', 'rbf'],
                               'degree':[3, 4],
                               'C':[0.01, 0.1, 1, 10, 100]}
                        }

### Functions for grid search automation

In [22]:
def classifier_automation(model, param_grid, X_train, y_train, X_test, y_test,
                          scaler=None, svd=False, cv=None, full_results=False):
                          
    """Returns best parameters and accuracy score for trainset and testset with best parameters.

    Parameters:
    - model:
        - 'logreg' - sklearn.linear_model.LogisticRegression()
        - 'dtc' - sklearn.tree.DecisionTreeClassifier()
        - 'svm' - sklearn.svm.SVC()
        - 'rfc' - sklearn.ensemble.RandomForestClassifier()
        - 'knn' - sklearn.neighbors.KNeighborsClassifier()
        - 'ridge' - sklearn.linear_model.RidgeClassifier()
    - param_grid - parameter grid with parameters for chosen model, a dict with str keys and list values
    - X_train, X_test - array of data
    - y_train, y_test - array of target
    - scaler - whether data should be scaled, default None
        - 'standard' - StandardScaler()
        - 'maxabs' - MaxAbsScaler()
    - svd - whether TruncatedSVD() should be used in preprocessing, default False
    - cv - int, number of samples for cross validation, default None (meaning: 5-fold)
    - full_results - bool, whether GridSearchCV.cv_results_ should be returned as pd.DataFrame, default False

    """

    models = {'logreg': LogisticRegression(), 'dtc': DecisionTreeClassifier(), 'svm': SVC(),
              'rfc': RandomForestClassifier(), 'knn': KNeighborsClassifier(), 'ridge': RidgeClassifier()}
    
    scalers = {'standard': StandardScaler(with_mean=False), 'maxabs': MaxAbsScaler()}

    assert model in models.keys(), "Chosen model is not supported, choose from: \
                                    logreg, dtc, svm, rfc, knn, ridge."

    if svd:
        
        prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb')),
                                     ('svd', TruncatedSVD())])
        
        model_param_grid = {'preprocessing__ngrams__svd__n_components':[100,256],
                            'preprocessing__ngrams__svd__random_state':[42],
                            'preprocessing__ngrams__svd__algorithm':['arpack', 'randomized']}
        
    else:
        
        prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb'),)])
        model_param_grid = {}

    preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
    if scaler=='standard':
        
        model_param_grid['scaler__with_mean'] = [False]
        
        pipe = Pipeline([('preprocessing', preprocessing),
                         ('scaler', StandardScaler()),
                         (model, models[model])])
    
    if scaler=='maxabs':
        pipe = Pipeline([('preprocessing', preprocessing),
                         ('scaler', MaxAbsScaler()),
                         (model, models[model])])
    
    elif scaler==None:
        pipe = Pipeline([('preprocessing', preprocessing),
                         (model, models[model])])

    for key in param_grid.keys():
        model_key = model + '__' + key
        model_param_grid[model_key] = param_grid[key]
        
    print(model_param_grid)
    
    f1p5_scorer = make_scorer(fbeta_score, beta=1.5)
    
    grid = GridSearchCV(pipe, model_param_grid, cv=cv, n_jobs=4, verbose=3, scoring=f1p5_scorer)

    grid.fit(X_train, y_train)

    train_score = grid.best_score_

    if full_results:
        return pd.DataFrame(grid.cv_results_)
    else:
        return [grid.best_params_, train_score, grid.score(X_test, y_test)]
    
    
def multi_classifier_automation(models_params, X_train, y_train, X_test, y_test, filename,
                                scaler=scaler, svd=svd, cv=None):

    """For each model returns a pd.DataFrame of best parameters, accuracy score for trainset and testset with best
    parameters.

    Parameters:
    - models_params - a dict with models as keys and dicts of parameters as values.
        Model names must be consistent with requirements for classifier_automation function.
    - X_train, X_test - array of data
    - y_train, y_test - array of target
    - filename - path for CSV export of results
    - scaler - whether data should be scaled, default None
    - svd - whether TruncatedSVD() should be used in preprocessing, default False
    - cv - int, number of samples for cross validation, default None

    """
    score = []

    for model in models_params.keys():
        result = classifier_automation(model, models_params[model], X_train, y_train, X_test, y_test,
                                       scaler=scaler, svd=svd, cv=cv, full_results=False)
        print(result)
        
        score.append([model, models_params[model],
                         result[0],
                         result[1],
                         result[2]])

    score_df = pd.DataFrame(score)
    score_df.columns = ['model',
                        'param_grid',
                        'best_parameters',
                        'score_train',
                        'score_test']
    
    now = datetime.now()
    current_time = now.strftime("%y%m%d_%H%M")
    
    out_file = f"results/{filename}_{current_time}.csv"

    pd.DataFrame.to_csv(score_df, out_file, index=False)
    
    return score_df

NameError: name 'scaler' is not defined

## GridSearchCV

### CountVectorizer + model()

In [187]:
multi_classifier_automation(models_params_part1, X_train, y_train, X_test, y_test,
                            scaler=None, svd=False, filename='part1_CV_models')

{'logreg__penalty': ['l1', 'l2'], 'logreg__solver': ['saga'], 'logreg__max_iter': [100, 1000], 'logreg__C': [0.1, 1, 10], 'logreg__class_weight': ['dict', 'balanced'], 'logreg__fit_intercept': [True, False]}
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    6.4s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.6min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 16.8min finished


[{'logreg__C': 1, 'logreg__class_weight': 'balanced', 'logreg__fit_intercept': False, 'logreg__max_iter': 1000, 'logreg__penalty': 'l2', 'logreg__solver': 'saga'}, 0.7612086346595162, 0.7637418053454362]
{'ridge__alpha': [0.01, 0.1, 1, 10], 'ridge__fit_intercept': [True, False], 'ridge__normalize': [True, False], 'ridge__class_weight': ['dict', 'balanced']}
Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.6s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:   11.1s
[Parallel(n_jobs=4)]: Done 160 out of 160 | elapsed:   14.0s finished


[{'ridge__alpha': 10, 'ridge__class_weight': 'balanced', 'ridge__fit_intercept': False, 'ridge__normalize': True}, 0.7624286737351296, 0.7769964841788045]
{'dtc__max_depth': [10, 50, 100, None], 'dtc__min_samples_split': [2, 5, 10, 20], 'dtc__min_samples_leaf': [1, 2, 5, 10, 20], 'dtc__criterion': ['gini', 'entropy'], 'dtc__random_state': [42]}
Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    2.1s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:   11.3s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   41.0s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.6min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done 800 out of 800 | elapsed:  2.7min finished


[{'dtc__criterion': 'gini', 'dtc__max_depth': None, 'dtc__min_samples_leaf': 1, 'dtc__min_samples_split': 2, 'dtc__random_state': 42}, 0.6346356419108045, 0.6247892074198987]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,logreg,"{'penalty': ['l1', 'l2'], 'solver': ['saga'], ...","{'logreg__C': 1, 'logreg__class_weight': 'bala...",0.761209,0.763742
1,ridge,"{'alpha': [0.01, 0.1, 1, 10], 'fit_intercept':...","{'ridge__alpha': 10, 'ridge__class_weight': 'b...",0.762429,0.776996
2,dtc,"{'max_depth': [10, 50, 100, None], 'min_sample...","{'dtc__criterion': 'gini', 'dtc__max_depth': N...",0.634636,0.624789


In [12]:
multi_classifier_automation(models_params_part2, X_train, y_train, X_test, y_test,
                            scaler=None, svd=False, filename='part2_CV_models')

{'knn__n_neighbors': [2, 5, 10, 20, 50], 'knn__metric': ['minkowski', 'canberra'], 'knn__weights': ['uniform', 'distance']}
Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    5.5s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:   12.5s finished


[{'knn__metric': 'minkowski', 'knn__n_neighbors': 2, 'knn__weights': 'distance'}, 0.24631795532739997, 0.2405816259087905]
{'rfc__random_state': [42], 'rfc__n_estimators': [100], 'rfc__bootstrap': [True, False], 'rfc__max_depth': [None], 'rfc__min_samples_split': [2, 5, 10], 'rfc__min_samples_leaf': [1, 2, 5, 10], 'rfc__criterion': ['gini', 'entropy']}
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   20.6s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed:  2.6min finished


[{'rfc__bootstrap': False, 'rfc__criterion': 'gini', 'rfc__max_depth': None, 'rfc__min_samples_leaf': 1, 'rfc__min_samples_split': 2, 'rfc__n_estimators': 100, 'rfc__random_state': 42}, 0.6044342113665725, 0.6230274693161894]
{'svm__kernel': ['linear', 'poly', 'sigmoid', 'rbf'], 'svm__degree': [3, 4], 'svm__C': [0.01, 0.1, 1, 10, 100]}
Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   25.5s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed:  4.6min finished


[{'svm__C': 10, 'svm__degree': 3, 'svm__kernel': 'sigmoid'}, 0.6693055352686506, 0.6836512261580382]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,knn,"{'n_neighbors': [2, 5, 10, 20, 50], 'metric': ...","{'knn__metric': 'minkowski', 'knn__n_neighbors...",0.246318,0.240582
1,rfc,"{'random_state': [42], 'n_estimators': [100], ...","{'rfc__bootstrap': False, 'rfc__criterion': 'g...",0.604434,0.623027
2,svm,"{'kernel': ['linear', 'poly', 'sigmoid', 'rbf'...","{'svm__C': 10, 'svm__degree': 3, 'svm__kernel'...",0.669306,0.683651


### CountVectorizer + TruncatedSVD() + model()

In [189]:
multi_classifier_automation(models_params_part1, X_train, y_train, X_test, y_test,
                            scaler=None, svd=True, filename='part1_CV_TrSVD_models')

{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'logreg__penalty': ['l1', 'l2'], 'logreg__solver': ['saga'], 'logreg__max_iter': [100, 1000], 'logreg__C': [0.1, 1, 10], 'logreg__class_weight': ['dict', 'balanced'], 'logreg__fit_intercept': [True, False]}
Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   11.1s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  2.5min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  5.0min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  9.2min
[Parallel(n_jobs=4)]: Done 960 out of 960 | elapsed: 12.3min finished


[{'logreg__C': 10, 'logreg__class_weight': 'balanced', 'logreg__fit_intercept': True, 'logreg__max_iter': 1000, 'logreg__penalty': 'l1', 'logreg__solver': 'saga', 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42}, 0.651744536480884, 0.6502560063016937]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'ridge__alpha': [0.01, 0.1, 1, 10], 'ridge__fit_intercept': [True, False], 'ridge__normalize': [True, False], 'ridge__class_weight': ['dict', 'balanced']}
Fitting 5 folds for each of 128 candidates, totalling 640 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.1s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.2s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  2.0min
[Parallel(n_jobs=4)]: Done 640 out of 640 | elapsed:  4.8min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'ridge__alpha': 10, 'ridge__class_weight': 'balanced', 'ridge__fit_intercept': False, 'ridge__normalize': True}, 0.6171370986704648, 0.624949290060852]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'dtc__max_depth': [10, 50, 100, None], 'dtc__min_samples_split': [2, 5, 10, 20], 'dtc__min_samples_leaf': [1, 2, 5, 10, 20], 'dtc__criterion': ['gini', 'entropy'], 'dtc__random_state': [42]}
Fitting 5 folds for each of 640 candidates, totalling 3200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   10.8s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  2.9min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  5.3min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  8.4min
[Parallel(n_jobs=4)]: Done 1144 tasks      | elapsed: 12.5min
[Parallel(n_jobs=4)]: Done 1560 tasks      | elapsed: 17.2min
[Parallel(n_jobs=4)]: Done 2040 tasks      | elapsed: 22.6min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed: 28.9min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed: 35.8min
[Parallel(n_jobs=4)]: Done 3200 out of 3200 | elapsed: 36.1min finished


[{'dtc__criterion': 'gini', 'dtc__max_depth': 10, 'dtc__min_samples_leaf': 1, 'dtc__min_samples_split': 10, 'dtc__random_state': 42, 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 100, 'preprocessing__ngrams__svd__random_state': 42}, 0.4431038545897241, 0.41995425957690113]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,logreg,"{'penalty': ['l1', 'l2'], 'solver': ['saga'], ...","{'logreg__C': 10, 'logreg__class_weight': 'bal...",0.651745,0.650256
1,ridge,"{'alpha': [0.01, 0.1, 1, 10], 'fit_intercept':...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.617137,0.624949
2,dtc,"{'max_depth': [10, 50, 100, None], 'min_sample...","{'dtc__criterion': 'gini', 'dtc__max_depth': 1...",0.443104,0.419954


In [13]:
multi_classifier_automation(models_params_part2, X_train, y_train, X_test, y_test,
                            scaler=None, svd=True, filename='part2_CV_TrSVD_models')

{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'knn__n_neighbors': [2, 5, 10, 20, 50], 'knn__metric': ['minkowski', 'canberra'], 'knn__weights': ['uniform', 'distance']}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   15.7s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  5.4min
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:  9.5min finished


[{'knn__metric': 'minkowski', 'knn__n_neighbors': 50, 'knn__weights': 'distance', 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42}, 0.5716614698983179, 0.5842033590558331]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'rfc__random_state': [42], 'rfc__n_estimators': [100], 'rfc__bootstrap': [True, False], 'rfc__max_depth': [None], 'rfc__min_samples_split': [2, 5, 10], 'rfc__min_samples_leaf': [1, 2, 5, 10], 'rfc__criterion': ['gini', 'entropy']}
Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.1s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.0s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 12.7min
[Parallel(n_jobs=4)]: Done 960 out of 960 | elapsed: 24.9min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'rfc__bootstrap': False, 'rfc__criterion': 'entropy', 'rfc__max_depth': None, 'rfc__min_samples_leaf': 5, 'rfc__min_samples_split': 2, 'rfc__n_estimators': 100, 'rfc__random_state': 42}, 0.4798243965752268, 0.45348837209302334]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'svm__kernel': ['linear', 'poly', 'sigmoid', 'rbf'], 'svm__degree': [3, 4], 'svm__C': [0.01, 0.1, 1, 10, 100]}
Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.0s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   13.9s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  2.1min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 13.4min
[Parallel(n_jobs=4)]: Done 800 out of 800 | elapsed: 13.7min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'svm__C': 100, 'svm__degree': 3, 'svm__kernel': 'rbf'}, 0.6138550958147712, 0.6017902813299233]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,knn,"{'n_neighbors': [2, 5, 10, 20, 50], 'metric': ...","{'knn__metric': 'minkowski', 'knn__n_neighbors...",0.571661,0.584203
1,rfc,"{'random_state': [42], 'n_estimators': [100], ...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.479824,0.453488
2,svm,"{'kernel': ['linear', 'poly', 'sigmoid', 'rbf'...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.613855,0.60179


### CountVectorizer + TruncatedSVD() + StandardScaler() + model()

In [23]:
multi_classifier_automation(models_params_part1, X_train, y_train, X_test, y_test,
                            scaler='standard', svd=True, filename='part1_CV_TrSVD_standard_models')

{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'scaler__with_mean': [False], 'logreg__penalty': ['l1', 'l2'], 'logreg__solver': ['saga'], 'logreg__max_iter': [100, 1000], 'logreg__C': [0.1, 1, 10], 'logreg__class_weight': ['dict', 'balanced'], 'logreg__fit_intercept': [True, False]}
Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   14.0s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  3.5min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  7.7min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 14.4min
[Parallel(n_jobs=4)]: Done 960 out of 960 | elapsed: 19.1min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


[{'logreg__C': 10, 'logreg__class_weight': 'balanced', 'logreg__fit_intercept': True, 'logreg__max_iter': 1000, 'logreg__penalty': 'l2', 'logreg__solver': 'saga', 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'scaler__with_mean': False}, 0.651757069835041, 0.6523113393915448]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'scaler__with_mean': [False], 'ridge__alpha': [0.01, 0.1, 1, 10], 'ridge__fit_intercept': [True, False], 'ridge__normalize': [True, False], 'ridge__class_weight': ['dict', 'balanced']}
Fitting 5 folds for each of 128 candidates, totalling 640 fits


[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.1s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.1s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done 640 out of 640 | elapsed:  4.5min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'ridge__alpha': 0.01, 'ridge__class_weight': 'balanced', 'ridge__fit_intercept': False, 'ridge__normalize': True, 'scaler__with_mean': False}, 0.615994709357017, 0.6196754563894523]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'scaler__with_mean': [False], 'dtc__max_depth': [10, 50, 100, None], 'dtc__min_samples_split': [2, 5, 10, 20], 'dtc__min_samples_leaf': [1, 2, 5, 10, 20], 'dtc__criterion': ['gini', 'entropy'], 'dtc__random_state': [42]}
Fitting 5 folds for each of 640 candidates, totalling 3200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   10.6s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.2min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  5.1min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  8.1min
[Parallel(n_jobs=4)]: Done 1144 tasks      | elapsed: 12.2min
[Parallel(n_jobs=4)]: Done 1560 tasks      | elapsed: 16.8min
[Parallel(n_jobs=4)]: Done 2040 tasks      | elapsed: 21.6min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed: 27.1min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed: 33.2min
[Parallel(n_jobs=4)]: Done 3200 out of 3200 | elapsed: 33.4min finished


[{'dtc__criterion': 'gini', 'dtc__max_depth': 10, 'dtc__min_samples_leaf': 1, 'dtc__min_samples_split': 10, 'dtc__random_state': 42, 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 100, 'preprocessing__ngrams__svd__random_state': 42, 'scaler__with_mean': False}, 0.44182684688234913, 0.41995425957690113]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,logreg,"{'penalty': ['l1', 'l2'], 'solver': ['saga'], ...","{'logreg__C': 10, 'logreg__class_weight': 'bal...",0.651757,0.652311
1,ridge,"{'alpha': [0.01, 0.1, 1, 10], 'fit_intercept':...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.615995,0.619675
2,dtc,"{'max_depth': [10, 50, 100, None], 'min_sample...","{'dtc__criterion': 'gini', 'dtc__max_depth': 1...",0.441827,0.419954


In [24]:
multi_classifier_automation(models_params_part2, X_train, y_train, X_test, y_test,
                            scaler='standard', svd=True, filename='part2_CV_TrSVD_standard_models')

{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'scaler__with_mean': [False], 'knn__n_neighbors': [2, 5, 10, 20, 50], 'knn__metric': ['minkowski', 'canberra'], 'knn__weights': ['uniform', 'distance']}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   17.5s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.9min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed:  9.8min finished


[{'knn__metric': 'minkowski', 'knn__n_neighbors': 50, 'knn__weights': 'distance', 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'scaler__with_mean': False}, 0.5655805926012489, 0.6012847965738758]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'scaler__with_mean': [False], 'rfc__random_state': [42], 'rfc__n_estimators': [100], 'rfc__bootstrap': [True, False], 'rfc__max_depth': [None], 'rfc__min_samples_split': [2, 5, 10], 'rfc__min_samples_leaf': [1, 2, 5, 10], 'rfc__criterion': ['gini', 'entropy']}
Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.4s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.1s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.3s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 12.9min
[Parallel(n_jobs=4)]: Done 960 out of 960 | elapsed: 25.6min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'rfc__bootstrap': False, 'rfc__criterion': 'entropy', 'rfc__max_depth': None, 'rfc__min_samples_leaf': 5, 'rfc__min_samples_split': 2, 'rfc__n_estimators': 100, 'rfc__random_state': 42, 'scaler__with_mean': False}, 0.4852259589332656, 0.4496124031007752]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'scaler__with_mean': [False], 'svm__kernel': ['linear', 'poly', 'sigmoid', 'rbf'], 'svm__degree': [3, 4], 'svm__C': [0.01, 0.1, 1, 10, 100]}
Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.3s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.2s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  2.3min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 87.3min
[Parallel(n_jobs=4)]: Done 800 out of 800 | elapsed: 95.1min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'scaler__with_mean': False, 'svm__C': 100, 'svm__degree': 3, 'svm__kernel': 'rbf'}, 0.6252713719622048, 0.5997506234413965]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,knn,"{'n_neighbors': [2, 5, 10, 20, 50], 'metric': ...","{'knn__metric': 'minkowski', 'knn__n_neighbors...",0.565581,0.601285
1,rfc,"{'random_state': [42], 'n_estimators': [100], ...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.485226,0.449612
2,svm,"{'kernel': ['linear', 'poly', 'sigmoid', 'rbf'...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.625271,0.599751


### CountVectorizer + TruncatedSVD() + MaxAbsScaler() + model()

In [25]:
multi_classifier_automation(models_params_part1, X_train, y_train, X_test, y_test,
                            scaler='maxabs', svd=True, filename='part1_CV_TrSVD_maxabs_models')

{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'logreg__penalty': ['l1', 'l2'], 'logreg__solver': ['saga'], 'logreg__max_iter': [100, 1000], 'logreg__C': [0.1, 1, 10], 'logreg__class_weight': ['dict', 'balanced'], 'logreg__fit_intercept': [True, False]}
Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   11.5s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  5.0min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  9.7min
[Parallel(n_jobs=4)]: Done 960 out of 960 | elapsed: 13.4min finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


[{'logreg__C': 10, 'logreg__class_weight': 'balanced', 'logreg__fit_intercept': True, 'logreg__max_iter': 1000, 'logreg__penalty': 'l1', 'logreg__solver': 'saga', 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42}, 0.648948524034443, 0.6512770137524558]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'ridge__alpha': [0.01, 0.1, 1, 10], 'ridge__fit_intercept': [True, False], 'ridge__normalize': [True, False], 'ridge__class_weight': ['dict', 'balanced']}
Fitting 5 folds for each of 128 candidates, totalling 640 fits


[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.2s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.1s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.8min
[Parallel(n_jobs=4)]: Done 640 out of 640 | elapsed:  4.4min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'ridge__alpha': 1, 'ridge__class_weight': 'balanced', 'ridge__fit_intercept': False, 'ridge__normalize': True}, 0.6161625395690937, 0.6176708451273757]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'dtc__max_depth': [10, 50, 100, None], 'dtc__min_samples_split': [2, 5, 10, 20], 'dtc__min_samples_leaf': [1, 2, 5, 10, 20], 'dtc__criterion': ['gini', 'entropy'], 'dtc__random_state': [42]}
Fitting 5 folds for each of 640 candidates, totalling 3200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   10.4s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  2.6min
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  4.9min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  7.8min
[Parallel(n_jobs=4)]: Done 1144 tasks      | elapsed: 11.5min
[Parallel(n_jobs=4)]: Done 1560 tasks      | elapsed: 15.9min
[Parallel(n_jobs=4)]: Done 2040 tasks      | elapsed: 20.6min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed: 26.1min
[Parallel(n_jobs=4)]: Done 3192 tasks      | elapsed: 32.1min
[Parallel(n_jobs=4)]: Done 3200 out of 3200 | elapsed: 32.3min finished


[{'dtc__criterion': 'gini', 'dtc__max_depth': 10, 'dtc__min_samples_leaf': 1, 'dtc__min_samples_split': 10, 'dtc__random_state': 42, 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 100, 'preprocessing__ngrams__svd__random_state': 42}, 0.44551184762791624, 0.41995425957690113]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,logreg,"{'penalty': ['l1', 'l2'], 'solver': ['saga'], ...","{'logreg__C': 10, 'logreg__class_weight': 'bal...",0.648949,0.651277
1,ridge,"{'alpha': [0.01, 0.1, 1, 10], 'fit_intercept':...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.616163,0.617671
2,dtc,"{'max_depth': [10, 50, 100, None], 'min_sample...","{'dtc__criterion': 'gini', 'dtc__max_depth': 1...",0.445512,0.419954


In [26]:
multi_classifier_automation(models_params_part2, X_train, y_train, X_test, y_test,
                            scaler='maxabs', svd=True, filename='part2_CV_TrSVD_maxabs_models')

{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'knn__n_neighbors': [2, 5, 10, 20, 50], 'knn__metric': ['minkowski', 'canberra'], 'knn__weights': ['uniform', 'distance']}
Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   18.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:  1.9min
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:  5.9min
[Parallel(n_jobs=4)]: Done 400 out of 400 | elapsed: 10.1min finished


[{'knn__metric': 'minkowski', 'knn__n_neighbors': 50, 'knn__weights': 'distance', 'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42}, 0.5614632731216419, 0.5718475073313782]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'rfc__random_state': [42], 'rfc__n_estimators': [100], 'rfc__bootstrap': [True, False], 'rfc__max_depth': [None], 'rfc__min_samples_split': [2, 5, 10], 'rfc__min_samples_leaf': [1, 2, 5, 10], 'rfc__criterion': ['gini', 'entropy']}
Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    7.2s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   15.5s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  1.1min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 13.6min
[Parallel(n_jobs=4)]: Done 960 out of 960 | elapsed: 26.7min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'rfc__bootstrap': False, 'rfc__criterion': 'entropy', 'rfc__max_depth': None, 'rfc__min_samples_leaf': 5, 'rfc__min_samples_split': 2, 'rfc__n_estimators': 100, 'rfc__random_state': 42}, 0.48007397298957716, 0.4518716577540107]
{'preprocessing__ngrams__svd__n_components': [100, 256], 'preprocessing__ngrams__svd__random_state': [42], 'preprocessing__ngrams__svd__algorithm': ['arpack', 'randomized'], 'svm__kernel': ['linear', 'poly', 'sigmoid', 'rbf'], 'svm__degree': [3, 4], 'svm__C': [0.01, 0.1, 1, 10, 100]}
Fitting 5 folds for each of 160 candidates, totalling 800 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:    1.3s
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed:    6.2s
[Parallel(n_jobs=4)]: Done 280 tasks      | elapsed:   14.5s
[Parallel(n_jobs=4)]: Done 504 tasks      | elapsed:  2.3min
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed: 15.1min
[Parallel(n_jobs=4)]: Done 800 out of 800 | elapsed: 15.4min finished


[{'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256, 'preprocessing__ngrams__svd__random_state': 42, 'svm__C': 100, 'svm__degree': 3, 'svm__kernel': 'rbf'}, 0.6225244527068583, 0.6014349332013856]


Unnamed: 0,model,param_grid,best_parameters,score_train,score_test
0,knn,"{'n_neighbors': [2, 5, 10, 20, 50], 'metric': ...","{'knn__metric': 'minkowski', 'knn__n_neighbors...",0.561463,0.571848
1,rfc,"{'random_state': [42], 'n_estimators': [100], ...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.480074,0.451872
2,svm,"{'kernel': ['linear', 'poly', 'sigmoid', 'rbf'...",{'preprocessing__ngrams__svd__algorithm': 'ran...,0.622524,0.601435


## Retesting with best parameters

In [157]:
# LogisticRegression()
# pipeline = CountVectorizer() + model
# {'logreg__C': 1, 'logreg__class_weight': 'balanced', 'logreg__fit_intercept': False,
#  'logreg__max_iter': 1000, 'logreg__penalty': 'l2', 'logreg__solver': 'saga'}

prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb'),)])
preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
logreg = Pipeline([('preprocessing', preprocessing),
                 ('logreg', LogisticRegression(C=1, class_weight='balanced', fit_intercept=False, max_iter=1000,
                                               penalty='l2', solver='saga'))])

logreg.fit(X_train, y_train)

In [158]:
logreg_pred = logreg.predict(X_test)
save_model_results(y_test, logreg_pred, 'logreg')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,logreg,0.93625,0.763742,0.752827,0.769993,0.781879,0.725857,2014,88,65,233


In [159]:
# RidgeClassifier()
# pipeline = CountVectorizer() + model
# {'ridge__alpha': 10, 'ridge__class_weight': 'balanced', 'ridge__fit_intercept': False, 'ridge__normalize': True}

prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb'),)])
preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
ridge = Pipeline([('preprocessing', preprocessing),
                 ('ridge', RidgeClassifier(alpha=10, class_weight='balanced',
                                           fit_intercept=False, normalize=True))])

ridge.fit(X_train, y_train)

In [161]:
ridge_pred = ridge.predict(X_test)
save_model_results(y_test, ridge_pred, 'ridge')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,ridge,0.93875,0.776996,0.764045,0.784443,0.798658,0.732308,2015,87,60,238


In [27]:
# DecisionTreeClassifier()
# pipeline = CountVectorizer() + model
# {'dtc__criterion': 'gini', 'dtc__max_depth': None,
#  'dtc__min_samples_leaf': 1, 'dtc__min_samples_split': 2, 'dtc__random_state': 42}

prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb'),)])
preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
dtc = Pipeline([('preprocessing', preprocessing),
                ('dtc', DecisionTreeClassifier(criterion='gini', max_depth=None, min_samples_leaf=1,
                                               min_samples_split=2, random_state=42))])

dtc.fit(X_train, y_train)

In [28]:
dtc_pred = dtc.predict(X_test)
save_model_results(y_test, dtc_pred, 'dtc')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,dtc,0.927083,0.624789,0.661509,0.605953,0.573826,0.780822,2054,48,127,171


In [34]:
# KNeighborsClassifier()
# pipeline = CountVectorizer + TruncatedSVD() + StandardScaler() + model
# {'knn__metric': 'minkowski', 'knn__n_neighbors': 50, 'knn__weights': 'distance',
#  'preprocessing__ngrams__svd__algorithm': 'randomized', 'preprocessing__ngrams__svd__n_components': 256,
#  'preprocessing__ngrams__svd__random_state': 42, 
#  'scaler__with_mean': False}

prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb')),
                              ('svd', TruncatedSVD(n_components=256, algorithm='randomized', random_state=42))])

preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
knn = Pipeline([('preprocessing', preprocessing),
                ('scaler', StandardScaler(with_mean=False)),
                ('knn', KNeighborsClassifier(n_neighbors=50, weights='distance', metric='minkowski'))])

knn.fit(X_train, y_train)

In [35]:
knn_pred = knn.predict(X_test)
save_model_results(y_test, knn_pred, 'knn')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,knn,0.84875,0.601285,0.543396,0.639432,0.724832,0.434608,1821,281,82,216


In [29]:
# RandomForestClassifier()
# pipeline = CountVectorizer() + model
# {'rfc__bootstrap': False, 'rfc__criterion': 'gini', 'rfc__max_depth': None,
#  'rfc__min_samples_leaf': 1, 'rfc__min_samples_split': 2, 'rfc__n_estimators': 100,
#  'rfc__random_state': 42}

prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb'),)])
preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
rfc = Pipeline([('preprocessing', preprocessing),
                ('rfc', RandomForestClassifier(n_estimators=100, bootstrap=False, criterion='gini', max_depth=None,
                                               min_samples_leaf=1, min_samples_split=2, random_state=42))])

rfc.fit(X_train, y_train)

In [30]:
rfc_pred = rfc.predict(X_test)
save_model_results(y_test, rfc_pred, 'rfc')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,rfc,0.935417,0.623027,0.679089,0.595497,0.550336,0.886486,2081,21,134,164


In [31]:
# SVC()
# pipeline = CountVectorizer() + model
# {'svm__C': 10, 'svm__degree': 3, 'svm__kernel': 'sigmoid'}

prepr_transformer = Pipeline([('vectorizer', CountVectorizer(ngram_range=(3, 4),analyzer='char_wb'),)])
preprocessing = ColumnTransformer(transformers=[("ngrams", prepr_transformer, "word")], sparse_threshold=1)
        
svm = Pipeline([('preprocessing', preprocessing),
                ('svm', SVC(C=10, degree=3, kernel='sigmoid'))])

svm.fit(X_train, y_train)

In [32]:
svm_pred = svm.predict(X_test)
save_model_results(y_test, svm_pred, 'svm')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,svm,0.93375,0.683651,0.708257,0.670605,0.647651,0.781377,2048,54,105,193


After performing a parameter grid search including a few of the most popular classification algorithms in 4 variants of the pipeline, it is visible that for 5 of 6 algorithms best results are achieved on the pipeline with minimal data preparation, where sparse matrix from CountVectorizer() is fed directly to the classification algorithm.

The results of RidgeClassifier() and Logistic Regression() are promising, but not satisfactory, which is why the author decided to test approaches using neural networks.

# Neural networks

The following experiments do not in any way aspire to be a thorough exploration of the possibilities.

As the author is a novice in the field, following tests are a quick overview of different types of neural network architectures and layer types, with minimal parameter variations.

In [89]:
# used for smaller, simple MLP networks
early_stopping = EarlyStopping(monitor="val_recall",
                               mode="max",
                               patience=5,
                               restore_best_weights=True)

# used where first run with early_stopping3 showed promise
early_stopping2 = EarlyStopping(monitor="val_recall",
                               mode="max",
                               patience=20,
                               restore_best_weights=True)

#default
early_stopping3 = EarlyStopping(monitor="val_recall",
                               mode="max",
                               patience=10,
                               restore_best_weights=True)

In [115]:
regularization = l2(0.001)

In [29]:
input_size = Xcv_train.shape[-1]
input_size

7198

## MLP

In [30]:
clear_session()

mlp0 = Sequential()

mlp0.add(Dense(16, input_dim=input_size, activation='relu'))
mlp0.add(Dense(1, 'sigmoid'))

mlp0.summary()

mlp0.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp0.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 115,201
Trainable params: 115,201
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50


<tensorflow.python.keras.callbacks.History at 0x7fda9a374970>

In [31]:
mlp0_pred = (mlp0.predict(Xcv_test) > 0.5).astype("int32")
save_model_results(y_test, mlp0_pred, 'mlp0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp0,0.94,0.722998,0.741935,0.71281,0.694631,0.796154,2049,53,91,207


In [32]:
clear_session()

mlp0scaled = Sequential()

mlp0scaled.add(Dense(16, input_dim=input_size, activation='relu'))
mlp0scaled.add(Dense(1, 'sigmoid'))

mlp0scaled.summary()

mlp0scaled.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp0scaled.fit(Xscaled_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 115,201
Trainable params: 115,201
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50


<tensorflow.python.keras.callbacks.History at 0x7fda9c31cc70>

In [33]:
mlp0scaled_pred = (mlp0scaled.predict(Xscaled_test) > 0.5).astype("int32")
save_model_results(y_test, mlp0scaled_pred, 'mlp0scaled')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp0scaled,0.919167,0.703055,0.690096,0.710526,0.724832,0.658537,1990,112,82,216


In [37]:
clear_session()

mlp0maxabs = Sequential()

mlp0maxabs.add(Dense(16, input_dim=input_size, activation='relu'))
mlp0maxabs.add(Dense(1, "sigmoid"))

mlp0maxabs.summary()

mlp0maxabs.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp0maxabs.fit(Xmaxabs_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 115,201
Trainable params: 115,201
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50


<tensorflow.python.keras.callbacks.History at 0x7fdab2e25fd0>

In [38]:
mlp0maxabs_pred = (mlp0maxabs.predict(Xmaxabs_test) > 0.5).astype("int32")
save_model_results(y_test, mlp0maxabs_pred, 'mlp0maxabs')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp0maxabs,0.941667,0.728056,0.748201,0.717241,0.697987,0.806202,2052,50,90,208


In [39]:
clear_session()

mlp1 = Sequential()
mlp1.add(Dense(16, input_dim=input_size, activation='sigmoid'))
mlp1.add(Dense(1, 'sigmoid'))

mlp1.summary()

mlp1.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp1.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 115,201
Trainable params: 115,201
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50


<tensorflow.python.keras.callbacks.History at 0x7fdab3154910>

In [40]:
mlp1_pred = (mlp1.predict(Xcv_test) > 0.5).astype("int32")
save_model_results(y_test, mlp1_pred, 'mlp1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp1,0.939583,0.712425,0.735883,0.699931,0.677852,0.804781,2053,49,96,202


In [41]:
clear_session()

mlp1scaled = Sequential()
mlp1scaled.add(Dense(16, input_dim=input_size, activation='sigmoid'))
mlp1scaled.add(Dense(1, 'sigmoid'))

mlp1scaled.summary()

mlp1scaled.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp1scaled.fit(Xscaled_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 115,201
Trainable params: 115,201
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50


<tensorflow.python.keras.callbacks.History at 0x7fdab34b3160>

In [42]:
mlp1scaled_pred = (mlp1scaled.predict(Xscaled_test) > 0.5).astype("int32")
save_model_results(y_test, mlp1scaled_pred, 'mlp1scaled')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp1scaled,0.895833,0.678287,0.640805,0.701258,0.748322,0.560302,1927,175,75,223


In [43]:
clear_session()

mlp1maxabs = Sequential()
mlp1maxabs.add(Dense(16, input_dim=input_size, activation='sigmoid'))
mlp1maxabs.add(Dense(1, 'sigmoid'))

mlp1maxabs.summary()

mlp1maxabs.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp1maxabs.fit(Xmaxabs_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 115,201
Trainable params: 115,201
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50


<tensorflow.python.keras.callbacks.History at 0x7fda9d51c370>

In [44]:
mlp1maxabs_pred = (mlp1maxabs.predict(Xmaxabs_test) > 0.5).astype("int32")
save_model_results(y_test, mlp1maxabs_pred, 'mlp1maxabs')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp1maxabs,0.939167,0.709669,0.733577,0.696949,0.674497,0.804,2053,49,97,201


In [45]:
clear_session()

mlp2 = Sequential()
mlp2.add(Dense(16, input_dim=input_size, activation='relu'))
mlp2.add(Dense(16, 'relu'))
mlp2.add(Dense(1, 'sigmoid'))

mlp2.summary()

mlp2.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp2.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


<tensorflow.python.keras.callbacks.History at 0x7fda9e51d340>

In [46]:
mlp2_pred = (mlp2.predict(Xcv_test) > 0.5).astype("int32")
save_model_results(y_test, mlp2_pred, 'mlp2')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp2,0.93875,0.72645,0.740741,0.718686,0.704698,0.780669,2043,59,88,210


In [47]:
clear_session()

mlp2scaled = Sequential()
mlp2scaled.add(Dense(16, input_dim=input_size, activation='relu'))
mlp2scaled.add(Dense(16, 'relu'))
mlp2scaled.add(Dense(1, 'sigmoid'))

mlp2scaled.summary()

mlp2scaled.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp2scaled.fit(Xscaled_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50


<tensorflow.python.keras.callbacks.History at 0x7fda9e5b5190>

In [48]:
mlp2scaled_pred = (mlp2scaled.predict(Xscaled_test) > 0.5).astype("int32")
save_model_results(y_test, mlp2scaled_pred, 'mlp2scaled')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp2scaled,0.93125,0.758065,0.740157,0.768476,0.788591,0.697329,2000,102,63,235


In [51]:
clear_session()

mlp2maxabs = Sequential()
mlp2maxabs.add(Dense(16, input_dim=input_size, activation='relu'))
mlp2maxabs.add(Dense(16, 'relu'))
mlp2maxabs.add(Dense(1, 'sigmoid'))

mlp2maxabs.summary()

mlp2maxabs.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp2maxabs.fit(Xmaxabs_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50


<tensorflow.python.keras.callbacks.History at 0x7fda61308100>

In [52]:
mlp2maxabs_pred = (mlp2maxabs.predict(Xmaxabs_test) > 0.5).astype("int32")
save_model_results(y_test, mlp2maxabs_pred, 'mlp2maxabs')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp2maxabs,0.93875,0.714825,0.735135,0.703934,0.684564,0.793774,2049,53,94,204


As data scaled MaxAbsScaler() did not improve results in the first 3 models, it was not used further.

In [124]:
clear_session()

mlp3 = Sequential()
mlp3.add(Dense(16, input_dim=input_size, activation='relu', kernel_regularizer=regularization))
mlp3.add(Dense(16, 'relu', kernel_regularizer=regularization))
mlp3.add(Dense(1, 'sigmoid'))

mlp3.summary()

mlp3.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp3.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


<tensorflow.python.keras.callbacks.History at 0x7fda230caee0>

In [125]:
mlp3_pred = (mlp3.predict(Xcv_test) > 0.5).astype("int32")
save_model_results(y_test, mlp3_pred, 'mlp3')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp3,0.937917,0.726815,0.739054,0.720137,0.708054,0.772894,2040,62,87,211


In [122]:
clear_session()

mlp3scaled = Sequential()
mlp3scaled.add(Dense(16, input_dim=input_size, activation='relu', kernel_regularizer=regularization))
mlp3scaled.add(Dense(16, 'relu', kernel_regularizer=regularization))
mlp3scaled.add(Dense(1, 'sigmoid'))

mlp3scaled.summary()

mlp3scaled.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp3scaled.fit(Xscaled_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50


<tensorflow.python.keras.callbacks.History at 0x7fda22ed9520>

In [123]:
mlp3scaled_pred = (mlp3scaled.predict(Xscaled_test) > 0.5).astype("int32")
save_model_results(y_test, mlp3scaled_pred, 'mlp3scaled')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp3scaled,0.932083,0.712375,0.719449,0.708475,0.701342,0.738516,2028,74,89,209


In [127]:
clear_session()

mlp4 = Sequential()
mlp4.add(Dense(16, input_dim=input_size, activation='relu', kernel_regularizer=regularization))
mlp4.add(Dropout(0.5))
mlp4.add(Dense(16, 'relu', kernel_regularizer=regularization))
mlp4.add(Dropout(0.5))
mlp4.add(Dense(1, 'sigmoid'))

mlp4.summary()

mlp4.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp4.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch

<tensorflow.python.keras.callbacks.History at 0x7fda23ce5cd0>

In [128]:
mlp4_pred = (mlp4.predict(Xcv_test) > 0.5).astype("int32")
save_model_results(y_test, mlp4_pred, 'mlp4')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp4,0.93125,0.647454,0.683301,0.628975,0.597315,0.798206,2057,45,120,178


In [129]:
clear_session()

mlp4scaled = Sequential()
mlp4scaled.add(Dense(16, input_dim=input_size, activation='relu'))
mlp4scaled.add(Dropout(0.5))
mlp4scaled.add(Dense(16, 'relu'))
mlp4scaled.add(Dropout(0.5))
mlp4scaled.add(Dense(1, 'sigmoid'))

mlp4scaled.summary()

mlp4scaled.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp4scaled.fit(Xscaled_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                115184    
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 115,473
Trainable params: 115,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch

<tensorflow.python.keras.callbacks.History at 0x7fda20a178b0>

In [130]:
mlp4scaled_pred = (mlp4scaled.predict(Xscaled_test) > 0.5).astype("int32")
save_model_results(y_test, mlp4scaled_pred, 'mlp4scaled')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp4scaled,0.930833,0.710141,0.715753,0.707037,0.701342,0.730769,2025,77,89,209


In [135]:
clear_session()

mlp5 = Sequential()
mlp5.add(Dense(32, input_dim=input_size, activation='relu', kernel_regularizer=regularization))
mlp5.add(Dropout(0.5))
mlp5.add(Dense(16, 'relu', kernel_regularizer=regularization))
mlp5.add(Dense(1, 'sigmoid'))

mlp5.summary()

mlp5.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp5.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                230368    
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 230,913
Trainable params: 230,913
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50


<tensorflow.python.keras.callbacks.History at 0x7fd9f50a5910>

In [136]:
mlp5_pred = (mlp5.predict(Xcv_test) > 0.5).astype("int32")
save_model_results(y_test, mlp5_pred, 'mlp5')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp5,0.94,0.732589,0.746479,0.725034,0.711409,0.785185,2044,58,86,212


In [133]:
clear_session()

mlp5scaled = Sequential()
mlp5scaled.add(Dense(32, input_dim=input_size, activation='relu', kernel_regularizer=regularization))
mlp5scaled.add(Dropout(0.5))
mlp5scaled.add(Dense(16, 'relu', kernel_regularizer=regularization))
mlp5scaled.add(Dense(1, 'sigmoid'))

mlp5scaled.summary()

mlp5scaled.compile(loss="binary_crossentropy",
               optimizer="adam",
               metrics=[Precision(), Recall()])

mlp5scaled.fit(Xcv_train, y_train,
           epochs=50,
           batch_size=32,
           validation_split=0.1,
           callbacks=[early_stopping])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                230368    
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 230,913
Trainable params: 230,913
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50


<tensorflow.python.keras.callbacks.History at 0x7fda23905a00>

In [134]:
mlp5scaled_pred = (mlp5scaled.predict(Xscaled_test) > 0.5).astype("int32")
save_model_results(y_test, mlp5scaled_pred, 'mlp5scaled')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,mlp5scaled,0.782083,0.621076,0.514392,0.702689,0.92953,0.355584,1600,502,21,277


# Embedding layer

In [55]:
clear_session()

emb0 = Sequential()

emb0.add(Embedding(input_dim=vocab_size,
                   output_dim=50,
                   input_length=maxlen))
emb0.add(Flatten())
emb0.add(Dense(10, activation='relu'))
emb0.add(Dense(1, activation='sigmoid'))
emb0.compile(loss="binary_crossentropy",
             optimizer="adam",
             metrics=[Precision(), Recall()])

emb0.summary()

emb0.fit(Xtok_train, ytok_train,
         epochs=50,
         batch_size=10,
         validation_split=0.1,
         callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 50)            1900      
_________________________________________________________________
flatten (Flatten)            (None, 1400)              0         
_________________________________________________________________
dense (Dense)                (None, 10)                14010     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 15,921
Trainable params: 15,921
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50


<tensorflow.python.keras.callbacks.History at 0x7fda6ab3c130>

In [57]:
emb0_pred = (emb0.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, emb0_pred, 'emb0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,emb0,0.92,0.655837,0.665505,0.650545,0.64094,0.692029,2017,85,107,191


In [58]:
clear_session()

emb1 = Sequential()

emb1.add(Embedding(input_dim=vocab_size,
                   output_dim=50,
                   input_length=maxlen))
emb1.add(GlobalMaxPool1D())
emb1.add(Dense(10, activation='relu'))
emb1.add(Dense(1, activation='sigmoid'))
emb1.compile(loss="binary_crossentropy",
             optimizer="adam",
             metrics=[Precision(), Recall()])

emb1.summary()

emb1.fit(Xtok_train, ytok_train,
         epochs=50,
         batch_size=10,
         validation_split=0.1,
         callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 50)            1900      
_________________________________________________________________
global_max_pooling1d (Global (None, 50)                0         
_________________________________________________________________
dense (Dense)                (None, 10)                510       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 2,421
Trainable params: 2,421
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fda6db54070>

In [59]:
emb1_pred = (emb1.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, emb1_pred, 'emb1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,emb1,0.91125,0.601399,0.617594,0.592695,0.577181,0.664093,2015,87,126,172


### Embedding layer + CNN

In [61]:
clear_session()

emb2 = Sequential()

emb2.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
emb2.add(Conv1D(128, 5, activation='relu'))
emb2.add(GlobalMaxPool1D())
emb2.add(Dense(10, activation='relu'))
emb2.add(Dense(1, activation='sigmoid'))

emb2.compile(loss="binary_crossentropy",
             optimizer="adam",
             metrics=[Precision(), Recall()])

emb2.summary()

emb2.fit(Xtok_train, ytok_train,
         epochs=50,
         batch_size=10,
         validation_split=0.1,
         callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 24, 128)           64128     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 10)                1290      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 69,229
Trainable params: 69,229
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 1

<tensorflow.python.keras.callbacks.History at 0x7fda9b225580>

In [62]:
emb2_pred = (emb2.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, emb2_pred, 'emb2')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,emb2,0.934167,0.77358,0.753125,0.785528,0.808725,0.704678,2001,101,57,241


In [63]:
clear_session()

emb3 = Sequential()

emb3.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
emb3.add(Conv1D(128, 5, activation='relu'))
emb3.add(GlobalMaxPool1D())
emb3.add(Dense(10, activation='relu'))
emb3.add(Dense(1, activation='sigmoid'))

emb3.compile(loss="binary_crossentropy",
             optimizer="adam",
             metrics=[Precision(), Recall()])

emb3.summary()

emb3.fit(Xtok_train, ytok_train,
         epochs=100,
         batch_size=10,
         validation_split=0.1,
         callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 24, 128)           64128     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 10)                1290      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 69,229
Trainable params: 69,229
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/1

<tensorflow.python.keras.callbacks.History at 0x7fda516065b0>

In [64]:
emb3_pred = (emb3.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, emb3_pred, 'emb3')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,emb3,0.92625,0.757649,0.730594,0.773694,0.805369,0.668524,1983,119,58,240


## SimpleRNN

In [66]:
clear_session()

smpl0 = Sequential()

smpl0.add(Embedding(input_dim=vocab_size,
                         output_dim=100,
                         input_length=maxlen))
smpl0.add(SimpleRNN(32, activation="tanh"))
smpl0.add(Dense(1, "sigmoid")) 

smpl0.summary()

smpl0.compile(loss="binary_crossentropy",
                   optimizer="adam",
                   metrics=[Precision(), Recall()])

smpl0.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 32)                4256      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 8,089
Trainable params: 8,089
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50


<tensorflow.python.keras.callbacks.History at 0x7fda5501d910>

In [68]:
smpl0_pred = (smpl0.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, smpl0_pred, 'smpl0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,smpl0,0.927917,0.743411,0.726698,0.753111,0.771812,0.686567,1997,105,68,230


In [69]:
clear_session()

smpl1 = Sequential()

smpl1.add(Embedding(input_dim=vocab_size,
                         output_dim=100,
                         input_length=maxlen))
smpl1.add(SimpleRNN(64, activation="tanh"))
smpl1.add(Dense(1, "sigmoid")) 

smpl1.summary()

smpl1.compile(loss="binary_crossentropy",
                   optimizer="adam",
                   metrics=[Precision(), Recall()])

smpl1.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 64)                10560     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 14,425
Trainable params: 14,425
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 3

Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50


<tensorflow.python.keras.callbacks.History at 0x7fda56479610>

In [71]:
smpl1_pred = (smpl1.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, smpl1_pred, 'smpl1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,smpl1,0.925417,0.728364,0.714514,0.736358,0.751678,0.680851,1997,105,74,224


## LSTM

In [72]:
clear_session()

lstm0 = Sequential()

lstm0.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
lstm0.add(LSTM(32, return_sequences=True))
lstm0.add(LSTM(16))
lstm0.add(Dense(1, "sigmoid"))

lstm0.summary()

lstm0.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

lstm0.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
lstm (LSTM)                  (None, 28, 32)            17024     
_________________________________________________________________
lstm_1 (LSTM)                (None, 16)                3136      
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 23,977
Trainable params: 23,977
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

Epoch 41/50
Epoch 42/50
Epoch 43/50


<tensorflow.python.keras.callbacks.History at 0x7fda585950a0>

In [73]:
lstm0_pred = (lstm0.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, lstm0_pred, 'lstm0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,lstm0,0.9225,0.742586,0.716463,0.758065,0.788591,0.656425,1979,123,63,235


In [77]:
clear_session()

lstm1 = Sequential()

lstm1.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
lstm1.add(LSTM(32))
lstm1.add(Dense(1, "sigmoid"))

lstm1.summary()

lstm1.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

lstm1.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 20,857
Trainable params: 20,857
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 3

Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


<tensorflow.python.keras.callbacks.History at 0x7fda701b4e50>

In [78]:
lstm1_pred = (lstm1.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, lstm1_pred, 'lstm1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,lstm1,0.934167,0.75114,0.742671,0.755968,0.765101,0.721519,2014,88,70,228


In [80]:
clear_session()

lstm2 = Sequential()

lstm2.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
lstm2.add(LSTM(32))
lstm2.add(Dense(16, "tanh"))
lstm2.add(Dense(1, "sigmoid"))

lstm2.summary()

lstm2.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

lstm2.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 16)                528       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 21,369
Trainable params: 21,369
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


<tensorflow.python.keras.callbacks.History at 0x7fdaa8390070>

In [81]:
lstm2_pred = (lstm2.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, lstm2_pred, 'lstm2')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,lstm2,0.929583,0.744622,0.730463,0.752794,0.768456,0.696049,2002,100,69,229


In [82]:
clear_session()

lstm3 = Sequential()

lstm3.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
lstm3.add(LSTM(32))
lstm3.add(Dense(16, "sigmoid"))
lstm3.add(Dense(1, "sigmoid"))

lstm3.summary()

lstm3.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

lstm3.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 16)                528       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 21,369
Trainable params: 21,369
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50


<tensorflow.python.keras.callbacks.History at 0x7fdaabb1e1f0>

In [83]:
lstm3_pred = (lstm3.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, lstm3_pred, 'lstm3')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,lstm3,0.937917,0.778832,0.763116,0.787919,0.805369,0.725076,2011,91,58,240


In [84]:
clear_session()

lstm4 = Sequential()

lstm4.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
lstm4.add(LSTM(32))
lstm4.add(Dense(32, "relu"))
lstm4.add(Dense(16, "sigmoid"))
lstm4.add(Dense(1, "sigmoid"))

lstm4.summary()

lstm4.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

lstm4.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense (Dense)                (None, 32)                1056      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 22,425
Trainable params: 22,425
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 1

<tensorflow.python.keras.callbacks.History at 0x7fdaaebcabe0>

In [85]:
lstm4_pred = (lstm4.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, lstm4_pred, 'lstm4')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,lstm4,0.931667,0.737659,0.731148,0.741356,0.748322,0.714744,2013,89,75,223


## BiLSTM

In [87]:
clear_session()

bilstm0 = Sequential()

bilstm0.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
bilstm0.add(Bidirectional(LSTM(32)))
bilstm0.add(Dense(1, "sigmoid"))

bilstm0.summary()

bilstm0.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

bilstm0.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                34048     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 37,913
Trainable params: 37,913
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 3

<tensorflow.python.keras.callbacks.History at 0x7fdabb6c9190>

In [88]:
bilstm0_pred = (bilstm0.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, bilstm0_pred, 'bilstm0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,bilstm0,0.9375,0.751928,0.75,0.753012,0.755034,0.745033,2025,77,73,225


In [93]:
clear_session()

bilstm1 = Sequential()

bilstm1.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
bilstm1.add(Bidirectional(LSTM(32, return_sequences=True)))
bilstm1.add(Bidirectional(LSTM(32)))
bilstm1.add(Dense(1, "sigmoid"))

bilstm1.summary()

bilstm1.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

bilstm1.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping2])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
bidirectional (Bidirectional (None, 28, 64)            34048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                24832     
_________________________________________________________________
dense (Dense)                (None, 1)                 65        
Total params: 62,745
Trainable params: 62,745
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fdadeed0fd0>

In [94]:
bilstm1_pred = (bilstm1.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, bilstm1_pred, 'bilstm1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,bilstm1,0.938333,0.733263,0.743056,0.727891,0.718121,0.769784,2038,64,84,214


In [95]:
clear_session()

bilstm2 = Sequential()

bilstm2.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
bilstm2.add(Bidirectional(LSTM(32, return_sequences=True)))
bilstm2.add(Bidirectional(LSTM(16)))
bilstm2.add(Dense(1, "sigmoid"))

bilstm2.summary()

bilstm2.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

bilstm2.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
bidirectional (Bidirectional (None, 28, 64)            34048     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 32)                10368     
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 48,249
Trainable params: 48,249
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50


<tensorflow.python.keras.callbacks.History at 0x7fdac14b0250>

In [97]:
bilstm2_pred = (bilstm2.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, bilstm2_pred, 'bilstm2')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,bilstm2,0.933333,0.737045,0.734219,0.738636,0.741611,0.726974,2019,83,77,221


In [98]:
clear_session()

bilstm3 = Sequential()

bilstm3.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
bilstm3.add(Bidirectional(LSTM(32)))
bilstm3.add(Dense(32, "relu"))
bilstm3.add(Dense(16, "sigmoid"))
bilstm3.add(Dense(1, "sigmoid"))

bilstm3.summary()

bilstm3.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

bilstm3.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                34048     
_________________________________________________________________
dense (Dense)                (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 40,473
Trainable params: 40,473
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 1

<tensorflow.python.keras.callbacks.History at 0x7fdaf2a4e910>

In [99]:
bilstm3_pred = (bilstm3.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, bilstm3_pred, 'bilstm3')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,bilstm3,0.9375,0.751928,0.75,0.753012,0.755034,0.745033,2025,77,73,225


In [100]:
clear_session()

bilstm4 = Sequential()

bilstm4.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
bilstm4.add(Bidirectional(LSTM(32)))
bilstm4.add(Dense(16, "sigmoid"))
bilstm4.add(Dense(1, "sigmoid"))

bilstm4.summary()

bilstm4.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

bilstm4.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                34048     
_________________________________________________________________
dense (Dense)                (None, 16)                1040      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 38,905
Trainable params: 38,905
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 2

<tensorflow.python.keras.callbacks.History at 0x7fdb19a43640>

In [101]:
bilstm4_pred = (bilstm4.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, bilstm4_pred, 'bilstm4')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,bilstm4,0.942083,0.769429,0.767947,0.770261,0.771812,0.76412,2031,71,68,230


## Conv + LSTM

In [102]:
clear_session()

convlstm0 = Sequential()

convlstm0.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
convlstm0.add(Conv1D(16, 3, activation="relu"))
convlstm0.add(MaxPooling1D((2)))
convlstm0.add(LSTM(32))
convlstm0.add(Dense(1, "sigmoid"))

convlstm0.summary()

convlstm0.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

convlstm0.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 26, 16)            4816      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 13, 16)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                6272      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 14,921
Trainable params: 14,921
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 1

<tensorflow.python.keras.callbacks.History at 0x7fdb1f730100>

In [103]:
convlstm0_pred = (convlstm0.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, convlstm0_pred, 'convlstm0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,convlstm0,0.938333,0.77961,0.764331,0.788436,0.805369,0.727273,2012,90,58,240


In [104]:
clear_session()

convlstm1 = Sequential()

convlstm1.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
convlstm1.add(Conv1D(16, 3, activation="relu"))
convlstm1.add(MaxPooling1D((2)))
convlstm1.add(LSTM(32))
convlstm1.add(Dense(16, "sigmoid"))
convlstm1.add(Dense(1, "sigmoid"))

convlstm1.summary()

convlstm1.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

convlstm1.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 26, 16)            4816      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 13, 16)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 32)                6272      
_________________________________________________________________
dense (Dense)                (None, 16)                528       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 15,433
Trainable params: 15,433
Non-trainable params: 0
____________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7fdb25d5d190>

In [105]:
convlstm1_pred = (convlstm1.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, convlstm1_pred, 'convlstm1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,convlstm1,0.935,0.766617,0.751592,0.775296,0.791946,0.715152,2008,94,62,236


In [106]:
clear_session()

convlstm2 = Sequential()

convlstm2.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
convlstm2.add(Conv1D(16, 3, activation="relu"))
convlstm2.add(MaxPooling1D((2)))
convlstm2.add(LSTM(32, return_sequences=True))
convlstm2.add(LSTM(16))
convlstm2.add(Dense(16, "sigmoid"))
convlstm2.add(Dense(1, "sigmoid"))

convlstm2.summary()

convlstm2.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

convlstm2.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 26, 16)            4816      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 13, 16)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 13, 32)            6272      
_________________________________________________________________
lstm_1 (LSTM)                (None, 16)                3136      
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 1

<tensorflow.python.keras.callbacks.History at 0x7fdaf29871f0>

In [107]:
convlstm2_pred = (convlstm2.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, convlstm2_pred, 'convlstm2')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,convlstm2,0.924583,0.791802,0.743989,0.821362,0.88255,0.643032,1956,146,35,263


In [108]:
convlstm2.save('models/convlstm2')



INFO:tensorflow:Assets written to: models/convlstm2/assets


INFO:tensorflow:Assets written to: models/convlstm2/assets


## Conv + BiLSTM

In [110]:
clear_session()

convbilstm0 = Sequential()

convbilstm0.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
convbilstm0.add(Conv1D(16, 3, activation="relu"))
convbilstm0.add(MaxPooling1D((2)))
convbilstm0.add(Bidirectional(LSTM(32)))
convbilstm0.add(Dense(32, "relu"))
convbilstm0.add(Dense(16, "sigmoid"))
convbilstm0.add(Dense(1, "sigmoid"))

convbilstm0.summary()

convbilstm0.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

convbilstm0.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 26, 16)            4816      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 13, 16)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                12544     
_________________________________________________________________
dense (Dense)                (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1

<tensorflow.python.keras.callbacks.History at 0x7fda47725ee0>

In [111]:
convbilstm0_pred = (convbilstm0.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, convbilstm0_pred, 'convbilstm0')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,convbilstm0,0.938333,0.794551,0.770898,0.808442,0.83557,0.715517,2003,99,49,249


In [142]:
convbilstm0.save('models/convbilstm0')



INFO:tensorflow:Assets written to: models/convbilstm0/assets


INFO:tensorflow:Assets written to: models/convbilstm0/assets


In [113]:
clear_session()

convbilstm1 = Sequential()

convbilstm1.add(Embedding(vocab_size,
                   output_dim=100,
                   input_length=maxlen))
convbilstm1.add(Conv1D(16, 3, activation="relu"))
convbilstm1.add(MaxPooling1D((2)))
convbilstm1.add(Bidirectional(LSTM(32)))
convbilstm1.add(Dense(16, "sigmoid"))
convbilstm1.add(Dense(1, "sigmoid"))

convbilstm1.summary()

convbilstm1.compile(loss="binary_crossentropy",
              optimizer="adam",
              metrics=[Precision(), Recall()])

convbilstm1.fit(Xtok_train, ytok_train, 
          epochs=50, 
          batch_size=32, 
          validation_split=0.1, 
          callbacks=[early_stopping3])

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 28, 100)           3800      
_________________________________________________________________
conv1d (Conv1D)              (None, 26, 16)            4816      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 13, 16)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                12544     
_________________________________________________________________
dense (Dense)                (None, 16)                1040      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 22,217
Trainable params: 22,217
Non-trainable params: 0
____________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7fdb5d190f70>

In [114]:
convbilstm1_pred = (convbilstm1.predict(Xtok_test) > 0.5).astype("int32")
save_model_results(y_test, convbilstm1_pred, 'convbilstm1')

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,convbilstm1,0.920417,0.74241,0.712782,0.760103,0.795302,0.645777,1972,130,61,237


# RESULTS

### Combining results

In [36]:
csv_list = ['results/baseline_210228_2044.csv',
                 'results/bilstm0_210301_0103.csv',
                 'results/bilstm1_210301_0114.csv',
                 'results/bilstm1_210301_0125.csv',
                 'results/bilstm2_210301_0129.csv',
                 'results/bilstm3_210301_0131.csv',
                 'results/bilstm4_210301_0133.csv',
                 'results/cld3_df.csv',
                 'results/convbilstm0_210301_0140.csv',
                 'results/convbilstm1_210301_0142.csv',
                 'results/convlstm0_210301_0135.csv',
                 'results/convlstm1_210301_0136.csv',
                 'results/convlstm2_210301_0137.csv',
                 'results/dtc_210301_2159.csv',
                 'results/emb0_210301_0027.csv',
                 'results/emb1_210301_0029.csv',
                 'results/emb2_210301_0030.csv',
                 'results/emb3_210301_0032.csv',
                 'results/fasttext_df.csv',
                 'results/knn_210301_2205.csv',
                 'results/logreg_210301_0323.csv',
                 'results/lstm0_210301_0039.csv',
                 'results/lstm1_210301_0047.csv',
                 'results/lstm2_210301_0053.csv',
                 'results/lstm3_210301_0056.csv',
                 'results/lstm4_210301_0058.csv',
                 'results/mlp0_210301_0001.csv',
                 'results/mlp0maxabs_210301_0004.csv',
                 'results/mlp0scaled_210301_0002.csv',
                 'results/mlp1_210301_0005.csv',
                 'results/mlp1maxabs_210301_0009.csv',
                 'results/mlp1scaled_210301_0007.csv',
                 'results/mlp2_210301_0009.csv',
                 'results/mlp2maxabs_210301_0012.csv',
                 'results/mlp2scaled_210301_0011.csv',
                 'results/mlp3_210301_0149.csv',
                 'results/mlp3scaled_210301_0148.csv',
                 'results/mlp4_210301_0151.csv',
                 'results/mlp4scaled_210301_0151.csv',
                 'results/mlp5_210301_0154.csv',
                 'results/mlp5scaled_210301_0154.csv',
                 'results/ridge_210301_0324.csv',
                 'results/rfc_210301_2200.csv',
                 'results/smpl0_210301_0034.csv',
                 'results/smpl1_210301_0036.csv',
                 'results/svm_210301_2202.csv',
                 'results/textcat_df.csv']

df_list = []
for file in csv_list:
    df_list.append(pd.read_csv(file, index_col=False))

combined_results = pd.concat(df_list, ignore_index=True)

combined_results = combined_results.drop(columns="Unnamed: 0")

combined_results = combined_results.sort_values(by=['F1.5-score'], ascending=False, ignore_index=True)

In [37]:
combined_results

Unnamed: 0,model,accuracy,F1.5-score,F1-score,F2-score,recall,precision,tn,fp,fn,tp
0,convbilstm0,0.938333,0.794551,0.770898,0.808442,0.83557,0.715517,2003,99,49,249
1,convlstm2,0.924583,0.791802,0.743989,0.821362,0.88255,0.643032,1956,146,35,263
2,convlstm0,0.938333,0.77961,0.764331,0.788436,0.805369,0.727273,2012,90,58,240
3,lstm3,0.937917,0.778832,0.763116,0.787919,0.805369,0.725076,2011,91,58,240
4,ridge,0.93875,0.776996,0.764045,0.784443,0.798658,0.732308,2015,87,60,238
5,emb2,0.934167,0.77358,0.753125,0.785528,0.808725,0.704678,2001,101,57,241
6,bilstm4,0.942083,0.769429,0.767947,0.770261,0.771812,0.76412,2031,71,68,230
7,convlstm1,0.935,0.766617,0.751592,0.775296,0.791946,0.715152,2008,94,62,236
8,logreg,0.93625,0.763742,0.752827,0.769993,0.781879,0.725857,2014,88,65,233
9,mlp2scaled,0.93125,0.758065,0.740157,0.768476,0.788591,0.697329,2000,102,63,235


In [169]:
pd.DataFrame.to_csv(combined_results, 'combined_results.csv', index=False)

## Conclusions

Although none of the models achieved F1.5-score of 0.8 or higher, the results of the research are promising. The difference between scores of existing language detection models (textcat, fasttext, cld3) and newly trained ones are large.

Best results are achieved by models combining convolutional and LSTM or bidirectional LSTM layers: `convbilstm0` and `convlstm2`. In business context a further analysis of the two models would have to be prepared in order to choose the best one for company's purposes. Possibly, the choice of metrics could be reconsidered and yet more emphasis would be put on recall, leading to choosing `convlstm2`.

It is also worth noting that `ridge`, classifier using ridge regression took 5. place, before many (however simple) CNN and RNN models.

Two best models (`convbilstm0` and `convlstm2`) have been saved to `models` subdirectory.

## Further exploration

Possible ways to continue work on the project:

* acquiring more / more diverse training data
* additional ways of data preparation - what else could be done?
    * would TfidfVectorizer() give better results than CountVectorizer()?
    * instead of 3-grams, should other n-grams be used? maybe of different lenghts?
    * other ways of dimensionality reduction, apart from TruncatedSVD()
    * would the results be better after using a lemmatizer on the dataset?
* additional algorithms to test - e.g. a voting classifier?
* further study of Tensorflow and Keras libraries, i.a.:
    * grid search of hyperparameters
    * use of initial bias