In [1]:
from pathlib import Path

import pandas as pd
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from spacy.lang.nb.stop_words import STOP_WORDS
import xgboost as xgb


In [2]:
SAVE_PATH = Path('model')
SAVE_PATH.mkdir(exist_ok=True)
DATA_PATH = Path('../data/norec')

In [3]:
subset_names = ['train', 'test', 'dev']
subsets = {name: pd.read_pickle(DATA_PATH / f'norsk_kategori_{name}.pkl') for name in subset_names}

In [4]:
text = subsets['train'].iloc[0]['text']

In [5]:
text

"Franz Ferdinand :\n« You Could Have It So Much Better »\n( Domino Recording )\nHøsten blir mye bedre med Franz Ferdinand .\nDet var vanskelig å forestille seg at Franz Ferdinand , etter å ha stått bak et av fjorårets mest energiske og hit-spekkede debutalbum , kunne klare å overgå seg selv på oppfølgeren .\nMen de fire postpønkglade kunststudentene fra Glasgow fornekter seg ikke .\nPå « You Could Have It So Much Better » har de ikke bare forbedret låtskriverferdighetene sine , men de har også et mye mer variert uttrykk enn tidligere .\nFortsatt er det de allsang- og dansedikterende låtbombene som råder , anført av den uimotståelige singlen « Do You Want To » .\nMen mens de tidligere først og fremst var ute etter å fenge , virker det som om Alex Kapranos & Co. denne gangen har lagt mer jobb i selve oppbyggingen av låtene , noe åpningen « The Fallen » og den mangedelte « I'm Your Villain » ( med et nesten like hektende riff som « Take Me Out » -signaturen ) er gode eksempler på .\nSkott

We need to check if the training set is balanced. Grouping by rating and counting the number of samples with each value should do the trick.

In [6]:
subsets['train'].groupby(['rating']).count()

Unnamed: 0_level_0,text
rating,Unnamed: 1_level_1
0,2681
1,14821


So the training set is imbalanced. We need to be aware of this and potentially correct it.

We'll create a vectorizer that will keep words with a document frequency between 5 and 1000, but no more than 10000 terms.

In [7]:
vectorizer = CountVectorizer(stop_words=STOP_WORDS, min_df=5, max_df=1000, max_features=10000)
vectorizer.fit_transform(subsets['train']['text'])

<17502x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 1560406 stored elements in Compressed Sparse Row format>

In [8]:
vectorizer.get_feature_names()[:10]

['00', '000', '08', '10', '100', '1000', '1080', '1080p', '11', '110']

In [9]:
len(vectorizer.get_feature_names())

10000

In [10]:
texts = {name: vectorizer.transform(subsets[name]['text']) for name in subset_names}
categories = {name: subsets[name]['rating'] for name in subset_names}

In [11]:
lr_model = LogisticRegression()
lr_model.fit(texts['train'], categories['train'])
print('Training metrics')
print(classification_report(categories['train'], lr_model.predict(texts['train'])))
print('Development metrics')
print(classification_report(categories['dev'], lr_model.predict(texts['dev'])))

Training metrics
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2681
           1       1.00      1.00      1.00     14821

    accuracy                           1.00     17502
   macro avg       1.00      1.00      1.00     17502
weighted avg       1.00      1.00      1.00     17502

Development metrics
              precision    recall  f1-score   support

           0       0.77      0.68      0.72       276
           1       0.96      0.97      0.96      1963

    accuracy                           0.94      2239
   macro avg       0.86      0.82      0.84      2239
weighted avg       0.93      0.94      0.93      2239



That's our baseline. With imbalanced training data, we end up with an F1 score of 72% and 96% for 0 (low) and 1 (high) respectively.

In [12]:
lr_model = LogisticRegression(class_weight='balanced')
lr_model.fit(texts['train'], categories['train'])
print('Training metrics')
print(classification_report(categories['train'], lr_model.predict(texts['train'])))
print('Development metrics')
print(classification_report(categories['dev'], lr_model.predict(texts['dev'])))

Training metrics
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2681
           1       1.00      1.00      1.00     14821

    accuracy                           1.00     17502
   macro avg       1.00      1.00      1.00     17502
weighted avg       1.00      1.00      1.00     17502

Development metrics
              precision    recall  f1-score   support

           0       0.67      0.76      0.71       276
           1       0.97      0.95      0.96      1963

    accuracy                           0.92      2239
   macro avg       0.82      0.85      0.83      2239
weighted avg       0.93      0.92      0.93      2239



That's surprising. The expected effect was slightly worse performance for the 1-class and better for the 0-class. There is not much change here, but the result for the 0-class did not improve.

In [13]:
class_weights = subsets['train'].groupby(['rating']).count()['text']
class_weights = {i: min(class_weights) / class_weight for i, class_weight in enumerate(class_weights)}
class_weights

{0: 1.0, 1: 0.18089197759935227}

In [14]:
lr_model = LogisticRegression(class_weight=class_weights)
lr_model.fit(texts['train'], categories['train'])
print('Training metrics')
print(classification_report(categories['train'], lr_model.predict(texts['train'])))
print('Development metrics')
print(classification_report(categories['dev'], lr_model.predict(texts['dev'])))

Training metrics
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      2681
           1       1.00      0.99      1.00     14821

    accuracy                           1.00     17502
   macro avg       0.98      1.00      0.99     17502
weighted avg       1.00      1.00      1.00     17502

Development metrics
              precision    recall  f1-score   support

           0       0.67      0.78      0.72       276
           1       0.97      0.94      0.96      1963

    accuracy                           0.92      2239
   macro avg       0.82      0.86      0.84      2239
weighted avg       0.93      0.92      0.93      2239



The conclusion is that the performance is not affected much by the class imbalance.

In [15]:
vectorizer = CountVectorizer(stop_words=STOP_WORDS, min_df=5, max_df=1000, max_features=1000)
vectorizer.fit_transform(subsets['train']['text'])
len(vectorizer.get_feature_names())
texts = {name: vectorizer.transform(subsets[name]['text']) for name in subset_names}
categories = {name: subsets[name]['rating'] for name in subset_names}
lr_model = LogisticRegression(class_weight='balanced')
lr_model.fit(texts['train'], categories['train'])
print('Training metrics')
print(classification_report(categories['train'], lr_model.predict(texts['train'])))
print('Development metrics')
print(classification_report(categories['dev'], lr_model.predict(texts['dev'])))

Training metrics
              precision    recall  f1-score   support

           0       0.50      0.89      0.64      2681
           1       0.98      0.84      0.90     14821

    accuracy                           0.85     17502
   macro avg       0.74      0.87      0.77     17502
weighted avg       0.90      0.85      0.86     17502

Development metrics
              precision    recall  f1-score   support

           0       0.43      0.76      0.55       276
           1       0.96      0.86      0.91      1963

    accuracy                           0.85      2239
   macro avg       0.70      0.81      0.73      2239
weighted avg       0.90      0.85      0.86      2239



In [16]:
vectorizer = CountVectorizer(stop_words=STOP_WORDS, min_df=5, max_df=1000, max_features=50000)
vectorizer.fit_transform(subsets['train']['text'])
print(f'Using {len(vectorizer.get_feature_names())} features')
texts = {name: vectorizer.transform(subsets[name]['text']) for name in subset_names}
categories = {name: subsets[name]['rating'] for name in subset_names}
lr_model = LogisticRegression(class_weight='balanced')
lr_model.fit(texts['train'], categories['train'])
print('Training metrics')
print(classification_report(categories['train'], lr_model.predict(texts['train'])))
print('Development metrics')
print(classification_report(categories['dev'], lr_model.predict(texts['dev'])))

Using 50000 features
Training metrics
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2681
           1       1.00      1.00      1.00     14821

    accuracy                           1.00     17502
   macro avg       1.00      1.00      1.00     17502
weighted avg       1.00      1.00      1.00     17502

Development metrics
              precision    recall  f1-score   support

           0       0.82      0.78      0.80       276
           1       0.97      0.98      0.97      1963

    accuracy                           0.95      2239
   macro avg       0.90      0.88      0.89      2239
weighted avg       0.95      0.95      0.95      2239



Looks like all versions are overfitting excpet the one with 1000 features, but the 50000 features still seem to do better on the dev set.

In [23]:
vectorizer = CountVectorizer(stop_words=STOP_WORDS, min_df=5, max_df=1000, max_features=5000)
vectorizer.fit_transform(subsets['train']['text'])
len(vectorizer.get_feature_names())
texts = {name: vectorizer.transform(subsets[name]['text']) for name in subset_names}
categories = {name: subsets[name]['rating'] for name in subset_names}
lr_model = LogisticRegression(class_weight='balanced')
lr_model.fit(texts['train'], categories['train'])
print('Training metrics')
print(classification_report(categories['train'], lr_model.predict(texts['train'])))
print('Development metrics')
print(classification_report(categories['dev'], lr_model.predict(texts['dev'])))

Training metrics
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      2681
           1       1.00      0.99      0.99     14821

    accuracy                           0.99     17502
   macro avg       0.97      0.99      0.98     17502
weighted avg       0.99      0.99      0.99     17502

Development metrics
              precision    recall  f1-score   support

           0       0.59      0.72      0.65       276
           1       0.96      0.93      0.94      1963

    accuracy                           0.90      2239
   macro avg       0.77      0.82      0.79      2239
weighted avg       0.91      0.90      0.91      2239



Some interesting results. If you understand what's happening, it's time to move over to the XGBoost variants.