In [1]:
import fasttext

1. (2 points) Turn the dataset into a dataset compatible with Fastext (see the _Tips on using FastText_ section a bit lower).
   * For pretreatment, only apply lower casing and punctuation removal.

First we load the data and apply preatreatment

In [2]:
from datasets import load_dataset, Dataset
imdb_dataset = load_dataset("imdb")
train_dataset = imdb_dataset["train"].train_test_split(
    stratify_by_column="label", test_size=0.2, seed=42
)
test_df = imdb_dataset["test"]
train_df = train_dataset["train"]
valid_df = train_dataset["test"]

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset imdb (/home/aeschylli/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 33.76it/s]
Loading cached split indices for dataset at /home/aeschylli/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-5f37fd0866e4f89f.arrow and /home/aeschylli/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-dd5732a0e6ac784c.arrow


Then we create a file and insert the data in a format understandable by fasttext

In [3]:
from string import punctuation
from typing import Type
import re
def clean_text(text_data: str) -> str:
    """
    Function lowering and removing ponctuation of a string
    """
    text_data = text_data.lower()
    text_data = re.sub("[" + punctuation + "]( |$)", "", text_data)
    return text_data

def create_fasttext_file(dataset: Type[Dataset], name: str):
    with open(name, 'w') as f:
        for example in dataset.shuffle(seed=42):
            label = '__label__' + ('positive' if example['label'] == 1 else 'negative')
            text = clean_text(example['text']).replace('\n', ' ')
            f.write(label + ' ' + text + '\n')
create_fasttext_file(train_df, 'imdb_train.txt')

Loading cached shuffled indices for dataset at /home/aeschylli/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-1646f8e39e83ca42.arrow


2. (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.

In [4]:
model = fasttext.train_supervised('imdb_train.txt')

Read 4M words
Number of words:  392111
Number of labels: 2
Progress: 100.0% words/sec/thread: 2989682 lr:  0.000000 avg.loss:  0.468147 ETA:   0h 0m 0s


In [5]:
create_fasttext_file(test_df, 'imdb_test.txt')

Loading cached shuffled indices for dataset at /home/aeschylli/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-c1eaa46e94dfbfd3.arrow


In [6]:
result = model.test('imdb_test.txt')
result

(25000, 0.84156, 0.84156)

The model was evaluated on the 25000 element of the test dataset and have an accuracy of **0.842** (as there is no seeding with fastext, it can be changed by a new execution)

3. (2 points) Use the [hyperparameters search functionality](https://fasttext.cc/docs/en/autotune.html) of FastText and repeat step 2.
   * To do so, you'll need to [split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) your training set into a training and a validation set.
   * Let the model search for 5 minutes (it's the default search time).
   * Don't forget to shuffle (and stratify) your splits. The dataset has its entry ordered by label (0s first, then 1s). Feeding the classifier one class and then the second can mess with its performances.

First we create the file for the val dataset, used for the hyperparameter search

In [7]:
create_fasttext_file(valid_df, 'imdb_val.txt')

Loading cached shuffled indices for dataset at /home/aeschylli/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-2b5fa46213387c00.arrow


Then we use the **train_supervised** function to do the hyperparameter search

In [8]:
best_model = fasttext.train_supervised(
    input='imdb_train.txt',
    autotuneValidationFile='imdb_val.txt',
    autotuneMetric='f1',)

Progress:  11.2% Trials:    5 Best score:  0.870400 ETA:   0h 4m26s

: 

: 

In [None]:
# Test the best model on the test data
test_data = 'imdb_test.txt'
result = best_model.test(test_data)
print("Accuracy:", result[1])

Accuracy: 0.87424


The accuracy for the best model is now : **0.87**. We have a 5% improvement wich is big in my opinion.

In [None]:
print('Basic model :')
print(f'lr: {model.lr}')
print(f'Ngrams: {model.wordNgrams}')
print(f'epoch: {model.epoch}')
print(f'dim: {model.dim}')
print('\nBest model :')
print(f'lr: {best_model.lr}')
print(f'Ngrams: {best_model.wordNgrams}')
print(f'epoch: {best_model.epoch}')
print(f'dim: {best_model.dim}')


In [None]:
import pandas as pd

data = pd.DataFrame(test_df)


(('__label__negative',), array([0.99893188]))

In [None]:
data['tuple']= data.text.apply(lambda x: best_model.predict(x))
data[['pred', 'proba']] = data.tuple.apply(pd.Series)
data.pred = data.pred.apply(lambda x :0 if x[0] == '__label__negative' else 1)
data.drop('tuple', axis=1, inplace=True)
data

Unnamed: 0,text,label,pred,proba
0,I love sci-fi and am willing to put up with a ...,0,0,[0.998931884765625]
1,"Worth the entertainment value of a rental, esp...",0,1,[0.5118221640586853]
2,its a totally average film with a few semi-alr...,0,0,[0.9995484352111816]
3,STAR RATING: ***** Saturday Night **** Friday ...,0,0,[0.9991713762283325]
4,"First off let me say, If you haven't enjoyed a...",0,1,[1.0000098943710327]
...,...,...,...,...
24995,Just got around to seeing Monster Man yesterda...,1,1,[0.7738313674926758]
24996,I got this as part of a competition prize. I w...,1,1,[0.68993079662323]
24997,I got Monster Man in a box set of three films ...,1,0,[0.6534895300865173]
24998,"Five minutes in, i started to feel how naff th...",1,0,[0.9462082982063293]


In [None]:
wrong = data.loc[data.label != data.pred]
wrong.reset_index(inplace=True)
wrong

Unnamed: 0,index,text,label,pred,proba
0,1,"Worth the entertainment value of a rental, esp...",0,1,[0.5118221640586853]
1,4,"First off let me say, If you haven't enjoyed a...",0,1,[1.0000098943710327]
2,22,The Forgotten (AKA: Don't Look In The Basement...,0,1,[0.6209208965301514]
3,28,Four things intrigued me as to this film - fir...,0,1,[0.9790410399436951]
4,41,Widow hires a psychopath as a handyman. Sloppy...,0,1,[0.7631122469902039]
...,...,...,...,...,...
3212,24981,"""Gaming? Nicotine? Fisticuffs? We're moving in...",1,0,[0.5742045044898987]
3213,24985,"This was on Showtime the other night, and I fi...",1,0,[0.9910656213760376]
3214,24986,"If you took a really good jack black movie, ad...",1,0,[0.961755633354187]
3215,24997,I got Monster Man in a box set of three films ...,1,0,[0.6534895300865173]


In [None]:
wrong.iloc[0].text

"Worth the entertainment value of a rental, especially if you like action movies. This one features the usual car chases, fights with the great Van Damme kick style, shooting battles with the 40 shell load shotgun, and even terrorist style bombs. All of this is entertaining and competently handled but there is nothing that really blows you away if you've seen your share before.<br /><br />The plot is made interesting by the inclusion of a rabbit, which is clever but hardly profound. Many of the characters are heavily stereotyped -- the angry veterans, the terrified illegal aliens, the crooked cops, the indifferent feds, the bitchy tough lady station head, the crooked politician, the fat federale who looks like he was typecast as the Mexican in a Hollywood movie from the 1940s. All passably acted but again nothing special.<br /><br />I thought the main villains were pretty well done and fairly well acted. By the end of the movie you certainly knew who the good guys were and weren't. The