In [22]:
import fasttext

1. (2 points) Turn the dataset into a dataset compatible with Fastext (see the _Tips on using FastText_ section a bit lower).
   * For pretreatment, only apply lower casing and punctuation removal.

First we load the data and apply preatreatment

In [23]:
from datasets import load_dataset, Dataset
imdb_dataset = load_dataset("imdb")
train_dataset = imdb_dataset["train"].train_test_split(
    stratify_by_column="label", test_size=0.2, seed=42
)
test_df = imdb_dataset["test"]
train_df = train_dataset["train"]
valid_df = train_dataset["test"]



  0%|          | 0/3 [00:00<?, ?it/s]



Then we create a file and insert the data in a format understandable by fasttext

In [24]:
from string import punctuation
from typing import Type
import re
def clean_text(text_data: str) -> str:
    """
    Function lowering and removing ponctuation of a string
    """
    text_data = text_data.lower()
    text_data = re.sub("[" + punctuation + "]( |$)", "", text_data)
    return text_data

def create_fasttext_file(dataset: Type[Dataset], name: str):
    with open(name, 'w') as f:
        for example in dataset.shuffle(seed=42):
            label = '__label__' + ('positive' if example['label'] == 1 else 'negative')
            text = clean_text(example['text']).replace('\n', ' ')
            f.write(label + ' ' + text + '\n')
create_fasttext_file(train_df, 'imdb_train.txt')



2. (2 points) Train a FastText classifier with default parameters on the training data, and evaluate it on the test data using accuracy.

In [25]:
model = fasttext.train_supervised('imdb_train.txt')

In [26]:
create_fasttext_file(test_df, 'imdb_test.txt')



In [27]:
result = model.test('imdb_test.txt')
result

(25000, 0.84172, 0.84172)

The model was evaluated on the 25000 element of the test dataset and have an accuracy of **0.842** (as there is no seeding with fastext, it can be changed by a new execution)

3. (2 points) Use the [hyperparameters search functionality](https://fasttext.cc/docs/en/autotune.html) of FastText and repeat step 2.
   * To do so, you'll need to [split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) your training set into a training and a validation set.
   * Let the model search for 5 minutes (it's the default search time).
   * Don't forget to shuffle (and stratify) your splits. The dataset has its entry ordered by label (0s first, then 1s). Feeding the classifier one class and then the second can mess with its performances.

First we create the file for the val dataset, used for the hyperparameter search

In [28]:
create_fasttext_file(valid_df, 'imdb_val.txt')



Then we use the **train_supervised** function to do the hyperparameter search

In [29]:
best_model = fasttext.train_supervised(
    input='imdb_train.txt',
    autotuneValidationFile='imdb_val.txt',
    autotuneMetric='f1')

In [30]:
# Test the best model on the test data
test_data = 'imdb_test.txt'
result = best_model.test(test_data)
print("Accuracy:", result[1])

Accuracy: 0.86256


The accuracy for the best model is now : **0.862**. We have a 2% improvement wich is good for the same model

In [31]:
print('Basic model :')
print(f'lr: {model.lr}')
print(f'Ngrams: {model.wordNgrams}')
print(f'epoch: {model.epoch}')
print(f'dim: {model.dim}')
print('\nBest model :')
print(f'lr: {best_model.lr}')
print(f'Ngrams: {best_model.wordNgrams}')
print(f'epoch: {best_model.epoch}')
print(f'dim: {best_model.dim}')


Basic model :
lr: 0.1
Ngrams: 1
epoch: 5
dim: 100

Best model :
lr: 0.04610774340466207
Ngrams: 1
epoch: 45
dim: 61


We can see that the dim of the vectors has been reduced, the lr is smaller, to avoid missing the opimum and we train on much more epoch.

5. (1 point) Using the tuned model, take at least 2 wrongly classified examples from the test set, and try explaining why the model failed.

In [32]:
import pandas as pd

data = pd.DataFrame(test_df)


In [33]:
data['tuple']= data.text.apply(lambda x: best_model.predict(x))
data[['pred', 'proba']] = data.tuple.apply(pd.Series)
data.pred = data.pred.apply(lambda x :0 if x[0] == '__label__negative' else 1)
data.drop('tuple', axis=1, inplace=True)
data

Unnamed: 0,text,label,pred,proba
0,I love sci-fi and am willing to put up with a ...,0,0,[0.9984738826751709]
1,"Worth the entertainment value of a rental, esp...",0,0,[0.6790785193443298]
2,its a totally average film with a few semi-alr...,0,0,[0.996353805065155]
3,STAR RATING: ***** Saturday Night **** Friday ...,0,0,[0.9988548755645752]
4,"First off let me say, If you haven't enjoyed a...",0,1,[0.9999990463256836]
...,...,...,...,...
24995,Just got around to seeing Monster Man yesterda...,1,1,[0.5521440505981445]
24996,I got this as part of a competition prize. I w...,1,0,[0.7088169455528259]
24997,I got Monster Man in a box set of three films ...,1,0,[0.7206078767776489]
24998,"Five minutes in, i started to feel how naff th...",1,0,[0.9826634526252747]


In [34]:
wrong = data.loc[data.label != data.pred]
wrong.reset_index(inplace=True)
wrong.sort_values(by='proba', inplace=True)
wrong.head(10)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  wrong.sort_values(by='proba', inplace=True)


Unnamed: 0,index,text,label,pred,proba
3292,23376,What is often neglected about Harold Lloyd is ...,1,0,[0.5001906156539917]
2606,18915,I don't understand why this movie has such a l...,1,0,[0.5002298355102539]
1457,11255,The core issues at play (God & Satan / Good & ...,0,1,[0.5005630850791931]
968,7446,...but I would be lying. A relative was a crew...,0,1,[0.5005725026130676]
2675,19368,Jack and Kate meet the physician Daniel Farady...,1,0,[0.5006703734397888]
108,741,The first official release of World Wrestling ...,0,1,[0.5006849765777588]
437,3340,honestly I don't know why this show lasted as ...,0,1,[0.5008364319801331]
2635,19113,A town in Japan is being taken over by a horri...,1,0,[0.5011528134346008]
115,811,This was a mish mash of a film that started ou...,0,1,[0.5012745261192322]
3290,23342,"Probably one of his lesser known films, it suf...",1,0,[0.5019996166229248]


In [35]:
wrong.iloc[0].text

'What is often neglected about Harold Lloyd is that he was an actor. Unlike Chaplin and Keaton, Lloyd didn\'t have the Vaudeville/Music Hall background and he wasn\'t a natural comedian. He came to Hollywood to act; and he discovered he had a knack for acting funny -- first in shorts, then in features. He made a name for himself as "Lonesome Luke", a Chaplin knock-off; with the "glasses character" that made him the all-American boy rather than a grotesque, Lloyd found his stride and his movies became some of the best produced during the silent era.<br /><br />He developed a reputation as a "daredevil" in some shorts, and retained this in some of his best movies ("Safety Last", "For Heaven\'s Sake", "Girl Shy"). He was more popular than either Chaplin or Keaton during the twenties and he became very rich before the advent of sound.<br /><br />The first sound movies were often disasters. To get the most out of their "sound", too much dialog was used in many movies.<br /><br />Lloyd\'s ac

I think the firstwrongly classified is an error of labelisation and is the model guess right. I would have guessed the same

In [36]:
wrong.iloc[2269].text

'Something not-so-great. "Silence of the Lambs" remains Demme\'s only good film, and I\'m of course including all the overrated, left-wing exercises in big-screen indulgence some of which I haven\'t seen - and never will.<br /><br />This is a light comedy that takes a thriller turn somewhere half-way. The first half is a little hard to sit through; it\'s neither funny nor particularly interesting. Daniels plays an over-the-top dumb-as-dirt naive moron who is so gullible and trusting that it defies belief. His animated acting is often annoying, and barely funny. He actually lies to Griffith at the outset and tells her that he is married, with children, when in fact he is divorced! In all the history of mankind and movies any man who is married and meets a woman he likes will lie - if he lies - and say that he ISN\'T married. (Later on, the not-at-all credible or believable rationale behind this was revealed: he wanted to "protect (himself)". What a load of crap...) The fact that his cha

Once again the author just explained the movie without giving it's opinion, this is why the model miss classified the data, and had a really small certainty (almost 0.5)