# Description

This file creates fasttext model based on training data - *sentiment140*. This model is saved to binary file in *models* directory and can be further used for sentiment prediction of fotball tweets. 

WARNING: If you get "Bad alloc" Memory Error in training the model, you need to decrease test_size. This error means that you do not have enough RAM in your computer.

In [1]:
# Run this cell just once! (or restart Kernel before second time)

import os

os.chdir('..')


In [2]:
import fastText
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from utils.fixed import *


In [3]:
TRAINING_DATA_PATH = os.path.join(MAIN_PATH, DATA_PATH, 'sentiment_m140_.csv')
TRAIN_FILE_PATH = os.path.join(MAIN_PATH, DATA_PATH, 'test-doc.txt')
TEST_FILE_PATH = os.path.join(MAIN_PATH, DATA_PATH, 'train-doc.txt')


### LOAD DATA TO DATAFRAME

In [4]:
SLANG_DICT = load_slang(SLANG_PATH)
train_df = pd.read_csv(TRAINING_DATA_PATH, sep=',', encoding="ISO-8859-1", lineterminator='\n', header=0)
train_df.columns = ['sentiment', 'id', 'date', 'query', 'user', 'tweetText']


 ### PREPARE DATA
Fasttext alghoritm needs training and testing data in special format. Individual texts have to be saved in txt file and be separated by 'new line character' ('\n'). At the end of each line there should be labels inserted with special prefix, for eample: '\_\_label__1' for positive sentiment and '\_\_label__0' for negative sentiment.

Example:
```
car broke down __label__0
im at the river its awesome __label__1
```

The train and test datasets should be saved in two different txt files.

In [5]:
#Tweets preprocessing and adding labels to each tweet
X, y = get_processed_tweets(train_df, SLANG_DICT, False, False, False, True)
indexes = list(train_df.index)
for i in indexes:
    X[i] = X[i] + ' ' + '__label__' + str(y[i])


### SPLIT DATA INTO TRAIN AND TEST DATASETS

In [6]:
test_size = 0.5
X_train, X_test, y_train, y_test, index_train, index_test = train_test_split(X, y, indexes, test_size=test_size,
                                                                             random_state=42)


### SAVE DATA TO TXT FILES

In [None]:
with open(TRAIN_FILE_PATH, mode='wt', encoding='utf-8') as myfile:
    myfile.write('\n'.join(X_train))
    myfile.write('\n')
with open(TEST_FILE_PATH, mode='wt', encoding='utf-8') as myfile:
    myfile.write('\n'.join(X_test))
    myfile.write('\n')


### TRAIN THE MODEL
The model can be trained with many different parameters. You can change vector size or number of grams.

WARNING: After running next cell, the error may occur - it depends on test_size, vector_size and ngrams. You can fix it by decreasing those numbers.

In [7]:
vector_size = 200
ngrams = 2
clf = fastText.train_supervised(TRAIN_FILE_PATH, wordNgrams=ngrams, dim=vector_size, minCount=1)


### SAVE THE MODEL

In [8]:
FASTTEXT_MODEL_PATH = os.path.join(MAIN_PATH, MODEL_PATH, 'fasttext-' + str(vector_size) + '.model')
clf.save_model(FASTTEXT_MODEL_PATH)

# To load the model run: 
# model=fastText.load_model('fasttext.model')


### CALCULATING TRAIN SET ACCURACY

In [9]:
train_list = list(train_df.loc[[i for i in index_train], ['text']]['text'])
result = clf.predict(train_list)
labels_train = [int(result[0][i][0].replace('__label__', '')) for i in range(len(result[0]))]
accuracy_score(labels_train, y_train)


0.9429

### CALCULATING TEST SET ACCURACY

In [10]:
test_list = list(train_df.loc[[i for i in index_test], ['text']]['text'])
result = clf.predict(test_list)
labels_test = [int(result[0][i][0].replace('__label__', '')) for i in range(len(result[0]))]
accuracy_score(labels_test, y_test)


0.7772666666666667