<img src=https://pre00.deviantart.net/4547/th/pre/f/2017/180/1/0/tao_pai_pai_by_wandertof-d9xtmm9.jpg height="500" width="500">

# Baseline model

This is the entrypoint for the competition, it:

* Reads data from tweets' CSV files
* Computes Bag of Words (BoW) from textual representations (tweets text)
* Tests two models to find out which performs better
* Predicts classes for the submission/benchmark tweets
* Generates a suitable CSV for Kaggle InClass

## Data representation

The function `obtain_data_representation` performs the BoW transformation over the training set and applies it to both the train and test set.

If no test set is provided, the input DataFrame is split into both train and test, 75% and 25% of the data respectively. This is done so as to be able to obtain an accuracy score, which will be the evaluation metric on Kaggle.

BoW is computed through `CountVectorizer` class of `sklearn`, restricting it to at most 200 features. The process of finding the best words is done by the `fit` method, whereas transforming the text to numerical vectors (using the learnt features) is done by `transform`. Lastly, `fit_transform` does in a single step the learning and transforming process.

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split


def obtain_data_representation(df, test=None):
    # If there is no test data, split the input
    if test is None:
        # Divide data in train and test
        train, test = train_test_split(df, test_size=0.25)
        df.airline_sentiment = pd.Categorical(df.airline_sentiment)
    else:
        # Otherwise, all is train
        train = df
        
    # Create a Bag of Words (BoW), by using train data only
    #cv = CountVectorizer(max_features=200)#, token_pattern=r'[A-Za-z]{3,}|no')
    cv = CountVectorizer(max_features=300,token_pattern=r'[A-Za-z@#]{3,}|no|yes|wtf|hrs|jfk', min_df = 5)
    
    #max_df=100)
    
    
    #x_train = cv.fit_transform(train['text'])
    #--
    vectonizer = cv.fit(train['text'])
    
    print(cv.vocabulary_)
    x_train = vectonizer.transform(train['text'])
    #--
    
    y_train = train['airline_sentiment'].values
    
    #print(cv.get_feature_names())
    print("vector shape: ", x_train.shape)
    print(type(x_train))
    print(x_train.toarray())
    
    # Obtain BoW for the test data, using the previously fitted one
    x_test = cv.transform(test['text'])
    try:
        y_test = test['airline_sentiment'].values
    except:
        # It might be the submision file, where we don't have target values
        y_test = None
        
    return {
        'train': {
            'x': x_train,
            'y': y_train
        },
        'test': {
            'x': x_test,
            'y': y_test
        }
    }

## Model training

Thought this function might seem strange at first, the only thing to know is that training an `sklearn` model is always done the same way:

```python
# 1. Create the model
model = BernoulliNB()

# 2. Train with some data, where `x` are features and
#    `y` is the target category
model.fit(x, y)

# 3. Predict new categories for test data (with which we
#    have not trained!)
y_pred = model.predict(test_x)
```

We might also obtain the accuracy score by using the function `accuracy_score`

In [48]:
from sklearn.metrics import accuracy_score

def train_model(dataset, dmodel, *model_args, **model_kwargs):
    # Create a Naive Bayes model
    model = dmodel(*model_args, **model_kwargs)
    
    # Train it
    model.fit(dataset['train']['x'], dataset['train']['y'])
    
    # Predict new values for test
    y_pred = model.predict(dataset['test']['x'])
    
    # Print accuracy score unless its the submission dataset
    if dataset['test']['y'] is not None:
        score = accuracy_score(dataset['test']['y'], y_pred)
        print("Model score is: {}".format(score))

    # Done
    return model, y_pred

In [49]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier


df = pd.read_csv('tweets_public.csv', index_col='tweet_id')
df = df.loc[df.airline_sentiment_confidence > .99]
dataset = obtain_data_representation(df)

# Train a Bernoulli Naive Bayes
modelNB, _ = train_model(dataset, BernoulliNB)

# Train a K Nearest Neighbors Classifier
modelKN, _ = train_model(dataset, KNeighborsClassifier)

{'@united': 3, 'well': 279, 'you': 298, 'can': 50, 'fix': 90, 'missing': 163, 'because': 36, 'but': 47, 'make': 155, 'for': 100, '@usairways': 4, 'are': 28, 'people': 191, 'this': 251, 'has': 114, 'two': 265, 'with': 288, 'and': 21, 'airport': 15, '@southwestair': 2, 'all': 16, 'know': 141, 'when': 282, 'the': 244, 'new': 170, 'from': 102, 'amp': 20, 'will': 287, 'care': 52, 'about': 6, 'person': 192, 'was': 274, 'lax': 144, 'more': 165, 'now': 177, 'have': 115, 'cancelled': 51, 'flightled': 93, 'flt': 96, 'what': 281, 'going': 108, 'follow': 99, 'stranded': 232, 'flights': 95, 'flighted': 92, 'flight': 91, 'today': 256, 'again': 8, 'never': 169, '@americanair': 0, 'change': 53, 'not': 175, 'issue': 134, 'how': 128, 'customer': 65, 'service': 218, '@jetblue': 1, 'thank': 241, 'that': 243, 'didn': 75, 'help': 118, 'work': 291, 'there': 248, 'attendant': 29, 'http': 130, 'one': 180, 'got': 110, 'nothing': 176, 'out': 186, 'very': 269, 'your': 299, 'delay': 70, 'why': 285, 'tomorrow': 258

## Submit file

Once we have found the best model (BernoulliNB for the above simple test), we can train it with all the data (that is, avoid doing a train/test split) and predict sentiments for the real submission data.

This cell below performs exactly this.

In [26]:
import datetime

def create_submit_file(df_submission, ypred):
    date = datetime.datetime.now().strftime("%m_%d_%Y-%H_%M_%S")
    filename = 'submission_' + date + '.csv'
    
    df_submission['airline_sentiment'] = ypred
    df_submission[['airline_sentiment']].to_csv(filename)
    
    print('Submission file created: {}'.format(filename))
    print('Upload it to Kaggle InClass')

    
# Read submission and retrain with whole data
df_submission = pd.read_csv('tweets_submission.csv', index_col='tweet_id')
# We use df_submision as test, otherwise it would split df in train/test
submission_dataset = obtain_data_representation(df, df_submission)
# Predict for df_submission
_, y_pred = train_model(submission_dataset, BernoulliNB)

# Create submission file with obtained y_pred
create_submit_file(df_submission, y_pred)

Submission file created: submission_11_14_2017-19_47_49.csv
Upload it to Kaggle InClass
