### Task 3

#### Assignment task
<pre>
Download data from <a href='https://tinyurl.com/ym6xcth4'>link</a>

1. Build a feature Matrix using TF-Idf_A method
2. Fit the Log Regression Classifier and get Accuracy, Precision, Recall and AUC Score
3. Compare the results with Naive Bayes
4. Write your observations

</pre>


<img src='https://i.imgur.com/ELUAXGn.png'>

## 1. Build a feature Matrix using TF-IDF method

### 1a. Preprocessing

#### 1a.1 Reading the first dataset

In [43]:
import numpy as np
import math
import pandas as pd

df_A = pd.read_csv('../../datasets/Tweets.csv')

df_A


Unnamed: 0,tweet_id,airline_sentiment,text,tweet_created
0,5.703060e+17,neutral,@VirginAmerica What @dhepburn said.,24-02-2015 11:35
1,5.703010e+17,positive,@VirginAmerica plus you've added commercials t...,24-02-2015 11:15
2,5.703010e+17,neutral,@VirginAmerica I didn't today... Must mean I n...,24-02-2015 11:15
3,5.703010e+17,negative,@VirginAmerica it's really aggressive to blast...,24-02-2015 11:15
4,5.703010e+17,negative,@VirginAmerica and it's a really big bad thing...,24-02-2015 11:14
...,...,...,...,...
14635,5.695880e+17,positive,@AmericanAir thank you we got on a different f...,22-02-2015 12:01
14636,5.695870e+17,negative,@AmericanAir leaving over 20 minutes Late Flig...,22-02-2015 11:59
14637,5.695870e+17,neutral,@AmericanAir Please bring American Airlines to...,22-02-2015 11:59
14638,5.695870e+17,negative,"@AmericanAir you have my money, you change my ...",22-02-2015 11:59


#### 1a.2. Using Regex for preprocessing the text

In [44]:
df_A['text'].replace(regex=True, inplace=True, to_replace=r'[^A-Za-z0-9 ]+', value=r'')
df_A['text'].replace(regex=True, inplace=True, to_replace=r'\d+\s*', value=r'')
df_A["text"] = df_A["text"].apply(lambda x: x.lower())

df_A['text']

0                         virginamerica what dhepburn said
1        virginamerica plus youve added commercials to ...
2        virginamerica i didnt today must mean i need t...
3        virginamerica its really aggressive to blast o...
4        virginamerica and its a really big bad thing a...
                               ...                        
14635    americanair thank you we got on a different fl...
14636    americanair leaving over minutes late flight n...
14637    americanair please bring american airlines to ...
14638    americanair you have my money you change my fl...
14639    americanair we have ppl so we need know how ma...
Name: text, Length: 14640, dtype: object

#### 1a.3 Generating the TF-IDF feature matrix

In [45]:
processed_reviews = df_A['text']

unique_words=[]
unique_words_count = []

for review in processed_reviews:
    for word in review.split():
        if word not in unique_words:
            unique_words.append(word)
            unique_words_count.append(1)
        else:
            index = unique_words.index(word)
            unique_words_count[index] = unique_words_count[index] + 1

tfidf_feature_matrix = np.zeros((len(processed_reviews),len(unique_words)))

for n,review in enumerate(processed_reviews):
    for word in review.split():
        index = unique_words.index(word)
        tfidf_feature_matrix[n][index] = review.split().count(word) * math.log(len(processed_reviews) / unique_words_count[index])
        
tfidf_feature_matrix

array([[3.33768398, 3.10078925, 9.59151279, ..., 0.        , 0.        ,
        0.        ],
       [3.33768398, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [3.33768398, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        9.59151279],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## 2. Fit the Log Regression Classifier and get Accuracy, Precision, Recall and AUC Score

### 2a. Mapping the airline_sentiment data

In [46]:
df = df_A

# Mapping values from -1 to 1
# Limiting values between 0 and 1 for binary logistic regression

df['airline_sentiment'] = df['airline_sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 1})
df['airline_sentiment'] = df['airline_sentiment'].astype('category')
df.head(10)

Unnamed: 0,tweet_id,airline_sentiment,text,tweet_created
0,5.70306e+17,1,virginamerica what dhepburn said,24-02-2015 11:35
1,5.70301e+17,1,virginamerica plus youve added commercials to ...,24-02-2015 11:15
2,5.70301e+17,1,virginamerica i didnt today must mean i need t...,24-02-2015 11:15
3,5.70301e+17,0,virginamerica its really aggressive to blast o...,24-02-2015 11:15
4,5.70301e+17,0,virginamerica and its a really big bad thing a...,24-02-2015 11:14
5,5.70301e+17,0,virginamerica seriously would pay a flight for...,24-02-2015 11:14
6,5.70301e+17,1,virginamerica yes nearly every time i fly vx t...,24-02-2015 11:13
7,5.703e+17,1,virginamerica really missed a prime opportunit...,24-02-2015 11:12
8,5.703e+17,1,virginamerica well i didntbut now i do d,24-02-2015 11:11
9,5.70295e+17,1,virginamerica it was amazing and arrived an ho...,24-02-2015 10:53


### 2b. Splitting the data into training and test sets and fitting the model

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, average_precision_score

X = tfidf_feature_matrix

y = df_A['airline_sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

lg = LogisticRegression(C=0.5, solver='liblinear', penalty='l1')
lg.fit(X_train, y_train)

LogisticRegression(C=0.5, penalty='l1', solver='liblinear')

In [48]:
lg_y_pred = lg.predict(X_test)

print("Accuracy:", lg.score(X_test, y_test))
print("Average precision-recall score", average_precision_score(y_test, lg_y_pred))
print("AUC score:", roc_auc_score(y_test, lg_y_pred))


Accuracy: 0.8306010928961749
Average precision-recall score 0.6829619580696455
AUC score: 0.8202985034653019


## 3. Compare the result with Naive Bayes

In [49]:
from sklearn.naive_bayes import BernoulliNB

bb = BernoulliNB()
bb.fit(X_train, y_train)
bb_y_pred = bb.predict(X_test)

print("Accuracy:", bb.score(X_test, y_test))
print("Average precision-recall score", average_precision_score(y_test, bb_y_pred))
print("AUC score:", roc_auc_score(y_test, bb_y_pred))

Accuracy: 0.833879781420765
Average precision-recall score 0.6892610138445917
AUC score: 0.8171850404558888


## 4. Observations

* The dataset is imbalanced, as the reviews are largely negative.

* The data was split between two, considering negative reviews as `0`, whereas neutral and positive reviews as `1` for binary classification.

* The training and test data have been split into training and test sets using the `train_test_split` method.(ratio 3:1)

* The model has been fit using Logistic Regression, and the accuracy is reported as `0.830`

* The model has been fit using BernoulliNB, and the accuracy is reported as `0.833`

* The TF-IDF feature matrix performs better than the BoW matrix by approximately 9.8% accuracy.

* BernoulliNB outperforms Logistic Regression in Accuracy as it is better at smaller amounts of data compared to other popular models.
