# Classifying the Complete Dataset
* The tuned logistic regression baseline classifier will be used in this notebook to score tweet sentiment.
* VADER compund scores will also generated and compared to the baseline model.

* By comparing sentiment scores of the covid and non covid DataFrames, we will begin to assess the impact the covid has on tweet sentiment.

In [5]:
import sys  
sys.path.insert(0, '~/data_bootcamp/data-science-final-project/scripts/')

# Import custom functions 
from functions import *
pd.set_option('display.max_colwidth', None)


## Importing Filtered Tweets

In [6]:
# Loading filtered tweets from pickle file

df_full = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_filtered_tweets_master.pkl')
#df_no_retweets = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets.pkl')
#df_no_rt_covid = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets_covid_mention.pkl')
#df_no_rt_no_covid = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets_no_covid.pkl')

## Load Model

In [9]:
lr_model = pickle.load(open('/Users/lclark/data_bootcamp/data-science-final-project/models/LogReg_GridCV_3C_89p_40kfeats.sav', 'rb'))
lr_model.best_params_

{'classifier': LogisticRegression(C=11.288378916846883, penalty='l1', solver='liblinear'),
 'classifier__C': 11.288378916846883,
 'classifier__penalty': 'l1',
 'classifier__solver': 'liblinear'}

## Classifying Tweets

### Logisitic Regression Classification

* Given that the full dataset is roughly 25% original tweets versus retweets, analyzing the full dataset may provide us with an indication of whether people tend to retweet positive or negative tweets more frequently

In [11]:
df_full['full_text_clean'] = df_full['full_clean'].apply(joiner)

In [12]:
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, ngram_range=(1,2), max_features=40000)

X = vectorizer.fit_transform(df_full.full_text_clean)

In [13]:
df_full['lr_labels'] = lr_model.predict(X)

### VADER

In [18]:
df_full['vader_text'] = df_full['full_text'].apply(vader_preprocess)

In [15]:
df_full.head(1)

Unnamed: 0_level_0,created_at,full_text,vader_text,no_hashtags,full_clean,covid_mention,retweet_count,user_name,is_retweet,full_text_clean,lr_labels
id_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1294232573636304896,2020-08-14 11:21:05+00:00,Dr. Bonnie Henry is one of the greatest leaders of our time. She has done everything we asked her to do and more. Did we ask her to do the right thing?\n#bcpoli,Dr. Bonnie Henry is one of the greatest leaders of our time. She has done everything we asked her to do and more. Did we ask her to do the right thing? #bcpoli,"[dr, bonnie, henry, one, greatest, leaders, time, done, everything, asked, ask, right, thing]","[dr, bonnie, henry, one, greatest, leaders, time, done, everything, asked, ask, right, thing, bcpoli]",1,0,bcpoli,0,dr bonnie henry one greatest leaders time done everything asked ask right thing bcpoli,4


In [16]:
%%time

df_full['vader_scores'] = df_full['vader_text'].apply(vader_analyze)
df_full = df_full.join(df_full['vader_scores'].apply(pd.Series))
df_full.drop(columns='vader_scores', inplace=True, axis=1)
df_full['vader_label'] = df_full['compound'].apply(lambda x: categorize(x)).astype('int8')

CPU times: user 3min 57s, sys: 3.36 s, total: 4min 1s
Wall time: 4min 1s


Unnamed: 0_level_0,created_at,full_text,vader_text,no_hashtags,full_clean,covid_mention,retweet_count,user_name,is_retweet,full_text_clean,lr_labels,neg,neu,pos,compound,vader_label
id_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1294232573636304896,2020-08-14 11:21:05+00:00,Dr. Bonnie Henry is one of the greatest leaders of our time. She has done everything we asked her to do and more. Did we ask her to do the right thing?\n#bcpoli,Dr. Bonnie Henry is one of the greatest leaders of our time. She has done everything we asked her to do and more. Did we ask her to do the right thing? #bcpoli,"[dr, bonnie, henry, one, greatest, leaders, time, done, everything, asked, ask, right, thing]","[dr, bonnie, henry, one, greatest, leaders, time, done, everything, asked, ask, right, thing, bcpoli]",1,0,bcpoli,0,dr bonnie henry one greatest leaders time done everything asked ask right thing bcpoli,4,0.0,0.884,0.116,0.6369,4


In [60]:
df_full[(df_full['vader_label'] == 2) & (df_full['compound'] != -0.0000)][['created_at','full_text','lr_labels','compound','vader_label']].sample(n=10)

## Comparing Logisitic Regression Classification with VADER

In [17]:
# Logisitic Regression Value Counts

df_full.lr_labels.value_counts()

2    183533
0    118437
4     97623
Name: lr_labels, dtype: int64

In [19]:
# VADER Value Counts with extracted full_text from retweet_status

df_full.vader_label.value_counts()

4    198743
0    121651
2     79199
Name: vader_label, dtype: int64

## VADER Value Counts before extracting the full_text from the retweet_status

If a tweet is a retweet, it will be truncated in the full_text column. You need to extract the full_text from the dictionary in retweet_status 

4  -  175328

0  -  106859

2  -  104546

In [68]:
def categorize(x, upper = 0.05,lower = -0.05):
    """
    Categorizes tweets into sentiment categories of 0, 2 and 4.
    Negative, Netral and Postive, respectively.
    0, 2 and 4 were chosen to compare against another model that calssifies this way.
    The upper and lower variables are standard thresholds from VADER Sentiment
    """
    if x < lower:
        return '0'
    elif ((x > (lower+0.0001) and x < upper)):
        return '2'
    else:
        return '4'