# Classifying the Complete Dataset
* The tuned logistic regression baseline classifier will be used in this notebook to score tweet sentiment.
* VADER compund scores will also generated and compared to the baseline model.

* By comparing sentiment scores of the covid and non covid DataFrames, we will begin to assess the impact the covid has on tweet sentiment.

In [1]:
import sys  
sys.path.insert(0, '~/data_bootcamp/data-science-final-project/scripts/')
# Import custom functions 
from functions import *

pd.set_option('display.max_colwidth', None)


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/lclark/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Importing Filtered Tweets

In [2]:
# Loading filtered tweets from pickle file

df_full = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_filtered_tweets_master.pkl')

# All the files below are a subset of df_filtered_tweets_master

#df_no_retweets = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets.pkl')
#df_no_rt_covid = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets_covid_mention.pkl')
#df_no_rt_no_covid = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets_no_covid.pkl')

## Load Model

In [35]:
lr_model = pickle.load(open('~/data_bootcamp/data-science-final-project/models/LogReg_GridCV_3C_89p_40kfeats.sav', 'rb'))
lr_model.best_params_

## Classifying Tweets

### Logisitic Regression Classification

* Given that the full dataset is roughly 25% original tweets versus retweets, analyzing the full dataset may provide us with an indication of whether people tend to retweet positive or negative tweets more frequently

In [4]:
df_full['full_text_clean'] = df_full['full_clean'].apply(joiner)

In [5]:
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, ngram_range=(1,2), max_features=40000)

X = vectorizer.fit_transform(df_full.full_text_clean)

In [6]:
df_full['lr_labels'] = lr_model.predict(X)

### VADER

In [7]:
%%time

# Analyze tweets, extract scores from dictionary result, drop dictionary result, categorize
df_full['vader_text'] = df_full['full_text'].apply(vader_preprocess)
df_full = vader_score_to_series(df_full)

CPU times: user 4min 4s, sys: 2.78 s, total: 4min 7s
Wall time: 4min 7s


In [19]:
# Testing wider thresholds than default +-0.05 of 0
df_full['vader_label_wider_neu'] = df_full['compound'].apply(lambda x: categorize(x, upper = 0.1,lower = -0.1))
df_full['vader_label_wider_neu'].value_counts().sort_index()

0    117152
2     90757
4    191684
Name: vader_label_wider_neu, dtype: int64

In [20]:
df_full.vader_label_wider_neu.describe()

count    399593.000000
mean          2.373040
std           1.718242
min           0.000000
25%           0.000000
50%           2.000000
75%           4.000000
max           4.000000
Name: vader_label_wider_neu, dtype: float64

In [12]:
df_full[(df_full['vader_label'] == 4)][['created_at','vader_text','lr_labels','compound','vader_label']].sample(n=10)

Unnamed: 0_level_0,created_at,vader_text,lr_labels,compound,vader_label
id_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1320060313501224960,2020-10-24 17:51:18+00:00,"Encouraging everyone who hasn’t already done so, to get out today to exercise your right to vote and to pick up one of these prized stickers! To find your polling station visit #BCelxn2020 #bcpoli",4,0.7959,4
1316073449983365120,2020-10-13 17:48:56+00:00,"We are putting forward a vision for this province that puts puts people, health and wellbeing, security and safety, and the resilience of communities at the centre of every decision. #bcpoli #bcelxn #vanpoli #bcvotes",4,0.7351,4
1313658286734934016,2020-10-07 01:51:56+00:00,ICBC insurance challenge: Try to get a quote for insurance in Alberta. Couldn’t believe the discriminatory questions. Couldn’t complete it but it was a great indicator of what the Liberals are taking bribes for #bcpoli #BCElection2020,0,0.7783,4
1310714805649510400,2020-09-28 22:55:36+00:00,Leaving a big hole in govt revenue makes no sense - especially in a pandemic. Time for B.C. to properly tax billionaires like Chip Wilson and Jim Pattison #bcpoli #WealthTax #cdnpoli,2,0.0772,4
1316262378892222464,2020-10-14 06:19:40+00:00,"Love @jjhorgan recommending @GeorgeAbbottPhd’s great book Big Promises, Small Government on disastrous BC Liberal past giant tax cuts that slashed services - to Andrew Wilkinson in #BCelxn2020 debate! #bcpoli #bcndp",2,0.636,4
1316726659353997312,2020-10-15 13:04:33+00:00,Court upholds 47% loan interest rate. Parliament hasn't rewritten #usury law since 1978. #cdnpoli @KPateontheHill @FCACan #bcpoli,2,0.4588,4
1323728771506057216,2020-11-03 20:48:27+00:00,"This is a good, hopeful profile of a fossil fuel town in the midst of its economic transition. Switching to a model that can be sustained takes thought and work, but it is possible. Great reporting from @TheTyee's @amandajfollett: #bcpoli",4,0.8674,4
1320498194027929600,2020-10-25 22:51:17+00:00,Congratulations @jjhorgan on your victory tonight for the @bcndp. And congrats to my former colleagues @AdamPOlsen and @SoniaFurstenau for your re election tonight. #bcpoli.,0,0.8074,4
1310408261833687040,2020-09-28 02:37:30+00:00,"Today I took time away from the campaign to spent time with my family at my elementary school. Thank you to teachers, admin & support staff who are working hard to keep our children safe & get our schools back open so parents can return to work. #RichmondBC - #StevestonBC #BCpoli",0,0.7717,4
1318718898347364352,2020-10-21 01:01:00+00:00,"Question for all the medical doctors out there - how many of you would say ""yes"" to representing tobacco companies like Phillip Morris? #bcpoli",4,0.3612,4


## Comparing Logisitic Regression Classification with VADER

In [22]:
# Logisitic Regression Value Counts

df_full['lr_labels'].value_counts().sort_index()

0    118437
2    183533
4     97623
Name: lr_labels, dtype: int64

In [23]:
# VADER Value Counts with extracted full_text from retweet_status

df_full.vader_label.value_counts().sort_index()

0    121637
2     79211
4    198745
Name: vader_label, dtype: int64

## VADER Value Counts before extracting the full_text from the retweet_status

If a tweet is a retweet, it will be truncated in the full_text column. You need to extract the full_text from the dictionary in retweet_status. Note: This comparison had a different number of tweets (more in text above), though the positive tweet count is less. THis gives us some indiction that negative sentiment is more strongly dictated by the end of a tweet than the beginning.

0  -  106859

2  -  104546

4  -  175328

In [33]:
# Create distributable labelled bcpoli dataset.

#df_full_distribute = df_full[['covid_mention','neg','neu','pos','compound','vader_label']].reset_index()
#df_full_distribute.to_pickle('~/data_bootcamp/data-science-final-project/data/bcpoli_vader_labelled_tweets.sav')

# Export labelled df_full

#df_full.to_pickle('~/data_bootcamp/data-science-final-project/data/bcpoli_labelled_tweets.pkl')