# Classifying the Complete Dataset
* The tuned logistic regression baseline classifier will be used in this notebook to score tweet sentiment.
* VADER compund scores will also generated and compared to the baseline model.

* By comparing sentiment scores of the covid and non covid DataFrames, we will begin to assess the impact the covid has on tweet sentiment.

In [1]:
import sys  
sys.path.insert(0, '/Users/lclark/data_bootcamp/data-science-final-project/scripts/')

# Import custom functions 
from functions import *
pd.set_option('display.max_colwidth', None)


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/lclark/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Importing Filtered Tweets

In [2]:
# Loading filtered tweets from pickle file

df_full = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_filtered_tweets_master.pkl')
#df_no_retweets = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets.pkl')
#df_no_rt_covid = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets_covid_mention.pkl')
#df_no_rt_no_covid = pd.read_pickle('~/data_bootcamp/data-science-final-project/data/df_original_tweets_no_covid.pkl')

## Load Model

In [3]:
lr_model = pickle.load(open('/Users/lclark/data_bootcamp/data-science-final-project/models/LogReg_GridCV_3C_89p_40kfeats.sav', 'rb'))
lr_model.best_params_

{'classifier': LogisticRegression(C=11.288378916846883, penalty='l1', solver='liblinear'),
 'classifier__C': 11.288378916846883,
 'classifier__penalty': 'l1',
 'classifier__solver': 'liblinear'}

## Classifying Tweets

### Logisitic Regression Classification

* Given that the full dataset is roughly 25% original tweets versus retweets, analyzing the full dataset may provide us with an indication of whether people tend to retweet positive or negative tweets more frequently

In [4]:
df_full['full_text_clean'] = df_full['full_clean'].apply(joiner)

In [5]:
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, ngram_range=(1,2), max_features=40000)

X = vectorizer.fit_transform(df_full.full_text_clean)

In [6]:
df_full['lr_labels'] = lr_model.predict(X)

### VADER

In [9]:
%%time
df_full['vader_text'] = df_full['full_text'].apply(vader_preprocess)
df_full = vader_score_to_series(df_full)

CPU times: user 4min 33s, sys: 4.34 s, total: 4min 37s
Wall time: 4min 39s


In [12]:
# Now all part of vader_score_to_series() ^

#df_full['vader_scores'] = df_full['vader_text'].apply(vader_analyze)
#df_full = df_full.join(df_full['vader_scores'].apply(pd.Series))
#df_full.drop(columns='vader_scores', inplace=True, axis=1)
#df_full['vader_label'] = df_full['compound'].apply(lambda x: categorize(x)).astype('int8')

In [10]:
df_full[['vader_label']].value_counts() 

vader_label
4              198743
0              121651
2               79199
dtype: int64

In [11]:
df_full[['lr_labels']].value_counts()

lr_labels
2            183533
0            118437
4             97623
dtype: int64

In [13]:
df_full[(df_full['vader_label'] == 0) & (df_full['vader_text'].str.contains('&amp'))][['created_at','vader_text','lr_labels','compound','vader_label']].sample(n=10)

Unnamed: 0_level_0,created_at,vader_text,lr_labels,compound,vader_label
id_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1295997680498249728,2020-08-19 08:15:00+00:00,"Utah teachers resign in droves over COVID-19 fears. Will this happen with Teachers &amp; Admin in #bced? How many will retire in the next few weeks. Will some move to Ontario? Or, for a lucky few, perhaps a personal leave for some self-care. #bced #bcpoli",2,-0.4515,0
1300443353541038080,2020-08-31 14:40:31+00:00,"...""We have lost so many of our loved ones, [.. we are shocked at the # of deaths in recent months. I lost my daughter 5 years ago at the beginning of the overdose crisis &amp; the fact that we are still losing so many people is a tragedy of our times.” #bcpoli @jjhorgan @adriandix",2,-0.9201,0
1309530520104050688,2020-09-25 16:29:40+00:00,"'Offensive to me': VC Health COVID19 reporting policy. DBH: “I think where some of the nuance came in was around exactly what we are talking about as an exposure event.” You’re putting non-enrolling &amp; enrolling Ts lives at risk, Dr. Daly. #bced #bcpoli",2,-0.6249,0
1309343520520429568,2020-09-25 04:06:36+00:00,@heckleNjive @richardzussman I've seen one film set. Everyone was both wearing masks and physically distanced. This was outside where transmission risk is much lower. Why are children &amp; teachers crowded together in classroom? Why the double standard? #bced #bcpoli #UnsafeSeptember,4,-0.5661,0
1303583887793115136,2020-09-09 06:39:52+00:00,"The wheels are already falling off the school ""plan"". We were told today that obviously parts of the plan are not possible, such as social distancing &amp; keeping cohorts apart at recess, so just ""do your best."" That's our plan. Districts have not been given resources. #bced #bcpoli",4,-0.1531,0
1326584439548178432,2020-11-11 17:55:51+00:00,"Another BC teacher weighs in on what's ACTUALLY happening in BC schools. BC Gov &amp; PHO's #differentrulesforschools leave students, staff, families vulnerable, not protected. Mixed messaging does not inspire confidence, trust, precaution adherence. #bced #bcpoli",2,-0.8981,0
1310316761149890560,2020-09-27 20:33:55+00:00,When CleanBC was announced we were told government would close 25% gap to meet our legislated emissions gap within two years. Then they dumped hundreds of millions into fracked gas &amp; LNG expansion. In fact 79% more $ to Oil and gas than Christy Clark’s government. THREAD #bcpoli,2,-0.4019,0
1317297541780705280,2020-10-17 02:53:02+00:00,"Yikes Seems @bcliberals want to sell off BC to the Chinese Government Teresa Wat needs media scrutiny &amp; so does @Wilkinson4BC Can’t believe @george_affleck chose this corrupted party to participate in. Just awful, worst of everything bad in our society #BCelxn2020 #bcpoli",4,-0.8225,0
1308956442959982592,2020-09-24 02:28:29+00:00,"John Horgan killed new 10 lane bridge to replace George Massey Tunnel, a project that was $900 mill under budget, and included 9,000 direct jobs; 500 mill for transit investments; replacement/upgrade of 3 interchanges &amp; 5 overpasses; &amp; widening of Hwy 99. #NDPtunnelvision #bcpoli",2,-0.6705,0
1325976086627319808,2020-11-10 01:38:29+00:00,@AlvinSingh ∞ᴴᴰv @KennedyStewart FOCUS PRIORITIES⚕ Ⅰ/PANDEMICSⅠⅠ/COVID19RECOVERYⅠⅠⅠ/SAVELIVES LIVELIHOODS ECONOMYᴴᴰ HERE&amp;NOWᴴᴰ #VANPOLI DUMBNOW✂AUDITOR DUMBER OLYMPICS✂ #DTES @JJHORGAN #VPD #YVR #BCPOLI #AI #SFU #UBC,0,-0.1531,0


## Comparing Logisitic Regression Classification with VADER

In [17]:
# Logisitic Regression Value Counts

df_full.lr_labels.value_counts()

2    183533
0    118437
4     97623
Name: lr_labels, dtype: int64

In [19]:
# VADER Value Counts with extracted full_text from retweet_status

df_full.vader_label.value_counts()

4    198743
0    121651
2     79199
Name: vader_label, dtype: int64

## VADER Value Counts before extracting the full_text from the retweet_status

If a tweet is a retweet, it will be truncated in the full_text column. You need to extract the full_text from the dictionary in retweet_status 

4  -  175328

0  -  106859

2  -  104546

In [68]:
def categorize(x, upper = 0.05,lower = -0.05):
    """
    Categorizes tweets into sentiment categories of 0, 2 and 4.
    Negative, Netral and Postive, respectively.
    0, 2 and 4 were chosen to compare against another model that calssifies this way.
    The upper and lower variables are standard thresholds from VADER Sentiment
    """
    if x < lower:
        return '0'
    elif ((x > (lower+0.0001) and x < upper)):
        return '2'
    else:
        return '4'