# Obama & Trump Tweets - Part 2

- In this notebook, I will use the exported CSV file and extract meaningful information and insights.
- Here i will be using new libraries string and re (regular expressions) for text manipulation and extractions
- I will also use spaCy and textBlob libraries for NLP (Natural Language Processing) and Sentiment Analysis
- One more time, I will save the extracted information and store them into DataFrame format and then export it into CSV and Pickle file to use it in '**3. ObamaTrump_Tweets_Analysis_Final**'

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import string
import re

import spacy
from textblob import TextBlob

# Set Default Settings

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# Load CSV File

In [45]:
# Load CSV file
df_tweets = pd.read_csv('Trump_Obama_Tweets.csv')

# Display the first few rows to understand your data
df_tweets.head()

Unnamed: 0.1,Unnamed: 0,id,user,date,text,favorite_counts,retweet_counts
0,0,1076308614372048897,Donald Trump,2018-12-22 02:49:05,OUR GREAT COUNTRY MUST HAVE BORDER SECURITY! https://t.co/ZGcYygMf3a,61773,17110
1,1,1076270321861312512,Donald Trump,2018-12-22 00:16:55,Wishing Supreme Court Justice Ruth Bader Ginsburg a full and speedy recovery!,66758,10652
2,2,1076256868190834689,Donald Trump,2018-12-21 23:23:27,Some of the many Bills that I am signing in the Oval Office right now. Cancelled my trip on Air Force One to Florida while we wait to see if the Democrats will help us to protect America’s Souther...,64177,15135
3,3,1076239448461987841,Donald Trump,2018-12-21 22:14:14,A design of our Steel Slat Barrier which is totally effective while at the same time beautiful! https://t.co/sGltXh0cu9,82539,21601
4,4,1076204680202670081,Donald Trump,2018-12-21 19:56:05,"Today, it was my honor to sign into law H.R. 7213, the “Countering Weapons of Mass Destruction Act of 2018.” The Act redesignates the @DHSgov Domestic Nuclear Detection Office as the Countering We...",34451,8423


# Understand The Data

In [13]:
# Features types
df_tweets.dtypes

Unnamed: 0          int64
id                  int64
user               object
date               object
text               object
favorite_counts     int64
retweet_counts      int64
dtype: object

In [15]:
# Check null values
df_tweets.isnull().sum()

Unnamed: 0         0
id                 0
user               0
date               0
text               0
favorite_counts    0
retweet_counts     0
dtype: int64

In [36]:
# To avoid displaying scientific numbers like '9.500000e+01'
pd.options.display.float_format = '{:20,.2f}'.format

In [42]:
# Default method will display numeric features only
df_tweets.describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,366.0,182.5,105.8,0.0,91.25,182.5,273.75,365.0
id,366.0,1.0031644307596804e+18,1.0103229142865848e+17,7.891478206954865e+17,9.340947109756682e+17,1.0696550220766186e+18,1.0733045173373318e+18,1.0763086143720488e+18
favorite_counts,366.0,245006.68,422746.32,2342.0,67516.5,97379.0,217377.25,4515657.0
retweet_counts,366.0,54334.59,118084.79,480.0,14639.0,22590.5,38621.5,1666772.0


In [27]:
# Using include=['O'] to display categorical features
df_tweets.describe(include=['O']).T

Unnamed: 0,count,unique,top,freq
user,366,2,Donald Trump,192
date,366,362,2018-12-08 14:19:31,2
text,366,366,"Happy Anniversary, @MichelleObama. For 26 years, you’ve been an extraordinary partner, someone who can always make me laugh, and my favorite person to see the world with. https://t.co/s8xoZ9j2YR",1


In [33]:
# Tweets shares
df_tweets.user.value_counts(normalize = True).round(3) * 100 

Donald Trump    52.5
Barak Obama     47.5
Name: user, dtype: float64


# Highlight Observation:

- Unwanted feature 'Unnamed: 0'
- We have 531 entries
- Donald Trump has more share than Obama with 52.5% (192 tweets)
- Feature [date] is not recognised as datetime format
- No null values
- The most liked tweet has 4,515,657 likes and the most retweeted tweet has 1,666,772 retweets
- High varience comparing the max/ min values for both likes and retweets counts. More details in EDA section
- Tweets [text] still raw. Needs to be cleaned up and extract insights/ usefull information out of it.

# Cleaning

1. Remove URLs from tweets
2. Extract 'hashtags' and 'mentions' from tweets
3. Extract additional information from tweets (i.e. number of spaces or ! or ? used.. etc)
4. Remove unwated features

In [46]:
# Remove URLs from the tweets
def re_remove_url(x):
    return re.sub(r'http\S+', '', x)

#Extracts hashtags from tweets
def extract_hashtags(x):
    try:
        hashtags = re.findall(r"#(\w+)", x) # Extract hashtags
        if not hashtags:
            return np.nan
        elif isinstance(hashtags, list): # Check if it's a list object
            return ', '.join(hashtags) # Convert the list to str object
        else:
            return hashtags
    except:
        return np.nan
     
#Extracts mentions from tweets
def extract_mentions(x):
    try:
        mentions = re.findall(r"@(\w+)", x) # Extract mentions
        if not mentions:
            return np.nan
        if isinstance(mentions, list):
            return ', '.join(mentions)
        else:
            return mentions
    except:
        return np.nan

# Add exctracted data in new columns
df_tweets['tweets']   = df_tweets.text.apply(lambda x: re_remove_url(x))
df_tweets['hashtags'] = df_tweets.text.apply(lambda x: extract_hashtags(x))
df_tweets['mentions'] = df_tweets.text.apply(lambda x: extract_mentions(x))

# Drop unwanted columns
df_tweets.drop(['text', 'Unnamed: 0'], axis = 1, inplace = True)

In [47]:
# Extract additional information about the tweets
tweets_length_list           = []
tweets_spaces_list           = []
tweets_uppercase_list        = []
tweets_punctuations_list     = []
tweets_questionmark_list     = []
tweets_exclamation_mark_list = []

def extract_text_details(x):
    tweets_length_list.append(len(x))                                                 # Length of the tweet
    tweets_spaces_list.append(sum([1 for l in x if l.isspace()]))                     # Total number of spaces exists in the tweet
    tweets_uppercase_list.append(sum([1 for l in x if l.isupper()]))                  # Total number of uppercases used in the tweet
    tweets_punctuations_list.append(sum([1 for l in x if l in string.punctuation]))   # Total number of punctuation exists in the tweet
    tweets_questionmark_list.append(x.count('?'))                                     # Total number of question marks in tweet 
    tweets_exclamation_mark_list.append(x.count('!'))                                 # Total number of exclamation marks in tweet

_ = df_tweets.tweets.apply(lambda x: extract_text_details(x)) # Since the function doesnt return values, it returns 'None' by default. Instead of displaying them, I stored them in temp object '_'
del _ # Delete _ object

df_tweets['tweets_length']           = tweets_length_list
df_tweets['tweets_spaces']           = tweets_spaces_list
df_tweets['tweets_uppercase']        = tweets_uppercase_list
df_tweets['tweets_punctuations']     = tweets_punctuations_list
df_tweets['tweets_questionmark']     = tweets_questionmark_list
df_tweets['tweets_exclamation_mark'] = tweets_exclamation_mark_list

# Sentiment Analysis

- Get polarity in percentage (emotions expressed in tweets weather it's positive or negative). Range is [-1.0, 1.0]
- Get subjectivity in percentage (opinions expressed in tweets like personal feelings, views, or beliefs). Range is [0, 1.0]
- Classify tweets [Very Negative, Negative, Neutral, Positive, Very Positive]
- Classify tweets as subjective or objective

In [48]:
# Extract polarity and subjectivity of the tweets 
polarity_list     = []
subjectivity_list = []

def polarity_subjectivity(x):
    analysis = TextBlob(x)
    polarity_list.append(round(analysis.polarity, 2))
    subjectivity_list.append(round(analysis.subjectivity, 2))
    
_ = df_tweets.tweets.apply(lambda x: polarity_subjectivity(x))
del _

df_tweets['polarity']     = polarity_list
df_tweets['subjectivity'] = subjectivity_list

In [61]:
# Very Positive / Positive / Very Negative / Negative / Neutral
def polarity_status(x):
    if x == 0:
        return 'Neutral'
    elif x > 0.00 and x < 0.50:
        return 'Positive'
    elif x >= 0.50:
        return 'Very Positive'
    elif x < 0.00 and x > -0.50:
        return 'Negative'
    elif x <= -0.50:
        return 'Very Negative'
    else:
        return 'Unknown'

# Very Positive / Positive / Very Negative / Negative / Neutral
def subjectivity_status(x):
    if x == 0:
        return 'Very Objective'
    elif x > 0.00 and x < 0.40:
        return 'Objective'
    elif x >= 0.40 and x < 0.70:
        return 'Subjective'
    elif x >= 0.70:
        return 'Very Subjective'

# Extract / Classify polarity and subjectivity
df_tweets['polarity_status'] = df_tweets.polarity.apply(lambda x: polarity_status(x))
df_tweets['subjectivity_status'] = df_tweets.subjectivity.apply(lambda x: subjectivity_status(x))

In [65]:
# Positive / Negative / Neutral numeric
# Very Positive and Positive are going to be ['is_positive']
neutral_list  = []
positive_list = []
negative_list = []

def polarity_status(x):
    if x == 0:
        neutral_list.append(1)
        positive_list.append(0)
        negative_list.append(0)
    elif x > 0.00:
        neutral_list.append(0)
        positive_list.append(1)
        negative_list.append(0)
    elif x < 0.00:
        neutral_list.append(0)
        positive_list.append(0)
        negative_list.append(1)
    
_ = df_tweets.polarity.apply(lambda x: polarity_status(x))
del _

df_tweets['is_neutral']  = neutral_list
df_tweets['is_positive'] = positive_list
df_tweets['is_negative'] = negative_list

In [66]:
# Convert [date] feature type to datetime type inorder to manipulate dates and times 
df_tweets.date = pd.to_datetime(df_tweets.date)

In [68]:
# Extract tweeting times [early, morning, noon, evening, midnight]
early_list    = []
morning_list  = []
noon_list     = []
evening_list  = []
midnight_list = []

def part_of_the_day(x):
    try:
        if x >= 5: 
            early_list.append(1)
            morning_list.append(0)
            noon_list.append(0)
            evening_list.append(0)
            midnight_list.append(0)
            return 'Early Morning'

        elif x >= 8: 
            early_list.append(0)
            morning_list.append(1)
            noon_list.append(0)
            evening_list.append(0)
            midnight_list.append(0)
            return 'Morning'

        elif x >= 12: 
            early_list.append(0)
            morning_list.append(0)
            noon_list.append(1)
            evening_list.append(0)
            midnight_list.append(0)
            return 'Afternoon'

        elif x >= 18: 
            early_list.append(0)
            morning_list.append(0)
            noon_list.append(0)
            evening_list.append(1)
            midnight_list.append(0)
            return 'Evening'

        elif x >= 0 and x < 5:
            early_list.append(0)
            morning_list.append(0)
            noon_list.append(0)
            evening_list.append(0)
            midnight_list.append(1)
            return 'Mid Night'
    except:
        early_list.append(np.nan)
        morning_list.append(np.nan)
        noon_list.append(np.nan)
        evening_list.append(np.nan)
        midnight_list.append(np.nan)
        return np.nan
    
df_tweets['part_of_day'] = df_tweets.date.dt.hour.apply(lambda x: part_of_the_day(x))

df_tweets['is_early']    = early_list
df_tweets['is_morning']  = morning_list
df_tweets['is_noon']     = noon_list
df_tweets['is_evening']  = evening_list
df_tweets['is_midnight'] = midnight_list 

# Language Analysis
- I will use spaCy for this part of analysis
- Extract additional information from the text

In [70]:
# Initiate nlp object. If the below did not run successfully, replace the 'en' with 'en_core_web_sm'
nlp = spacy.load('en')

In [71]:
is_norp_list    = []  # Nationalities or religious or political groups.
is_time_list    = []
is_org_list     = []  # Companies, agencies, institutions, etc.
is_gpe_list     = []  # Countries, cities, states.
is_loc_list     = []  # Non-GPE locations, mountain ranges, bodies of water.
is_product_list = []    
is_workart_list = []  # Titles of books, songs, etc.
is_fac_list     = []  # Buildings, airports, highways, bridges, etc.

is_noun_list    = []  # girl, cat, tree, air, beauty
is_pron_list    = []  # I, you, he, she, myself, themselves, somebody
is_adv_list     = []  # very, tomorrow, down, where, there
is_propn_list   = []  # Mary, John, London, NATO, HBO
is_verb_list    = []   
is_intj_list    = []  # psst, ouch, bravo, hello

def extract_tweet_style(x):
    
    doc = nlp(x)
    
    is_norp_list.append(sum([1 for i in doc.ents if i.label_ == 'NORP']))
    is_time_list.append(sum([1 for i in doc.ents if i.label_ == 'TIME']))
    is_org_list.append(sum([1 for i in doc.ents if i.label_ == 'ORG']))
    is_gpe_list.append(sum([1 for i in doc.ents if i.label_ == 'GPE']))
    is_loc_list.append(sum([1 for i in doc.ents if i.label_ == 'LOC']))
    is_product_list.append(sum([1 for i in doc.ents if i.label_ == 'PRODUCT']))
    is_workart_list.append(sum([1 for i in doc.ents if i.label_ == 'WORK_OF_ART']))
    is_fac_list.append(sum([1 for i in doc.ents if i.label_ == 'FAC']))

    is_noun_list.append((sum([1 for i in doc if i.pos_ == 'NOUN'])))
    is_pron_list.append((sum([1 for i in doc if i.pos_ == 'PRON'])))
    is_adv_list.append((sum([1 for i in doc if i.pos_ == 'ADV'])))
    is_propn_list.append((sum([1 for i in doc if i.pos_ == 'PROPN'])))
    is_verb_list.append((sum([1 for i in doc if i.pos_ == 'VERB'])))
    is_intj_list.append((sum([1 for i in doc if i.pos_ == 'INTJ'])))


_ = df_tweets.tweets.apply(lambda x: extract_tweet_style(x))
del _
                   
df_tweets['is_norp']     = is_norp_list
df_tweets['is_time']     = is_time_list
df_tweets['is_org']      = is_org_list
df_tweets['is_gpe']      = is_gpe_list
df_tweets['is_loc']      = is_loc_list
df_tweets['is_product']  = is_product_list
df_tweets['is_workart']  = is_workart_list
df_tweets['is_fac']      = is_fac_list

df_tweets['is_noun']     = is_noun_list
df_tweets['is_pron']     = is_pron_list
df_tweets['is_adv']      = is_adv_list
df_tweets['is_propn']    = is_propn_list
df_tweets['is_verb']     = is_verb_list
df_tweets['is_intj']     = is_intj_list

In [78]:
# This is the describtion of the new numeric features
df_tweets.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,366.0,1.0031644307596804e+18,1.0103229142865848e+17,7.891478206954865e+17,9.340947109756682e+17,1.0696550220766188e+18,1.0733045173373316e+18,1.0763086143720488e+18
favorite_counts,366.0,245006.68,422746.32,2342.0,67516.5,97379.0,217377.25,4515657.0
retweet_counts,366.0,54334.59,118084.79,480.0,14639.0,22590.5,38621.5,1666772.0
tweets_length,366.0,185.31,80.53,11.0,113.0,196.5,266.0,294.0
tweets_spaces,366.0,31.63,14.22,1.0,19.0,34.0,45.0,57.0
tweets_uppercase,366.0,9.42,8.36,1.0,4.0,7.0,12.0,62.0
tweets_punctuations,366.0,6.2,3.84,1.0,3.0,5.0,8.0,22.0
tweets_questionmark,366.0,0.07,0.32,0.0,0.0,0.0,0.0,3.0
tweets_exclamation_mark,366.0,0.46,0.6,0.0,0.0,0.0,1.0,3.0
polarity,366.0,0.19,0.27,-0.62,0.0,0.15,0.37,1.0


In [82]:
# This is the describtion of the objects (string) features
df_tweets.describe(include=['O']).T

Unnamed: 0,count,unique,top,freq
user,366,2,Donald Trump,192
tweets,366,365,Don't get tripped up by misinformation. Join the @OFA Truth Team today:,2
hashtags,35,18,GetCovered,7
mentions,58,41,MichelleObama,7
polarity_status,366,5,Positive,188
subjectivity_status,366,4,Subjective,171
part_of_day,366,2,Early Morning,317


In [83]:
# This is the describtion of the date feature
df_tweets.describe(include=['datetime']).T

Unnamed: 0,count,unique,top,freq,first,last
date,366,362,2018-12-19 01:13:30,2,2016-10-20 16:54:36,2018-12-22 02:49:05


In [None]:
df_tweets.groupby(['user']).agg({    'is_norp': 'sum',
                                     'is_time': 'sum',
                                     'is_org': 'sum',
                                     'is_gpe': 'sum',
                                     'is_loc': 'sum',
                                     'is_product': 'sum',
                                     'is_workart': 'sum',
                                     'is_fac': 'sum',
                                     'is_noun': 'sum',
                                     'is_pron': 'sum',
                                     'is_adv': 'sum',
                                     'is_propn': 'sum',
                                     'is_verb': 'sum',
                                     'is_intj': 'sum'  })

# Export DataFrame as CSV as backup and as Pickle file

In [84]:
# Export the new dataframe into CSV file and Pickle file (to store the dataframe as an object to use it later)

#df_tweets.to_csv('Trump_Obama_Tweets_Detailed.csv')
df_tweets.to_pickle('Trump_Obama_Tweets_Detailed.pickle')