## Exploring Twitter data


### Date: Feb 3 2021

#### Description

This notebook was provided by Carina Albrecht and edited to explore a Twitter dataset containing Tweets sent during the Capitol Hill riot.

## 1 Data exploration
Before we can begin exploring the Twitter data, we will install a few libraries designed to work with text. You will only need to pip install these libraries once. After the install, it is recommended to comment out these install lines.

In [None]:
# textblob is a basic API for common natural language processing (NLP) tasks 
# such as part-of-speech tagging, noun phrase extraction, sentiment analysis, 
# classification, translation, and more. Learn more here https://textblob.readthedocs.io/en/dev/
pip install textblob

In [None]:
# The wordcloud package will help us to make wordclouds to better display keywords in the discourse
# You can read more on wordcloud here https://pypi.org/project/wordcloud/
pip install wordcloud

In [None]:
#langdetect is a language detection library ported from Google.
#more information is available here https://pypi.org/project/langdetect/
pip install langdetect

The above code installs libraries that we will need to import below for our analysis of the Twitter data. We will collect tweets during the Capitol Hill Riots and use these libraries to assist in our analysis. Remember, we must first import the libraries we need before we can begin analyzing or plotting. We are importing more libraries than just what was installed above. We will also import pandas (to work with data in a data frame), numpy (to perform basic matrix transformations), re (to work with regular expressions for strings), string (for common string operations such as concatenation, pil (for python images), nltk (for sentiment analysis), sklearn (for feature extraction and ml), and matplotlib as well as plotly (for visualizations).

In [None]:
# import libraries
import pandas as pd
import numpy as np

# Text processing
import re 
from textblob import TextBlob
import string

# Word cloud visualization
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

# Machine learning (sentiment analysis)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from langdetect import detect
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

# This one will be used to help us with lexicon
import nltk

# Other visualization
import matplotlib.pyplot as plt
import plotly.express as px

In [None]:
csv_file = "tcat_Jan6th_Proud_Boys-20210106-20210106------------fullExport--9654fe3ff4.csv"
twitter_data = pd.read_csv(csv_file)

In [None]:
twitter_data.info()

In [None]:
# drop duplicates, inplace allows us to overwrite the data in memory
twitter_data.drop_duplicates(inplace=True)

In [None]:
# check for the first five entries in our data with no duplicates
twitter_data.head()

## 2 Data cleaning

In [None]:
twitter_data.columns

In [None]:
# Let's remove a few columns 
twitter_data = twitter_data.drop(['withheld_copyright', 'withheld_scope', 'truncated', 'lat', 'lng',
                                 'from_user_utcoffset', 'from_user_timezone', 'from_user_lang', 
                                 'from_user_withheld_scope'],axis=1)


In [None]:
# let's do some regular expressions
# regular expressions -
# lambda: apply the same operation on a given subset of the data, you get to define what that function is
#RT @someletters:
remove_rt = lambda x: re.sub("RT @\w+: ","",x)
rt = lambda x: re.sub('(@[A-Za-z0-9]+)'," ",x)
twitter_data['text'] = twitter_data['text'].map(remove_rt).map(rt)
twitter_data['text'] = twitter_data['text'].str.lower()
twitter_data.head()

## 3 Machine learning modelling

Natural language processing (sentiment analysis)


In [None]:
# We are going to create a couple new columns: polarity and subjectivity
# polarity is a range between (-1, 1). 1 is a positive statement, -1 is a negative statement
# subjectivity refers to personal opinion, emotion or judgement, whereas objective refers to fact
# subjectivity ranges from (0,1), where 0 is pure personal emotion, 1 is known fact
twitter_data[['polarity','subjectivity']] = twitter_data['text'].apply(lambda Text : pd.Series(TextBlob(Text).sentiment))

In [None]:
twitter_data.head(3)

In [None]:
# Computing a score for the text column using SentimentIntensityAnalyzer 
# If you have a "lexicon error", try the following
nltk.download('vader_lexicon')
for index,row in twitter_data['text'].iteritems():
    # compute a score
    score = SentimentIntensityAnalyzer().polarity_scores(row)
    # Assign score categories to variables
    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    comp = score['compound']
    
    # If negative score (neg) is greater than positive score (pos), then the text should be categorized as "negative"
    if neg> pos:
        twitter_data.loc[index,"sentiment"] = 'negative'
    # If positive score (pos) is greater than the negative score (neg), then the text should be categorized as "positive"
    elif pos > neg:
        twitter_data.loc[index,"sentiment"] = "positive"
    # Otherwise 
    else:
        twitter_data.loc[index,"sentiment"] = "neutral"
        twitter_data.loc[index,'neg'] = neg
        twitter_data.loc[index,'pos'] = pos
        twitter_data.loc[index,'neu'] = neu
        twitter_data.loc[index,'compound'] = comp

In [None]:
#Let's go back in to our data set and look specifically at the variable sentiment. What are the unique values?
twitter_data['sentiment'].unique()

## 4 Data visualizing 
In this section we want to understand the polarity and subjectivity of the tweets in our sample in a visual format. This will give us the ability to summarize thousands of Tweets in a more meaningful representation.

In [None]:
# Let's take a look at how many are labelled positive, negative or neutral
tw_list_negative = twitter_data[twitter_data['sentiment']=='negative']
tw_list_positive = twitter_data[twitter_data['sentiment']=='positive']
tw_list_neutral = twitter_data[twitter_data['sentiment']=='neutral']

# Let's count how many of these values belong to each category. We will define a function to count values.
def count_values_in_column(data,feature):
    
    total = data.loc[:,feature].value_counts(dropna=False)
    percentage = round(data.loc[:,feature].value_counts(dropna=False,normalize=True)*100,2)
    
    return pd.concat([total,percentage],axis=1, keys=['Total', 'Percentage'])

# Values for sentiment
pc = count_values_in_column(twitter_data, "sentiment")

pc

In [None]:
# Create a piechart
names = pc.index
size = pc['Percentage']
my_circle = plt.Circle((0,0), 0.7, color='white')
plt.pie(size, labels=names,colors=['blue','red','green'])
p = plt.gcf()
p.gca().add_artist(my_circle)
plt.show()

In [None]:
#Function to Create Wordcloud
def create_wordcloud(text):
    mask = np.array(Image.open("cloud.jpeg"))
    stopwords = set(STOPWORDS)
    wc = WordCloud(background_color="white",
    mask = mask,
    max_words=3000,
    stopwords=stopwords,
    repeat=True)
    wc.generate(str(text))
    wc.to_file("wc.png")
    print("Word Cloud Saved Successfully")
    path="wc.png"
    display(Image.open(path))

In [None]:
#Creating wordcloud for all tweets
create_wordcloud(twitter_data["text"].values)

In [None]:
#Creating wordcloud for positive sentiment
create_wordcloud(tw_list_positive["text"].values)

In [None]:
#Creating wordcloud for negative sentiment
create_wordcloud(tw_list_negative["text"].values)

In [None]:
#Calculating tweet’s length and word count
twitter_data['text_len'] = twitter_data['text'].astype(str).apply(len)
twitter_data['text_word_count'] = twitter_data['text'].apply(lambda x: len(str(x).split()))
round(pd.DataFrame(twitter_data.groupby("sentiment").text_len.mean()),2)

In [None]:
round(pd.DataFrame(twitter_data.groupby("sentiment").text_word_count.mean()),2)

In [None]:
nltk.download('stopwords')

In [None]:
#Removing Punctuation
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0–9]+', '', text)
    return text
twitter_data['punct'] = twitter_data['text'].apply(lambda x: remove_punct(x))

#Applying tokenization- splitting a phrase, sentence, paragraph, or an entire text document into smaller units
def tokenization(text):
    text = re.split('\W+', text)
    return text
twitter_data['tokenized'] = twitter_data['punct'].apply(lambda x: tokenization(x.lower()))


#Removing stopwords
stopword = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    text = [word for word in text if word not in stopword]
    return text
    
twitter_data['nonstop'] = twitter_data['tokenized'].apply(lambda x: remove_stopwords(x))

#Applying Stemmer
ps = nltk.PorterStemmer()
def stemming(text):
    text = [ps.stem(word) for word in text]
    return text
twitter_data['stemmed'] = twitter_data['nonstop'].apply(lambda x: stemming(x))

#Cleaning Text
def clean_text(text):
    text_lc = "".join([word.lower() for word in text if word not in string.punctuation]) # remove puntuation
    text_rc = re.sub('[0-9]+', '', text_lc)
    tokens = re.split('\W+', text_rc)    # tokenization
    text = [ps.stem(word) for word in tokens if word not in stopword]  # remove stopwords and stemming
    return text
twitter_data.head()

In [None]:
#Applying Countvectorizer
countVectorizer = CountVectorizer(analyzer=clean_text) 
countVector = countVectorizer.fit_transform(twitter_data['text'])
print('{} Number of reviews has {} words'.format(countVector.shape[0], countVector.shape[1]))
#print(countVectorizer.get_feature_names())

count_vect_df = pd.DataFrame(countVector.toarray(), columns=countVectorizer.get_feature_names())

In [None]:
# Most Used Words
count = pd.DataFrame(count_vect_df.sum(),columns=["Value"])
countdf = count.sort_values("Value",ascending=False).head(20)

px.bar(countdf[1:],x=countdf.index[1:],y="Value")


## 5 Conclusions

Refer back to our first table made under section 4 visualization - the majority of tweets (86.89%) of the tweets were labelled as positive, whereas only 12.37% were labelled as negative. We find word clouds that share similar patterns with prominent display of terms like "proud boys", "capitol" and "national guard". The most prominent terms include "boy", "stay", "trump", "people". Why are the tweets so overwhelmingly positive?
