# Assignment 3: Trump, Twitter, and Text

Due Date: 11:59pm Monday, February 14, 2022   

#### Welcome to the third homework assignment of Data 200! In this assignment, we will work with Twitter data in order to analyze Donald Trump's tweets.

#### Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions indivi
dually. If you do discuss the assignments with others please include their names below.

**Collaborators: list collaborators here**

In [None]:
# Run this cell to set up your notebook
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import zipfile
import seaborn as sns
from IPython.display import display, Latex, Markdown
import re

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)

### Before we start

All data is made from real-world phenomena, be it the movement of the planets, animal behavior, or human bodies and activities. Working with data always has a bearing back on how human beings know and act in the world. The dataset that you're about to work with in this homework consists of a compilation of President Trump's Tweets. It's important to acknowledge that these Tweets are more than just data -- they're the means by which the President expresses his opinions, performs public and foreign policy, and shapes the lives of people in the US and all over the world. More fundamentally, these Tweets are a powerful form of speech that is particularly significant on the eve of the 2020 US Presidential Election. We recognize that working with this data now, even in the context of a technical exercise, is not a neutral activity and may create difficult feelings in students. We encourage you to observe what you may be experiencing and invite you to consider these dimensions of data science work alongside your technical lessons and we're glad to discuss these issues together in section.

### Disclaimer about sns.distplot()
This homework was designed for a slightly older version of seaborn, which does not support the new displot method. Instead, in this homework we will heavily rely on distplot (with a t). As you may have noticed in lab 5, use of the distplot function triggers a deprecation warning to notify the user that they should replace all deprecated functions with the updated version. Generally, warnings should not be suppressed but we will do so in this assignment to avoid cluttering.

See the seaborn documentation on [distributions](https://seaborn.pydata.org/tutorial/distributions.html) and [functions](https://seaborn.pydata.org/tutorial/function_overview.html) for more details.

In [None]:
# Run this cell to suppress all DeprecationWarnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
# Import the dataset
tweets = pd.read_csv("tweets_01-08-2021.csv")
tweets.head()

### Question 0

There are many ways we could choose to read the President’s tweets. Why might someone be interested in doing data analysis on the President’s tweets? Name a kind of person or institution which might be interested in this kind of analysis. Then, give two reasons why a data analysis of the President's tweets might be interesting or useful for them. Answer in 2-3 sentences.

**Type your answer here, replacing this text.**

### Question 1

Construct a DataFrame called trump containing data from all the tweets stored in _text_. The index of the DataFrame should be the ID of each tweet (looks something like 907698529606541312). It should have these columns:

*date: The time the tweet was created encoded as a datetime object.*
*device: The source device of the tweet.*
*text: The text of the tweet.*
*retweet: The retweet count of the tweet.*

**Finally, the resulting DataFrame should be sorted by the index.**

Hint: You might want to explicitly specify the columns and indices using pd.DataFrame().

In [None]:
trump = ...
trump.head()

### Question 2 Tweet source analysis 

In the following questions, we are going to find out the charateristics of Trump tweets and the devices used for the tweets.

First let's examine the source field:

In [None]:
trump['device'].unique()

In the following plot, we see that there are two device types that are more commonly used than others.

In [None]:
plt.figure(figsize=(8, 6))
trump['device'].value_counts().plot(kind="bar")
plt.xlabel('device')
plt.ylabel("Number of Tweets")
plt.title("Number of Tweets by Source");

#### Question 2 (a) 

Let's extrat the rows with the two major devices "Twitter for iPhone" and "Twitter for Android", and then replace "Twitter for iPhone" by "iPhone" and "Twitteer for Android" by "Android". 

In [None]:
trump_device = ...
trump_device.head()

#### Question 2 (b) 

parse the date into six new columns representing "year", "month", "day", "hour", "minute" and "second". 

In [None]:
...

#### Question 2 (c) 

Overlay the distributions of Trump's 2 most frequently used web technologies over the years. 

In [None]:
...

#### Question 2 (d) 

Please commeent on the pattern(s) from the plot in Question 2 (c). 

**Type your answer here, replacing this text.**

#### Question 2 (e) 

Is there a difference between Trump's tweet behavior across these devices? Draw the distribution over hours of the day that Trump tweets on each device for the 2 most commonly used devices. 

In [None]:
...

#### Question 2 (f) 

According to this [Verge article](https://www.theverge.com/2017/3/29/15103504/donald-trump-iphone-using-switched-android), Donald Trump switched from an Android to an iPhone sometime in March 2017.

Let's see if this information significantly changes our plot. Create a figure similar to your figure from question 2(e), but this time, only use tweets that were tweeted before 2017.

In [None]:
...

#### Question 2 (g) 

During the campaign, it was theorized that Donald Trump's tweets from Android devices were written by him personally, and the tweets from iPhones were from his staff. Does your figure give support to this theory? What kinds of additional analysis could help support or reject this claim?

**Type your answer here, replacing this text.**

## Part II Sentiment Analysis 

It turns out that we can use the words in Trump's tweets to calculate a measure of the sentiment of the tweet. For example, the sentence "I love America!" has positive sentiment, whereas the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

We will use the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of Trump's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Run the following cell to show the first few rows of the lexicon:

In [45]:
print(''.join(open("vader_lexicon.txt").readlines()[:10]))

$:	-1.5	0.80623	[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]
%)	-0.4	1.0198	[-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]
%-)	-1.5	1.43178	[-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]
&-:	-0.4	1.42829	[-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]
&:	-0.7	0.64031	[0, -1, -1, -1, 1, -1, -1, -1, -1, -1]
( '}{' )	1.6	0.66332	[1, 2, 2, 1, 1, 2, 2, 1, 3, 1]
(%	-0.9	0.9434	[0, 0, 1, -1, -1, -1, -2, -2, -1, -2]
('-:	2.2	1.16619	[4, 1, 4, 3, 1, 2, 3, 1, 2, 1]
(':	2.3	0.9	[1, 3, 3, 2, 2, 4, 2, 3, 1, 2]
((-:	2.1	0.53852	[2, 2, 2, 1, 2, 3, 2, 2, 3, 2]



As you can see, the lexicon contains emojis too! Each row contains a word and the polarity of that word, measuring how positive or negative the word is.

#### Question 3
The creators of VADER describe the tool’s assessment of polarity, or “compound score,” in the following way:
“The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.”

As you can see, VADER doesn't "read" sentences, but works by parsing sentences into words assigning a preset generalized score from their testing sets to each word separately.

VADER relies on humans to stabilize its scoring. The creators use Amazon Mechanical Turk, a crowdsourcing survey platform, to train its model. Its training set of data consists of a small corpus of tweets, New York Times editorials and news articles, Rotten Tomatoes reviews, and Amazon product reviews, tokenized using the natural language toolkit (NLTK). Each word in each dataset was reviewed and rated by at least 20 trained individuals who had signed up to work on these tasks through Mechanical Turk.

#### Question 3 (a) 

Please score the sentiment of one of the following words:

- police
- order
- Democrat
- Republican
- gun
- dog
- technology
- TikTok
- security
- face-mask
- science
- climate change
- vaccine

What score did you give it and why? Can you think of a situation in which this word would carry the opposite sentiment to the one you’ve just assigned?

**Type your answer here, replacing this text.**


#### Question 3 (b) 

VADER aggregates the sentiment of words in order to determine the overall sentiment of a sentence, and further aggregates sentences to assign just one aggregated score to a whole tweet or collection of tweets. This is a complex process and if you'd like to learn more about how VADER aggregates sentiment, here is the info at this link.

Are there circumstances (e.g. certain kinds of language or data) when you might not want to use VADER? What features of human speech might VADER misrepresent or fail to capture?

####  Question 3 (c) 
Read **vader_lexicon.txt** into a DataFrame called sent. The index of the DataFrame should be the words in the lexicon. sent should have one column named polarity, storing the polarity of each word.

Hint: The pd.read_csv function may help here. Since the file is tab-separated, be sure to set sep='\t' in your call to pd.read_csv.

In [None]:
sent = ...
sent.head()

Now, let's get rid of punctuation since it will cause us to fail to match words. Create a new column called **no_punc** in the trump DataFrame to be the lowercased text of each tweet with all punctuation replaced by a single space. We consider punctuation characters to be any character that isn't a Unicode word character or a whitespace character. You may want to consult the Python documentation on regex for this problem.

(Why don't we simply remove punctuation instead of replacing with a space? See if you can figure this out by looking at the tweet data.)


In [None]:
# Save your regex in punct_re
punct_re = r''
trump['no_punc'] = ...

#### Question 3 (d) 

Now, let's convert the tweets into what's called a tidy format to make the sentiments easier to calculate. Use the no_punc column of trump to create a table called tidy_format. The index of the table should be the IDs of the tweets, repeated once for every word in the tweet. 

It has two columns:

1. num: The location of the word in the tweet. For example, if the tweet was "i love america", then the location of the word "i" is 0, "love" is 1, and "america" is 2.
2. word: The individual words of each tweet.

As usual, try to avoid using any for loops. Our solution uses a chain of 5 methods on the trump DataFrame, albeit using some rather advanced Pandas hacking.

- **Hint 1:** Try looking at the expand argument to pandas' str.split.
- **Hint 2:** Try looking at the stack() method.
- **Hint 3:** Try looking at the level parameter of the reset_index method.    
    

In [None]:
tidy_format = ...

#### Question 3 (e) 

Now that we have this table in the tidy format, it becomes much easier to find the sentiment of each tweet: we can join the table with the lexicon table.Add a polarity column to the trump table. The polarity column should contain the sum of the sentiment polarity of each word in the text of the tweet.

**Hints:**
- You will need to merge the tidy_format and sent tables and group the final answer.
- If certain words are not found in the sent table, set their polarities to 0.

In [None]:
trump['polarity'] = ...

Now we have a measure of the sentiment of each of his tweets! Note that this calculation is rather basic; you can read over the VADER readme to understand a more robust sentiment analysis.

Now, run the cells below to see the most positive and most negative tweets from Trump in your dataset:

In [None]:
print('Most negative tweets:')
for t in trump.sort_values('polarity').head()['text']:
    print('\n  ', t)

In [None]:
print('Most positive tweets:')
for t in trump.sort_values('polarity', ascending=False).head()['text']:
    print('\n  ', t)

#### Question 3 (f)

Read the 5 most positive and 5 most negative tweets. Do you think these tweets are accurately represented by their polarity scores?

**Type your answer here, replacing this text.**

#### Question 4 

Now, let's try looking at the distributions of sentiments for tweets containing certain keywords.

#### Question 4 (a) 

In the cell below, create a single plot showing both the distribution of tweet sentiments for tweets containing *nytimes*, as well as the distribution of tweet sentiments for tweets containing *fox*.

Be sure to label your axes and provide a title and legend. Be sure to use different colors for fox and nytimes.

In [None]:
# write your codes below 

#### Question 4 (b) 

Comment on what you observe in the plot above. Can you find another pair of keywords that lead to interesting plots? Describe what makes the plots interesting. 

**Type your answer here, replacing this text.**

#### Question 5 

Now, let's see whether there's a difference in sentiment for tweets with hashtags and those without.

#### Question 5 (a) 

First, we'll need to write some regex that can detect whether a tweet contains a hashtag or a link. We say that:

- A tweet is a retweet if it has the string 'rt' anywhere in the tweet if it is preceeded and followed by a non-word character (the start and end of the string count as non-word characters).
- A tweet has a hashtag if it has the character '#' anywhere in the tweet followed by a letter.
- A tweet contains a link or a picture if it has http anywhere in the tweet
(You can check out Trump's Twitter for why these criteria are true).

In the cell below, assign rt_re to a regex pattern that identifies retweets and hash_link_re to a regex pattern that identifies tweets with hashtags or links.

**Hints:**
- Be sure to precede your regex pattern with r to make it a raw string (Ex: r'pattern'). To find out more, you can read the first paragraph of the documentation.
- You may find using regex word boundaries helpful for one of your patterns.

In [None]:
rt_re = ...
hash_link_re = ...

#### Question 5 (b) 

Let's see whether there's a difference in sentiments for tweets with hashtags/links and those without.

Note: You will get a UserWarning error when running the below cell. For the purpose of this homework, you can ignore it.

Run the cell below to see a distribution of tweet sentiments based on whether a tweet contains a hashtag or link.

In [None]:
sns.distplot(trump[trump['text'].str.contains(hash_link_re)]['polarity'],label='hashtag or link');
sns.distplot(trump[~trump['text'].str.contains(hash_link_re)]['polarity'],label='no hashtag or link');
plt.xlim(-10, 10);
plt.ylim(0, 0.4);
plt.title('Distribution of Tweet Polarities (hashtag/link vs none)');
plt.legend();

What do you notice about the distributions? Answer in 1-2 sentences.

**Type your answer here, replacing this text.**

## Congratulations! You have finished Assignment 3! 