# Analyzing Twitter Data with Python

Outline:
- Credentials and Authentication
- Collecting and Streaming Data
- Loading Tweets
- Analyzing Data

The `tweepy` module will be used to collect Twitter data with the Streaming API.

In [2]:
# !pip install tweepy

In [3]:
from tweepy import OAuthHandler, API
# import tweepy

import json

## Credentials

First of all, we need to create a Twitter account, validate it, and then create a Twitter developer account. The developer account can be created within the [Twitter developer](https://developer.twitter.com/en/apps) web page.

Second, we create an app to generate a _Customer Key_, a _Consumer Secret_, an _Access Token_, and an _Access Token Secret_.

The steps to generate keys are as follows:
- Create a Twitter acoount, validate it with a phone number,
- Create a Twitter developer account,
- Create an app,
- Generate keys and tokens.

It is worth noting that the app's API keys should be kept secure. It is important not to commit API keys and access tokens to publicly accessible version control systems such as Github or BitBucket.

The Twitter credentials (keys and tokens) can be kept locally as a `.json` file in the following format.

In [None]:
{"consumer_key":"API key",
 "consumer_secret":"API secret key",
 "access_token_key":"Access token",
 "access_token_secret":"Access token secret"
}

In [5]:
# Load Twitter app information
with open('twitter_cred.json','r') as file:
    twitter_cred = json.load(file)

In [6]:
consumer_key = twitter_cred['consumer_key']
consumer_secret = twitter_cred['consumer_secret']
access_token = twitter_cred['access_token_key']
access_token_secret = twitter_cred['access_token_secret']

`tweepy` library requires a Twitter API key to authenticate with Twitter.

In [7]:
# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)

We can print the username to see if our account is properly authenticated.

In [None]:
user = api.me()
print(user.name)

## Collecting Data

`tweepy.api` is a wrapper for the Twitter API. Methods include:
 - Timeline Methods,
 - Status Methods,
 - User Methods,
 - Direct Message Methods,
 - Friendship Methods,
 - Account Methods,
 - Favorite Methods,
 - Block Methods,
 - Saved Searches Methods, 
 - Spam Reporting Methods,
 - Help Methods,
 - List Methods,
 - Trends Methods,
 - Geo Methods.
[Source:tweepy](https://tweepy.readthedocs.io/en/v3.5.0/api.html#api-reference)

### Single Tweet

A **status** is a _tweet_ with various attributes such as _created at_, _id_, and _text_. (See [twitter](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object) for full list of attributes.)

For example, below status is [the most retweeted tweet](https://mashable.com/article/most-retweeted-tweet-billionaire/?europe=true) in English.

In [9]:
# Most retweeted tweet
status = api.get_status('849813577770778624')

In [37]:
type(status)

tweepy.models.Status

In [34]:
# Print first 500 characters of the Status.
str(status)[:500]

"Status(_api=<tweepy.api.API object at 0x103326550>, _json={'created_at': 'Thu Apr 06 02:38:40 +0000 2017', 'id': 849813577770778624, 'id_str': '849813577770778624', 'text': 'HELP ME PLEASE. A MAN NEEDS HIS NUGGS https://t.co/4SrfHmEMo3', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 849813572351737856, 'id_str': '849813572351737856', 'indices': [38, 61], 'media_url': 'http://pbs.twimg.com/media/C8sk8QlUwAAR3qI.jpg', 'media_url_h"

We can convert the status object to JSON format. The parameter `indent` is to make it easier  reading the result.

In [38]:
# Status to json object (encoding)
tweet_json = json.dumps(status._json, indent=2)
print(tweet_json)

{
  "created_at": "Thu Apr 06 02:38:40 +0000 2017",
  "id": 849813577770778624,
  "id_str": "849813577770778624",
  "text": "HELP ME PLEASE. A MAN NEEDS HIS NUGGS https://t.co/4SrfHmEMo3",
  "truncated": false,
  "entities": {
    "hashtags": [],
    "symbols": [],
    "user_mentions": [],
    "urls": [],
    "media": [
      {
        "id": 849813572351737856,
        "id_str": "849813572351737856",
        "indices": [
          38,
          61
        ],
        "media_url": "http://pbs.twimg.com/media/C8sk8QlUwAAR3qI.jpg",
        "media_url_https": "https://pbs.twimg.com/media/C8sk8QlUwAAR3qI.jpg",
        "url": "https://t.co/4SrfHmEMo3",
        "display_url": "pic.twitter.com/4SrfHmEMo3",
        "expanded_url": "https://twitter.com/carterjwm/status/849813577770778624/photo/1",
        "type": "photo",
        "sizes": {
          "small": {
            "w": 382,
            "h": 680,
            "resize": "fit"
          },
          "thumb": {
            "w": 150,
         

In [39]:
type(tweet_json), type(status)

(str, tweepy.models.Status)

In [40]:
status.text

'HELP ME PLEASE. A MAN NEEDS HIS NUGGS https://t.co/4SrfHmEMo3'

In [41]:
status.user.name

'Carter Wilkerson'

In [42]:
status.created_at

datetime.datetime(2017, 4, 6, 2, 38, 40)

### Trends

In [43]:
# Get the available WOEID of a location 
#api.trends_available()

# Worldwide trends (WOEID=1)
trends_ww = api.trends_place(1)

# name of the first top trend worldwide
trends_ww[0]['trends'][0]['name']

'#TwitterBestFandom'

### Streaming API

Streaming API allows us to collect real-time Twitter data based on either a sample or keyword filtering. 
>Using the streaming api has three steps.
 - Create a class inheriting from StreamListener,
 - Using that class create a Stream object,
 - Connect to the Twitter API using the Stream. [Source: tweepy](https://tweepy.readthedocs.io/en/v3.5.0/streaming_how_to.html)

Below code calls a Stream Listener. [Main Source](https://github.com/SocialDataAnalytics-Winter2018/lab04/blob/master/slistener.py) 

In [45]:
%run SListener.py

Alternatively, we can load the file into the cell with the `%load` magic command. (`%load?` for more info)

In [81]:
# Print SListener.py
# %load SListener.py

In [78]:
from tweepy import Stream

# Instantiate the SListener object, specify a time limit to listen (in seconds) 
listen = SListener(api, time_limit=60)

# Instantiate the Stream object
stream = Stream(auth, listen)

There are various _streams_ available through Tweepy. We'll use _filter_ in this notebook.

In [79]:
# Set up keywords to track
keywords = ['datascience', 'python']

# Begin collecting data
stream.filter(track = keywords)

## Loading Tweets

Tweets are collected from the Streaming API in **JSON** format. JSON is written with JavaScript object notation. We need to convert data into a Python data structure to work on it.

### Single Tweet

In [48]:
# Convert from JSON to Python object (decode JSON)
tweet = json.loads(tweet_json)

# Print tweet text
print(tweet['text'])

# Print tweet id
print(tweet['id'])

HELP ME PLEASE. A MAN NEEDS HIS NUGGS https://t.co/4SrfHmEMo3
849813577770778624


In [54]:
# Print user name
print(tweet['user']['screen_name'])

# Print user follower count
print(tweet['user']['followers_count'])

# Print user location
print(tweet['user']['location'])

# Print user description
print(tweet['user']['description'])

# Print the number of retweets
print(tweet['retweet_count'])

carterjwm
99918
Reno, NV - San Diego, CA
I kinda like chicken nuggets
3528525


In [51]:
# Check whether retweet or not
print(tweet['retweeted'])

False


In [None]:
# Print the text of tweet which has been retweeted
#print(retweet['retweeted_status']['text'])

# Print the user handle of the tweet
#print(retweet['user']['screen_name'])

# Print the user handle of the tweet which has been retweeted
#print(retweet['retweeted_status']['user']['screen_name'])

### Multiple Tweets in a JSON File

We may want to load hundreds of tweets for analysis. Therefore, we need to import the file as a DataFrame.

In [72]:
tweet_json = open('tweets.json', 'r').read()
twt = tweet_json.split("\n")[0]

tweet = json.loads(twt)
tweet['text']

'RT @pyblogsal: The oldest piece of CPython, the parser generator (pgen) has been retired and replaced with a new version! 🐍🖥️🥳 \n\nThe parser…'

'RT @pyblogsal: The oldest piece of CPython, the parser generator (pgen) has been retired and replaced with a new version! 🐍🖥️🥳 \n\nThe parser…'

In [55]:
import pandas as pd

In [92]:
def create_tweet_list(json_file):
    """
    Formats the data and creates a tweet list before converting to a DataFrame.
    """
    tweets_list = []
    
    with open(json_file, 'r') as tweets:
        tweets_json = tweets.read().split("\n")

        # Iterate through each tweet
        for tweet in tweets_json:

            # Create Python object for the tweet
            tweet_obj = json.loads(tweet)

            # Check if 140+ character tweet
            if 'extended_tweet' in tweet_obj:
                # Save the extended tweet in 'extended_tweet-full_text'
                tweet_obj['extended_tweet-full_text'] = tweet_obj['extended_tweet']['full_text']
        
            tweets_list.append(tweet_obj)
        
    return tweets_list

In [93]:
# Create tweet list with the above function
tweet_list = create_tweet_list('tweets.json')

# Create a pandas DataFrame
tweets_df = pd.DataFrame(tweet_list)

In [95]:
tweets_df.head(3)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,extended_tweet-full_text,favorite_count,favorited,...,quote_count,reply_count,retweet_count,retweeted,retweeted_status,source,text,timestamp_ms,truncated,user
0,,,Sat Mar 02 19:19:29 +0000 2019,,"{'hashtags': [], 'urls': [], 'user_mentions': ...",,,,0,False,...,0,0,0,False,{'created_at': 'Sat Mar 02 15:58:50 +0000 2019...,"<a href=""http://tapbots.com/tweetbot"" rel=""nof...","RT @pyblogsal: The oldest piece of CPython, th...",1551554369837,False,"{'id': 18064071, 'id_str': '18064071', 'name':..."
1,,,Sat Mar 02 19:19:29 +0000 2019,"[0, 140]","{'hashtags': [], 'urls': [{'url': 'https://t.c...",,"{'full_text': 'Hier mal das, was das Projekt g...","Hier mal das, was das Projekt grundlegend zeig...",0,False,...,0,0,0,False,,"<a href=""http://twitter.com/download/android"" ...","Hier mal das, was das Projekt grundlegend zeig...",1551554369900,True,"{'id': 3292060903, 'id_str': '3292060903', 'na..."
2,,,Sat Mar 02 19:19:33 +0000 2019,"[0, 68]","{'hashtags': [], 'urls': [{'url': 'https://t.c...","{'media': [{'id': 1101925025471295488, 'id_str...",,,0,False,...,0,0,0,False,,"<a href=""https://dlvrit.com/"" rel=""nofollow"">d...",Python resulta ser el mejor lenguaje de 2018 h...,1551554373932,False,"{'id': 903393260860825600, 'id_str': '90339326..."


In [96]:
tweets_df['text'].head()

0    RT @pyblogsal: The oldest piece of CPython, th...
1    Hier mal das, was das Projekt grundlegend zeig...
2    Python resulta ser el mejor lenguaje de 2018 h...
3    RT @KirkDBorne: Is Robotic Process Automation ...
4    RT @AndrewinContact: What is the Internet of T...
Name: text, dtype: object

## Analyzing Data