# CSCI 4152/6509 Lab 7: Fetching Tweets with Python

**Mar 04/06 2019, 13:05-14:25 | Location: Lab 1209, Mona Campbell Building | Lab instructor: Dijana Kosmajac, Tukai Pain | Lab copyright: Colin Conrad**

## Lab Overview

- Understand How Application Programming Interfaces (APIs) Work
- Use the Tweepy Package to Extract Tweets and Profile Information
- Clean Text Using Regex
- Export Tweets in a CSV Format for Later Analysis

Be sure to get help from the teaching assistant whenever you have questions.

### Step 1: Logging in to server bluenose (or timberlea)

- As in the previous lab, login to your account on the server bluenose (or timberlea).
- Change your directory to csci4152 or csci6509 (`cd csci6509` or `cd csci4152`), whichever is your registered course. This directory should have been already created in your previous lab.
- Now, using the command `mkdir lab7` create the directory *lab7*.
- After this, you should make this directory your current directory using the command: `cd lab7`.
- Copy contents of *lab7* directory from https://web.cs.dal.ca/~vlado/csci6509/files/lab7 using `wget https://web.cs.dal.ca/~vlado/csci6509/files/lab7/*`.
- Now, you should be able to access the notebook through https://timberlea.cs.dal.ca:8000/, as instructed in the previous lab.

### Step 2: Register with Twitter API

So far, we have been using Python for data analysis which was already available to us through NLTK package. Today, we are going to learn about using Twitter API to collect a dataset for analysis. API, in a simple wording, is a software intermediary that allows two applications to talk to each other. APIs are often provided by platforms that want developers to build apps for them. For instance, Twitter provides an API, but so does Google Maps, Google Talk, Facebook, YouTube, AccuWeather and so much more. APIs let us bring outside computing power into our apps, and they do this using webservices. Most webservices use the Respresentational state transfer (REST) protocol to provide the service.

About REST: REST uses a number of protocols similar to HTTP. For instance, operations such as GET or DELETE are used to make requests for data. For more information about REST and its relative the SOAP protocol see https://blog.smartbear.com/apis/understanding-soap-and-rest-basics/.

In order to access the REST API, we will need a Twitter developer account. If you do not have a Twitter account, please visit twitter.com and sign up. You do not need to use your real name, but I encourage you to use your proper email address. It will also ask you for your phone number. It may need this for Twitter developer to authenticate that you are a real developer and not a spambot. NOTE: If you absolutely refuse to give Twitter your phone number, please come see me (TA Dijana Kosmajac) and we can accomodate.

When you have a Twitter account, visit https://developer.twitter.com/ and start your first app! Give it a name. This may require a few extra steps, because Twitter usually reviews the application for the API access. Make sure to accurately describe the purpose of your access. Once your app is configured and approved, you can access the app page. Near the top of the page, you will find "Keys and Access Tokens". You will need these to autheenticate the access from your application, so keep the page open for now.

### Step 3: Tweepy Package

One of the best features of Python is that it has a package system which can greatly extend the basic features of the programming language. The `csv` package that we used before was one of the basic libraries of python and is maintained by the Python Software Foundation. Most packages however are maintained by a special community of dedicated users. One such library is the Tweepy (`tweepy`) library, but other tools such as Natural Language Toolkit (`nltk`), Python Data Analysis Library (`pandas`), and SciPy (`scipy`) also work this way. To help manage these packages, modern python versions are shipped with a package manager called `pip`. Tweepy should be already available on the Bluenose (Timberlea) server. For your own research, if you want to install the library on your local machine, you can do so by using pip.

`sudo pip install tweepy`

With Tweepy in hand, we can investigate the following script. For more information on the Tweepy package, visit https://github.com/tweepy/tweepy.

### Step 4: Twitter Scraper Script

Modify and run the following script. Add the required API keys.

In [18]:
#Twitter Profiler app. This is a simple script to configure the Twitter API

import tweepy, time # https://github.com/tweepy/tweepy
import time

# Twitter API credentials. Get yours from https://developer.twitter.com. 
# Twitter account is required. If you need help, visit
# https://developer.twitter.com/en/docs/basics/authentication/overview

consumer_key = "Thlrd***************" # insert your consumer_key
consumer_secret = "4MbdAF*******************************" # insert your consumer_secret
access_key = "*****************************" # insert your access_key
access_secret = "***********************" # insert your access_secret

# this function collects a twitter profile request and returns a Twitter object
def get_profile(screen_name):
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)
    try:
        # https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show
        # describes get_user
        user_profile = api.get_user(screen_name)
    except Exception as e:
        return "There was an error. Details:" + e
    return user_profile

# uses the function to query a Twitter user.
s = get_profile("google")

if type(s) != str:
    print(s._json) # this is using the raw message in json
                   # s instance is an object with attributes that correspond to the Twitter json response
else:
    print(s)

{'id': 20536157, 'id_str': '20536157', 'name': 'Google', 'screen_name': 'Google', 'location': 'Mountain View, CA', 'profile_location': None, 'description': '#HeyGoogle', 'url': 'https://t.co/ZKjExFETrD', 'entities': {'url': {'urls': [{'url': 'https://t.co/ZKjExFETrD', 'expanded_url': 'http://about.google', 'display_url': 'about.google', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 21846242, 'friends_count': 224, 'listed_count': 95941, 'created_at': 'Tue Feb 10 19:14:39 +0000 2009', 'favourites_count': 2474, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 106747, 'lang': None, 'status': {'created_at': 'Tue Mar 10 18:00:42 +0000 2020', 'id': 1237438243081261056, 'id_str': '1237438243081261056', 'text': 'Calling all programmers around the world: Be a part of the #CodeJam community. Registration is now open for our ann… https://t.co/YyS8MTNQ2Q', 'truncated': True, 'entities': {'hashtags': [{'text

The first thing you will notice is the import statement. This is familar, but this time we are importing the tweepy and time libraries. The time library makes it so that we can interpret time data structures, which is very helpful for interpreting tweets.

The `consumer_key` and other variables are the keys that are used by the Twitter API in order to run. We should take our keys from the *developer.twitter.com* and put them in here. I have provided a partial screenshot for your reference.

<img src="snippy.png?modified=12345678" />

Once we obtained and copied the keys into the script, we should be able to get the data from the API. Let's take a look at the function `get_profile`.

The auth variable contains the OAuth process for logging into the Twitter API using your Twitter account keys. The `auth.set_access_token` handles the other part of the Twitter OAuth process, which gets you into the API. Finally, the API variable saves the API session for use in the Python script. This activates Tweepy and lets you in to Twitter's API securely. It works because OAuth works. If you want to learn more about OAuth, definitely check out the Wikipedia page.

### Step 5: Interpret Something Useful

We have code that works, but its output is very messy and not user readable. This is simply the Twitter API data. When you print Tweepy's Twitter API object, you will receive a lot of JSON, which is the format that the API uses dump data.

Let's try modifying the script to only show some relevant features. We modified the end of the script, where we fetch only specific attributes:

In [6]:
# Twitter Profiler app. This is a simple script to configure the Twitter API

import tweepy
import time # https://github.com/tweepy/tweepy

# Twitter API credentials.

consumer_key = "Thlrd***************" # insert your consumer_key
consumer_secret = "4MbdAF*******************************" # insert your consumer_secret
access_key = "*****************************" # insert your access_key
access_secret = "***********************" # insert your access_secret

# This function collects a Twitter profile request and returns a Twitter object
def get_profile(screen_name):
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)
    try:
        # https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show
        # describes get_user
        user_profile = api.get_user(screen_name)
    except Exception as e:
        return "There was an error. Details:" + e
    return user_profile

# Uses the function to query a Twitter user.
s = get_profile("google")

if type(s) != str:
    print("ID:\t%s"%s.id_str)
    print("Name:\t%s (%s)"%(s.name,s.screen_name))
    print("Location:\t%s"%s.location)
    print("Description:\t%s"%s.description)
    print("Followers count:\t%d"%s.followers_count)
else:
    print(s)



ID:	20536157
Name:	Google (Google)
Location:	Mountain View, CA
Description:	#HeyGoogle
Followers count:	21833677


That's a bit more usable. How did we change this? When the Twitter API creates a REST query, it makes a JSON request.  JSON (JavaScript Object Notation) is a widely adopted potocol for transferring web data. To get the raw json object directly, you can use `s._json` property, instead of accessing individual attributes.

The problem is that Python is not designed to interpret JSON. Tweepy is a library designed specifically to transform JSON data into a python-readable Twitter object. This is why we are able to simply state s.location and receive the location. As you likely know, this is not normally allowed in Python.

Try tinkering with this. You can learn more about the Twitter profile object here: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object

### Step 6: Retrieving Users' Tweets

So far, we have managed to get Python to print a given user's profile information. What about collecting user tweets? This is surprisingly simple because of how the tweepy library structures the data. Recognizing that each user is a twitter user object, it follows that a users' tweets must likewise be in some way related to that object. Profiles and tweets have what is often called a "one-to-many" relationship, where there are many tweets to a given user.

To search for ways to collect tweets, we can look at the Twitter API documentation. Take a quick look and try to see what's going on: https://developer.twitter.com/en/docs/tweets/timelines/overview

The docs stay that we should use the user_timeline function to access a users' tweets. Immediately, we might think that the simplest way to do this is to call the user_timeline function on the user object. However, this will not work, as the user_timeline function is a call to the Twitter API object, not the user object. This is what is implied when the documentation requires a screen_name or id to make the call.

It makes more sense to create a separate `get_tweets` function that takes the screen_name as input and returns the user tweets. Let's try something like the following:

In [20]:
#Twitter Profiler app. This is a simple script to configure the Twitter API
import tweepy, time, csv

consumer_key = "Thlrd***************" # insert your consumer_key
consumer_secret = "4MbdAF*******************************" # insert your consumer_secret
access_key = "*****************************" # insert your access_key
access_secret = "***********************" # insert your access_secret

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

# this function collects a twitter profile request and returns a Twitter object
def get_profile(screen_name):
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)
    try:
        # https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show
        # describes get_user
        user_profile = api.get_user(screen_name)
    except Exception as e:
        return "There was an error. Details:" + e
    return user_profile

#this function collects twitter profile tweets and returns Tweet objects
def get_tweets(screen_name):
    try:
        # https://developer.twitter.com/en/docs/tweets/timelines/overview describes user_timeline
        tweets = api.user_timeline(screen_name, count=20)
    except Exception as e:
        return "There was an error. Details:" + e
    return tweets

# uses the function to query a Twitter user.
t = get_tweets("google")
print(t)

[Status(_api=<tweepy.api.API object at 0x7f164e69a790>, _json={'created_at': 'Wed Mar 11 03:06:07 +0000 2020', 'id': 1237575500811272192, 'id_str': '1237575500811272192', 'text': '@Siddharthbapnaa Hi there. Thank you for letting us know about this issue, and your patience as we fixed the proble… https://t.co/dkFmZ8P9ex', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'Siddharthbapnaa', 'name': 'S', 'id': 2423573916, 'id_str': '2423573916', 'indices': [0, 16]}], 'urls': [{'url': 'https://t.co/dkFmZ8P9ex', 'expanded_url': 'https://twitter.com/i/web/status/1237575500811272192', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}]}, 'source': '<a href="http://www.conversocial.com" rel="nofollow">Conversocial</a>', 'in_reply_to_status_id': 1233676839266086912, 'in_reply_to_status_id_str': '1233676839266086912', 'in_reply_to_user_id': 2423573916, 'in_reply_to_user_id_str': '2423573916', 'in_reply_to_screen_name': 'Siddharthba

Try this script, complete with your Twitter keys. The result will be a big mess of api tweet results! This is a good sign because it means that our script is working. However, it needs to be further refined.

When using Tweepy, each tweet is an object in itself, complete with a series of properties which can be accessed. Our results currently give us a list of 20 Tweets, each with a lot of data. We can organize our data better by accessing just the text property of the tweets. Let's add the code to loop through the data and give us the tweet content:

In [8]:
for tweet in t:
    print(tweet.text +'\n')
    print(tweet._json["text"]+'\n')

@Iam_Dipanshu77 Hi Dipanshu. Let's see what we can do to help. Look out for a DM with next steps.

@Iam_Dipanshu77 Hi Dipanshu. Let's see what we can do to help. Look out for a DM with next steps.

@Oso75592541 Hmm. Let's see what we can do to help. Look out for a DM with next steps.

@Oso75592541 Hmm. Let's see what we can do to help. Look out for a DM with next steps.

Think about how much you read on your phone every day. Now when you say "Hey Google, read it," the browser on your… https://t.co/GcdSe6tlp9

Think about how much you read on your phone every day. Now when you say "Hey Google, read it," the browser on your… https://t.co/GcdSe6tlp9

@Gamer64T Hi there. We suggest adding extra layers of security to your account with these tips: https://t.co/k6X4NghPTb. Hope this helps.

@Gamer64T Hi there. We suggest adding extra layers of security to your account with these tips: https://t.co/k6X4NghPTb. Hope this helps.

@AzwerRizvi Mind if we jump in, Azwer? Just to confirm, have you a

### Step 7: Export to CSV

So far, we have managed to develop a script that collects a given user's tweets. However, the program currently prints the tweets that we find, and doesn't do anything useful with them. Let's modify the script to export the Tweets to a csv file.

We will have to change the script to include two things. As you may recall, we will first need to import the csv library. The second thing we will need is a script to 1) open a writable csv file and 2) write each tweet into a row of the csv file. In addition to the tweet.text we might also wnt some other information, such as the id of the tweet and the user who wrote it. We should also make sure there is a header for the csv file. The end of the script would thus look something like this:

In [9]:
with open ('tweets.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow([]) # insert the features here
    for tweet in t:
        writer.writerow([]) # insert the features here

This is nice, but there is one more useful feature we can add to our script before finishing. Currently, the script only evaluates a single user. We can modify it to include many users using an array. If we store all of the users we would like to investigate in a single array and then loop through the array, we can gather a greater variety of tweets. Let's modify the code one more time to include this feature. Make sure your code looks like the following:

In [10]:
#Twitter Profiler app. This is a simple script to configure the Twitter API

import tweepy, time, csv

consumer_key = "Thlrd***************" # insert your consumer_key
consumer_secret = "4MbdAF*******************************" # insert your consumer_secret
access_key = "*****************************" # insert your access_key
access_secret = "***********************" # insert your access_secret

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

# this function collects a twitter profile request and returns a Twitter object
def get_profile(screen_name):
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_key, access_secret)
    api = tweepy.API(auth)
    try:
        # https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-show
        # describes get_user
        user_profile = api.get_user(screen_name)
    except Exception as e:
        return "There was an error. Details:" + e
    return user_profile

#this function collects twitter profile tweets and returns Tweet objects
def get_tweets(screen_name):
    try:
        # https://developer.twitter.com/en/docs/tweets/timelines/overview describes user_timeline
        tweets = api.user_timeline(screen_name, count=20)
    except Exception as e:
        return "There was an error. Details:" + e
    return tweets

# set of profiles that we want to obtain.
profiles = ["google", "msdev","rosiebarton","realDonaldTrump"]

with open ('tweets.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(["id","screen_name","created_at","text"])
    for profile in profiles:
        t = get_tweets(profile)
        for tweet in t:
            writer.writerow([str(tweet.id),tweet.user.screen_name,tweet.created_at,tweet.text])
            # or
            # tweet = tweet._json
            # writer.writerow([tweet["id"],tweet["user"]["screen_name"],tweet["created_at"],tweet["text"].encode("utf-8")])

Save and run the script. It should produce a `tweets.csv` file. 

__Submit:__ Submit the output file `tweets.csv` using the `submit-nlp` command.

### Step 8: Preprocessing tweets

We have saved csv file with tweets collected from Twitter. Now, the task is to extract some information from collected data. Take a look at the `Patterns` class:

In [3]:
import re

# Taken from: https://github.com/s/preprocessor/blob/master/preprocessor/defines.py

class Patterns:
    URL_PATTERN=re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
    HASHTAG_PATTERN = re.compile(r'#\w*')
    MENTION_PATTERN = re.compile(r'@\w*')
    RESERVED_WORDS_PATTERN = re.compile(r'^(RT|FAV)')

    try:
        # UCS-4
        EMOJIS_PATTERN = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    except re.error:
        # UCS-2
        EMOJIS_PATTERN = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')

    SMILEYS_PATTERN = re.compile(r"(?:X|:|;|=)(?:-)?(?:\)|\(|O|D|P|S){1,}", re.IGNORECASE)
    NUMBERS_PATTERN = re.compile(r"(^|\s)(\-?\d+(?:\.\d)*|\d+)")

This class has some basic regular expression for data extraction from Twitter messages. You can use it to extract URLs, hashtags, mentions, emoticons, etc. Let's extract mentions and URLs from the saved tweets:

In [10]:
#Find hashtags script. This is a simple script to read csv and find hashtags

import csv, re, pandas as pd

tweets_df = pd.read_csv('tweets.csv')

for items in tweets_df['text'].iteritems(): 
    print("Mentions:")
    print(re.findall(Patterns.MENTION_PATTERN, items[1]))
    print("URLs:")
    print(re.findall(Patterns.URL_PATTERN, items[1]))
    cle = re.sub(Patterns.MENTION_PATTERN, "", (items[1]).lower())
    cle = re.sub(Patterns.URL_PATTERN, "", cle)
    print("Raw text:")
    print(items[1])
    print(cle)
    print("_"*40)
    

# or
# with open ('tweets.csv', 'r') as infile:
#     reader = csv.reader(infile,quotechar='"')
#     for row in reader:
#         print(row)

Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>
Mentions:
<class 'list'>


Create `find_hashtags.py` which reads previously created `tweets.csv` file. You should:
- extract word tokens from each tweet. On standard output print top 10 word tokens per account. 
- extract hashtags per account. On standard output print all hashtags per account. 
- clean each tweet: lowercase, remove URLs, remove hashtags, remove mentions, remove reserved words (RT, FAV), remove emojis (emojis removal doesn't have to be perfect). Print cleaned tweets per account.

__Submit:__ Submit the script file `find_hashtags.py` using the `submit-nlp` command.

Example output:

<div class="alert alert-block alert-info">
<samp>Google:<br>* top 10 words: in, trending, feedback, listed, using, focus, ... <br>* hashtags: #GoogleDoodle, #IWD2018, #MarioMaps<br>* clean tweets:<br>1. playfab is a complete back-end platform for live games ...<br>2. discover the top six tips you should know while working ...<br>3. ...<br>realDonaldTrump:
* unique tokens: In\u2026, United, Townshi\u2026, Thank, just, being, ...
* hashtags: #InternationalWomensDay, #MAGA
* clean tweets:
1. ...
...</samp>
</div>



This concludes today's lab.