Professor: Apostolos Filippas

Class: Web Analytics

Topic: Parsing web content III

**This material was developed by Jie Lu. You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.**

# <center><font color='red'>Scraping data using Twitter's API</font></center>

This notebook details how we can use an API to parse content -- if, of course, the website provides you with an API them. To get introduced to using APIs to parse data, we will examine some scraping examples using Twitter's API.

# 0. Before class

Make sure you can load all the packages below before coming to class

In [1]:
# load packages
# Note: please install the Tweepy package if you haven't
#       to do so, run "pip install tweepy" on your command line

# Tweepy allows us to access the twitter API in python
import tweepy  

# Pandas wil lallow us to extract tweets/users and load the data into dataframe 
import pandas as pd

# JSON will allow us to work with JSON files
import json

# json_normalize() will allows us to normalize semi-structured JSON data into a flat table
from pandas.io.json import json_normalize 

## 1. Authentication


Let's begin with some necessary authentication steps!

- To start using Twitter's API, we first have to obtain the correct **authentication keys and tokens**.
- These keys and tokens allow Twitter to securely identify each user, and protect user privacy

Authentication is handled by the tweepy.AuthHandler class.  
Tweepy tries to make the authentication process as painless as possible for you. 

To begin the authentication process we need to:
- register a client application with Twitter
- create a new application

If you follow the steps before the class. Once you are done you should have your consumer key and secret. Keep these two handy, as you’ll need them. To find them, go to https://developer.twitter.com/en/apps/ and then click on details:

<br>
<img src="https://docs.google.com/uc?export=download&id=1fxYeyGrs7lgDg-YlYEdJ7SlRmuQvpWWM" width="800" height="480">
<br>

In [3]:
# Variables that contain the credentials to access Twitter API
# Use your own key and tokens instead
consumer_key    = "imqH65lm6swRGWCYmzCRjIBE5"
consumer_secret = "FvIqyOhTnzBtb8C6TjCmJThbRVZduDDVMP4X25MIfC2enL7RGo"
access_key      = "725501506888196096-gdAWRWBX1aQbLU8aLMcBvh5QYs19Y10"
access_secret   = "1TEpIWTm73wBkDF1ANmcDfS5XmEfBndLGP8jDZUexI5fp"

# Setup access to API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)

# create the API object
api = tweepy.API(auth)

Now, just to make sure it worked, let's check the most recent tweets from our stream.

In [4]:
# most recent 5 tweets from my stream
recent_tweets = api.home_timeline(count=200)
for tweet in recent_tweets:
    if 'Airbnb' or '@Airbnb' or '#Airbnb' in tweet.text: 
        print(tweet.text)

Top transfer Devin Leary has committed to Kentucky 🏈 https://t.co/0MPn02VMOK
𝙎𝙪𝙧𝙫𝙚𝙮 𝙨𝙖𝙮𝙨...

#SicEm | #PersonOverPlayer https://t.co/OaWzoc0KRK
Let's go! #RIDEHIGH https://t.co/p0HDuRc8Jn
RT @BoMcNally: Had a blast speaking at the @StanfordFball official visit weekend with some fellow football alumni. Preaching what I’m exper…
RT @jjordangarrison: COMMITTED! #RideHigh #AGTG 💚🤎 @mark_serve @CoachSpears1 @CoachWulff @coachjmusky https://t.co/ZHMm1iO9eH
BIG TIME commit out of the Northeast joining the #TunnelWorkersUnion and @StanfordFball https://t.co/SWwq96nwrz
RT @LouStagner: Yesterday I posted that making "net par" is a solid goal. 

I had a conversation with a 7-index friend who said:

"I am not…
RT @UWTarrant: We've got an easy and fun way for you to show your support for veterans this holiday season! Thanks to our @ArmedForcesBowl…
The Path... It's Paved By Legends. https://t.co/Q39a1S6HgE
If you didn't know, I've joined up with @Qb11Sd &amp; @DouglasTS to deliver a weekly podcast 

### Collecting tweets that have Airbnb and Housing Market in it 

In [16]:
# Collecting tweets that have Airbnb and Housing Market in it 
Tweets_Housing= []
Airbnb_tweets7 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate = 202201010000,
                                       maxResults = 100, toDate = 202202010000) 

for tweet in Airbnb_tweets7: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

34


In [17]:
# Collecting tweets that have Airbnb and Housing Market in it 
Airbnb_tweets8 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate = 202202010000,
                                       maxResults = 100, toDate = 202203010000) 

for tweet in Airbnb_tweets8: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

134


In [18]:
# Collecting tweets that have Airbnb and Housing Market in it 
Airbnb_tweets9 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate = 202203020000,
                                       maxResults = 100, toDate = 202204010000) 

for tweet in Airbnb_tweets9: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

234


In [19]:
# Collecting tweets that have Airbnb and Housing Market in it 
Airbnb_tweets10 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate = 202204020000,
                                       maxResults = 100, toDate = 202205010000) 

for tweet in Airbnb_tweets10: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

334


In [20]:
# Collecting tweets that have Airbnb and Housing Market in it 
Airbnb_tweets11 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate = 202205020000,
                                       maxResults = 100, toDate = 202206010000) 

for tweet in Airbnb_tweets11: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

434


In [21]:
# Collecting tweets that have Airbnb and Housing Market in it 
Airbnb_tweets12 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate = 202206020000,
                                       maxResults = 100, toDate = 202207010000) 

for tweet in Airbnb_tweets12: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

534


In [22]:
# Collecting tweets that have Airbnb and Housing Market in it 
Airbnb_tweets13 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202207020000,
                                       maxResults = 100, toDate = 202208010000) 

for tweet in Airbnb_tweets13: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

634


In [23]:
Airbnb_tweets14 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202208020000,
                                       maxResults = 100, toDate = 202209010000) 

for tweet in Airbnb_tweets14: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

734


In [24]:
Airbnb_tweets15 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202209020000,
                                       maxResults = 100, toDate = 202210010000) 

for tweet in Airbnb_tweets15: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

834


In [25]:
Airbnb_tweets16 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202210020000,
                                       maxResults = 100, toDate = 202211010000) 

for tweet in Airbnb_tweets16: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

934


In [26]:
Airbnb_tweets17 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202211020000,
                                       maxResults = 100, toDate = 202212010000) 

for tweet in Airbnb_tweets17: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

1034


In [27]:
Airbnb_tweets18 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202212020000,
                                       maxResults = 100, toDate = 202212090000) 

for tweet in Airbnb_tweets18: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

1088


In [28]:
Airbnb_tweets19 = api.search_full_archive(query ='"Airbnb" "Housing Market"' , label ='NewWebAnalyticsProject', fromDate =202212100000,
                                       maxResults = 100, toDate = 202212190000) 

for tweet in Airbnb_tweets19: 
    Tweets_Housing.append(tweet.text) 
print(len(Tweets_Housing))

1188


In [29]:
print(Tweets_Housing)

["Airbnb and it's consequences have been a disaster for the housing market https://t.co/2n9AYgNqO7", '@GeorgeMonbiot An Airbnb tax? So the cost of Airbnb goes up? A windfall tax on gas companies? Taxing them as the co… https://t.co/feduGrizhC', '@ofvariousstorms The housing market in general needs to tank because tell me why my town of 45K people has 5 availa… https://t.co/oKR3SDttur', 'The death of our housing market is going to be because of companies like AirBNB &amp; Zillow. Just watch. This time the… https://t.co/SVR2LA8N7R', "What to expect from the housing market in 2022: Another sellers' market\n\nhttps://t.co/fcRcJMoXsZ\n\n#realestatetech… https://t.co/2UySKE4I38", 'The author forgot to mention the negative impact Airbnb has on neighbors and the housing market, but I agree - Airb… https://t.co/qV0oW4jw0S', 'NEW FULL-TEXT: "Airbnb and the housing market in Italy. Evidence from six Cities" by Congiu et al. https://t.co/tuVjlA6on2', 'Is the #NYC Housing Market Crushing your #airb

In [30]:
Dframe = pd.DataFrame(Tweets_Housing, columns = ['HM_Tweets']) 
Dframe.to_csv('2022_HM_Tweets.csv', index= False )

### Collecting Tweets with just Airbnb in it 

In [5]:
# collecting tweets that have Airbnb in it
Tweets = []
Airbnb_tweets2 = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202201010000,
                                       maxResults = 100, toDate = 202202010000) 

for tweet in Airbnb_tweets2: 
    Tweets.append(tweet.text) 
print(len(Tweets))

100


In [6]:

Airbnb_tweets3 = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202202010000,
                                       maxResults = 100, toDate = 202203010000) 

for tweet in Airbnb_tweets3: 
    Tweets.append(tweet.text) 
print(len(Tweets))

200


In [7]:

Airbnb_tweetsD = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202203010000,
                                       maxResults = 100, toDate = 202204010000) 

for tweet in Airbnb_tweetsD: 
    Tweets.append(tweet.text) 
print(len(Tweets))

300


In [8]:

Airbnb_tweetsO = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202204010000,
                                       maxResults = 100, toDate = 202205010000) 

for tweet in Airbnb_tweetsO: 
    Tweets.append(tweet.text) 
print(len(Tweets))

400


In [9]:

Airbnb_tweetsL = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202205010000,
                                       maxResults = 100, toDate = 202206010000) 

for tweet in Airbnb_tweetsL: 
    Tweets.append(tweet.text) 
print(len(Tweets))

500


In [10]:

Airbnb_tweets = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202206010000,
                                       maxResults = 100, toDate = 202207010000) 

for tweet in Airbnb_tweets: 
    Tweets.append(tweet.text) 
print(len(Tweets))

600


In [11]:
Airbnb_tweetsN = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202207010000,
                                       maxResults = 100, toDate= 202208010000)
for tweet in Airbnb_tweetsN: 
    Tweets.append(tweet.text)
print(len(Tweets))

700


In [12]:
Airbnb_tweetsU = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202208010000,
                                       maxResults = 100, toDate= 202209010000)
for tweet in Airbnb_tweetsU: 
    Tweets.append(tweet.text)
print(len(Tweets))

800


In [13]:
Airbnb_tweetsF = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202209010000,
                                       maxResults = 100, toDate= 202210010000)
for tweet in Airbnb_tweetsF: 
    Tweets.append(tweet.text)
print(len(Tweets))

900


In [14]:
Airbnb_tweetsG = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202210010000,
                                       maxResults = 100, toDate= 202211010000)
for tweet in Airbnb_tweetsG: 
    Tweets.append(tweet.text)
print(len(Tweets))

1000


In [15]:
Airbnb_tweetsH = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202211010000,
                                       maxResults = 100, toDate= 202212010000)
for tweet in Airbnb_tweetsH: 
    Tweets.append(tweet.text)
print(len(Tweets))

1100


In [16]:
Airbnb_tweetsQ = api.search_full_archive(query =('Airbnb' or '#Airbnb' or '@Airbnb'), label ='NewWebAnalyticsProject', fromDate = 202212010000,
                                       maxResults = 100, toDate= 202212170000)
for tweet in Airbnb_tweetsQ: 
    Tweets.append(tweet.text)
print(len(Tweets))

1200


In [17]:
Df = pd.DataFrame(Tweets, columns = ['Tweets']) 
Df.to_csv('2022_Tweets.csv', index= False )

In [18]:
print(Tweets)

['ルールメイキングの基本は、アジェンダシェイピング。何を争点に置いて、利害関係者をどのタイミングで巻き込んで、どの方向にもっていくか。記事中にあるUberとAirbnbの明暗を分けた事例は大変参考になりますね。\n\n渉外...… https://t.co/Y6uDjTuhIN', 'oomfs going to wembley n2 and wanna do a big harrie airbnb COME FIND ME!!!', 'RT @BuzzPatterson: I propose that Canadians and Californians get together and banish our soy boys @JustinTrudeau and @GavinNewsom to an isl…', 'RT @SunilDa56757277: #app #application #appdesign #appdesigner #videogame #conceptart #digital #education #airbnb #lightroom #editing #Inst…', 'RT @SunilDa56757277: #app #application #appdesign #appdesigner #videogame #conceptart #digital #education #airbnb #lightroom #editing #Inst…', "from what i'm understanding (and i have brain damage so i'm prob wrong, pls correct me)\n\nif you film anywhere in pu… https://t.co/Uta4HdrE38", 'RT @SunilDa56757277: #app #application #appdesign #appdesigner #videogame #conceptart #digital #education #airbnb #lightroom #editing #Inst…', '@harrysbrasil airbnb muito provavelmente', 'Pourquoi les

In [19]:
print(len(Tweets))

1200


Yes! We got a bunch of tweets from our timeline! 

This isn't practically very helpful, but it shows us how we can use Twitter's API to get data. In fact, Tweepy API is a really powerful tool:
- it has many methods that we can use
- it gives us a very convenient way to fetch large volumes of data