# 11.4 Twitter Scraping

In this tutorial, we will build a scraper for twitter. Twitter offer a free API to search for historical tweets. You can find the details [here](https://developer.twitter.com/en/docs/tweets/search/overview). However, the free API is limited to searches in the last 7 days. If you need less recent data, you have to pay. However, if you need a moderate amount of data, webscraping could be an alternative.

Let's start from the twitter search page: https://twitter.com/search-advanced

Suppose you want to search a specific Twitter account, for example `@realDonaldTrump`. The url we get is https://twitter.com/search?l=&q=from%3ArealDonaldTrump&src=typd which we can decompose into  different parts:
- https://twitter.com/search?
- l=
- q=from%3ArealDonaldTrump
- src=typd

The second and last part are actually useless. The important part is the third which represents our query. We can build a more succing and still working url as https://twitter.com/search?q=from%3ArealDonaldTrump. Yes, the URL is still working and is providing exactly the same content.

Suppose we wanted to extract the tweets from the page above. If we scroll in the browser, we can see more. However, how many do we get with a standard python request? Let's see.

In [12]:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote

# Prepare url
query = quote('realDonaldTrump')
url_twitter = 'https://twitter.com/search?q=from:%s'
url = url_twitter % query
print(url)

# Get page content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')

# Recover tweets
tweets = soup.select('div[class*="js-stream-tweet"]')
len(tweets)

https://twitter.com/search?q=from:realDonaldTrump


0

Weird... we do not get any result. Why? The answer relies in the headers. Let's try with "as-real" headers.

In [13]:
# Set headers
headers = {
        "Host": "twitter.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64)",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Language": "de,en-US;q=0.7,en;q=0.3",
        "X-Requested-With": "XMLHttpRequest",
        "Referer": "https://twitter.com/search-advanced/",
        "Connection": "keep-alive"}

# Get page content
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')

# Recover tweets
tweets = soup.select('div[class*="js-stream-tweet"]')
len(tweets)

20

Having good headers makes a difference!

Now we have discovered how to scrape twitter by username. Clearly, we can apply the same logic and recover which are the url components needed to scrape twytter by:
- query
- user
- location

Moreover, we can add further constraints like:
- location
- timespan
- language

You just have to insert a parameter into https://twitter.com/search-advanced and see which url twitter outputs. 

Let's now recover the tweet content.

In [14]:
# Clean tweets
print(url)
tweets_clean = [tweet.select('p[class*="tweet-text"]')[0].text for tweet in tweets]
tweets_clean

https://twitter.com/search?q=from:realDonaldTrump


['GAME OVER!pic.twitter.com/yvMa6bPqfy',
 'Incredible numbers!https://twitter.com/realdonaldtrump/status/1223099086389121026\xa0…',
 'THE BEST IS YET TO COME!pic.twitter.com/SOn6wRV9Zs',
 'Thank you Iowa, I love you!https://www.pscp.tv/w/cQOMgDEzNTg4Mzh8MXZPeHdvYkxCQUx4QrT6MSzbemiKv0jhIMKEWz80dLpl7urO3cn8NWkS7psa?t=56s\xa0…',
 'Don Lemon, the dumbest man on television (with terrible ratings!).https://twitter.com/dailycaller/status/1221999373829144578\xa0…',
 'Nadler ripped final argument away from Schiff, thinks Shifty did a terrible job. They are fighting big time!https://twitter.com/julio_rosas11/status/1223090988060758021\xa0…',
 'Great poll in Iowa, where I just landed for a Big Rally! #KAG2020pic.twitter.com/4YCo01XYCn',
 'Shifty Adam Schiff is a CORRUPT POLITICIAN, and probably a very sick man. He has not paid the price, yet, for what he has done to our Country!',
 'A very good question!https://twitter.com/marklevinshow/status/1221423157749452800\xa0…',
 'Thank you Nick!https://t

However, we still have one problem: scrolling. We are able to get only the first 20 tweets. How do we get more? There is a simple trick.

In [15]:
# Get min position of current search
min_position = re.findall('data-max-position="(\w+)', response.text)[0]
min_position

'thGAVUV0VFVBaAgLWVi4Xv'

If we use the minimum position of the current search as the maximum position of the next search, we are effectively scrolling the page.

In [16]:
# Scroll down
url2 = url + '&max_position=' + min_position
print(url2)

# Get page content
response = requests.get(url2, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')

# Recover tweets
tweets = soup.select('div[class*="js-stream-tweet"]')
tweets_clean = [tweet.select('p[class*="tweet-text"]')[0].text for tweet in tweets]
tweets_clean

https://twitter.com/search?q=from:realDonaldTrump&max_position=thGAVUV0VFVBaAgLWVi4Xv


['Incredible numbers!https://twitter.com/realdonaldtrump/status/1223099086389121026\xa0…',
 'Nadler ripped final argument away from Schiff, thinks Shifty did a terrible job. They are fighting big time!https://twitter.com/julio_rosas11/status/1223090988060758021\xa0…',
 'THE BEST IS YET TO COME!pic.twitter.com/SOn6wRV9Zs',
 'Americans across the political spectrum are disgusted by the Washington Democrats’ Partisan Hoaxes, Witch Hunts, & Con Jobs. Registered Democrats and Independents are leaving the Democrat Party in droves, & we are welcoming these voters to the Republican Party w/ wide open arms!pic.twitter.com/UCdQXY3vPn',
 'To keep America Safe, we have fully rebuilt the U.S. Military – it is now stronger, more powerful, and more lethal than ever before. Thanks to the courage of American Heroes, the ISIS Caliphate has been DESTROYED & its founder & leader – the animal known as al-Baghdadi – is DEAD!pic.twitter.com/9LXDf6mJKf',
 'This November, we are going to defeat the Radical Soc

Awesome! We can now write a small Twitter scraper.

In [17]:
import re
import requests
import datetime
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import quote

In [18]:
# Scraping function
def scrape_twitter(username, since=False, until=False, n=100):
    
    twitter_url = 'https://twitter.com/search?q=%s&max_position='

    # Add options
    options = ' from:' + username
    if since:
        options += ' since:' + since
    if until:
        options += ' until:' + until
    url = twitter_url % quote(options)

    # Scrape data
    active = True
    max_position = ''
    df_tweets = pd.DataFrame()

    print(url+max_position)

    while active:
        df, max_position = get_data(url+max_position)
        df_tweets = df_tweets.append(df, sort=False)
        if len(df_tweets) >= n or not max_position:
            active = False
            
    return df_tweets

In [19]:
# Get data module
def get_data(url):

    # Set headers
    headers = {
        "Host": "twitter.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64)",
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Language": "de,en-US;q=0.7,en;q=0.3",
        "X-Requested-With": "XMLHttpRequest",
        "Referer": url,
        "Connection": "keep-alive"}

    # Get data
    response = requests.get(url, headers=headers)

    # Clean data
    df = clean_data(response)
    min_position = re.findall('data-max-position="(\w+)', response.text)[0]
    print(".", end='')

    return df, min_position

In [20]:
# Clean data
def clean_data(response):

    soup = BeautifulSoup(response.text, 'html5lib')
    tweets = soup.select('div[class*="js-stream-tweet"]')
    df = pd.DataFrame()

    for tweet in tweets:
        temp_df = pd.DataFrame(index=[0])
        temp_df['user_name'] = tweet.attrs['data-name']
        date = re.findall('data-time="(\\d*)"', str(tweet))[0]
        temp_df['date'] = datetime.datetime.fromtimestamp(int(date))
        temp_df['text'] = tweet.select('p[class*="tweet-text"]')[0].text
        stat = re.findall('data\-tweet\-stat\-count="(\\d*)"', str(tweet))
        temp_df['comments'] = int(stat[0])
        temp_df['retweets'] = int(stat[1])
        temp_df['likes'] = int(stat[2])
        temp_df['mentions'] = ", ".join(re.findall('(@\\w*)', tweet.text))
        temp_df['hashtags'] = ", ".join(re.findall('(#\\w*)', tweet.text))
        df = df.append(temp_df, sort=False)

    return df

In [21]:
# Scrape first 100 tweets of Donald Trump in 2020
username = 'realDonaldTrump'
since = '2020-01-01'
df = scrape_twitter(username=username, since=since)
df

https://twitter.com/search?q=%20from%3ArealDonaldTrump%20since%3A2020-01-01&max_position=
.....

Unnamed: 0,user_name,date,text,comments,retweets,likes,mentions,hashtags
0,Donald J. Trump,2020-01-31 15:33:25,Incredible numbers!https://twitter.com/realdon...,2468,7513,29688,"@realDonaldTrump, @realDonaldTrump",
0,Donald J. Trump,2020-01-31 15:25:15,"Nadler ripped final argument away from Schiff,...",4455,8406,29144,"@realDonaldTrump, @Julio_Rosas11",
0,Donald J. Trump,2020-01-31 05:22:00,THE BEST IS YET TO COME!pic.twitter.com/SOn6wR...,9240,21057,78618,@realDonaldTrump,
0,Donald J. Trump,2020-01-31 04:16:35,Americans across the political spectrum are di...,9338,16767,60148,@realDonaldTrump,
0,Donald J. Trump,2020-01-31 04:04:06,"To keep America Safe, we have fully rebuilt th...",3646,11960,47817,@realDonaldTrump,
...,...,...,...,...,...,...,...,...
0,Donald J. Trump,2020-01-27 06:18:33,I NEVER told John Bolton that the aid to Ukrai...,44248,29265,123514,@realDonaldTrump,
0,Donald J. Trump,2020-01-27 00:54:34,.....Melania and I send our warmest condolence...,6357,33875,261024,@realDonaldTrump,
0,Donald J. Trump,2020-01-26 21:39:07,"Nothing done wrong, READ THE TRANSCRIPTS!",19483,18156,110234,@realDonaldTrump,
0,Donald J. Trump,2020-01-26 16:06:45,"“Again: Read the Transcript!” Michael Goodwin,...",7953,12146,56791,@realDonaldTrump,


We made it! Now we have a nice short script to scrape twitter!

## Bibliography

- Mitchell, R. (2018). *Web scraping with Python: Collecting more data from the modern web*. O'Reilly Media, Inc.
- Vanden Broucke, S., & Baesens, B. (2018). *Practical Web scraping for data science*. New York, NY: Apress.