# Scraping historical tweets without a Twitter Developer Account

The tool we will use:
- snscrape

What you need: 
- Python 3.8

What you don't need:
- a Twitter Developer Account


For a research project related to public discourse about results on international large scale assessments I needed to scrape historical tweets, going back all the way to the begining of Twitter. This is how I discovered **snscrape**, a wonderful tool, easy to setup and use. 

I didn't find snscrape from the start, initially I was reading through the intricate details of Twitter Developer Account, application procedure, different levels of access, limits etc etc. But luckily a friend recommended snscrape and suddenly the task of collecting tweets became extremely easy.

Snscrape is a popular tool with social scientists for Tweets collection, at least in 2021. Apparently, it bypasses several limitations of the Twitter API.  
The prettiest thing is that you don't need Twitter developer account credentials (like you do with <a href='https://www.tweepy.org/'>Tweepy</a>, for example)


## Table of contents


1. [Installing snscrape](#1.-Installing-snscrape)
2. [How to use snscrape](#2.-How-to-use-snscrape)
3. [Calling snscrape CLI commands from Python Notebook](#3.-Calling-snscrape-CLI-commands-from-Python-Notebook)
4. [Using snscrape Python wrapper](#4.-Using-snscrape-Python-wrapper)
5. [Tweets meta-information gathered with snscrape](#5.-Tweets-meta-information-gathered-with-snscrape) 
6. [Dataset manipulation: JSON, CSV and Pandas DataFrame](#6.-Dataset-manipulation:-JSON,-CSV-and-Pandas-DataFrame)
7. [Basic exploration of our collected dataset of tweets](#7.-Basic-exploration-of-our-collected-dataset-of-tweets)
8. [Bonus: Publishing your Jupyter Notebook on Medium](#8.-Bonus:-Publishing-your-Jupyter-Notebook-on-Medium)
9. [What next ? Sentiment analysis](#9.-What-next-?-Sentiment-analysis)

We begin with some standard library imports.

In [2]:
import os
import subprocess

import json
import csv

import uuid

from IPython.display import display_javascript, display_html, display

import pandas as pd
import numpy as np

from datetime import datetime, date, time

## 1. Installing snscrape

Snscrape is available from its <a href='https://github.com/JustAnotherArchivist/snscrape'>official github project repository</a>.

Snscrape has two versions:
- released version, which you can install by running this line in a commant line terminal: **pip3 install snscrape** (for a Windows machine)
- **development version**, which is said to have richer functionality, so this is the one I'll be using.   
I will use the latter.

First, let's check the current Python version, as snscrape documentation mentions **it requires Python 3.8**

In [8]:
from platform import python_version
print(python_version())

3.8.3


If you don't see 3.8.x in your case, please upgrade your Python version before you continue this tutorial, otherwise you will **not be able to install snscrape**.

Installing the development version of snscrape.

In [9]:
pip install git+https://github.com/JustAnotherArchivist/snscrape.git

Collecting git+https://github.com/JustAnotherArchivist/snscrape.git
  Cloning https://github.com/JustAnotherArchivist/snscrape.git to c:\users\temp\appdata\local\temp\pip-req-build-0047h652
Building wheels for collected packages: snscrape
  Building wheel for snscrape (setup.py): started
  Building wheel for snscrape (setup.py): finished with status 'done'
  Created wheel for snscrape: filename=snscrape-0.3.5.dev98+gf64ce21-py3-none-any.whl size=50801 sha256=3f93a36eae72f4482ca953a1556968c0929e9301c71e64d460992144c5fe3494
  Stored in directory: C:\Users\Temp\AppData\Local\Temp\pip-ephem-wheel-cache-yk57j11p\wheels\92\42\87\33fa9b18f7a75d02643a9ca3743339aec9be28c6796267c7d8
Successfully built snscrape
Note: you may need to restart the kernel to use updated packages.


  Running command git clone -q https://github.com/JustAnotherArchivist/snscrape.git 'C:\Users\Temp\AppData\Local\Temp\pip-req-build-0047h652'


In [10]:
import snscrape.modules.twitter as sntwitter

## 2. How to use snscrape

- through its command line interface (CLI) in the command prompt terminal.
- use Python to run the CLI commands from a Jupyter notebook, for example (if you don't want to use the terminal to run commands)
- or use the official snscrape Python wrapper. The Python wrapper is not well documented, unfortunately.

Parameters you can use:
- --jsonl : get the data into jsonl format
- --progress
- --max-results : limit the number of tweets to collect
- --with-entity : Include the entity (e.g. user, channel) as the first output item (default: False)
- --since DATETIME : Only return results newer than DATETIME (default: None)
- --progress : Report progress on stderr (default: False)

In [11]:
#Run the snscrape help to see what options / parameters we can use
cmd = 'snscrape --help'

#This is similar to running os.system(cmd), which would show the output of running the command in the Terminal
#window from where I started my Jupyter Notebook (which is what I used to develop this code)
#By using subprocees, I capture the commands's output into a variable, whose content I can then print here.
output = subprocess.check_output(cmd, shell=True)
                                 
print(output.decode("utf-8"))                                 

usage: snscrape [-h] [--version] [-v] [--dump-locals] [--retry N] [-n N]
                [-f FORMAT | --jsonl] [--with-entity] [--since DATETIME]
                [--progress]
                {telegram-channel,vkontakte-user,weibo-user,facebook-group,instagram-user,instagram-hashtag,instagram-location,reddit-user,reddit-subreddit,reddit-search,twitter-thread,twitter-search,facebook-user,facebook-community,twitter-user,twitter-hashtag,twitter-list-posts,twitter-profile}
                ...

positional arguments:
  {telegram-channel,vkontakte-user,weibo-user,facebook-group,instagram-user,instagram-hashtag,instagram-location,reddit-user,reddit-subreddit,reddit-search,twitter-thread,twitter-search,facebook-user,facebook-community,twitter-user,twitter-hashtag,twitter-list-posts,twitter-profile}
                        The scraper you want to use

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and e

## 3. Calling snscrape CLI commands from Python Notebook

Notice I make use of a few snscrape parameters:  
- --max-results, to limit the search
- --jsonl, to have my results saved directly into a json file
- --since yyyy-mm-dd, so collect tweets starting with this date
- twitter-search will tell snscrape what the actual text to search is.  
    Notice I use the 'until:yyyy-mm-dd'. This is a workaround for the fact that sncrape does not have support for an --until DATETIME parameters.  
    So I'm using Twitter's search <strong>until</strong> feature. That is, I am using a feature already built-in in Twitter search.  
    For more <strong>search operators</strong> that you can use and pass on to snscrape as part of the text to search for, see the <a href='https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators'>Twitter documentation on search operators</a>.

In [12]:
json_filename = 'pisa2018-query-tweets.json'

#Using the OS library to call CLI commands in Python
os.system(f'snscrape --max-results 5000 --jsonl --progress --since 2018-12-01 twitter-search "#pisa2018 lang:fr until:2019-12-31" > {json_filename}')

0

## 4. Using snscrape Python wrapper

In [13]:
start = date(2016, 12, 5)
start = start.strftime('%Y-%m-%d')

stop = date(2016, 12, 14)
stop = stop.strftime('%Y-%m-%d')

keyword = 'pisa2018'

In [15]:
maxTweets = 1000

#We are going to write the data into a csv file
filename = keyword + start + '-' + stop + '.csv'
csvFile = open(filename, 'a', newline='', encoding='utf8')

#We write to the csv file by using csv writer
csvWriter = csv.writer(csvFile)
csvWriter.writerow(['id','date','tweet'])

#I will use the following Twitter search operators:
# since - start date for Tweets collection 
# stop  - stop date for Tweets collection
# -filter:links - not very clear what this does, from Twitter search operators documentation: https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators
#                 but it looks like this will exclude tweets with links from the search results
# -filter:replies - removes @reply tweets from search results
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(keyword + 'since:' +  start + ' until:' + \
                                                        stop + ' -filter:links -filter:replies').get_items()):
    if i > maxTweets :
        break
    csvWriter.writerow([tweet.id, tweet.date, tweet.content])

csvFile.close()

## 5. Tweets meta-information gathered with snscrape

Let's have a look at all the information that is available for every single tweet scraped using snscrape.  

For this code I am using one example file that I made precidely for this, which contains a single JSON object. If you want to use a JSON file created with the steps above, you need to make some changes before you can run json.loads on it, as explained in <a href='https://stackoverflow.com/questions/21058935/python-json-loads-shows-valueerror-extra-data'>this stackoverflow discussion</a>.

The solution for pretty printing JSON data inside a Jupyter Notebook comes from <a href='https://gist.github.com/nerevar/a068ee373e22391ad3a1413b3e554fb5'>this github project</a>.

Click on the + icons to expand the contents of that particular item.

In [16]:
filename = 'example.json'
  
with open(filename) as json_file:
    data = json.load(json_file)

class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict) or isinstance(json_data, list):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json_data
        self.uuid = str(uuid.uuid4())

    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;font: 12px/18px monospace !important;"></div>'.format(self.uuid), raw=True)
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
            renderjson.set_show_to_level(2);
            document.getElementById('%s').appendChild(renderjson(%s))
        });
      """ % (self.uuid, self.json_str), raw=True)

RenderJSON([data])

## 6. Dataset manipulation: JSON, CSV and Pandas DataFrame

### Converting JSON to Pandas DataFrame

Pandas DataFrame is **the** data structure of choice in Data Science, so we read the JSON file into a DataFrame.  

Then we save it as CSV, since CSV is the most common file type for Data Science small projects.

In [5]:
filename = 'pisa2018-query-tweets'
tweets_df = pd.read_json(filename +'.json', lines=True)

In [18]:
tweets_df.shape

(327, 23)

In [19]:
tweets_df.head(3)

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers,coordinates,place
0,https://twitter.com/Netcole_fr/status/12102181...,2019-12-26 15:17:35+00:00,Le périscolaire pour apprendre en s'amusant .....,Le périscolaire pour apprendre en s'amusant .....,1210218104510386177,"{'username': 'Netcole_fr', 'displayname': 'Net...",[https://www.amazon.fr/dp/1686530544],[https://t.co/cC28XiWfc7],0,0,...,fr,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,,,
1,https://twitter.com/HugoLamanna/status/1208736...,2019-12-22 13:08:48+00:00,"La #Chine première du classement #PISA2018, pa...","La #Chine première du classement #PISA2018, pa...",1208736143585554432,"{'username': 'HugoLamanna', 'displayname': 'La...",[],[],1,0,...,fr,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,,,
2,https://twitter.com/OSC_SciencesPo/status/1207...,2019-12-18 16:30:45+00:00,Le système scolaire français au prisme des com...,Le système scolaire français au prisme des com...,1207337415486136323,"{'username': 'OSC_SciencesPo', 'displayname': ...",[https://www.sciencespo.fr/liepp/fr/content/le...,[https://t.co/l93v8cALQh],0,1,...,fr,"<a href=""https://mobile.twitter.com"" rel=""nofo...",https://mobile.twitter.com,Twitter Web App,,,,"[{'username': 'LIEPP_ScPo', 'displayname': 'LI...",,


### Saving DataFrame to CSV

In [20]:
tweets_df.to_csv(filename +'.csv', index = False)

## 7. Basic exploration of our collected dataset of tweets

### Basic introduction to tweets

Tweets are 280 character messages (hence the name 'microblogging'). Just like on other social media platforms, you need to create an account and then you can start participating to the tweetverse.  

Tweets act as short status updates. Tweets appear on timelines. Timelines are collections of tweets sorted in a chronological order. On your account's home page, you're shown a timeline where tweets from people you follow will be displayed. 

You can post your own brand new tweet, retweet an already existing tweet (which means ou just share the exact same tweet) or quote an existing tweet (similar to retweeting, but you can add your own comment to it). 

You can also reply to someone else's tweets or 'like' them.  

Tweets often contain **entities**, which are mentions of:
- other users, which appear in the form of @other_user
- places
- urls
- media that was attached to the tweet
- hashtags, that look like #example_hashtag. Hashtags are just a way to apply a label on a tweet. If I'm tweeting something about results of PISA, the Programme for International Student Assessment, I will likely use #oecdpisa in my tweet, for example.

### Counting the number of Tweets we scraped  

The following cell is overkill in this particular scenario, but imagine you just scraped 1 million tweets and you want to know how many you got. The cell below is a very efficient way to count in that case. 

In [21]:
num = sum(1 for line in open(json_filename))
print(num)

327


### Check tweets for a particular text

In [22]:
substring = 'justesse'

count = 0
f = open(json_filename, 'r')
for i, line in enumerate(f):
    if substring in line:
        count = count + 1
        obj = json.loads(line)
        print(f'Tweet number {count}: {obj["content"]}')
print(count)
f.close()

Tweet number 1: 💼#PISA2018 : la France de justesse au dessus de la moyenne des pays de l'@OCDE_fr, des #inégalités sociales toujours importantes, il est relevé chez les élèves la critique des conditions de la scolarité, l'autocensure...
https://t.co/HafH4Kcnn5
Lire ici : https://t.co/1CQq2yPE7y
1


The actual content of the tweet is available through test_df['content'] or test_df.content  

renderedContent seems to contain the same information as content.

In [23]:
tweets_df.iloc[0].content

"Le périscolaire pour apprendre en s'amusant ... #pisa2018\n#UnPlanBpourLécole : https://t.co/cC28XiWfc7"

Links mentioned in the tweet are also listed separately in the outlinks column.

In [24]:
tweets_df.iloc[0].outlinks

['https://www.amazon.fr/dp/1686530544']

We can gauge the popularity of a tweet through these features:
- replyCount
- retweetCount
- likeCount
- quoteCount

In [25]:
popularity_columns = ['replyCount', 'retweetCount', 'likeCount', 'quoteCount']
tweets_df.iloc[0][popularity_columns]

replyCount      0
retweetCount    0
likeCount       0
quoteCount      0
Name: 0, dtype: object

Find the most retweeted tweet in our dataset.

In [26]:
tweets_df.iloc[tweets_df.retweetCount.idxmax()][['content','retweetCount']]

content         #PISA2018\nLa France médiocrement classée dans...
retweetCount                                                  161
Name: 103, dtype: object

## 8. Bonus: Publishing your Jupyter Notebook on Medium

In [27]:
pip install jupyter_to_medium

Note: you may need to restart the kernel to use updated packages.


In [28]:
import jupyter_to_medium as jtm
jtm.publish('Scraping historical tweets without a Twitter Developer Account.ipynb',
            integration_token='2896a01c2565cfe9209fb96dab1bf2fb79614391b0f6abbc46aca1dc3fba9b7c1',
            pub_name=None,
            title='Scraping historical tweets without a Twitter Developer Account',
            tags=['scraping with Python', 'Twitter archive'],
            publish_status='draft',
            notify_followers=False,
            license='all-rights-reserved',
            canonical_url=None,
            chrome_path=None,
            save_markdown=False,
            table_conversion='chrome'
            )

  warn("Your element with mimetype(s) {mimetypes}"


loading image to medium


Image Storage Information from Medium
-------------------------------------

[
    {
        "data": {
            "url": "https://cdn-images-1.medium.com/proxy/1*9fYa3dFxd2p345blmwjnuQ.png",
            "md5": "9fYa3dFxd2p345blmwjnuQ"
        }
    }
]



Successfully posted to Medium!!!
--------------------------------
id                  79a2c61f76ab
title               Scraping historical tweets without a Twitter Developer Account
authorId            10c72e180be4b304a8c49f0e230277f5e2b89dd743592deaa2d8bdc3886f216fe
url                 https://medium.com/@mihaelagrigore/79a2c61f76ab
canonicalUrl        
publishStatus       draft
license             all-rights-reserved
licenseUrl          https://policy.medium.com/medium-terms-of-service-9db0094a1e0f
tags                ['scraping-with-python', 'twitter-archive']


{'data': {'id': '79a2c61f76ab',
  'title': 'Scraping historical tweets without a Twitter Developer Account',
  'authorId': '10c72e180be4b304a8c49f0e230277f5e2b89dd743592deaa2d8bdc3886f216fe',
  'url': 'https://medium.com/@mihaelagrigore/79a2c61f76ab',
  'canonicalUrl': '',
  'publishStatus': 'draft',
  'license': 'all-rights-reserved',
  'licenseUrl': 'https://policy.medium.com/medium-terms-of-service-9db0094a1e0f',
  'tags': ['scraping-with-python', 'twitter-archive']}}

And that's about it for a quick intro to scraping tweets without the need to apply for a Twitter Developer Account and with no limitations for the maximum number of tweets we can get or for how far back in time we can go.

## 9. What next ? Sentiment analysis

What to do next with the tweets you just scraped ? In my case, I was very interested in <a href='https://www.kaggle.com/mishki/twitter-sentiment-analysis-using-nlp-techniques'>NLP for sentiment analysis of tweets</a>, or you may try topic modelling using Latent Dirichlet Allocation (LDA) or build a network graph from this data and use network analysis methods on it.