# I. DATA COLLECTION 📊

Data collection inspired by [Scraping historical tweets without a Twitter Developer Account's blog](https://mihaelagrigore.medium.com/scraping-historical-tweets-without-a-twitter-developer-account-79a2c61f76ab) 

## 1. Import necessary librairies

In [1]:
import os
import subprocess

import json
import csv

import uuid

from IPython.display import display_javascript, display_html, display

import pandas as pd
import numpy as np

from datetime import datetime, date, time

## 2. Install snscrape and import sntwitter

In [2]:
pip install git+https://github.com/JustAnotherArchivist/snscrape.git

Collecting git+https://github.com/JustAnotherArchivist/snscrape.git
  Cloning https://github.com/JustAnotherArchivist/snscrape.git to c:\users\louis\appdata\local\temp\pip-req-build-_e845o89
  Resolved https://github.com/JustAnotherArchivist/snscrape.git to commit 0d824ab77334ed4ab6250e5e491171afeccfb298
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: snscrape
  Building wheel for snscrape (pyproject.toml): started
  Building wheel for snscrape (pyproject.toml): finished with status 'done'
  Created wheel for snscrape: filename=snscrape-0.6.2.20230321.dev50+g0d824ab

  Running command git clone --filter=blob:none --quiet https://github.com/JustAnotherArchivist/snscrape.git 'C:\Users\louis\AppData\Local\Temp\pip-req-build-_e845o89'

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: C:\Users\louis\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [3]:
import snscrape.modules.twitter as sntwitter

## 3. Write our query

We wanted to collect tweet about Ukraine. 


Parameters from the following request :

- jsonl: This option specifies the output format as JSONL (JSON Lines). JSONL is a format where each line of the file contains a valid JSON object. It is commonly used for storing a large number of records, such as tweets, in a compact and efficient manner.

- max-results : This option sets the maximum number of results to retrieve. 

- lang : choose tweets from a specific langage

- twitter-search '#ukraine': This part of the command specifies the search query. It searches for tweets on Twitter using the specified search criteria. In this example, it searches for tweets posted with Ukraine's hashtag.

- ukraine-query-tweets.json: This redirects the output of the command to the given file. The retrieved tweets will be saved in this file in JSONL format

- since / until : to have a window which delimits tweet in time

In [4]:
json_filename = 'ukraine-query-tweets.json'

#Using the OS library to call CLI commands in Python
os.system(f'snscrape --max-results 200000 --jsonl --progress --since 2022-12-01 twitter-search "ukraine lang:en until:2023-06-06 " > {json_filename}')

0

## 4. JSON and CSV convertions

In [5]:
filename = 'ukraine-query-tweets'
tweets_df = pd.read_json(filename +'.json', lines=True)
tweets_df.shape

(200000, 40)

In [6]:
tweets_df.to_csv(filename +'.csv', index = False)

In [8]:
import pandas as pd

tweets_df = pd.read_csv(filename +'.csv')

display(tweets_df)

Unnamed: 0,_type,url,date,rawContent,renderedContent,id,user,replyCount,retweetCount,likeCount,...,bookmarkCount,pinned,editState,content,outlinks,outlinksss,tcooutlinks,tcooutlinksss,username,_snscrape
0,snscrape.modules.twitter.Tweet,https://twitter.com/PeterKropotki16/status/166...,2023-06-05 23:59:58+00:00,To understand RU's strategy regarding Ukraine ...,To understand RU's strategy regarding Ukraine ...,1665871111735705600,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,To understand RU's strategy regarding Ukraine ...,[],,[],,PeterKropotki16,0.6.2.20230321.dev50+g0d824ab
1,snscrape.modules.twitter.Tweet,https://twitter.com/aixlachapelle/status/16658...,2023-06-05 23:59:57+00:00,@SecBlinken @TiranaHassan @hrw Human right sho...,@SecBlinken @TiranaHassan @hrw Human right sho...,1665871111152873472,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,1,1,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,@SecBlinken @TiranaHassan @hrw Human right sho...,[],,[],,aixlachapelle,0.6.2.20230321.dev50+g0d824ab
2,snscrape.modules.twitter.Tweet,https://twitter.com/No_Suffer_Fools/status/166...,2023-06-05 23:59:57+00:00,@TruthHunter7778 @moravian63 @jimross03941273 ...,@TruthHunter7778 @moravian63 @jimross03941273 ...,1665871108824866817,"{'_type': 'snscrape.modules.twitter.User', 'us...",1,0,0,...,1,,{'_type': 'snscrape.modules.twitter.EditState'...,@TruthHunter7778 @moravian63 @jimross03941273 ...,[],,[],,No_Suffer_Fools,0.6.2.20230321.dev50+g0d824ab
3,snscrape.modules.twitter.Tweet,https://twitter.com/ddoregon2020/status/166587...,2023-06-05 23:59:55+00:00,@CavemanInASuit @RepClayHiggins Giuliani flew ...,@CavemanInASuit @RepClayHiggins Giuliani flew ...,1665871099567939586,"{'_type': 'snscrape.modules.twitter.User', 'us...",1,0,2,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,@CavemanInASuit @RepClayHiggins Giuliani flew ...,[],,[],,ddoregon2020,0.6.2.20230321.dev50+g0d824ab
4,snscrape.modules.twitter.Tweet,https://twitter.com/nerate_us/status/166587107...,2023-06-05 23:59:49+00:00,Nikki Haley says backing Ukraine is about prot...,Nikki Haley says backing Ukraine is about prot...,1665871076746752002,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,Nikki Haley says backing Ukraine is about prot...,['https://cnn.it/3OOR7cK'],https://cnn.it/3OOR7cK,['https://t.co/GYEi7tD7lY'],https://t.co/GYEi7tD7lY,nerate_us,0.6.2.20230321.dev50+g0d824ab
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,snscrape.modules.twitter.Tweet,https://twitter.com/collaisco/status/166447094...,2023-06-02 03:16:11+00:00,@LePapillonBlu2 He’d have a hard time finding ...,@LePapillonBlu2 He’d have a hard time finding ...,1664470942733705216,"{'_type': 'snscrape.modules.twitter.User', 'us...",2,0,5,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,@LePapillonBlu2 He’d have a hard time finding ...,[],,[],,collaisco,0.6.2.20230321.dev50+g0d824ab
199996,snscrape.modules.twitter.Tweet,https://twitter.com/robysrh/status/16644709059...,2023-06-02 03:16:03+00:00,I don't give a fuck about Ukraine and Russia t...,I don't give a fuck about Ukraine and Russia t...,1664470905937100801,"{'_type': 'snscrape.modules.twitter.User', 'us...",1,0,0,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,I don't give a fuck about Ukraine and Russia t...,[],,[],,robysrh,0.6.2.20230321.dev50+g0d824ab
199997,snscrape.modules.twitter.Tweet,https://twitter.com/wlcondra/status/1664470881...,2023-06-02 03:15:57+00:00,@BernieSanders What a dumb comment. Your debt ...,@BernieSanders What a dumb comment. Your debt ...,1664470881698193408,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,0,0,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,@BernieSanders What a dumb comment. Your debt ...,[],,[],,wlcondra,0.6.2.20230321.dev50+g0d824ab
199998,snscrape.modules.twitter.Tweet,https://twitter.com/kardinal691/status/1664470...,2023-06-02 03:15:54+00:00,another 15 such little helpers are waiting for...,another 15 such little helpers are waiting for...,1664470868267810817,"{'_type': 'snscrape.modules.twitter.User', 'us...",0,1,4,...,0,,{'_type': 'snscrape.modules.twitter.EditState'...,another 15 such little helpers are waiting for...,[],,[],,kardinal691,0.6.2.20230321.dev50+g0d824ab


## 5. First test to see if our dataset is correct

Counting the number of Tweets we scraped


In [9]:
num = sum(1 for line in open(json_filename))
print(num)

200000


Check tweets for a particular text, here "juste"


In [None]:
substring = 'juste'

count = 0
f = open(json_filename, 'r')
for i, line in enumerate(f):
    if substring in line:
        count = count + 1
        obj = json.loads(line)
        print(f'Tweet number {count}: {obj["content"]}')
print(count)
f.close()

In [11]:
tweets_df.iloc[0].content

"To understand RU's strategy regarding Ukraine and against the collective West, you just need to look at this GIF. The dog denies the fire and endures the heat but believes that eventually the wood will burn out, the flames will disappear, and a new house will be rebuilt. https://t.co/PZawq7ji0q"

Links mentioned in the tweet are also listed separately in the outlinks column.

In [12]:
tweets_df.iloc[0].outlinks


'[]'

In [13]:
popularity_columns = ['replyCount', 'retweetCount', 'likeCount', 'quoteCount']
tweets_df.iloc[0][popularity_columns]

replyCount      0
retweetCount    0
likeCount       0
quoteCount      0
Name: 0, dtype: object

Find the most retweeted tweet in our dataset.

In [14]:
tweets_df.iloc[tweets_df.retweetCount.idxmax()][['content','retweetCount']]

content         I wish the U.S. government supported its citiz...
retweetCount                                                 4891
Name: 43499, dtype: object