In this ipython notebook we scrape tweets by employing various searching modes depending on the use-case at hand, e.g., scrape tweets from a particular user, scrape followers' information etc. 

We use the OSINT scrapper `Twint` in a Pythonic way. Check out [Twint's Github page](https://github.com/twintproject/twint) for more details. 

For configuration options of Twint see [here](https://github.com/twintproject/twint/wiki/Configuration).

### 1 - Installs and imports

In [1]:
!pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining twint from git+https://github.com/twintproject/twint.git@origin/master#egg=twint
  Cloning https://github.com/twintproject/twint.git (to revision origin/master) to ./src/twint
  Running command git clone -q https://github.com/twintproject/twint.git /content/src/twint
  Running command git checkout -q origin/master
Collecting aiodns
  Downloading aiodns-3.0.0-py3-none-any.whl (5.0 kB)
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 4.9 MB/s 
[?25hCollecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting elasticsearch
  Downloading elasticsearch-8.3.3-py3-none-any.whl (382 kB)
[K     |████████████████████████████████| 382 kB 62.8 MB/s 
Collecting aiohttp_socks
  Downloading aiohttp_socks-0.7.1-py3-none-any.whl (9.3 kB)
Collecting schedule
  Downloading s

In [2]:
!pip install nest_asyncio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**RESTART RUNTIME**

In [1]:
import twint
import pandas as pd
import nest_asyncio
from collections import Counter

In [2]:
nest_asyncio.apply()

### 2 -  Declare parameters to acquire tweets/account info

In [3]:
# provide username
username = 'larrouturou'

# provide keyword or search string
keyword = 'zubair'

# provide conversation id for which we want to scrape the replies
conversation_id = '1541498855061164034'

# provide the number of most recent tweets you want to scrape
N = 100000

# provide the beginning date ( '%Y-%m-%d %H:%M:%S' format)
since_date = '2022-06-27 00:00:00'

# provide the end date
until_date = '2022-06-30 00:00:00'

### 3 - Scrape tweets from a Twitter acccount

In [4]:
def get_tweets (user_name, num, since_date, until_date):
    print ("======================================")
    print(":: Acquiring tweets of", user_name, "::")
    print ("======================================")

    # Configure
    c = twint.Config()
    c.Username = user_name

    c.Since = since_date
    c.Until = until_date
    c.Limit = num

    c.Store_object =  True
    c.User_full = True
    c.Profile_full = True
    c.Hide_output = True

    c.Pandas = True
    twint.run.Search(c)

    return pd.DataFrame(twint.storage.panda.Tweets_df)

In [6]:
output  = get_tweets (user_name=username, num=500, since_date=since_date, until_date=until_date)
print (output)

:: Acquiring tweets of larrouturou ::
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
                    id      conversation_id    created_at  \
0  1542194179001942019  1542194179001942019  1.656523e+12   
1  1542190261576556544  1542190261576556544  1.656522e+12   
2  1542176075698126848  1542176075698126848  1.656518e+12   
3  1541862880248741888  1541862880248741888  1.656444e+12   
4  1541498855061164034  1541498855061164034  1.656357e+12   
5  1541487123915833345  1541487123915833345  1.656354e+12   

                  date timezone place  \
0  2022-06-29 17:11:59    +0000         
1  2022-06-29 16:56:25    +0000         
2  2022-06-29 16:00:03    +0000         
3  2022-06-28 19:15:31    +0000         
4  2022-06-27 19:09:01    +0000         
5  2022-06-27 18:22:24    +0000         

                                               tweet language  \
0  En quelques mois, les macronistes sont passés ...       fr   
1  Jour de #honte à l'Assemblée nat

### 4 - Scrape the account details of an user

In [7]:
def get_account_info (user_name):
    print ("===========================================")
    print(":: Scrapping account info of", user_name, "::")
    print ("===========================================")

    # Configure
    c = twint.Config()
    c.Username = user_name

    c.Store_object =  True
    c.User_full = True
    c.Profile_full = True
    c.Hide_output = True

    c.Pandas = True
    twint.run.Lookup(c)

    return pd.DataFrame(twint.storage.panda.User_df)

In [8]:
result  = get_account_info (user_name=username)
result.head()

:: Scrapping account info of larrouturou ::


Unnamed: 0,id,name,username,bio,url,join_datetime,join_date,join_time,tweets,location,following,followers,likes,media,private,verified,avatar,background_image
0,18578222,Pierre Larrouturou,larrouturou,Justice sociale & climat - Député européen @No...,https://t.co/EVf5dMr0pA,2009-01-03 12:10:04 UTC,2009-01-03,12:10:04 UTC,15053,,53321,56817,78822,1711,False,True,https://pbs.twimg.com/profile_images/144102932...,https://pbs.twimg.com/profile_banners/18578222...


### 5 - Scrape tweets using a keyword

In [9]:
def get_all_tweets (keyword, num, since_date, until_date):
    print ("==============================================")
    print(":: Acquiring tweets with keyword", keyword, "::")
    print ("==============================================")

    # Configure
    c = twint.Config()
    c.Search = keyword


    c.Since = since_date
    c.Until = until_date

    c.Limit = num

    #c.Lang = 'nl'
    #c.Translate = True 
    #c.TranslateDest = "en"

    c.Store_object =  True
    c.User_full = True
    c.Profile_full = True
    c.Hide_output = True

    c.Pandas = True
    twint.run.Search(c)

    return pd.DataFrame(twint.storage.panda.Tweets_df)

In [11]:
result  = get_all_tweets (keyword=keyword, num=500, since_date=since_date, until_date=until_date)
print (result)

:: Acquiring tweets with keyword zubair ::
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
                     id      conversation_id    created_at  \
0   1542296760806060035  1541673228254552066  1.656547e+12   
1   1542296375681028096  1542080777881456641  1.656547e+12   
2   1542296164661329921  1542296164661329921  1.656547e+12   
3   1542296152992690177  1542296152992690177  1.656547e+12   
4   1542295829993533441  1542049884471119873  1.656547e+12   
5   1542295795507920896  1542020613631791104  1.656547e+12   
6   1542295773311733760  1542048713975115776  1.656547e+12   
7   1542295506923036675  1542295506923036675  1.656547e+12   
8   1542295335409659904  1542295335409659904  1.656547e+12   
9   1542295306687381506  1542049884471119873  1.656547e+12   
10  1542295258641207297  1542078745158094848  1.656547e+12   
11  1542295164919889921  1542178634319814657  1.656547e+12   
12  1542294912514887680  1541759832813969410  1.656547e+12   
13  1542

### 6 - Scrape replies to a tweet

In [13]:
def get_replies (user_name, conversation_id, num, since_date, until_date):
    print ("=======================================================")
    print(":: Acquiring tweet replies to ", conversation_id, "::")
    print ("=======================================================")

    # configure replies call
    replies = twint.Config()
    
    replies.Since = since_date
    replies.Until = until_date
    replies.Limit = num
    replies.To = user_name

    replies.Store_object =  True
    replies.User_full = True
    replies.Profile_full = True
    replies.Hide_output = True

    replies.Pandas = True
    twint.run.Search(replies)
    df = twint.storage.panda.Tweets_df

    df = df [df ['conversation_id'] == conversation_id]

    return pd.DataFrame(df)

In [14]:
replies = get_replies (user_name=username, conversation_id=conversation_id, num=N, since_date=since_date, until_date=until_date)

print(replies)

:: Acquiring tweet replies to  1541498855061164034 ::
[!] No more data! Scraping will stop now.
found 0 deleted tweets in this search.
                     id      conversation_id    created_at  \
5   1542236063791566848  1541498855061164034  1.656533e+12   
10  1542223490262061057  1541498855061164034  1.656530e+12   
11  1542217317894848512  1541498855061164034  1.656528e+12   
12  1542217273007411200  1541498855061164034  1.656528e+12   
14  1542211665000861697  1541498855061164034  1.656527e+12   
15  1542211515826192393  1541498855061164034  1.656527e+12   
24  1542185902705119232  1541498855061164034  1.656521e+12   
25  1542168581579821056  1541498855061164034  1.656517e+12   
26  1542164488899530753  1541498855061164034  1.656516e+12   
27  1542138609028980741  1541498855061164034  1.656509e+12   
28  1542109682399350784  1541498855061164034  1.656503e+12   
30  1542104454098321408  1541498855061164034  1.656501e+12   
31  1542104061939695616  1541498855061164034  1.656501e+12 