# Twitter Scrapping Tool

## Resources

- [Analyzing Tweets with NLP in minutes with Spark, Optimus and Twint](https://towardsdatascience.com/analyzing-tweets-with-nlp-in-minutes-with-spark-optimus-and-twint-a0c96084995f)

## Install Twint and Requirements

In [None]:
%%capture

# Install "twint" and requirements
!pip install -e ./twint -r ./twint/requirements.txt
!pip install -r requirements.txt

# Enable extension
!jupyter nbextension enable --py widgetsnbextension

In [None]:
# Enable extension
!jupyter nbextension enable --py widgetsnbextension

## Import and Setup Libraries

In [1]:
# Solve compatibility issues with notebooks and RunTime errors.
import nest_asyncio
nest_asyncio.apply()

In [2]:
import twint
import pandas as pd
from datetime import datetime, timedelta
from tqdm.notebook import tqdm

## Scrape Twitter according to Config

In [7]:
# Function for running single/multiple search on a date interval
def scrape_twitter(search, limit, since, until=None, output="../data/output.csv"):
    
    # Initialize search configuration
    config = twint.Config()

    # Search keys
    config.Search = search

    # Search settings
    config.Lang = "en"
    config.Limit = limit
    config.Verified = False
    
    # Output settings
    config.Hide_output = True
    config.Store_csv = True
    config.Output = output
    config.Pandas = True
    
    # Run search
    if not until:
        until = since
        
    df = pd.DataFrame()
    
    start = datetime.strptime(since, '%Y-%m-%d')
    end = datetime.strptime(until, '%Y-%m-%d')
    delta = end - start
    
    for i in tqdm(range(1, delta.days + 2)):
        itr = start + timedelta(days=i)
        config.Until = itr.strftime('%Y-%m-%d')
        
        twint.run.Search(config)
        
        df = pd.concat([df, twint.storage.panda.Tweets_df], ignore_index=True)
        df = df.drop_duplicates(subset='id')
    
    return df

In [None]:
df = scrape_twitter(
    search="bitcoin",
    limit=500,
    since='2021-10-29',
    until='2021-10-30'
)

  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
df