### I hope this algorithm can help you! If you need to contact me for any reason, do not hesitate to follow and write me about it through my social media or e-mail =) 

## LET'S SHARE AND OPEN-SOURCE IT!

#### GitHub: https://github.com/neural-insights

#### LinkedIn: https://www.linkedin.com/in/lucas-barone-peres/

#### Medium: https://medium.com/@lucas.barone.peres

#### e-mail: lucas.barone.peres@gmail.com

# **Collecting tweets**

### Installing SNScrape library

In [None]:
#Developer version of SNScrape (it runs only on Jupyter Notebook)

!pip install git+https://github.com/JustAnotherArchivist/snscrape.git

### SNScrape guides:

- SNScrape's GitHub: https://github.com/JustAnotherArchivist/snscrape

- Medium's article and tutorial: https://medium.com/dataseries/how-to-scrape-millions-of-tweets-using-snscrape-195ee3594721

### Importing fundamental libraries

If you don't have some of these libraries installed, just uncomment the commands below to install or update them.

In [None]:
#!pip install pandas
#!pip install tqdm

In [None]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
from time import sleep
from tqdm.auto import tqdm

#Ignore warnings
import warnings
warnings.filterwarnings("ignore")

### Declaring some accessory functions to check if input conditions are plausible

In [None]:
def check_input(input_type:str, date_type:str):
    while True:
        try:
            user_input = int(input(f"Enter the {date_type} {input_type}:"))
            if input_type == 'month':
                if not 1 <= user_input <= 12:
                    print("Invalid month: enter a number between 1 and 12")
                    continue
            elif input_type == 'day':
                if not 1 <= user_input <= 31:
                    print("Invalid day: enter a number between 1 and 31")
                    continue
            return user_input
        except:
            print("Invalid input: enter only positive numbers")
            continue

def get_date(date_type:str):
    while True:
        year = check_input(input_type='year', date_type=date_type)
        month = check_input(input_type='month', date_type=date_type)
        day = check_input(input_type='day', date_type=date_type)
        try:
            date = pd.to_datetime(f"{year}-{month}-{day}").date()
            return date
        except Exception as e:
            print(e)
            print("Please re-enter the desired date...")
            continue

### Declaring the main function which will collect all inputs from user and analyze their viability

In [None]:
def search_your_tweets():
    name = input("Enter the name of any keyword you want to search on tweets:")

    while True:
        initial_date = get_date(date_type='initial')
        final_date = get_date(date_type='final') + pd.Timedelta(days=1)
        if final_date >= initial_date: break
        else: print("Final date is earlier than initial date, please fix this...")

    #Note you can't choose the same day for both initial and final days        
    number_daily_tweets = int(input("Enter the number of tweets you want to catch per day:"))

    info_tuple = (name, initial_date, final_date, number_daily_tweets)
    return info_tuple

### Declaring lists of tweets items you need to scrape

At this point, you must decide exactly which information you need from tweets for your future analysis

In [None]:
#The first list gathers the tweet fields that will be considered in coding demonstration foward. 
#The second list includes all tweet items which snscrappe can scrape and return. So, you can create your own list 
#with the fields you need to scrape based on these information, or just deleting unnecessary fields below. 


default_str_list =  ["date", "tweet_id", "tweet", "username", "replies_score",
                                             "retweet_score", "likes_score", "quotes_score"]


all_fields_str_list = ['url', 'date', 'rawContent', 'renderedContent', 'id', 'content', 'user', 'replyCount', 'likeCount',
                           'quoteCount', 'conversationId', 'lang', 'source', 'sourceUrl', 'sourceLabel', 'links', 'media', 'retweetedTweet',
                               'quotedTweet', 'inReplyToTweetId', 'inReplyToUser', 'mentionedUsers', 'coordinates', 'place', 'hashtags',
                                   'cashtags', 'card']


After deciding the tweet items you want to scrape, you must create a commented list with the same item's name in the same sequence, but with the prefix "tweet." added to all fields you wrote, as exemplified below. You will use this list within the TwitterSearchScrapper afterward, but keep them commented to avoid error on the next cell.

**If you've chosen to proceed with the default list, you don't need to do anything.**

In [None]:
#[tweet.date, tweet.id, tweet.content, tweet.user.username,
#                                     tweet.replyCount, tweet.retweetCount, tweet.likeCount, tweet.quoteCount]

#['tweet.url', 'tweet.date', 'tweet.rawContent', 'tweet.renderedContent', 'tweet.id', 'tweet.content', 'tweet.user', 'tweet.replyCount', 'tweet.likeCount',
#                           'tweet.quoteCount', 'tweet.conversationId', 'tweet.lang', 'tweet.source', 'tweet.sourceUrl', 'tweet.sourceLabel', 'tweet.links', 'tweet.media',
#                               'tweet.retweetedTweet', 'tweet.quotedTweet', 'tweet.inReplyToTweetId', 'tweet.inReplyToUser', 'tweet.mentionedUsers', 'tweet.coordinates', 'tweet.place', 
#                                   'tweet.hashtags','tweet.cashtags', 'tweet.card']

### Building a function to catch tweets and store into a DataFrame

PAY ATTENTION TO THE UPPER-CASE COMMENTS! 4 INSTRUCTIONS WILL GUIDE YOU TO COMPLETE YOUR PREFERENCES FOR THE TWITTER SEARCH SCRAPPER.

In [None]:
def tweet_collector(name, initial_date, final_date, number_daily_tweets):

    total_tweets = (final_date - initial_date).days * number_daily_tweets  
                
    # List to store tweets
    tweets_list = []
    
    # Progress bar
    progress_bar = tqdm(total = total_tweets)
    
    # Counting
    count = 0

   # Loop

#NOW IT'S TIME TO MAKE SOME CHOISES! 
    
    while count != total_tweets:
        for j in range(1):
            
#(1) ENGLISH WAS SET AS THE DEFAULT LANGUAGE. IF YOU NEED TO SEARCH IN ANOTHER LANGUAGE, CHANGE TwitterSearchScrapper PARAMETERS (lang: )
#(2) RETWEETS ARE BEING FILTERED. IF YOU WANT TO COLLECT RETWEETS FOR YOUR DATA, JUST DELETE "-filter:replies'"

            for i, tweet in enumerate(sntwitter.TwitterSearchScraper(f'{name} since:{initial_date} until:{final_date} lang:en -filter:replies').get_items()):
                if i >= number_daily_tweets:
                    final_date = (final_date - pd.Timedelta(days=1))
                    break
                    
#(3) YOU MUST OVERWRITE TWEETS_LIST PARAMETERS WITH YOUR OWN LIST OF TWEET FIELDS DECLARED BEFORE, IF YOU DECIDED DON'T USE DEFAULT_LIST. IN THIS CASE,
#    COPY AND PASTE THE LIST YOU WROTE BEFORE WITH THE PREFIX "tweet." ADDED, FOLLOWING PREVIOUS INSTRUCTIONS.

                tweets_list.append([tweet.date, tweet.id, tweet.content, tweet.user.username,
                                     tweet.replyCount, tweet.retweetCount, tweet.likeCount, tweet.quoteCount])
                sleep(0.01)
                count += 1
                progress_bar.update(1)

    progress_bar.close()

    # Creating a DataFrame
    
#(4) YOU MUST OVERWRITE THE FIELDS COLUMNS WITH YOUR OWN LIST OF TWEET FIELDS YOU DECLARED BEFORE
    df = pd.DataFrame(tweets_list, columns=default_str_list)
    
    # Storing as csv
    df.to_csv(f"{name}_tweets_{initial_date.strftime('%Y%m%d')}_{final_date.strftime('%Y%m%d')}.csv", encoding = 'utf-8', index = False)
    
    print("\n Congratulations! All tweets were collected! \n\n")

# Now, it's up to you! You only need to run the cell below in order to collect your desired tweets! 

### Collecting your own tweets and saving as .csv

In [None]:
#ALL NUMBERS MUST BE INTEGERS!

info_tuple = search_your_tweets()

tweet_collector(*info_tuple)

#The document will be saved in your current folder, with the name of the keyword(s) you is(are) looking for + the month's number

### Check your Data in a Pandas Dataframe

In [None]:
#Insert the document's name between the brackets

my_tweets_df = pd.read_csv('trump_tweets_1.csv')
my_tweets_df.head(10)

### Further informations about your Data

In [None]:
my_tweets_df.info()

### Examples of running tweet_collector function

**Let's take a look at what people are tweeting about the new iPhone 14 release in September 2022**

In [None]:
info_tuple = search_your_tweets()

tweet_collector(info_tuple[0],info_tuple[1],info_tuple[2],info_tuple[3],info_tuple[4],info_tuple[5])

#Note you can't choose the same day for both initial and final days

#If the Progress Bar stops you probably are in a infinite loop. There is many reasons why you would create an infinite loop. For instance,
#whether don't exist enough tweets about the topic you are searching for in the period selected. Try to perform some tests to figure out 
#probably reasons why you have created an inifite loop.

**Let's take a look what people are tweeting about the new iPhone 14 release at September 2022**

In [None]:
my_tweets_df1 = pd.read_csv('iPhone 14_tweets_9.csv')
my_tweets_df1.head(15)

In [None]:
my_tweets_df1.info()

### Collecting data in multiples months and concatenating

**What are people tweeting about Ukraine war during the last 3 months? (Considering today as 01/04/2023)**

In this case of multiple months needed, we need to run the function multiple times and concatenate the DataFrames

In [None]:
#First month
info_tuple = search_your_tweets()

tweet_collector(info_tuple[0],info_tuple[1],info_tuple[2],info_tuple[3],info_tuple[4],info_tuple[5])

In [None]:
#Second month
info_tuple = search_your_tweets()

tweet_collector(info_tuple[0],info_tuple[1],info_tuple[2],info_tuple[3],info_tuple[4],info_tuple[5])

In [None]:
#Third month
info_tuple = search_your_tweets()

tweet_collector(info_tuple[0],info_tuple[1],info_tuple[2],info_tuple[3],info_tuple[4],info_tuple[5])

In [None]:
ukraine_df1 = pd.read_csv('Ukraine war_tweets_1.csv')
ukraine_df2 = pd.read_csv('Ukraine war_tweets_2.csv')
ukraine_df3 = pd.read_csv('Ukraine war_tweets_3.csv')

df_merged = pd.concat([ukraine_df1, ukraine_df2, ukraine_df3])
df_merged.info()

In [None]:
df_merged.head(5)

In [None]:
df_merged.tail(5)

## Thank you for reaching the end of my script! I hope I could helped you with your task! If my algorithm was useful for you, or if you want to make suggestions or reviews, please follow me and write me about it through my social media =) 

# LET'S SHARE AND OPEN-SOURCE IT!

#### GitHub: https://github.com/neural-insights

#### LinkedIn: https://www.linkedin.com/in/lucas-barone-peres/

#### Medium: https://medium.com/@lucas.barone.peres

#### e-mail: lucas.barone.peres@gmail.com