# Twitter Scraper for Data Job and AI Sentiment Analysis

## Introduction

This notebook is part of a sentiment analysis project aimed at understanding the public sentiment towards jobs related to data and AI. The project aims to explore people's opinions and attitudes towards data jobs, specifically in the areas of data science, data analysis, data engineering, and AI.

The purpose of this notebook is to scrape tweets related to the project's scope and create a dataset that will be used for the sentiment analysis task. The notebook contains code for scraping tweets related to specific search queries, processing the scraped data, and saving it to a CSV file.

## Dataset description:

The dataset we are trying to create will consist of tweets related to data jobs and AI. Specifically, we will be scraping tweets that contain the search terms "data science", "data analysis", "data engineering", "AI", "artificial intelligence", "Chat-GPT", "GPT-3", and "GPT-4". The dataset will be in the form of a CSV file, with columns for the tweet ID, date, content, and search query. This dataset will then be used for sentiment analysis to gain insights into public sentiment towards data jobs and AI.

# Setup

The setup for the notebook includes importing necessary libraries and defining any necessary functions. 

In this notebook, we will be using the `snscrape` module to scrape tweets from Twitter, pandas to store the scraped data in a DataFrame, datetime to parse dates, tqdm to display a progress bar while scraping, and time to introduce delays while scraping to avoid being rate-limited by Twitter.

We have also defined a main function called `scrape_tweets` which takes in a search term, number of tweets, language, year, and whether to save the data as a CSV file or not. 

This function scrapes tweets using `snscrape` and returns a DataFrame of the scraped tweets. It also saves the DataFrame as a CSV file if requested.

In [None]:
# Installing required module
%%capture
!pip install pandas
!pip install snscrape

# Scraping Tweets

The below code scrapes Twitter for tweets containing various search terms related to data science, data analysis, data engineering, and artificial intelligence using the snscrape library. 

The `scrape_tweets` function takes in the search term, the number of tweets to scrape, the language of the tweets, the year to search for tweets (if provided), and whether to save the scraped data as a CSV file.

For each search term, the function is called with a specified number of tweets to scrape, year, and save_file flag. 

The scraped data is saved in a CSV file with the filename as the search term and year (if provided).

After scraping, the code reads in all the CSV files in a specified directory, adds a `query_term` and `job_title` column to each DataFrame, and merges all the DataFrames into one final DataFrame. 

The final DataFrame contains all the scraped tweets along with their respective search term and job title.

In [None]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time
import os

def scrape_tweets(search_term: str, 
                  number_of_tweets: int = 1000, 
                  language: str = 'en', 
                  year: int = None, 
                  save_file: bool = False) -> pd.DataFrame:
    """
    Scrape tweets using snscrape module and return a dataframe of the scraped tweets.
    
    Args:
        search_term (str): The search term to query on Twitter.
        number_of_tweets (int): The number of tweets to scrape. Defaults to 1000.
        language (str): The language of the tweets to scrape. Defaults to 'en'.
        year (int): The year to search for tweets. If provided, the query will search for tweets from the given year till the current date. Defaults to None.
        save_file (bool): Whether to save the dataframe as a CSV file. Defaults to False.
    
    Returns:
        pd.DataFrame: A dataframe containing the scraped tweets with columns ['id', 'date', 'content'].
    """
    
    # Set the query string based on the search term, language, and year
    query = f"{search_term} lang:{language}"
    if year is not None:
        try:
            since_date = datetime(year=year, month=1, day=1).strftime('%Y-%m-%d')
            query += f" since:{since_date}"
        except ValueError:
            print(f"Invalid year provided: {year}")
            return pd.DataFrame()
    
    # Use snscrape to scrape tweets
    tweets = []
    for i, tweet in tqdm(enumerate(sntwitter.TwitterSearchScraper(query).get_items()), total=number_of_tweets):

        if i >= number_of_tweets:
            break
        tweets.append([tweet.id, tweet.date, tweet.rawContent])
        
    # Convert list of tweets to a pandas dataframe
    df = pd.DataFrame(tweets, columns=['id', 'date', 'content'])
    
    # Save the dataframe as a CSV file if requested
    if save_file:
        filename = f"{search_term}_{year}" if year is not None else f"{search_term}"
        df.to_csv(f"{filename}.csv", index=False)
    
    return df


In [None]:
# Define the list of queries
queries = [
    "chatgpt datascience",
    "chatgpt dataanalysis", 
    "chatgpt dataengineering",

    "GPT-3 datascience", 
    "GPT-3 dataanalysis", 
    "GPT-3 dataengineering", 

    "GPT-4 datascience", 
    "GPT-4 dataanalysis", 
    "GPT-4 dataengineering", 

    "openai datascience", 
    "openai dataanalysis", 
    "openai dataengineering", 

    "AI datascience",
    "AI dataanalysis", 
    "AI dataengineering", 

    "artificialintelligence datascience",
    "artificialintelligence dataanalysis",
    "artificialintelligence dataengineering", 

]

# Loop over the queries and scrape tweets for each query
for query in queries:
    print(f"Scraping tweets for query: {query}")

    # Call the scrape_tweets function to scrape tweets for the query
    scrape_tweets(query, 10_000, year=2020, save_file=True)

    # Print separator for readability
    print("="*80)

Scraping tweets for query: chatgpt datascience


 75%|███████▍  | 7465/10000 [05:52<01:59, 21.18it/s]


Scraping tweets for query: chatgpt dataanalysis


  2%|▏         | 160/10000 [00:05<05:40, 28.93it/s]


Scraping tweets for query: chatgpt dataengineering


  1%|          | 110/10000 [00:05<08:46, 18.77it/s]


Scraping tweets for query: GPT-3 datascience


 19%|█▉        | 1927/10000 [01:24<05:53, 22.83it/s]


Scraping tweets for query: GPT-3 dataanalysis


  0%|          | 3/10000 [00:00<48:01,  3.47it/s]


Scraping tweets for query: GPT-3 dataengineering


  0%|          | 19/10000 [00:02<17:43,  9.39it/s] 


Scraping tweets for query: GPT-4 datascience


  5%|▌         | 505/10000 [00:23<07:19, 21.59it/s]


Scraping tweets for query: GPT-4 dataanalysis


  0%|          | 17/10000 [00:01<13:42, 12.14it/s] 


Scraping tweets for query: GPT-4 dataengineering


  0%|          | 4/10000 [00:01<1:20:08,  2.08it/s]


Scraping tweets for query: openai datascience


 44%|████▎     | 4365/10000 [03:12<04:08, 22.71it/s]


Scraping tweets for query: openai dataanalysis


  0%|          | 28/10000 [00:01<09:22, 17.74it/s]


Scraping tweets for query: openai dataengineering


  0%|          | 28/10000 [00:02<15:11, 10.94it/s]


Scraping tweets for query: AI datascience


100%|██████████| 10000/10000 [08:34<00:00, 19.43it/s]


Scraping tweets for query: AI dataanalysis


 83%|████████▎ | 8296/10000 [04:38<00:57, 29.83it/s]


Scraping tweets for query: AI dataengineering


 53%|█████▎    | 5308/10000 [03:53<03:26, 22.76it/s]


Scraping tweets for query: artificialintelligence datascience


100%|██████████| 10000/10000 [08:18<00:00, 20.06it/s]


Scraping tweets for query: artificialintelligence dataanalysis


 45%|████▌     | 4518/10000 [03:37<04:24, 20.75it/s]


Scraping tweets for query: artificialintelligence dataengineering


 37%|███▋      | 3749/10000 [02:44<04:33, 22.85it/s]






In [None]:
# Define the directory where the CSV files are located
directory = '/content/'

# Define a function to add query and job title columns to a dataframe
def add_query_and_jobtitle(df, searchterm, jobtitle):
    df['query_term'] = searchterm
    df['job_title'] = jobtitle
    return df

# Define a dictionary to store the merged dataframes for each job title
merged_dfs = {}

# Loop over each CSV file in the directory
for filename in os.listdir(directory):
    
    # Check if the file is a CSV file
    if not filename.endswith('.csv'):
        continue
    
    # Extract the search term, job title, and year from the filename
    parts = filename[:-9].split(' ')
    searchterm = parts[0]
    jobtitle = parts[1]
    
    # Read the CSV file into a dataframe
    filepath = os.path.join(directory, filename)
    df = pd.read_csv(filepath)
    
    # Add query and job title columns to the dataframe
    df = add_query_and_jobtitle(df, searchterm, jobtitle)
    
    # Merge the dataframe with the existing merged dataframe for this job title
    if jobtitle in merged_dfs:
        merged_dfs[jobtitle] = pd.concat([merged_dfs[jobtitle], df], ignore_index=True)
    else:
        merged_dfs[jobtitle] = df

# Concatenate all the merged dataframes into one final dataframe
final_df = pd.concat(merged_dfs.values(), ignore_index=True)

# Print the final dataframe
print('Final merged dataframe:')
final_df.head()

Final merged dataframe:


Unnamed: 0,id,date,content,query_term,job_title
0,1641069233017851904,2023-03-29 13:26:09+00:00,RT Using GPT-3.5-Turbo and GPT-4 to Apply Text...,GPT-4,dataengineering
1,1636501901494607873,2023-03-16 22:57:12+00:00,"""Ingestion solved? 🤔 #DataEngineering #AI #Op...",GPT-4,dataengineering
2,1623374696119996455,2023-02-08 17:34:23+00:00,OMG! Have you heard about GPT-4? This new AI t...,GPT-4,dataengineering
3,1439864014528471041,2021-09-20 08:08:27+00:00,GPT-3 and GPT-4 Could Ruin the Future Internet...,GPT-4,dataengineering
4,1641069233017851904,2023-03-29 13:26:09+00:00,RT Using GPT-3.5-Turbo and GPT-4 to Apply Text...,GPT-3,dataengineering


In [None]:
# Saving the final dataframe
final_df.to_csv("Twitter_Final_data.csv", index=False)