# Twitter/X Extract Transform Load (ETL) Pipeline

In this project, we conduct a comprehensive sentiment analysis of tweets to understand public opinion on various topics. Utilizing data scraped from the Twitter, we analyze sentiments expressed in user-generated content through multiple methodologies, including TextBlob, VADER, and Transformers. This analysis helps in uncovering insights into how people feel about specific themes, events, or issues highlighted in the tweets.


## Advantages of This Project
- **Multi-Method Analysis**: By leveraging various sentiment analysis techniques, we can compare and contrast results to determine which method yields the most accurate sentiment classification for our dataset.

- **Data-Driven Insights**: The analysis enables stakeholders to make informed decisions based on public sentiment, enhancing engagement strategies and content curation.

- **Scalability**: The methodologies applied can be scaled to analyze larger datasets, making this project applicable for various organizations looking to gauge public opinion.


## Real-Life Applications
- **Market Research**: Businesses can use sentiment analysis to assess customer feedback on products, services, or brands, allowing for timely adjustments and improvements.

- **Social Media Monitoring**: Organizations can monitor brand sentiment on social media platforms to manage public relations and address negative sentiments proactively.

- **Political Campaigning**: Politicians and campaign managers can analyze sentiments around specific policies or announcements, helping them tailor their strategies to resonate with voters.


Overall, this project not only demonstrates the practical application of data science techniques in understanding social sentiment but also provides a valuable tool for various stakeholders in different domains.


## Twitter Data Scraping Using Nitter

In this section, we utilized the `ntscraper` library to scrape tweets from a Twitter account. The `Nitter` class was used to initialize the scraper, which allows for efficient data collection from Twitter without relying on the official API.

We fetched the latest 50 tweets from the user timeline. This data will be used in the subsequent analysis. To verify the structure and content of the scraped tweets, we printed the output for inspection.


In [1]:
!pip install ntscraper



In [2]:
from ntscraper import Nitter

# Initialize the Nitter scraper
scraper = Nitter(log_level=1, skip_instance_check=False)

# Fetch tweets from a specific user
tweets = scraper.get_tweets("TEDTalks", mode="user", number=50)

# Check the structure of the tweets
print(tweets)

Testing instances: 100%|██████████| 16/16 [00:19<00:00,  1.24s/it]

19-Oct-24 21:08:44 - No instance specified, using random instance https://nitter.privacydev.net





19-Oct-24 21:08:49 - Current stats for TEDTalks: 20 tweets, 0 threads...
19-Oct-24 21:08:54 - Current stats for TEDTalks: 38 tweets, 0 threads...
19-Oct-24 21:08:55 - Empty page on https://nitter.privacydev.net
{'tweets': [{'link': 'https://twitter.com/TEDTalks/status/1841076091097616513#m', 'text': 'Can you guess who this cat represents? Introducing TED Games, a new way to experience the wonder and fun of TED every day.   Play our very first game — The Purring Test — right now: t.ted.com/tavdsCU  PS - Come back and let us know your score!', 'user': {'name': 'TED Talks', 'username': '@TEDTalks', 'profile_id': '877631054525472768', 'avatar': 'https://pbs.twimg.com/profile_images/877631054525472768/Xp5FAPD5_bigger.jpg'}, 'date': 'Oct 1, 2024 · 11:22 AM UTC', 'is-retweet': False, 'is-pinned': True, 'external-link': '', 'replying-to': [], 'quoted-post': {}, 'stats': {'comments': 22, 'retweets': 15, 'quotes': 3, 'likes': 140}, 'pictures': ['https://pbs.twimg.com/media/GYwrtr2XwAMH_PV.jpg'],

## Organizing Scraped Data into a DataFrame

After scraping the tweets, the next step was to structure the data for easier manipulation and analysis. We created a pandas DataFrame to store the relevant features from the tweets. The extracted fields include:

- **Link**: URL to the specific tweet.
- **Text**: The content of the tweet.
- **User**: The name of the user who posted the tweet.
- **Likes**: Number of likes received by the tweet.
- **Quotes**: Number of times the tweet has been quoted.
- **Retweets**: Number of retweets.
- **Comments**: Number of comments.

This DataFrame allows us to efficiently explore and analyze the tweet data in the subsequent sections.


In [3]:
import pandas as pd
data = {
        'link':[],
        'text':[],
        'user':[],
        'likes':[],
        'quotes':[],
        'retweets':[],
        'comments':[]
    }

for tweet in tweets['tweets']:
        data['link'].append(tweet['link'])
        data['text'].append(tweet['text'])
        data['user'].append(tweet['user']['name'])
        data['likes'].append(tweet['stats']['likes'])
        data['quotes'].append(tweet['stats']['quotes'])    
        data['retweets'].append(tweet['stats']['retweets'])    
        data['comments'].append(tweet['stats']['comments'])    
df_tweets = pd.DataFrame(data)
df_tweets

Unnamed: 0,link,text,user,likes,quotes,retweets,comments
0,https://twitter.com/TEDTalks/status/1841076091...,Can you guess who this cat represents? Introdu...,TED Talks,140,3,15,22
1,https://twitter.com/TEDTalks/status/1847400688...,"""I'm not that worried about the weeks running ...",TED Talks,122,1,28,16
2,https://twitter.com/TEDTalks/status/1846667974...,Anyone can use big tech companies’ powerful AI...,TED Talks,60,0,9,2
3,https://twitter.com/TEDTalks/status/1846302531...,"""I did not come to see the chatbot of my grand...",TED Talks,73,1,15,5
4,https://twitter.com/MethaneSAT/status/18161294...,What if we could slow global warming in our li...,MethaneSAT,67,2,22,8
5,https://twitter.com/ianbremmer/status/18461764...,are we headed for another january 6th if harri...,ian bremmer,121,1,27,30
6,https://twitter.com/TEDTalks/status/1844785507...,"Congratulations to TED speaker @DemisHassabis,...",TED Talks,17,0,3,0
7,https://twitter.com/TEDTalks/status/1844785353...,Congratulations to TED speaker and @TheAudacio...,TED Talks,47,1,5,6
8,https://twitter.com/TEDTalks/status/1844766504...,#ICYMI: We’re on the search for the next great...,TED Talks,45,1,12,6
9,https://twitter.com/FaberBooks/status/18447518...,Peter Pomerantsev speaks @TEDTalks on how to f...,Faber Books,56,1,19,3


## DataFrame Overview and Summary Statistics

To better understand the structure and composition of the tweet data, we performed the following checks:

- **DataFrame Info**: Provides a concise summary of the DataFrame, including the data types of each column, the number of non-null entries, and the memory usage.

- **Descriptive Statistics**: Using the `describe()` method, we generated summary statistics for the numerical columns such as the number of likes, quotes, retweets, and comments. This gave us insights into the distribution of these engagement metrics across the scraped tweets.

- **Column Names**: We retrieved the list of column names in the DataFrame to confirm the structure and ensure all relevant features were properly captured.


In [4]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   link      38 non-null     object
 1   text      38 non-null     object
 2   user      38 non-null     object
 3   likes     38 non-null     int64 
 4   quotes    38 non-null     int64 
 5   retweets  38 non-null     int64 
 6   comments  38 non-null     int64 
dtypes: int64(4), object(3)
memory usage: 2.2+ KB


In [5]:
df_tweets.describe()

Unnamed: 0,likes,quotes,retweets,comments
count,38.0,38.0,38.0,38.0
mean,151.657895,3.526316,31.763158,12.815789
std,343.915316,5.212961,39.393671,15.675519
min,6.0,0.0,1.0,0.0
25%,33.25,0.25,11.0,3.0
50%,54.0,1.0,19.0,6.0
75%,117.5,4.0,29.5,15.25
max,2056.0,21.0,156.0,71.0


In [6]:
df_tweets.columns

Index(['link', 'text', 'user', 'likes', 'quotes', 'retweets', 'comments'], dtype='object')

## Calculating Total Likes

To assess the overall engagement of the collected tweets, we calculated the total number of likes by summing the values in the 'likes' column. This metric helps us gauge the collective popularity and reach of the tweets scraped from the specified user.

The result gives us a clear understanding of how many likes these tweets have garnered in total.


In [7]:
total_likes = df_tweets.likes.sum()
print('Total number of likes = {}.'.format(total_likes))

Total number of likes = 5763.


## Calculating Total Retweets

In addition to likes, retweets are another key metric of engagement on Twitter. We computed the total number of retweets by summing the values in the 'retweets' column. This gives an indication of how widely the tweets have been shared across the platform.

The total retweets provide insights into the extent of the tweets' amplification through user interactions.


In [9]:
total_retweets = df_tweets.retweets.sum()
print('Total number of retweets = {}.'.format(total_retweets))

Total number of retweets = 1207.


## Analyzing Tweet Engagement: Top and Least Liked Tweets

We identified the top 5 most liked tweets and compared them with the bottom 5 tweets based on the number of likes. This analysis provides insight into audience engagement, showcasing both the most and least successful content.

In [11]:
Top_likes = df_tweets.nlargest(5,'likes')
Top_likes

Unnamed: 0,link,text,user,likes,quotes,retweets,comments
33,https://twitter.com/AGRA_Africa/status/1834570...,"In August, AGRA President Dr @Agnes_Kalibata, ...",AGRA - Sustainably Growing Africa’s Food Systems.,2056,0,149,13
34,https://twitter.com/RoryStewartUK/status/18342...,I used to think the way to end poverty was to ...,Rory Stewart,598,21,156,71
20,https://twitter.com/TEDTalks/status/1840831479...,"According to @AndrewYang, US politics can be r...",TED Talks,535,18,129,51
22,https://twitter.com/owocki/status/184076981334...,How Quadratic Funding could finance your dream...,"Owocki.Ξth (🍄,🟢)",369,17,90,45
27,https://twitter.com/TEDTalks/status/1837584836...,"""We're all hallucinating all the time, includi...",TED Talks,281,8,87,39


In [12]:
Least_likes = df_tweets.nsmallest(5,'likes')
Least_likes

Unnamed: 0,link,text,user,likes,quotes,retweets,comments
12,https://twitter.com/TEDTalks/status/1844046605...,"Turns out, you love to listen to TED Talks on ...",TED Talks,6,0,3,1
11,https://twitter.com/TEDTalks/status/1844046606...,Thank you so much for your support!,TED Talks,7,0,1,1
13,https://twitter.com/TEDTalks/status/1844046603...,Fixable has been nominated for the @SignalAwar...,TED Talks,8,1,3,2
6,https://twitter.com/TEDTalks/status/1844785507...,"Congratulations to TED speaker @DemisHassabis,...",TED Talks,17,0,3,0
14,https://twitter.com/TEDTalks/status/1844046602...,Your favorite TED podcasts have been nominated...,TED Talks,20,1,3,3


## Retweet Analysis

To further analyze the retweet engagement, we sort the tweets based on their retweet counts. 

### Sorted Tweets by Retweets (Ascending Order)
The tweets can be sorted in ascending order of retweets to identify tweets with lower engagement:

In [31]:
df_tweets.sort_values('retweets')

Unnamed: 0,link,text,user,likes,quotes,retweets,comments,textblob_sentiment,vader_sentiment,transformer_sentiment,transformer_confidence
11,https://twitter.com/TEDTalks/status/1844046606...,Thank you so much for your support!,TED Talks,7,0,1,1,0.25,0.6696,POSITIVE,0.99985
6,https://twitter.com/TEDTalks/status/1844785507...,"Congratulations to TED speaker @DemisHassabis,...",TED Talks,17,0,3,0,0.0,0.9201,POSITIVE,0.999077
14,https://twitter.com/TEDTalks/status/1844046602...,Your favorite TED podcasts have been nominated...,TED Talks,20,1,3,3,0.5,0.4588,POSITIVE,0.658019
12,https://twitter.com/TEDTalks/status/1844046605...,"Turns out, you love to listen to TED Talks on ...",TED Talks,6,0,3,1,0.541667,0.8655,POSITIVE,0.993336
13,https://twitter.com/TEDTalks/status/1844046603...,Fixable has been nominated for the @SignalAwar...,TED Talks,8,1,3,2,0.386869,0.9523,POSITIVE,0.969175
7,https://twitter.com/TEDTalks/status/1844785353...,Congratulations to TED speaker and @TheAudacio...,TED Talks,47,1,5,6,0.0,0.8979,POSITIVE,0.998415
2,https://twitter.com/TEDTalks/status/1846667974...,Anyone can use big tech companies’ powerful AI...,TED Talks,60,0,9,2,0.240741,0.872,NEGATIVE,0.991087
35,https://twitter.com/TEDTalks/status/1834258109...,Can your phone read your mind? Ethicist @NitaF...,TED Talks,34,2,10,5,0.5,0.5859,POSITIVE,0.636579
18,https://twitter.com/tiltcollective/status/1842...,Now on @TEDTalks: Can we meet climate goals an...,Tilt Collective,27,6,11,5,0.166667,0.0,POSITIVE,0.956121
32,https://twitter.com/TEDTalks/status/1836472918...,You may have seen the deep fake memes of the U...,TED Talks,27,0,11,9,-0.166667,-0.2617,NEGATIVE,0.994895


## Retweet Sorting

To further analyze the retweet engagement, we sort the tweets based on their retweet counts. 

### Sorted Tweets by Retweets (Ascending Order)
The tweets can be sorted in ascending order of retweets to identify tweets with lower engagement:



In [17]:
df_tweets.sort_values('retweets', ascending= False)

Unnamed: 0,link,text,user,likes,quotes,retweets,comments
34,https://twitter.com/RoryStewartUK/status/18342...,I used to think the way to end poverty was to ...,Rory Stewart,598,21,156,71
33,https://twitter.com/AGRA_Africa/status/1834570...,"In August, AGRA President Dr @Agnes_Kalibata, ...",AGRA - Sustainably Growing Africa’s Food Systems.,2056,0,149,13
20,https://twitter.com/TEDTalks/status/1840831479...,"According to @AndrewYang, US politics can be r...",TED Talks,535,18,129,51
22,https://twitter.com/owocki/status/184076981334...,How Quadratic Funding could finance your dream...,"Owocki.Ξth (🍄,🟢)",369,17,90,45
27,https://twitter.com/TEDTalks/status/1837584836...,"""We're all hallucinating all the time, includi...",TED Talks,281,8,87,39
36,https://twitter.com/TEDTalks/status/1833408819...,Here's what you can say if you think a loved o...,TED Talks,167,10,74,13
17,https://twitter.com/TEDTalks/status/1843019648...,Practice makes perfect — but are you practicin...,TED Talks,118,3,36,9
30,https://twitter.com/ianbremmer/status/18371527...,all major decisions being made about what arti...,ian bremmer,89,4,34,11
28,https://twitter.com/patagonia/status/183716328...,"“Frankly, we’re running out of time for saving...",Patagonia,116,4,33,10
31,https://twitter.com/TEDTalks/status/1836805680...,The critical decisions about AI are being made...,TED Talks,71,5,30,25


## Sentiment Analysis

Now we employ three powerful sentiment analysis techniques: TextBlob, VADER, and Transformers. Each method offers unique strengths for assessing the emotional tone of the tweets collected. By leveraging these tools, we aim to gain deeper insights into public sentiment surrounding the content shared by the selected Twitter user. This multi-faceted approach enhances the robustness of our analysis and provides a comprehensive understanding of the sentiment expressed in the tweets.


### TextBlob Sentiment Analysis

For sentiment analysis, the TextBlob library was employed to evaluate the sentiment of the tweets. A function was created to determine the sentiment polarity of each tweet's text, yielding a score ranging from -1 (indicating negative sentiment) to 1 (indicating positive sentiment). This approach facilitated an effective assessment of the emotional tone present in the tweets.


In [34]:
!pip install textblob

from textblob import TextBlob

# TextBlob Sentiment Analysis
def get_textblob_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity  # Returns sentiment polarity (-1 to 1)



### VADER Sentiment Analysis

To enhance the sentiment analysis, the VADER (Valence Aware Dictionary and sEntiment Reasoner) library was utilized. This library is particularly effective for analyzing sentiments in social media texts. A function was developed to compute the sentiment scores of each tweet using VADER, focusing on the compound score, which ranges from -1 (indicating negative sentiment) to 1 (indicating positive sentiment). This method allowed for a nuanced understanding of the emotional content in the tweets.


In [33]:
!pip install vaderSentiment

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Initialize VADER sentiment analyzer
vader_analyzer = SentimentIntensityAnalyzer()

# VADER Sentiment Analysis
def get_vader_sentiment(text):
    vader_score = vader_analyzer.polarity_scores(text)
    return vader_score['compound']  # Compound score (-1 to 1)



### Transformers Sentiment Analysis

For a comprehensive analysis of sentiments, the Transformers library was employed, utilizing a pre-trained model for sentiment analysis. A function was created to evaluate the sentiment of each tweet, producing a label of either 'POSITIVE' or 'NEGATIVE' along with a confidence score. This approach leverages advanced deep learning models to provide accurate and context-aware sentiment evaluations, further enriching the analysis of the tweets collected.


In [23]:
!pip install tf-keras
from transformers import pipeline

# Initialize Hugging Face's sentiment analysis pipeline (using BERT or similar model)
transformer_analyzer = pipeline("sentiment-analysis")

# Transformers Sentiment Analysis (returns either 'POSITIVE' or 'NEGATIVE' with score)
def get_transformer_sentiment(text):
    result = transformer_analyzer(text)[0]
    return result['label'], result['score']  # Returns label ('POSITIVE'/'NEGATIVE') and confidence score

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 2.8 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 2.1 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 2.6 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.17.0
19-Oct-24 21:21:34 - From c:\Users\rajma\AppData\Local\Programs\Python\Python312\Lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



### Sentiment Analysis Application

The sentiment analysis was applied to the tweets using three different methods: TextBlob, VADER, and Transformers. 

1. **TextBlob Sentiment Analysis**: This method was used to calculate the sentiment polarity for each tweet, resulting in a new column `textblob_sentiment`.

2. **VADER Sentiment Analysis**: The VADER tool was employed to derive a compound sentiment score for each tweet, stored in the `vader_sentiment` column.

3. **Transformers Sentiment Analysis**: A deep learning model from the Transformers library was utilized to classify the sentiment of each tweet as 'POSITIVE' or 'NEGATIVE'. The results are captured in two columns: `transformer_sentiment` for the sentiment label and `transformer_confidence` for the associated confidence score.

The DataFrame now contains the original tweet text along with the computed sentiment analysis results for comparison and further evaluation.


In [30]:
# Apply TextBlob sentiment analysis to the 'content' column
df_tweets['textblob_sentiment'] = df_tweets['text'].apply(get_textblob_sentiment)

# Apply VADER sentiment analysis to the 'content' column
df_tweets['vader_sentiment'] = df_tweets['text'].apply(get_vader_sentiment)

# Apply Transformers sentiment analysis to the 'content' column
df_tweets['transformer_sentiment'] = df_tweets['text'].apply(lambda text: get_transformer_sentiment(text)[0])
df_tweets['transformer_confidence'] = df_tweets['text'].apply(lambda text: get_transformer_sentiment(text)[1])

# Display the DataFrame with all sentiment analysis columns
df_tweets[['user', 'text', 'textblob_sentiment', 'vader_sentiment', 'transformer_sentiment', 'transformer_confidence']]

Unnamed: 0,user,text,textblob_sentiment,vader_sentiment,transformer_sentiment,transformer_confidence
0,TED Talks,Can you guess who this cat represents? Introdu...,0.107846,0.7177,POSITIVE,0.998199
1,TED Talks,"""I'm not that worried about the weeks running ...",0.2,-0.1543,POSITIVE,0.967625
2,TED Talks,Anyone can use big tech companies’ powerful AI...,0.240741,0.872,NEGATIVE,0.991087
3,TED Talks,"""I did not come to see the chatbot of my grand...",0.35,0.0,NEGATIVE,0.912412
4,MethaneSAT,What if we could slow global warming in our li...,-0.1,0.5423,POSITIVE,0.997663
5,ian bremmer,are we headed for another january 6th if harri...,0.004861,-0.6486,NEGATIVE,0.993896
6,TED Talks,"Congratulations to TED speaker @DemisHassabis,...",0.0,0.9201,POSITIVE,0.999077
7,TED Talks,Congratulations to TED speaker and @TheAudacio...,0.0,0.8979,POSITIVE,0.998415
8,TED Talks,#ICYMI: We’re on the search for the next great...,0.330195,0.835,NEGATIVE,0.774227
9,Faber Books,Peter Pomerantsev speaks @TEDTalks on how to f...,0.8,-0.4019,POSITIVE,0.740075
