<a href="https://colab.research.google.com/github/kanchanmaurya95/AI_ML_100_days/blob/main/Sentiment_Analysis_on_Twitter_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Sentiment Analysis on Twitter Data**
In this project, we delve into the fascinating world of Natural Language Processing (NLP) by performing sentiment analysis on Twitter data. Sentiment analysis, also known as opinion mining, is a subfield of NLP that involves analyzing text to determine the sentiment behind it. This can be particularly useful for businesses, policymakers, and individuals who wish to gauge public opinion on various topics, products, or services.

Twitter, with its vast trove of short, opinion-rich texts, provides an excellent dataset for sentiment analysis. Each tweet is a reflection of the user's feelings, opinions, or thoughts on a subject, making Twitter an invaluable resource for understanding public sentiment.

**Key Concepts**
Sentiment Analysis: At its core, sentiment analysis involves classifying text as positive, negative, or neutral based on the emotion conveyed. This can be extended to detect more specific sentiments like happiness, anger, or surprise.

**Natural Language Processing (NLP):** NLP is a field of artificial intelligence that focuses on enabling machines to understand, interpret, and respond to human language in a valuable way.

**Machine Learning in NLP:** Sentiment analysis typically involves machine learning, where algorithms learn to classify text based on a training dataset.

**Project Steps**
1. **Data Acquisition:** We use the Sentiment140 dataset, a popular choice containing 1.6 million tweets labeled for sentiment. This dataset is loaded using Deeplake, a Python client for Activeloop Hub.

https://datasets.activeloop.ai/docs/ml/datasets/sentiment-140-dataset/#:~:text=Sentiment%2D140%20dataset%20has%20800%2C000,only%20some%20data%20containing%20emoticons.

2. **Data Preprocessing:** Tweets are preprocessed to convert them into a format suitable for analysis. This includes tokenization (splitting text into individual words or tokens), removing stop words (common words that don't contribute much meaning), and stemming (reducing words to their base form).

3. **Feature Extraction:** The preprocessed text is transformed into numerical values using the Term Frequency-Inverse Document Frequency (TF-IDF) method, which reflects the importance of words within the dataset.

4. **Model Training:** We employ the Naive Bayes classifier, a popular choice for text classification tasks due to its simplicity and effectiveness. The model is trained on a portion of the dataset to learn the correlation between features (words) and labels (sentiments).

5. **Model Evaluation:** The performance of the model is evaluated using metrics such as accuracy, precision, recall, and F1-score. These metrics provide insight into how well the model is classifying the sentiments.

6. **Prediction and Analysis:** Finally, the trained model is used to predict sentiments of new tweets, allowing us to analyze public opinion on various topics.

Through this project, we aim to demonstrate the power of machine learning and NLP in extracting meaningful insights from textual data, particularly from social media platforms like Twitter.

In [27]:
import deeplake
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re


In [28]:
# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [29]:
# Function to preprocess tweets
def preprocess_tweet(tweet):
    # Remove handles (@user)
    tweet = re.sub(r'@[A-Za-z0-9_]+', '', tweet)

    # Remove URLs
    tweet = re.sub(r'https?://[A-Za-z0-9./]+', '', tweet)

    # Tokenize and remove stop words
    tokens = word_tokenize(tweet.lower())
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()

    processed_tokens = [stemmer.stem(word) for word in tokens if word.isalpha() and word not in stop_words]
    return ' '.join(processed_tokens)


In [30]:

# Load the dataset from Activeloop Hub
ds = deeplake.load("hub://activeloop/sentiment-140-test")



\

Opening dataset in read-only mode as you don't have write permissions.


\

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/sentiment-140-test



/

hub://activeloop/sentiment-140-test loaded successfully.



 

In [31]:


# Check the dimensions of the tensors
print("Tweet text tensor shape:", ds['tweet_text'].numpy().shape)
print("Sentiment tensor shape:", ds['sentiment_type'].numpy().shape)

# Convert to DataFrame
# Assuming both tensors are now 1D, you can proceed with the DataFrame conversion
# If they are not 1D, you will need to reshape or process them accordingly


Tweet text tensor shape: (498, 1)
Sentiment tensor shape: (498, 1)


In [32]:
# Flatten the tensors and convert to DataFrame
tweets = ds['tweet_text'].numpy().flatten()
sentiments = ds['sentiment_type'].numpy().flatten()
df = pd.DataFrame({'tweet': tweets, 'sentiment': sentiments})

In [33]:
# Map the sentiment labels
df['sentiment'] = df['sentiment'].map({0: 'negative', 2: 'neutral', 4: 'positive'}).astype(str)

# Preprocess the tweets again
df['processed_tweet'] = df['tweet'].apply(preprocess_tweet)




In [34]:
# Split dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(df['processed_tweet'], df['sentiment'], test_size=0.2, random_state=42)




In [35]:
print(y_train.dtypes)
print(y_train.unique())


object
['positive' 'negative' 'neutral']


In [36]:
# Vectorize the text
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)




In [37]:
# Train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)



In [38]:

# Predict on test set
y_pred = clf.predict(X_test_tfidf)



In [39]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"Precision (Weighted): {precision}")
print(f"Recall (Weighted): {recall}")
print(f"F1 Score (Weighted): {f1}")


Accuracy: 0.65
Precision (Weighted): 0.7012058212058213
Recall (Weighted): 0.65
F1 Score (Weighted): 0.6270854441067207


In [40]:
# Select the first 50 tweets from the original dataset
first_5_tweets = df['tweet'].head(50)

# Preprocess these tweets
processed_tweets = first_5_tweets.apply(preprocess_tweet)

# Vectorize these tweets
vectorized_tweets = vectorizer.transform(processed_tweets)

# Predict sentiment
predicted_sentiment = clf.predict(vectorized_tweets)

# Display the results
for tweet, sentiment in zip(first_5_tweets, predicted_sentiment):
    print(f"Tweet: {tweet}")
    print(f"Predicted Sentiment: {sentiment}\n")


Tweet: @stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.
Predicted Sentiment: positive

Tweet: Reading my kindle2...  Love it... Lee childs is good read.
Predicted Sentiment: positive

Tweet: Ok, first assesment of the #kindle2 ...it fucking rocks!!!
Predicted Sentiment: negative

Tweet: @kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)
Predicted Sentiment: positive

Tweet: @mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)
Predicted Sentiment: positive

Tweet: @richardebaker no. it is too big. I'm quite happy with the Kindle2.
Predicted Sentiment: positive

Tweet: Fuck this economy. I hate aig and their non loan given asses.
Predicted Sentiment: negative

Tweet: Jquery is my new best friend.
Predicted Sentiment: positive

Tweet: Loves twitter
Predicted Sentiment: positive

Tweet: how can you not love Obama? he