<a href="https://colab.research.google.com/github/hasanemre11/python/blob/main/Sentiment_Analysis_on_Twitter_Data_Using_Machine_Learning_and_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis of Tweets Using NLP Techniques

##Introduction
In this project, we aim to perform sentiment analysis on a large dataset of tweets using various Natural Language Processing (NLP) techniques. The 'Sentiment140' dataset is utilized for training and evaluating machine learning models to classify tweets as positive or negative. The analysis involves text preprocessing, feature extraction methods like Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF), and model training with algorithms such as Logistic Regression, Random Forest, Gradient Boosting, and Support Vector Machines (SVM).

##Data Exploration and Preprocessing

####1. Downloading the Dataset

In [1]:
from google.colab import userdata
import os
import pandas as pd
import numpy as np

os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')

!kaggle datasets download -d kazanova/sentiment140 -p /content

# Unzip the downloaded file
!unzip /content/sentiment140.zip -d /content/

# Read the CSV file into a DataFrame
df = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1', header=None)

# Sampling a subset of the data for faster processing
df = df.sample(n=250000, random_state=42)

# Display the first few rows of the DataFrame
df.head()

# Checking DataFrame information
df.info()


Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140
License(s): other
Downloading sentiment140.zip to /content
 88% 71.0M/80.9M [00:00<00:00, 242MB/s]
100% 80.9M/80.9M [00:00<00:00, 244MB/s]
Archive:  /content/sentiment140.zip
  inflating: /content/training.1600000.processed.noemoticon.csv  


#### 2. Dropping Unnecessary Columns



*   Renaming column names: We rename the columns to provide meaningful labels.



In [4]:
# Renaming columns
df.columns = ['Target', 'ID', 'Date', 'Query', 'Username', 'Text']
df.head()


Unnamed: 0,Target,ID,Date,Query,Username,Text
541200,0,2200003196,Tue Jun 16 18:18:12 PDT 2009,NO_QUERY,LaLaLindsey0609,@chrishasboobs AHHH I HOPE YOUR OK!!!
750,0,1467998485,Mon Apr 06 23:11:14 PDT 2009,NO_QUERY,sexygrneyes,"@misstoriblack cool , i have no tweet apps fo..."
766711,0,2300048954,Tue Jun 23 13:40:11 PDT 2009,NO_QUERY,sammydearr,@TiannaChaos i know just family drama. its la...
285055,0,1993474027,Mon Jun 01 10:26:07 PDT 2009,NO_QUERY,Lamb_Leanne,School email won't open and I have geography ...
705995,0,2256550904,Sat Jun 20 12:56:51 PDT 2009,NO_QUERY,yogicerdito,upper airways problem




*   Dropping unnecessary columns: We drop the columns that are not relevant to the sentiment analysis task.


In [5]:
df.drop(['ID', 'Date', 'Query', 'Username'], axis=1, inplace=True)
df.head()

Unnamed: 0,Target,Text
541200,0,@chrishasboobs AHHH I HOPE YOUR OK!!!
750,0,"@misstoriblack cool , i have no tweet apps fo..."
766711,0,@TiannaChaos i know just family drama. its la...
285055,0,School email won't open and I have geography ...
705995,0,upper airways problem


####3. Data Cleaning, Tokenization, and Lemmatization

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Function to clean text with advanced preprocessing
def clean_text_advanced(text_series):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    def preprocess(text):
        text = re.sub(r'@\w+', '', text)  # Remove mentions
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)  # Remove URLs
        text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and special characters
        text = text.lower()  # Convert text to lowercase

        # Expand contractions
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'ll", " will", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'s", " is", text)
        text = re.sub(r"\'m", " am", text)
        text = re.sub(r"\'t", " not", text)
        text = re.sub(r"\'", "", text)

        # Tokenization and lemmatization
        tokenizer = TweetTokenizer()
        tokens = tokenizer.tokenize(text)
        tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]

        return ' '.join(tokens)

    cleaned_text = text_series.apply(preprocess)
    return cleaned_text

# Applying text cleaning
df['Cleaned_Text'] = clean_text_advanced(df['Text'])
df[['Text', 'Cleaned_Text']].head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


                                                     Text  \
541200             @chrishasboobs AHHH I HOPE YOUR OK!!!    
750     @misstoriblack cool , i have no tweet apps  fo...   
766711  @TiannaChaos i know  just family drama. its la...   
285055  School email won't open  and I have geography ...   
705995                             upper airways problem    

                                             Cleaned_Text  
541200                                       ahhh hope ok  
750                                cool tweet apps razr 2  
766711  know family drama lamehey next time u hang kim...  
285055  school email wont open geography stuff revise ...  
705995                               upper airway problem  


## Training Machine Learning Models Using the Bag-of-Words (BoW) Method

####1. Applying the Bag-of-Words (BoW) Method



In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Initialize CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

# Convert text data to Bag-of-Words
X = vectorizer.fit_transform(df['Cleaned_Text'])
y = df['Target']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


####2. Model Training and Evaluation



*   Logistic Regression



In [9]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(C=0.1, penalty='l2', solver='lbfgs', max_iter=1000)
model_lr.fit(X_train, y_train)

# Make predictions
y_pred_lr = model_lr.predict(X_test)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
report_lr = classification_report(y_test, y_pred_lr)

print(f'Accuracy (Logistic Regression): {accuracy_lr}')
print('Classification Report (Logistic Regression):')
print(report_lr)


Accuracy (Logistic Regression): 0.7612666666666666
Classification Report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.78      0.72      0.75     37483
           4       0.74      0.80      0.77     37517

    accuracy                           0.76     75000
   macro avg       0.76      0.76      0.76     75000
weighted avg       0.76      0.76      0.76     75000





*   Random Forest Classifier



In [10]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Model Evaluation
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)

print(f'Accuracy (Random Forest): {accuracy_rf}')
print('Classification Report (Random Forest):')
print(report_rf)


Accuracy (Random Forest): 0.7432933333333334
Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.73      0.76      0.75     37483
           4       0.75      0.72      0.74     37517

    accuracy                           0.74     75000
   macro avg       0.74      0.74      0.74     75000
weighted avg       0.74      0.74      0.74     75000





*   Gradient Boosting Classifier



In [11]:
from sklearn.ensemble import GradientBoostingClassifier

model_gb = GradientBoostingClassifier()
model_gb.fit(X_train, y_train)

# Make predictions
y_pred_gb = model_gb.predict(X_test)

# Evaluate the model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
report_gb = classification_report(y_test, y_pred_gb)

print(f'Accuracy (Gradient Boosting): {accuracy_gb}')
print('Classification Report (Gradient Boosting):')
print(report_gb)


Accuracy (Gradient Boosting): 0.6772133333333333
Classification Report (Gradient Boosting):
              precision    recall  f1-score   support

           0       0.79      0.49      0.60     37483
           4       0.63      0.87      0.73     37517

    accuracy                           0.68     75000
   macro avg       0.71      0.68      0.66     75000
weighted avg       0.71      0.68      0.67     75000



### Training Machine Learning Models Using the TF-IDF Method

####1. Applying the TF-IDF Method

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text data to TF-IDF features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = tfidf_vectorizer.fit_transform(df['Cleaned_Text'])
y = df['Target']

# Remap target classes: 0 to 0 and 4 to 1
y = np.where(y == 4, 1, 0)

# Split data into training and testing sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)


####2. Model Training and Evaluation



*  Gradient Boosting (XGBoost) Implementation


In [14]:
import xgboost as xgb
from xgboost import XGBClassifier

# Initialize XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=300, learning_rate=0.2, max_depth=4, colsample_bytree=0.9, subsample=0.9, random_state=42)

# Train the model
xgb_model.fit(X_train_tfidf, y_train)

# Model Evaluation
y_pred_xgb = xgb_model.predict(X_test_tfidf)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
report_xgb = classification_report(y_test, y_pred_xgb)

print(f'Accuracy (XGBoost): {accuracy_xgb}')
print('Classification Report (XGBoost):')
print(report_xgb)

Accuracy (XGBoost): 0.7348266666666666
Classification Report (XGBoost):
              precision    recall  f1-score   support

           0       0.79      0.64      0.71     37483
           1       0.70      0.83      0.76     37517

    accuracy                           0.73     75000
   macro avg       0.74      0.73      0.73     75000
weighted avg       0.74      0.73      0.73     75000





*   Random Forest Classifier Implementation



In [15]:
# 1. Training Random Forest Model
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)  # 100 trees (adjust n_estimators)

# Train the model
rf_model.fit(X_train_tfidf, y_train)

# 2. Model Evaluation
y_pred_rf = rf_model.predict(X_test_tfidf)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)

print(f'Accuracy (Random Forest): {accuracy_rf}')
print('Classification Report (Random Forest):')
print(report_rf)


Accuracy (Random Forest): 0.74912
Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.75      0.75      0.75     37483
           1       0.75      0.74      0.75     37517

    accuracy                           0.75     75000
   macro avg       0.75      0.75      0.75     75000
weighted avg       0.75      0.75      0.75     75000





*   Support Vector Machine (SVM) Implementation



In [17]:
from sklearn.svm import SVC

# 2. Training SVM Model
svm_model = SVC(kernel='linear', C=1)  # Linear kernel, C=1 (tune this)
svm_model.fit(X_train_tfidf, y_train)

# 3. Model Evaluation
y_pred_svm = svm_model.predict(X_test_tfidf)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
report_svm = classification_report(y_test, y_pred_svm)

print(f'Accuracy (SVM): {accuracy_svm}')
print('Classification Report (SVM):')
print(report_svm)


Accuracy (SVM): 0.7595733333333333
Classification Report (SVM):
              precision    recall  f1-score   support

           0       0.78      0.73      0.75     37483
           1       0.75      0.79      0.77     37517

    accuracy                           0.76     75000
   macro avg       0.76      0.76      0.76     75000
weighted avg       0.76      0.76      0.76     75000



##Project Report

###1. Dataset Overview



*   Source: Sentiment140 dataset
*   Description: The dataset contains 1.6 million tweets with sentiment labels (0 for negative, 4 for positive).



###2. Data Preprocessing

####2.1 Data Loading and Inspection


*   Downloaded and unzipped the dataset.
*   Renamed columns and dropped irrelevant ones.





####2.2 Text Cleaning:


*   Removed mentions, URLs, and special characters.
*   Lowercased text, expanded contractions.
*   Tokenized and lemmatized text, removed stopwords.






###3. Feature Extraction





3.1   Bag-of-Words (BoW) Method:


*   Used CountVectorizer to convert text data into BoW features.



3.2  TF-IDF Method:

*   Applied TfidfVectorizer to convert text data into TF-IDF features.



###4. Model Training and Evaluation

We evaluated the performance of different machine learning models using both the Bag-of-Words (BoW) and TF-IDF features. The evaluation metrics considered include Accuracy, Precision, Recall, and F1-Score.

####4.1 Bag-of-Words Models













*   Logistic Regression:

  *   Accuracy: 76.13%
  *   Precision: 0.74 (Negative), 0.78 (Positive)

  *   Recall: 0.80 (Negative), 0.72 (Positive)
  *   F1-Score: 0.77 (Negative), 0.75 (Positive)
  *   Analysis: This model performed well with a balanced precision and recall, indicating good performance in both identifying positive and negative sentiments.

*   Random Forest Classifier:

  *   Accuracy: 74.33%
  *   Precision: 0.75 (Negative), 0.73 (Positive)

  *   Recall: 0.72 (Negative), 0.76 (Positive)
  *   F1-Score: 0.74 (Negative), 0.74 (Positive)
  *   Analysis: The model showed a balanced performance but slightly lower than Logistic Regression, indicating some room for improvement in distinguishing between the two classes.

*   Gradient Boosting Classifier:

  *   Accuracy: 67.72%
  *   Precision: 0.63 (Negative), 0.79 (Positive)

  *   Recall: 0.87 (Negative), 0.49 (Positive)
  *   F1-Score: 0.73 (Negative), 0.60 (Positive)
  *   Analysis: The model had high recall for the positive class but suffered from lower precision, indicating it is better at identifying positives but misclassifies more negatives.

####4.2 TF-IDF Models

*   XGBoost Classifier::

  *   Accuracy: 73.48%
  *   Precision: 0.70 (Negative), 0.79 (Positive)

  *   Recall: 0.83 (Negative), 0.64 (Positive)
  *   F1-Score: 0.76 (Negative), 0.71 (Positive)
  *   Analysis: This model performed well, with high recall for the negative class and good overall precision, making it suitable for applications prioritizing minimizing false negatives.

*   Random Forest Classifier:

  *   Accuracy: 74.91%
  *   Precision: 0.75 (Negative), 0.75 (Positive)

  *   Recall: 0.75 (Negative), 0.74 (Positive)
  *   F1-Score: 0.75 (Negative), 0.75 (Positive)
  *   Analysis: The TF-IDF-based Random Forest model performed similarly to the BoW version, with slightly better balance between precision and recall.

*   Support Vector Machine (SVM):

  *   Accuracy: 75.96%
  *   Precision: 0.75 (Negative), 0.78 (Positive)

  *   Recall: 0.79 (Negative), 0.73 (Positive)
  *   F1-Score: 0.77 (Negative), 0.75 (Positive)
  *   Analysis: The SVM model achieved a good balance between precision and recall, showing its robustness for sentiment classification tasks.

###5. Conclusion


The sentiment analysis project demonstrated that various machine learning models can effectively classify Twitter sentiments. The SVM and Logistic Regression models showed the best performance, with accuracy around 76%. Both models achieved a good balance between precision and recall, making them suitable for most sentiment analysis applications.