# Mini Project 4

# Problem Statement
The task is to build a model that will determine the tone (neutral, positive, negative, Can’t
tell) of the text. To do this, you will need to train the model on the training data. The resulting
model will have to determine the class (neutral, positive, negative, Can’t tell) of test texts
(test data that were not used to build the model) with maximum accuracy.

Data Dictionary

ID: tweetID

Tweet: Tweet by user

Sentiment: tone of user

Negative = 0,
Neutral = 1,
Positive = 2,
Can’t tell = 3

Perform Sentiment Analysis using knowledge of NLP.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVR
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from pandas.core.apply import frame_apply
import warnings
warnings.filterwarnings("ignore")

In [2]:
# load the dataset
df=pd.read_csv('data-mini4.csv')
df

Unnamed: 0,tweet_id,tweet,sentiment
0,1701,#sxswnui #sxsw #apple defining language of tou...,1
1,1851,Learning ab Google doodles! All doodles should...,1
2,2689,one of the most in-your-face ex. of stealing t...,2
3,4525,This iPhone #SXSW app would b pretty awesome i...,0
4,3604,Line outside the Apple store in Austin waiting...,1
...,...,...,...
7269,3343,@mention Google plze Tammi. I'm in middle of ...,1
7270,5334,RT @mention ÷¼ Are you all set? ÷_ {link} ÷...,1
7271,5378,RT @mention Aha! Found proof of lactation room...,1
7272,2173,We just launched our iPad app at #SXSW! Get al...,1


In [3]:
df.shape

(7274, 3)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7274 entries, 0 to 7273
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   tweet_id   7274 non-null   int64 
 1   tweet      7273 non-null   object
 2   sentiment  7274 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 170.6+ KB


In [5]:
df['sentiment'].value_counts()

1    4311
2    2382
0     456
3     125
Name: sentiment, dtype: int64

In [6]:
df.isnull().sum()

tweet_id     0
tweet        1
sentiment    0
dtype: int64

In [7]:
# Remove rows with missing values (NaN) in the 'tweet' column
df = df.dropna(subset=['tweet'])

In [8]:
# Preprocessing: Convert text to lowercase and remove any leading/trailing spaces
df['tweet'] = df['tweet'].str.lower().str.strip()

In [9]:
# Split the data into training and testing sets
X = df['tweet']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

In [11]:
# Fit and transform on the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train.fillna(''))
X_test_tfidf = tfidf_vectorizer.transform(X_test.fillna(''))

In [12]:
# Train a Logistic Regression classifier
classifier = LogisticRegression(max_iter=500)
classifier.fit(X_train_tfidf, y_train)

LogisticRegression(max_iter=500)

In [13]:
# Predict on the test data
y_pred = classifier.predict(X_test_tfidf)

In [14]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.6831615120274914

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.05      0.08        88
           1       0.68      0.90      0.78       849
           2       0.68      0.46      0.55       495
           3       0.00      0.00      0.00        23

    accuracy                           0.68      1455
   macro avg       0.47      0.35      0.35      1455
weighted avg       0.66      0.68      0.65      1455



In [15]:
# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6831615120274914
