# **Apple-Twitter-Sentiment-Analysis**

##  Project Introduction: Apple Twitter Sentiment Analysis

This project aims to classify sentiments in tweets mentioning **Apple Inc.** into three categories: **positive**, **neutral**, or **negative**. Sentiment analysis is a key application of natural language processing (NLP) that helps organizations understand public opinion at scale.

We use a labeled dataset of tweets annotated with sentiment categories and confidence scores. By building a machine learning model to classify these tweets automatically, we aim to enable scalable monitoring of public sentiment related to Apple’s brand and products.

This project follows the **CRISP-DM (Cross-Industry Standard Process for Data Mining)** methodology, which provides a structured and iterative framework for carrying out data science projects effectively.

---

##  Business Understanding

Understanding customer and public sentiment is crucial for brand management, competitive analysis, and decision-making. In this case, we are interested in automating the detection of sentiment in social media posts about **Apple Inc.**.

Manually labeling tweets is both time-consuming and prone to inconsistency. Automating this process allows:
- **Real-time insights** into how people perceive Apple products, services, and announcements.
- **Trend monitoring** around events such as product launches or controversies.
- **Actionable intelligence** for marketing, public relations, and strategic planning.

###  Objective

The main objectives of this project are:

- To preprocess the tweet data using **Natural Language Processing (NLP)** techniques.
- To perform **Exploratory Data Analysis (EDA)** on the text to understand word patterns, frequency, and sentiment distribution.
- To build a **machine learning classifier** that accurately predicts the sentiment of tweets.
- To evaluate the performance of different classifiers using appropriate **classification metrics** such as accuracy, precision, recall, and F1-score.


This system could later be extended for real-time social media tracking or integrated into customer feedback pipelines.

---

## Data Understanding

The dataset used in this project is the **Apple Twitter Sentiment DFE** dataset from data world. It contains tweets labeled with sentiment classes and additional metadata. Here's a high-level overview of the dataset:

###  Key Columns:
- `text`: The tweet content (our main feature for analysis).
- `sentiment`: The sentiment label assigned to each tweet (our target variable).
- `sentiment:confidence`: A score indicating how confident the annotator or system was in the assigned label.

###  Observations from Initial Exploration:
- The dataset contains **3886 rows** and several metadata columns (e.g., unit ID, query, annotation state) which are not relevant for sentiment modeling and will be dropped during preprocessing.
- The `sentiment` column contains some **non-numeric and noisy values** (e.g., `not_relevant`, or merged label codes like `3\n1`) that must be cleaned or filtered out.
- The `text` column contains **raw social media content**, including mentions, hashtags, links, and special characters that will need to be cleaned for modeling.
- The `sentiment:confidence` column varies in range and may be used to **filter out low-confidence annotations** to improve model quality.

This understanding will guide our **data cleaning and feature engineering** efforts in the next stages of the project.



In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download("punkt_tab")
nltk.download("wordnet")
from sklearn.model_selection import train_test_split,GridSearchCV,cross_val_score
from sklearn.metrics import classification_report,roc_auc_score,confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings("ignore")


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\HomePC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HomePC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# loading the dataset
df=pd.read_csv("Apple-Twitter-Sentiment-DFE.csv",encoding="ISO-8859-1")
df.head()

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,sentiment,sentiment:confidence,date,id,query,sentiment_gold,text
0,623495513,True,golden,10,,3,0.6264,Mon Dec 01 19:30:03 +0000 2014,5.4e+17,#AAPL OR @Apple,3\nnot_relevant,#AAPL:The 10 best Steve Jobs emails ever...htt...
1,623495514,True,golden,12,,3,0.8129,Mon Dec 01 19:43:51 +0000 2014,5.4e+17,#AAPL OR @Apple,3\n1,RT @JPDesloges: Why AAPL Stock Had a Mini-Flas...
2,623495515,True,golden,10,,3,1.0,Mon Dec 01 19:50:28 +0000 2014,5.4e+17,#AAPL OR @Apple,3,My cat only chews @apple cords. Such an #Apple...
3,623495516,True,golden,17,,3,0.5848,Mon Dec 01 20:26:34 +0000 2014,5.4e+17,#AAPL OR @Apple,3\n1,I agree with @jimcramer that the #IndividualIn...
4,623495517,False,finalized,3,12/12/14 12:14,3,0.6474,Mon Dec 01 20:29:33 +0000 2014,5.4e+17,#AAPL OR @Apple,,Nobody expects the Spanish Inquisition #AAPL




We load the Apple Twitter Sentiment dataset from a CSV file using `pandas.read_csv`. Since the file contains special characters, we specify the encoding as `ISO-8859-1`.

The dataset contains various metadata columns related to tweet annotations as well as the actual tweet text and sentiment labels.


In [3]:
# extracting the relevant columns
df_rel=df[["sentiment","sentiment:confidence","text"]]
df_rel.head()

Unnamed: 0,sentiment,sentiment:confidence,text
0,3,0.6264,#AAPL:The 10 best Steve Jobs emails ever...htt...
1,3,0.8129,RT @JPDesloges: Why AAPL Stock Had a Mini-Flas...
2,3,1.0,My cat only chews @apple cords. Such an #Apple...
3,3,0.5848,I agree with @jimcramer that the #IndividualIn...
4,3,0.6474,Nobody expects the Spanish Inquisition #AAPL


From the loaded dataset, we extract only the columns that are directly relevant for sentiment analysis:

- `sentiment`: the label or target variable
- `sentiment:confidence`: confidence score of the assigned label
- `text`: the actual tweet text we want to analyze

This step simplifies our dataset to only what's necessary for modeling and text preprocessing.


In [4]:
# renaming the sentiment:confidence column
df_rel=df_rel.rename(columns={"sentiment:confidence":"sentiment confidence"})


Here, we rename the column `sentiment:confidence` to `sentiment confidence` for easier access and improved readability in our analysis and model-building code.


In [5]:
# summary of the dataset
df_rel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   sentiment             3886 non-null   object 
 1   sentiment confidence  3886 non-null   float64
 2   text                  3886 non-null   object 
dtypes: float64(1), object(2)
memory usage: 91.2+ KB



In this step, we examine a concise summary of the dataset. It contains 3 columns and 3,886 entries. The `sentiment` and `sentiment confidence` columns are currently stored as object types, suggesting the need for further inspection and possible data type conversion before modeling.


In [6]:
col_check=["sentiment","sentiment confidence"]
for col in col_check:
    print(df_rel[col].value_counts())

sentiment
3               2162
1               1219
5                423
not_relevant      82
Name: count, dtype: int64
sentiment confidence
1.0000    1899
0.6722      46
0.6884      32
0.6825      29
0.6635      27
          ... 
0.8578       1
0.4882       1
0.6490       1
0.6686       1
0.9230       1
Name: count, Length: 654, dtype: int64


We inspect the value counts of two key columns:
- `sentiment`: to see how many tweets fall into each label category
- `sentiment confidence`: to examine the spread of confidence scores assigned during annotation

This step helps us identify possible anomalies like unexpected labels (e.g., `not_relevant`) and decide on thresholding for label quality based on confidence.


In [7]:
# casting 'sentiment' and 'sentiment confidence' column values to integers
df_rel["sentiment confidence"] = pd.to_numeric(df_rel["sentiment confidence"], errors='coerce')
df_rel["sentiment"] = pd.to_numeric(df_rel["sentiment"], errors='coerce')
df_rel.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   sentiment             3804 non-null   float64
 1   sentiment confidence  3886 non-null   float64
 2   text                  3886 non-null   object 
dtypes: float64(2), object(1)
memory usage: 91.2+ KB




This cell ensures that the `sentiment` and `sentiment confidence` columns are properly converted to numeric values using `pd.to_numeric`.

We use `errors='coerce'` to handle invalid entries (e.g., text labels like `'not_relevant'`) by converting them to `NaN`.

After conversion, we call `.info()` again to verify the changes and identify any new missing values introduced during the coercion.


In [None]:
# creating a function to clean the dataset
def clean(df):
    print(f"null : {df.isna().sum()}")
    df=df.dropna()
    print(f"is duplicated : {df.duplicated().sum()}")
    df=df.drop_duplicates()
    return df
    

Here we define a reusable function called `clean(df)` to perform essential dataset cleaning:
- Display the number of missing values (`NaN`)
- Drop rows with missing values
- Display the number of duplicate rows
- Remove duplicate rows

The cleaned DataFrame is returned for further use in model development.


In [None]:
# utilizing the clean function
df_clean=clean(df_rel)
df_clean.info()

null : sentiment               82
sentiment confidence     0
text                     0
dtype: int64
is duplicated : 382
<class 'pandas.core.frame.DataFrame'>
Index: 3422 entries, 0 to 3885
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   sentiment             3422 non-null   float64
 1   sentiment confidence  3422 non-null   float64
 2   text                  3422 non-null   object 
dtypes: float64(2), object(1)
memory usage: 106.9+ KB
