**Mini Project**

Airline Tweet Sentiment Classifier using Natural Language Processing

Note:
1. Use sample dataset from airline_tweets_sample.csv
2. Step:
   1. Import libraries
   2. Load and explore dataset
    - dowload nltk required like stopwords
    - load dataset
    - check data set at least top 5 values
   3. Clean and preprocess th text
     - convert to lowercase
     - remove urls
     - remove special characters and numbers
     - remove stopswords
     - apply stemming
   4. convert text to numerical vectors(TD-IDF)
     - convert text to TD-IDF
     - check x,y,and shape len
   5. split into train and test sets
     - 80% training, 20% testing
   6. train and test logistic regression model
   7. evaluate accuaracy and classification report
   8. predict sentiment for new example tweets

   



In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.svm import LinearSVC

In [2]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [18]:
data = pd.read_csv('airline_tweets_sample.csv')
data.head()

Unnamed: 0,text,sentiment
0,"@United flight was delayed for 3 hours, worst ...",negative
1,"Loved the service on @Delta, crew was super fr...",positive
2,"@AmericanAir lost my luggage again, so disappo...",negative
3,Smooth boarding and on-time arrival. Great job...,positive
4,The seats were uncomfortable but staff was polite,neutral


In [4]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    words = text.split()
    words = [w for w in words if w not in stop_words]
    words = [stemmer.stem(w) for w in words]
    return " ".join(words)

In [5]:
data['cleaned_text']=data['text'].apply(clean_text)
print(data[['cleaned_text','sentiment']].head())

                                      cleaned_text sentiment
0         unit flight delay hour worst experi ever  negative
1            love servic delta crew super friendli  positive
2               americanair lost luggag disappoint  negative
3  smooth board ontim arriv great job southwestair  positive
4                       seat uncomfort staff polit   neutral


In [6]:
vectorizer = TfidfVectorizer(max_features=1000)
matrixs = vectorizer.fit_transform(data['cleaned_text'])
matrixs.shape

(30, 105)

In [7]:
X = matrixs
y = data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (24, 105)
X_test shape: (6, 105)
y_train shape: (24,)
y_test shape: (6,)


In [8]:

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.16666666666666666
Classification Report:
               precision    recall  f1-score   support

    negative       0.20      1.00      0.33         1
     neutral       0.00      0.00      0.00         1
    positive       0.00      0.00      0.00         4

    accuracy                           0.17         6
   macro avg       0.07      0.33      0.11         6
weighted avg       0.03      0.17      0.06         6



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [17]:
new_tweet = [
    'My family and i had a very pleasent experience with our flight from Incheon to KL.Loved this service',
    'Twice I have had to deal with delayed flights with Malaysia Airlines.',
    'Arrival is on time and smooth ',
    'I was delighted with the flight, made so by an outstanding team of flight attendants led by an able purser.',
    ' I am extremely disappointed regarding damage to my child’s stroller during my recent flight from Kuala Lumpur to Doha',
    'The seats were uncomfortable but staff was polite',
    'Had extra free snacks, greatt	']

new_clean_tweet = [clean_text(tweet) for tweet in new_tweet]
new_tweet_vec =vectorizer.transform(new_clean_tweet)
prediction = model.predict(new_tweet_vec)
print('\nPrediction are: ')
for tweet, pred in zip(new_tweet, prediction):
    print(f"\nTweet: \n{tweet} \nPrediction: {pred}")


Prediction are: 

Tweet: 
My family and i had a very pleasent experience with our flight from Incheon to KL.Loved this service 
Prediction: negative

Tweet: 
Twice I have had to deal with delayed flights with Malaysia Airlines. 
Prediction: negative

Tweet: 
Arrival is on time and smooth  
Prediction: positive

Tweet: 
I was delighted with the flight, made so by an outstanding team of flight attendants led by an able purser. 
Prediction: negative

Tweet: 
 I am extremely disappointed regarding damage to my child’s stroller during my recent flight from Kuala Lumpur to Doha 
Prediction: negative

Tweet: 
The seats were uncomfortable but staff was polite 
Prediction: neutral

Tweet: 
Had extra free snacks, greatt	 
Prediction: positive
