## **Final Project**

### **A. Introduction**

**Team Members**  :
- Livia Amanda Annafiah
- Alfarabi
- Badriah Nursakinah

**Dataset**       : [Airline Reviews](https://www.kaggle.com/datasets/juhibhojani/airline-reviews/data)  

**Hugging Face**  : [Link](https://huggingface.co/spaces/liviamanda/FlightBuddy)


---

**Problem Statement**  

Choosing the right airline can greatly affect a traveler's overall experience, including comfort, service quality, and in-flight amenities. With many online reviews available, airline passengers often **rely on these reviews** to make informed decisions about which airline to choose. However, the large number of reviews can make it difficult and **time-consuming** to read through and understand the general opinion about an airline.

**FlightBuddy** aims to solve this problem by using advanced Natural Language Processing (NLP) techniques to analyze airline reviews quickly and accurately. By processing and understanding a large number of reviews, FlightBuddy can determine whether the opinions in the reviews are positive or negative.

---

**Objective**  

The main goal of **FlightBuddy** is to improve the decision-making process for travelers by providing personalized airline recommendations based on the analysis of review sentiments. Specifically, FlightBuddy aims to:

- Analyze the sentiment of airline reviews to classify them as positive or negative, with accuracy serving as the metric.
- Recommend five airlines with similar positive characteristics for users who have seen favorable reviews.
- Suggest top-rated alternative airlines for users who have encountered negative experiences, ensuring they have better options for future travel.

***This notebook focuses on testing the NLP model using new, unseen data.***

### **B. Libraries**

The following libraries are used for this inference:

In [1]:
# Libraries for data loading and manipulation
import os
import re
import zipfile
import numpy as np
import pandas as pd
from tensorflow.keras.models import load_model

# Libraries for pre-processing
import tensorflow as tf
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

nltk.download('stopwords')
nltk.download('punkt')

# Import library to ignore warnings
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning, module='tensorflow')




[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\septi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\septi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### **C. Data Loading**

The model is stored within a compressed zip file, so it must be extracted first before use.

In [2]:
# Define path to zip file
zip_file_path = 'model_logreg.zip'

# Read the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    
    # Extract all contents to a directory named 'unzipped_model'
    zip_ref.extractall('unzipped_model')

# Load the model from the unzipped folder
unzipped_model_path = os.path.join('unzipped_model', 'model_logreg')
model = load_model(unzipped_model_path)





The model is successfully loaded.

In this notebook, new data will be generated specifically for testing the model. As part of this analysis, two rows are created to represent two reviews.

In [3]:
# Create new data
df_inf = {'Review': ["First, I got delayed and after I waited for almost an hour, the flight got cancelled last minute.",
                     "The flight experience was excellent! The staff were friendly, and everything was smooth."]}

# Convert to dataframe
df_inf = pd.DataFrame(df_inf)

# Show data
df_inf

Unnamed: 0,Review
0,"First, I got delayed and after I waited for al..."
1,The flight experience was excellent! The staff...


### **D. Text Pre-processing**

Before moving on to the prediction phase, it's essential to preprocess the data just as it was done in the main notebook.

In [4]:
# Define stopwords
stpwds_id = list(set(stopwords.words('english')))

# Add custom stopwords
custom_stopwords = ['the', 'to', 'and', 'I', 'was', 'a', 'in', 'of', ' for', 'on', 'flight', 'with', 'that', 'my', 'is', 'not', 'were', 'they',
                    'The', 'at', 'we', 'had', 'from', 'but', 'have', 'it', 'this', 'no', 'as', 'me', 'you', 'our', 'be', 'are', 'an', 'very', 'so',
                    'service', 'their', 'We', 'time','airline', 'would', 'or', 'us', 'by', 'only', 'get', 'all' 'which']

stpwds_id.extend(custom_stopwords)

# Define Stemming
stemmer = PorterStemmer()

In [5]:
# Define function for text preprocessing
def text_preprocessing(text):
    # Case folding
    text = text.lower()

    # Mention removal
    text = re.sub(r'https?://(?:www\.[^\s\n\r]+|[^\s\n\r]+)', '', text)

    # Hashtags removal
    text = re.sub(r'#', '', text)

    # Newline removal (\n)
    text = re.sub(r'[\n\r]', '', text)

    # Replaces the numbers with an empty string
    text = re.sub(r'\d+', '', text)

    # Whitespace removal
    text = text.strip()

    # URL removal
    text = re.sub(r"http\S+", " ", text)
    text = re.sub(r"www.\S+", " ", text)

    # Non-letter removal (such as emoticon, symbol (like μ, $, 兀), etc.)
    text = re.sub("[^A-Za-z\s']", " ", text)

    # Tokenization
    tokens = word_tokenize(text)

    # Stopwords removal
    tokens = [word for word in tokens if word not in stpwds_id]

    # Stemming
    tokens = [stemmer.stem(word) for word in tokens]

    # Combining Tokens
    text = ' '.join(tokens)

    return text

Once the functions for data preprocessing have been defined, they are applied to the new data set.

In [6]:
# Applying Text Preprocessing to the Dataset
df_inf['Processed Review'] = df_inf['Review'].apply(lambda x: text_preprocessing(x))

# Show before and after processing
df_inf

Unnamed: 0,Review,Processed Review
0,"First, I got delayed and after I waited for al...",first got delay wait almost hour got cancel la...
1,The flight experience was excellent! The staff...,experi excel staff friendli everyth smooth


The preprocessing has been successfully completed, as seen by the before and after states of the data in the dataframe.

### **E. Model Prediction**

Once the data has been preprocessed, the loaded model can be used for prediction to determine whether the review is positive (recommended) or negative (not recommended).

In [7]:
# Prediction array
model.predict(df_inf['Processed Review'])



array([[0.9634347, 0.0361073],
       [0.2070894, 0.8019827]], dtype=float32)

This array represents the model's predictions for two reviews. Each row corresponds to a review, and the two columns represent the probabilities assigned to each class (0 and 1, in this case, representing `Not Recommended` and `Recommended` respectively).

- **First review**:
  - The model predicts `Not Recommended` with a probability of approximately 0.963.
  - The model predicts `Recommended` with a probability of approximately 0.036.

- **Second review**:
  - The model predicts `Not Recommended` with a probability of approximately 0.207.
  - The model predicts `Recommended` with a probability of approximately 0.802.

In [8]:
# Make predictions using the trained model
y_pred_inf = model.predict(df_inf['Processed Review'])

# Loop through each prediction result
for pred in y_pred_inf:
    
    # Get the index of the highest predicted value (argmax)
    pred_label = np.argmax(pred)
    
    # Check the predicted label and print the corresponding recommendation
    if pred_label == 0:
        print(f'Not Recommended')
    elif pred_label == 1:
        print(f'Recommended')

Not Recommended
Recommended


The code above uses the loaded model to predict the recommendation status of pre-processed reviews. The model takes these reviews as input and generates predictions, determining whether each one is `Recommended` or `Not Recommended`. It uses a decision threshold, set at 0.5, to classify reviews. Probabilities higher than this threshold are labeled as `Recommended`, while those falling below are labeled as `Not Recommended`.

Therefore, the first review is predicted as `Recommended`, while the second review is predicted as `Recommended`.

### **F. Conclusion**

Overall, the model demonstrates its ability to **successfully** predict unseen data. It can differentiate between **positive (recommended)** and **negative (not recommended)** reviews with accuracy.