<a href="https://colab.research.google.com/github/iamanantalok/Airline-Passenger-Referral-Prediction/blob/main/Capstone_Project_Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Airline Passenger Referral Prediction**  



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name**            - Anant Alok


# **Project Summary -**

The "Airline Passenger Referral Prediction" project aims to create a data-driven solution for forecasting passenger referrals in a commercial airline. Referrals are instances where current passengers recommend the airline's services to potential new customers. This project acknowledges the significant influence of word-of-mouth recommendations in the aviation sector and aims to optimize and capitalize on this marketing channel.

Key Project Objectives:

1. Data Collection and Preparation: The project involves gathering comprehensive historical data related to passenger referrals. This data includes various attributes such as passenger demographics, flight details, referral sources, and referral outcomes. Rigorous data cleaning and transformation processes will be carried out to ensure data quality and suitability for analysis.

2. Feature Engineering: The project will focus on creating meaningful features from the collected data. These features will include elements such as referral source types (e.g., social media, in-flight conversations), passenger flight frequency, loyalty status, and geographical variables. These engineered features will serve as the foundational inputs for the predictive model.

3. Advanced Model Development: The core of the project is the development of a robust predictive model using advanced machine learning techniques. This model will predict the likelihood of a passenger making a referral based on various features. Different algorithms, including logistic regression, decision trees, random forests, and potentially more advanced methods like gradient boosting and neural networks, will be explored and fine-tuned for optimal performance.

4. Model Training and Validation: The dataset will be split into distinct training and validation subsets to facilitate model training. Techniques such as cross-validation will be used to ensure the model's reliability and effectiveness. This iterative process will also help optimize the model's hyperparameters.

5. Performance Evaluation: Thorough evaluation of the model's predictive accuracy will be conducted using established metrics such as accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (AUC-ROC). These metrics will provide a comprehensive understanding of the model's strengths and weaknesses.

6. Insights and Interpretability: The model's results will be carefully analyzed to extract valuable insights into the main drivers of passenger referrals. This analysis may include an examination of feature importance, shedding light on the key factors influencing successful referrals.

7. Integration Strategy: The project will outline a strategic plan for seamlessly integrating the predictive model into the airline's operational systems. The objective is to enable real-time referral predictions, enhancing the passenger experience during booking and post-flight interactions.

8. Ongoing Monitoring and Maintenance: A comprehensive monitoring system will be established to continuously track the model's performance. Regular updates and maintenance will be carried out to ensure sustained accuracy, accommodating changes in passenger behavior and evolving market dynamics.

9. Ethical Considerations: The project will actively address potential ethical concerns related to passenger privacy, data security, and fair treatment of individuals in predictions. Compliance with legal regulations and industry standards will be of utmost importance.

10. Business Implications: The project will assess the expected impact of the referral prediction model on critical business metrics. This includes forecasting increased customer acquisition rates, optimizing marketing campaigns, and ultimately improving overall customer satisfaction.

In conclusion, the "Airline Passenger Referral Prediction" project aims to utilize data-driven insights to transform the airline's marketing efforts. By accurately predicting passenger referrals and understanding the contributing factors, the project seeks to enable targeted marketing strategies, personalized customer interactions, and improved business outcomes.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This project is centered around an extensive dataset containing airline reviews from 2006 to 2019, covering various popular airlines globally. The dataset includes multiple-choice and free-text questions, offering a comprehensive view of passenger opinions. It was collected in Spring 2019 and serves as the project's foundation.

The primary aim is to develop a predictive model capable of identifying passengers likely to recommend the airline to others. Although passenger referrals significantly impact customer acquisition and loyalty, the airline currently lacks the ability to predict potential advocates. This limitation hampers the strategic utilization of this influential marketing channel.

The challenge is to build an accurate predictive model considering diverse passenger attributes and behaviors to forecast referral likelihood. Solving this problem holds the potential to maximize referrals, fostering business growth and customer engagement.

In essence, this project offers the airline a crucial opportunity to leverage historical data, advanced modeling, and predictive analytics to transform its marketing strategy. Accurate prediction of passenger referrals can enhance customer acquisition, foster loyalty, and elevate performance in the competitive aviation industry.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.impute import SimpleImputer  # Missing value imputation
import scipy.stats as stats  # Hypothesis testing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder  # Categorical encoding

# Libraries for text data preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer

# Import SelectKBest, f_regression for feature selection based on statistical tests
from sklearn.feature_selection import SelectKBest, f_regression

# Import train_test_split for splitting data into training and testing sets
from sklearn.model_selection import train_test_split

# Libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Importing model evaluation metrics
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
from sklearn.metrics import roc_curve, roc_auc_score

# Libraries for cross-validation and hyperparameter tuning
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Suppress warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Import the necessary library to mount Google Drive
from google.colab import drive

# Mount Google Drive to access files
drive.mount('/content/drive')


In [None]:
# Load Dataset
airline_df = pd.read_excel("/content/drive/MyDrive/Capstone Project-3- Airline-Passenger-Referral-Prediction/data_airline_reviews.xlsx")


### Dataset First View

In [None]:
# Display the first five rows of the DataFrame
airline_df.head()

In [None]:
# Display the last five rows of the DataFrame
airline_df.tail()

### Dataset Rows & Columns count

In [None]:
# Count the rows and columns in the airline dataset
num_rows, num_cols = airline_df.shape

# Print the results
print(f"Airline dataset has {num_rows} rows and {num_cols} columns.")


### Dataset Information

In [None]:
# Display information about the airline dataset
airline_df.info()

#### Duplicate Values

In [None]:
# Count duplicate rows in the DataFrame
duplicate_count = airline_df.duplicated().sum()

# Print the total number of duplicate rows
print("Total Duplicate Rows in the DataFrame:", duplicate_count)

In [None]:
# Drop duplicate rows from the DataFrame in-place
airline_df.drop_duplicates(inplace=True)


#### Missing Values/Null Values

In [None]:
# Function to identify columns with missing values
def show_missing():
    missing = airline_df.columns[airline_df.isnull().any()].tolist()
    return missing

# Missing data counts
print('Missing Data Count')
print(airline_df[show_missing()].isnull().sum().sort_values(ascending=False))

# Separator for clarity
print('--' * 50)

# Missing data percentages
print('Missing Data Percentage')
print(round(airline_df[show_missing()].isnull().sum().sort_values(ascending=False) / len(airline_df) * 100, 5))


In [None]:
# Calculate the percentage of missing values in each column for airline_df
missing_percent = (airline_df.isnull().sum() / len(airline_df)) * 100

# Create a color map
cmap = plt.get_cmap('viridis')

# Normalize missing percentages to [0, 1] for colormap
normalized_missing = missing_percent / 100

# Create a bar plot with colors indicating missing values percentage
plt.figure(figsize=(13, 6))
bars = plt.bar(missing_percent.index, missing_percent, color=cmap(normalized_missing))
plt.title('Percentage of Missing Values by Column')
plt.xlabel('Columns')
plt.ylabel('Percentage')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Add a colorbar to indicate the missing values percentage gradient
sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=0, vmax=1))
sm._A = []  # An empty array is required for the colorbar to work
cbar = plt.colorbar(sm, pad=0.03)
cbar.set_label('Missing Values Percentage')

plt.show()

### What did you know about your dataset?

**Key Observations:**

**Dataset Size:** The dataset comprises a total of 131,895 entries, or rows.

**Data Attributes:** Within the dataset, there are 17 columns, each representing distinct features.

**Non-Null Values:** The count of "Non-Null" values in each column is noteworthy. This figure indicates the absence of missing data, which is crucial for ensuring the accuracy of our analysis and the performance of any modeling.

**Data Types:** The "Dtype" column denotes the data types for each feature. Specifically, seven columns are of data type float64, signifying numerical data. Meanwhile, ten columns are of object data type, encompassing both categorical variables and textual information.

**Insights from Non-Null Counts:**

It's worth noting that some columns contain missing data in the form of NaN values. Notably, "airline," "overall," "author," "review_date," "customer_review," "aircraft," "traveller_type," "cabin," "route," "date_flown," "seat_comfort," "cabin_service," "food_bev," "entertainment," "ground_service," "value_for_money," and "recommended" exhibit instances of missing values.

**Insights from Data Types:**

Among the features, seven are of a numeric data type (float64). These are likely to represent ratings or scores for various aspects of the airline experience.

Conversely, ten columns are of object data type, encompassing categorical variables and textual data. Notable examples include "airline," "author," "review_date," "customer_review," "aircraft," "traveller_type," "cabin," "route," "date_flown," and "recommended."

## ***2. Understanding Your Variables***

In [None]:
# Explore each column in airline_df
for column in airline_df.columns:
    print(f"Column: {column}")
    print("Data Type:", airline_df[column].dtype)
    print("Number of Unique Values:", airline_df[column].nunique())
    print("Value Counts:")
    print(airline_df[column].value_counts())
    print("-" * 30)


In [None]:
# Generate summary statistics for numerical features in the dataset
numerical_summary = airline_df.describe()

# Display the summary statistics
print(numerical_summary)

In [None]:
# Loop through each column in the DataFrame
for column in airline_df.select_dtypes(include=['object']).columns:
    print(f"Column: {column}")
    print("Number of Unique Values:", airline_df[column].nunique())
    print("Value Counts:")
    print(airline_df[column].value_counts())
    print("-" * 30)


### Variables Description

Here's a concise description of the features in the dataset:

1. **Airline:** The name of the airline operating the flight.

2. **Overall:** A numerical rating given by customers, typically ranging from 1 to 10, representing their overall satisfaction with the trip.

3. **Author:** The person who authored the trip review.

4. **Review Date:** The date on which the review was posted.

5. **Customer Review:** A free-text field containing the detailed review provided by customers.

6. **Aircraft:** The type or model of the aircraft used for the flight.

7. **Traveler Type:** This feature indicates the type of traveler, such as business or leisure.

8. **Cabin:** The specific cabin class or section of the airplane in which the passenger traveled.

9. **Date Flown:** The date on which the flight took place.

10. **Seat Comfort:** A numerical rating, usually on a scale of 1 to 5, representing the comfort level of the seats.

11. **Cabin Service:** A numerical rating, typically on a scale of 1 to 5, reflecting the quality of service provided within the cabin.

12. **Food and Beverage:** A numerical rating, typically on a scale of 1 to 5, evaluating the quality of food and beverages provided during the flight.

13. **Entertainment:** A numerical rating, usually on a scale of 1 to 5, assessing the quality and availability of in-flight entertainment.

14. **Ground Service:** A numerical rating, typically on a scale of 1 to 5, evaluating the quality of services provided on the ground, such as check-in and baggage handling.

15. **Value for Money:** A numerical rating, typically on a scale of 1 to 5, indicating whether passengers felt they received value for the price paid.

16. **Recommended:** A binary feature, possibly used as a target variable, indicating whether the customer would recommend the airline (e.g., 1 for recommended, 0 for not recommended).

These features collectively provide valuable insights into customer experiences with different airlines, enabling analysis and assessment of various aspects of the airline industry.

### Check Unique Values for each variable.

In [None]:
# Check unique values for each variable (column)
for column in airline_df.columns:
    unique_values = airline_df[column].unique()
    print(f"Column: {column}")
    print("Unique Values:", unique_values)
    print("-" * 30)


## 3. ***Data Wrangling***

#### **Managing Missing Data in the Dataset**

In [None]:
# Define a function to check for missing values in a DataFrame
def missing_values_check(df):
    # Calculate the percentage of missing values for each column
    percent_missing = df.isnull().sum() * 100 / len(df)

    # Create a DataFrame to store column names and their respective missing percentages
    missing_values_df = pd.DataFrame({'column_name': df.columns,
                                      'percent_missing': percent_missing})

    # Sort the DataFrame by the percentage of missing values in descending order
    return missing_values_df.sort_values('percent_missing', ascending=False)

# Call the missing_values_check function with the airline_df DataFrame
missing_values_result = missing_values_check(airline_df)

# Print the result, which shows columns with missing values sorted by their percentages
print("Columns with Missing Values (sorted by percentage of missing values):")
print(missing_values_result)


In [None]:
# Drop the 'aircraft' feature due to a high percentage of missing values
airline_df.drop(columns=['aircraft'], inplace=True)

In [None]:
# Drop rows with null values in the specified categorical features
airline_df.dropna(subset=['airline', 'author', 'review_date', 'customer_review', 'cabin', 'traveller_type', 'date_flown', 'route', 'recommended'], inplace=True)


In [None]:
# List of columns to analyze and impute missing values
columns_to_analyze = ['overall', 'seat_comfort', 'cabin_service', 'food_bev',
                      'entertainment', 'ground_service', 'value_for_money']

# Iterate through each numerical column
for column in columns_to_analyze:
    # Create a distribution plot for the current column
    plt.figure(figsize=(8, 5))
    sns.histplot(data=airline_df, x=column, kde=True)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

    # Calculate statistics for imputation
    median_value = airline_df[column].median()
    mean_value = airline_df[column].mean()
    mode_value = airline_df[column].mode()[0]

    # Print imputation options for the current column
    print(f"Column: {column}")
    print(f"Median Imputation: {median_value:.2f}")
    print(f"Mean Imputation: {mean_value:.2f}")
    print(f"Mode Imputation: {mode_value:.2f}")
    print("=" * 85)


1. We generate distribution plots using `sns.histplot()` for each of the columns under analysis.

2. We compute and display the statistics for each imputation method, including the median, mean, and mode.

3. Next, we visualize the distribution plots to gain insights into the shape of the data distribution, identifying whether it exhibits skewness, symmetry, or the presence of outliers.

4. These statistical measures provide an understanding of the central tendency of the data, aiding us in determining the most appropriate imputation strategy for our analysis.

In [None]:
# List of columns to impute missing values
columns_to_impute = ['overall', 'seat_comfort', 'cabin_service', 'food_bev',
                     'entertainment', 'ground_service', 'value_for_money']

# Impute missing values with the median
for column in columns_to_impute:
    median_value = airline_df[column].median()  # Calculate median
    airline_df[column].fillna(median_value, inplace=True)  # Replace missing values with median

# Convert columns to int64 data type
for column in columns_to_impute:
    airline_df[column] = airline_df[column].astype(int)  # Convert to int64 data type


In [None]:
# Check for missing values in the DataFrame
missing_values = airline_df.isnull().sum()

# Print the count of missing values for each column
print("Missing Values Count:")
print(missing_values)

#### **Improving Date Handling: Datetime Conversion for 'review_date' and 'date_flown'**

In [None]:
def handle_review_date(date_review_values):
    fin_date = []
    for date in date_review_values:
        # Extracting day
        day = date.split()[0]
        if len(day) == 3:
            day = int(day[:1])
        else:
            day = int(day[:2])

        # Extracting month
        month = date.split()[1]
        month_map = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7,
                     'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}
        month = month_map[month]

        # Extracting year
        year = date.split()[-1]

        # Constructing the date in 'yyyy-mm-dd' format
        fin_date.append(f'{year}-{month:02d}-{day:02d}')

    # Returning as pandas datetime
    return pd.to_datetime(fin_date)

# Convert the 'review_date' column to pandas datetime using the handle_review_date function
airline_df['review_date'] = handle_review_date(airline_df['review_date'])



In [None]:
def handle_date_flown(date_flown_values):
    fin_date = []
    for date in date_flown_values:
        if pd.isna(date):
            fin_date.append(np.nan)  # Handle missing values as np.nan
        else:
            try:
                fin_date.append(pd.to_datetime(date))  # Try to convert to datetime
            except:
                year = date.split()[1]
                month = date.split()[0]
                month_map = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7,
                             'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}
                fin_date.append(pd.to_datetime(f'{year}-{month_map[month]:02d}-01'))  # Handle other date formats

    return fin_date

# Convert the 'date_flown' column to pandas datetime using the handle_date_flown function
airline_df['date_flown'] = handle_date_flown(airline_df['date_flown'])

# Extract the month component and create a new 'month' column
airline_df['month'] = airline_df['date_flown'].dt.month



#### **Enhancing Data Clarity: Splitting 'Route' into 'Arrival' and 'Departure' Columns**

In [None]:
def handle_route():
    final_route = []
    for route in airline_df.route.values:
        if pd.isna(route):
            final_route.append((np.nan, np.nan))  # Handle missing values as (np.nan, np.nan)
        else:
            to_ind = str(route).find(' to ')
            via_idx = str(route).find(' via ')
            if via_idx == -1:
                final_route.append((str(route)[:to_ind], str(route)[to_ind + 3:]))
            else:
                final_route.append((str(route)[:to_ind], str(route)[to_ind + 3:via_idx]))
    return final_route

# Update the 'route' column using the handle_route function
airline_df['route'] = handle_route()

# Create the 'arrival_city' and 'departure_city' columns from the 'route' column
airline_df['arrival_city'] = airline_df['route'].apply(lambda x: x[0])
airline_df['departure_city'] = airline_df['route'].apply(lambda x: x[1])

# Drop the original 'route' column
airline_df.drop('route', inplace=True, axis=1)

### What all manipulations have you done and insights you found?



**Initial Data Cleanup:**
- To address data quality issues, the 'aircraft' feature, which contained nearly 70% missing values, was removed from the dataset.
- The dataset was further prepared by categorizing features into two groups: categorical and numerical variables.
- The 'date_flown' and 'review_date' columns, originally stored as object data types, were converted into Pandas DateTime objects to facilitate more effective exploratory data analysis (EDA).
- For improved data organization and analysis, the 'route' feature was split into two separate features: 'arrival_city' and 'departure_city,' and the 'route' feature itself was subsequently dropped."



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Exploring the Distributions of Numeric Features**

In [None]:
# Create a 2x3 grid of subplots
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))

# Distribution of Overall Ratings
sns.histplot(data=airline_df, x='overall', bins=10, kde=True, ax=axes[0, 0])
axes[0, 0].set_xlabel('Overall Rating')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Overall Ratings')

# Distribution of Seat Comfort Ratings
sns.histplot(data=airline_df, x='seat_comfort', bins=5, kde=True, ax=axes[0, 1])
axes[0, 1].set_xlabel('Seat Comfort Rating')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Seat Comfort Ratings')

# Distribution of Cabin Service Ratings
sns.histplot(data=airline_df, x='cabin_service', bins=5, kde=True, ax=axes[0, 2])
axes[0, 2].set_xlabel('Cabin Service Rating')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Distribution of Cabin Service Ratings')

# Distribution of Food and Beverage Ratings
sns.histplot(data=airline_df, x='food_bev', bins=5, kde=True, ax=axes[1, 0])
axes[1, 0].set_xlabel('Food and Beverage Rating')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Food and Beverage Ratings')

# Distribution of Entertainment Ratings
sns.histplot(data=airline_df, x='entertainment', bins=5, kde=True, ax=axes[1, 1])
axes[1, 1].set_xlabel('Entertainment Rating')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Entertainment Ratings')

# Distribution of Ground Service Ratings
sns.histplot(data=airline_df, x='ground_service', bins=5, kde=True, ax=axes[1, 2])
axes[1, 2].set_xlabel('Ground Service Rating')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].set_title('Distribution of Ground Service Ratings')

# Customize the layout and spacing
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
# Create a countplot for the distribution of Value for Money Ratings
plt.figure(figsize=(8, 6))
sns.histplot(data=airline_df, x='value_for_money', bins=6, kde=True)
plt.xlabel('Value for Money Rating')
plt.ylabel('Frequency')
plt.title('Distribution of Value for Money Ratings')
plt.show()


##### 1. Why did you pick the specific chart?

The histplot function, when used with kde (Kernel Density Estimation), is particularly well-suited for analyzing continuous numerical data. This function combines the traditional histogram (bar plot) with a smoothed density curve, allowing it to provide an estimate of the underlying data distribution. It serves as a valuable tool for visualizing the characteristics of continuous distributions, aiding in the identification of features like peaks and valleys within the data.

##### 2. What is/are the insight(s) found from the chart?

1. Ratings of 1 to 2 are the most common in the overall feature.
2. Regarding Seat Comfort, it's notable that the highest rating is 1, followed by a second-highest rating of 4.
3. For the Cabin Service feature, the highest rating is 5, and the second-highest rating is 1.
4. In the Food and Beverage feature, ratings of 2, 4, and 5 occur with roughly equal frequency.
5. Both the Entertainment and Ground Service features show that the highest rating is 3, with the second-highest rating being 1.
6. The Value for Money feature indicates that most passengers assign a rating of 1 as the highest, suggesting that many airlines may not provide satisfactory service to their passengers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Overall, the insights we've gathered point to opportunities for enhancement and the potential for positive business outcomes through the resolution of specific issues that lead to lower ratings. Improving seat comfort, enriching food and beverage choices, and delivering better value for money can positively impact passenger satisfaction and foster loyalty.

Conversely, there's a possibility of negative growth stemming from the frequent occurrence of low ratings in the overall experience, seat comfort, and value for money. These aspects play a critical role in passenger satisfaction and could result in unfavorable reviews, reduced repeat business, and damage to the airline's reputation if left unattended.

#### **Ranking Airlines by Trip Frequency: Top 10 Airlines**

In [None]:
# Get the top 10 airlines based on the frequency of trips
top_10_airlines = airline_df['airline'].value_counts().head(10)

# Create a bar plot for the top 10 airlines by count
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_airlines.values, y=top_10_airlines.index, palette='viridis')
plt.xlabel('Number of Trips')
plt.ylabel('Airline')
plt.title('Top 10 Airlines by Number of Trips')
plt.show()


##### 1. Why did you pick the specific chart?

The selection of a bar plot for this visualization is appropriate as it efficiently illustrates the distribution of reviews among different airlines, facilitating a straightforward comparison of the top 10 airlines. This choice capitalizes on the advantages of bar plots in presenting categorical data and enabling meaningful side-by-side comparisons.

##### 2. What is/are the insight(s) found from the chart?

1. American Airlines stands out as the most widely recognized and frequented airline, boasting the highest number of reviews. This suggests a strong reputation and passenger trust in this airline.

2. Spirit Airlines, despite having fewer reviews than American Airlines, secures the second spot in popularity. This implies that it is a preferred choice among budget-conscious travelers seeking economical options.

3. United Airlines and British Airways also enjoy significant popularity, attributed to their extensive range of destinations and services catering to diverse traveler needs.

4. Airlines like China Southern Airlines, Emirates, Delta Air Lines, and Turkish Airlines have earned popularity within specific regions of the world. For instance, China Southern Airlines is favored for travel to and from China, while Emirates serves as a preferred option for Middle East-bound travelers.

5. Frontier Airlines and Qatar Airways operate as low-cost carriers, offering competitive fares but potentially differing in the level of service they provide when compared to other airlines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights can have a positive business impact for airlines by helping them:

1. Improve customer service, addressing specific areas of concern like food quality.
2. Target marketing campaigns effectively based on customer preferences, such as budget-conscious travelers.
3. Develop new products and services tailored to customer needs, like offering more legroom.

These insights also prevent negative growth by addressing issues such as uncomfortable seats, food dissatisfaction, and poor customer service. Addressing these concerns enhances customer satisfaction and loyalty, ultimately benefiting the airline's bottom line.

#### **Understanding Categorical Data Distribution**

In [None]:
# Distribution of Traveller Types
plt.figure(figsize=(8, 6))
sns.countplot(data=airline_df, x='traveller_type', palette='Set2')
plt.xlabel('Traveller Type')
plt.ylabel('Frequency')
plt.title('Distribution of Traveller Types')
plt.xticks(rotation=45)
plt.show()

# Distribution of Cabin Types
plt.figure(figsize=(8, 6))
sns.countplot(data=airline_df, x='cabin', palette='Set3')
plt.xlabel('Cabin Type')
plt.ylabel('Frequency')
plt.title('Distribution of Cabin Types')
plt.xticks(rotation=45)
plt.show()

# Distribution of Recommended
plt.figure(figsize=(8, 6))
sns.countplot(data=airline_df, x='recommended', palette='Pastel1')
plt.xlabel('Recommended')
plt.ylabel('Frequency')
plt.title('Distribution of Recommended')
plt.show()


##### 1. Why did you pick the specific chart?

The countplot is an effective choice for visualizing categorical data distributions. It is frequently employed when the goal is to grasp the frequency distribution of distinct categories or levels within a variable. This type of plot offers valuable insights into the composition and distribution of specific features, making it a valuable tool for univariate analyses of this nature.

##### 2. What is/are the insight(s) found from the chart?

1. "Solo Leisure" is the most frequent traveler type, suggesting that a substantial portion of passengers in the dataset travel independently.
2. "Economy Class" stands out as the predominant cabin type, with a notably higher frequency compared to other cabin types.
3. A greater number of passengers opted for a "no" recommendation than a "yes," implying that a significant portion of passengers did not have a positive enough experience to endorse the airline.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights can positively impact airlines by tailoring their services to customer needs, enhancing satisfaction, loyalty, and revenue. For instance, focusing on Economy Class improvements, addressing customer concerns, and boosting customer service can drive growth.

However, some insights indicate potential negative growth. A rising number of non-recommendations suggests unmet expectations, potentially lowering satisfaction and loyalty. Declining Business Class passengers may signify price sensitivity among business travelers, impacting revenue.


#### **Number of Travelers per Cabin Class**

In [None]:
# Set the color palette
sns.set_palette('gist_ncar')

# Count the number of passengers in each cabin class and create a pie chart
airline_df['cabin'].value_counts().plot(kind='pie', autopct='%1.0f%%', figsize=(10, 5))

# Add a title
plt.title('Distribution of Passengers by Cabin Class')

# Display the pie chart
plt.show()


##### 1. Why did you pick the specific chart?

The pie chart offers a valuable snapshot of airline passenger demographics, highlighting the popularity of different travel classes. Economy class leads the preferences, followed by premium economy, business class, and first class. This data can guide airlines in pricing and service-related decision-making.

##### 2. What is/are the insight(s) found from the chart?

The chart displays a pie chart illustrating the distribution of passengers across different airline classes. It reveals that 79% of passengers opt for economy class, while first class accounts for a modest 3%. Premium economy represents 4% of passengers, and business class comprises 15%.

This pie chart provides insightful demographic information about airline passengers. It underscores that the majority of travelers choose economy class, likely due to its affordability. In contrast, the higher costs associated with first class and business class restrict their appeal to a smaller segment of passengers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights can positively impact airlines in several ways:

1. **Economy Class Focus:** Given the majority of passengers travel in economy class, airlines can enhance marketing and services for this segment, including discounts, comfort improvements, and better in-flight entertainment.

2. **Expand Premium Economy:** With a rising share of passengers in premium economy, airlines can consider expanding offerings, such as more seats and enhanced amenities.

3. **Target Business Travelers:** Airlines can cater to the valuable business traveler segment by offering discounts, comfortable seats, and business-oriented amenities like Wi-Fi and workspaces.

However, if not managed carefully, these insights could also lead to negative growth. Overemphasizing economy class may alienate high-paying customers, and excessive price increases could deter budget-conscious travelers.

#### **Passenger Travel Patterns by Month**

In [None]:
# Create a bar plot for the distribution of travel months
plt.figure(figsize=(10, 6))
sns.countplot(data=airline_df, x='month', palette='PuBuGn')
plt.xlabel('Month')
plt.ylabel('Number of Passengers')
plt.title('Distribution of Passengers Travel by Month')
plt.xticks(ticks=range(0, 12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()


##### 1. Why did you pick the specific chart?

The 'month' feature falls under the category of categorical data, given that it comprises distinct and discrete categories representing months of the year. Bar plots serve as a powerful tool for visualizing the distribution of such categorical data. In this context, our goal is to ascertain the frequency of passenger travel for each month. Bar plots excel in presenting the count of occurrences for each category (i.e., month) in a straightforward and easily interpretable manner.

##### 2. What is/are the insight(s) found from the chart?



1. July emerges as the top choice for travel, boasting 3,410 passengers. This popularity is likely attributed to July being a summer month, a prime season for vacations.
2. August follows closely as the second most favored travel month, with 3,321 passengers, also benefiting from the summer season.
3. June, December, and September are also prominent travel months, hosting 3,237, 3,202, and 3,014 passengers, respectively. These months, situated in summer or fall, typically offer favorable travel weather.
4. Conversely, January, February, and March see the lowest travel numbers, with 2,965, 2,403, and 2,494 passengers, respectively. These winter months, marked by cold and snowy weather, tend to discourage travel.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Gained insights from the chart can positively impact airlines and travel agencies in several ways:

1. **Strategic Planning:** Airlines can tailor marketing and pricing strategies to capitalize on peak travel months, optimizing revenue.
2. **Destination Planning:** Travel agencies can design itineraries around popular destinations during peak months, driving sales.

3. **Promotions:** Special offers and discounts can be crafted to boost travel demand during less popular months.

However, these insights can also have negative implications:

1. **Overbooking Risk:** Airlines might overbook flights during peak months, risking customer dissatisfaction.
2. **Service Reliability:** Travel agencies may face challenges if they oversell travel packages, leading to cancellations and refunds.
3. **Adaptation Delays:** Slow response to changing travel demand can result in missed revenue opportunities.


#### **Passenger Counts by Review Date and Date Flown**

In [None]:
# Create a figure with two subplots
plt.figure(figsize=(15, 5))

# Subplot 1: Distribution of passengers by review year
plt.subplot(1, 2, 1)
airline_df.groupby(airline_df.review_date.dt.year)['review_date'].count().plot(
    ylabel='Number of Passengers',
    xticks=range(2015, 2019)
)
plt.title('Distribution of Passengers by Review Year')

# Subplot 2: Distribution of passengers by travel year
plt.subplot(1, 2, 2)
airline_df.groupby(airline_df.date_flown.dt.year)['date_flown'].count().plot(
    ylabel='Number of Passengers',
    xticks=range(2013, 2019)
)
plt.title('Distribution of Passengers by Travel Year')

# Adjust layout and display the subplots
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The line plot layout enables a simultaneous comparison of passenger counts across multiple years for both review dates and date flown. This facilitates the identification of patterns, distinctions, and trends between these two perspectives, offering a comprehensive view of passenger distribution over time.

##### 2. What is/are the insight(s) found from the chart?

1. Over time, there has been a noticeable increase in both passenger counts and the number of reviews. This growth can be attributed to various factors, including the rising popularity of air travel and the widespread availability of online review platforms.

2. Interestingly, the number of reviews consistently exceeds the number of passengers, indicating that a significant proportion of passengers actively engage in providing feedback after their flights.

3. A peak in both passenger counts and reviews occurs in 2018, likely influenced by the record-breaking year in air travel.

4. Conversely, there is a decline in both passenger counts and reviews in 2019, possibly due to factors like a global economic slowdown and the emergence of budget airlines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

1. **Improved Customer Service:** The airline can enhance customer service by addressing complaints promptly, efficiently resolving issues, and providing superior support.

2. **Attracting More Passengers:** Strategies like competitive pricing, improved in-flight entertainment, and new routes can help the airline draw more passengers.

3. **Industry Trends:** The insights can assist the airline in staying ahead of industry trends, enabling informed decisions and maintaining a competitive edge.

**Negative Impact:**

1. **Market Share Loss:** Failure to address factors contributing to declining passengers and reviews could result in continued market share erosion.

2. **Reputation Damage:** Ill-received business changes may lead to customer loss and harm the airline's reputation.

#### **Exploring the Relationship Between Overall and Sub-Ratings**

In [None]:
# Set the color palette
sns.set_palette('crest')

# List of different kinds of ratings columns (excluding 'overall')
review_columns = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create a grid of subplots
fig, axes = plt.subplots(nrows=len(review_columns), ncols=1, figsize=(10, 6 * len(review_columns)))

# Loop through each review column and create a bar plot
for i, col in enumerate(review_columns):
    ax = axes[i]
    x = airline_df.groupby('overall')[col].value_counts().unstack()
    x.plot(kind='bar', ax=ax)
    ax.set_title(f'Relationship between {col.replace("_", " ").title()} and Overall Ratings')
    ax.set_xlabel('Overall Rating')
    ax.set_ylabel('Count')
    ax.legend(title=col.replace("_", " ").title(), loc='upper left')

# Adjust layout and show the plots
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are well-suited for comparing categorical data, where each category corresponds to a bar. They facilitate straightforward comparisons of category distributions across various groups, such as different 'overall' ratings.

##### 2. What is/are the insight(s) found from the chart?

The ratings for seat_comfort, cabin_service, and food_bev exhibit an upward trend as the overall rating increases. This suggests that passengers tend to be more content with these aspects when they assign a higher overall rating to the airline. On the other hand, the ratings for entertainment, ground_service, and value_for_money also show an increase with the overall rating, though the correlation is somewhat weaker. This implies that while these factors remain significant for passengers, they may not carry as much weight as seat_comfort, cabin_service, and food_bev in determining the overall rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

The insights gained can positively impact airlines in several ways:

1. **Enhanced Seat Comfort:** Investment in more spacious and comfortable seats.
2. **Improved Cabin Service:** Recruitment of experienced and friendly flight attendants.
3. **Enhanced Food and Beverage Offerings:** Partnerships with renowned chefs and diversified menus.
4. **Upgraded Entertainment:** Installation of larger screens and expanded channel offerings.
5. **Efficient Ground Service:** Faster check-in and baggage handling.
6. **Better Value for Money:** Competitive pricing and more affordable fares.

**Slower Growth:**

While no insights directly lead to negative growth, some could result in a slower growth rate if unaddressed. For instance, neglecting seat comfort, cabin service, or dining options may prompt passengers to opt for competitors offering superior services, potentially hindering growth.

#### **Understanding the Interplay of Overall Ratings, Traveler Types, and Cabin Choices**

In [None]:
# Set the color palette
sns.set_palette('crest')

# Create a grid of subplots
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 18))

# List of categorical features to compare with 'overall'
categorical_features = ['traveller_type', 'cabin']

# Loop through each categorical feature and create a grouped bar plot
for i, feature in enumerate(categorical_features):
    ax = axes[i]
    sns.countplot(data=airline_df, x=feature, hue='overall', ax=ax)
    ax.set_title(f'Comparison of Overall Ratings by {feature.replace("_", " ").title()}')
    ax.set_xlabel(feature.replace("_", " ").title())
    ax.set_ylabel('Count')
    ax.legend(title='Overall Rating')

# Adjust layout and show the plots
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The grouped bar plot is an effective choice for visually representing and comparing the distribution of 'overall' ratings across various categories within categorical features. It offers a clear and informative method for examining the relationship between these variables and uncovering possible patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

1. Solo leisure travelers exhibit the highest average overall rating at 5.08, surpassing business travelers (4.44), couple leisure travelers (4.36), and family leisure travelers (4.32). This implies that solo leisure travelers tend to have a more contented overall experience with airlines compared to other traveler types.

2. Business class attains the highest average overall rating at 6.41, signifying a significant lead over economy class (4.22), first class (6.09), and premium economy (5.17). This indicates that passengers express notably higher satisfaction with their overall experience in business class compared to other cabin classes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



**Traveller Type:**

- **Solo Leisure Travelers:** Airlines can enhance service for solo leisure travelers with personalized amenities like priority check-in, free Wi-Fi, and power outlets.

- **Business Travelers:** Airlines can cater to business travelers with flexible ticket policies, premium amenities like lie-flat seats, and access to airport lounges.

**Cabin Type:**

- **Economy Class:** Improving space and comfort for economy class passengers by offering more legroom and wider aisles.

- **Economy Class Amenities:** Enhancing the in-flight experience for economy class passengers with complimentary food and drinks, Wi-Fi, and entertainment options.

These suggestions provide specific actions airlines can take to address passenger preferences and improve their overall experience.

#### **Recommended Feature Dynamics: Airlines, Traveler Types, and Cabins**


In [None]:
# Group by 'airline' and 'recommended', count occurrences, and create a bar plot
airline_count = airline_df.groupby(['airline', 'recommended']).agg({'recommended': 'count'}).rename(columns={'recommended': 'count'}).sort_values(by='count', ascending=False).unstack()
airline_count[:20].plot(kind='bar', figsize=(10, 5))
plt.legend(['No', 'Yes'])
plt.xlabel('Airline')
plt.ylabel('Count')
plt.title('Recommended vs. Airline')
plt.show()

# Group by 'traveller_type' and 'recommended', count occurrences, and create a bar plot
traveller_count = airline_df.groupby(['traveller_type', 'recommended']).agg({'recommended': 'count'}).rename(columns={'recommended': 'count'}).sort_values(by='count', ascending=False).unstack()
traveller_count[:4].plot(kind='bar', figsize=(10, 5))
plt.legend(['No', 'Yes'])
plt.xlabel('Traveller Type')
plt.ylabel('Count')
plt.title('Recommended vs. Traveller Type')
plt.show()

# Group by 'cabin' and 'recommended', count occurrences, and create a bar plot
cabin_count = airline_df.groupby(['cabin', 'recommended']).agg({'recommended': 'count'}).rename(columns={'recommended': 'count'}).sort_values(by='count', ascending=False).unstack()
cabin_count[:4].plot(kind='bar', figsize=(10, 5))
plt.legend(['No', 'Yes'])
plt.xlabel('Cabin')
plt.ylabel('Count')
plt.title('Recommended vs. Cabin')
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots offer a straightforward and interpretable way to visualize data. In a bar plot, each bar represents the count of passengers who have recommended a particular airline. The height of each bar corresponds to the number of passengers who recommended that airline, making it easy to discern which airlines received more recommendations.

##### 2. What is/are the insight(s) found from the chart?


**Recommended vs. Airline:**

- Airlines like ANA All Nippon Airways, Aegean Airlines, and Air New Zealand receive predominantly 'yes' (recommended) responses, indicating positive passenger sentiment.
- Airlines such as Adria Airways, Air Arabia, and Alaska Airlines have a higher count of 'no' (not recommended), suggesting areas for improvement in passenger recommendations.

**Recommended vs. Traveler Type:**

- Business travelers ('Business') tend to provide more 'yes' recommendations, reflecting a positive sentiment.
- Couples ('Couple Leisure') and solo travelers ('Solo Leisure') exhibit a relatively balanced distribution of 'yes' and 'no' recommendations.

**Recommended vs. Cabin:**

- 'Economy Class' passengers have diverse experiences and preferences, resulting in a mixed distribution of 'yes' and 'no' recommendations.
- 'Business Class' passengers receive a higher count of 'yes' recommendations, indicating positive reception.
- 'First Class' and 'Premium Economy' passengers also have a balanced distribution of 'yes' and 'no' recommendations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



**Positive Business Impact:**

- **Enhanced Passenger Experience:** Airlines with more 'yes' recommendations can expect increased customer loyalty, positive word-of-mouth, and a strong brand reputation, ultimately attracting new customers and fostering repeat business.

- **Focus on Business Travelers:** Business travelers tend to provide more positive recommendations, offering airlines an opportunity to tailor services and amenities to this segment, potentially boosting bookings and loyalty.

- **Premium Cabin Services:** 'Business Class' and 'First Class' passengers receive more 'yes' recommendations, allowing airlines to enhance premium cabin services, attract luxury-seeking travelers, and justify higher fares.

**Negative Growth Insights:**

- **Areas of Improvement:** Airlines with a higher count of 'no' recommendations may face negative growth unless they address the underlying issues causing dissatisfaction, which can result in reduced customer satisfaction, negative reviews, and decreased repeat business.

- **Economy Class Challenges:** The mixed distribution of 'yes' and 'no' recommendations among 'Economy Class' passengers highlights the need to address comfort, service quality, and amenities in this class to prevent decreased customer loyalty and retention.

#### **Passenger Ratings Across Traveler Types and Cabins**

In [None]:
# Set the color palette
sns.set_palette('Set2')

# List of different kinds of ratings columns
rating_columns = ['seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create a grid of subplots
fig, axes = plt.subplots(nrows=len(rating_columns), ncols=2, figsize=(15, 6 * len(rating_columns)))

# Loop through each rating column and create bar plots for 'traveller_type' and 'cabin'
for i, col in enumerate(rating_columns):
    ax1 = axes[i, 0]
    ax2 = axes[i, 1]

    # Bar plot for 'traveller_type'
    sns.barplot(data=airline_df, x='traveller_type', y=col, hue=airline_df['recommended'], ax=ax1)
    ax1.set_title(f'{col.replace("_", " ").title()} by Traveller Type')
    ax1.set_xlabel('Traveller Type')
    ax1.set_ylabel(col.replace("_", " ").title())
    ax1.tick_params(axis='x', rotation=45)

    # Box plot for 'cabin'
    sns.boxplot(data=airline_df, x='cabin', y=col, hue=airline_df['recommended'], palette=['blue', 'yellow'], ax=ax2)
    ax2.set_title(f'{col.replace("_", " ").title()} by Cabin')
    ax2.set_xlabel('Cabin')
    ax2.set_ylabel(col.replace("_", " ").title())
    ax2.tick_params(axis='x', rotation=45)

# Adjust layout and show the plots
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Selecting this particular plot enables us to effectively communicate how various facets of the air travel experience are interpreted by distinct passenger segments. This, in turn, empowers airlines to make well-informed decisions to enhance and optimize their services.

##### 2. What is/are the insight(s) found from the chart?



**Seat Comfort:**

- Business travelers tend to rate seat comfort higher compared to other traveler types.
- First Class passengers generally rate seat comfort higher compared to other cabin types.

**Cabin Service:**

- Business travelers rate cabin service higher on average.
- Passengers in Business Class cabins tend to rate cabin service higher compared to other cabin types.

**Food and Beverage:**

- Solo travelers tend to rate food and beverage quality higher on average.
- Passengers in Business Class cabins rate food and beverage quality higher compared to other cabin types.

**Entertainment:**

- Business travelers tend to rate entertainment higher on average.
- Passengers in Business Class and First Class cabins generally rate entertainment higher compared to other cabin types.

**Ground Service:**

- Business travelers rate ground service higher on average.
- Passengers in Business Class cabins tend to rate ground service higher compared to other cabin types.

**Value for Money:**

- Solo travelers tend to rate value for money higher on average.
- Passengers in Business Class cabins rate value for money higher compared to other cabin types.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Understanding passenger preferences helps airlines enhance the aspects passengers value most, leading to increased customer satisfaction, loyalty, and repeat business. For example:

1. Improving seat comfort in economy class with wider seats and more legroom.
2. Enhancing cabin service with personalized attention from well-trained flight attendants.
3. Elevating food and beverage quality with chef-prepared meals on chinaware.
4. Expanding entertainment options with more channels and movies.
5. Enhancing ground service with dedicated check-in counters and lounges.

**Negative Growth Insights:**

Some insights could lead to negative growth if not addressed. For instance, the higher value-for-money rating in business class compared to economy class suggests potential overpricing in economy. Airlines risk losing customers to budget carriers offering lower fares if this issue isn't addressed.

#### **Impact of Flight Month and Years on Passenger Ratings**

In [None]:
# Extract year and month from review_date
airline_df['flight_year'] = airline_df['review_date'].dt.year
airline_df['flight_month'] = airline_df['review_date'].dt.month

# List of different kinds of ratings columns
rating_columns = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create a grid of subplots
fig, axes = plt.subplots(nrows=len(rating_columns), ncols=2, figsize=(15, 6 * len(rating_columns)))

# Loop through each rating column and create line plots for flight year and month
for i, col in enumerate(rating_columns):
    ax1 = axes[i, 0]
    ax2 = axes[i, 1]

    # Line plot for flight year
    sns.lineplot(data=airline_df, x='flight_year', y=col, ax=ax1)
    ax1.set_title(f'{col.replace("_", " ").title()} by Flight Year')
    ax1.set_xlabel('Flight Year')
    ax1.set_ylabel(col.replace("_", " ").title())
    ax1.tick_params(axis='x', rotation=45)

    # Line plot for flight month
    sns.lineplot(data=airline_df, x='flight_month', y=col, ax=ax2)
    ax2.set_title(f'{col.replace("_", " ").title()} by Flight Month')
    ax2.set_xlabel('Flight Month')
    ax2.set_ylabel(col.replace("_", " ").title())
    ax2.set_xticks(range(1, 13))
    ax2.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

# Adjust layout and show the plots
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Line plots are a standard choice for visualizing temporal trends or variations in data. In this context, we aim to explore how passenger ratings have evolved across various flight years and months. Line plots provide a clear means of observing any patterns or shifts in ratings as we examine changes in the time variable, whether it's the year or the month.

##### 2. What is/are the insight(s) found from the chart?

**Impact of Flight Year on Ratings:**

- Overall ratings declined steadily from 2015 to 2019.
- Seat comfort, cabin service, food and beverage, entertainment, and ground service ratings followed a similar decreasing trend over the years.
- Value for money ratings initially dropped (2015-2016) but then began a slight recovery.

**Impact of Flight Month on Ratings:**

- Overall ratings remained relatively stable throughout the year, with a slight upswing during the middle months.
- Ratings for seat comfort, cabin service, food and beverage, and entertainment displayed similar patterns with minor fluctuations.
- Ground service and value for money ratings also showed consistent patterns with slight variations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

- **Continuous Improvement:** Increasing overall and specific ratings over the years indicate improved services and passenger experiences, fostering loyalty and attracting more passengers.

- **Seasonal Adaptation:** Recognizing seasonal rating patterns enables the airline to adjust services during peak travel times, addressing specific issues.

**Areas for Improvement:**

- **Declining Ratings:** Persistent rating declines pose potential issues, including reduced satisfaction and customer attrition, necessitating corrective action.

- **Monthly Inconsistencies:** Fluctuations in monthly ratings highlight service quality inconsistencies, warranting investigation and improvement.

- **Competitor Comparison:** Contrasting ratings with competitors helps identify areas of deficiency, prompting efforts to bridge gaps in service quality.

#### **Analyzing the Impact of Overall Ratings on Passenger Recommendations**

In [None]:
# Calculate the mean overall rating for recommended and not recommended reviews
mean_rating_recommended = airline_df[airline_df['recommended'] == 'yes']['overall'].mean()
mean_rating_not_recommended = airline_df[airline_df['recommended'] == 'no']['overall'].mean()

# Create data for the donut chart
labels = ['Recommended', 'Not Recommended']
sizes = [mean_rating_recommended, mean_rating_not_recommended]
colors = ['#66b3ff','#99ff99']
explode = (0.1, 0)  # explode the 1st slice (Recommended)

# Create a donut chart
plt.pie(sizes, labels=labels, colors=colors, autopct='%.1f%%', startangle=90, pctdistance=0.85, explode=explode)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Equal aspect ratio ensures that pie is drawn as a circle
plt.tight_layout()
plt.title('Mean Overall Rating for Recommended vs. Not Recommended Reviews')
plt.show()





##### 1. Why did you pick the specific chart?

I selected the donut chart for this visualization due to its ability to effectively depict the distribution of mean overall ratings between recommended and not recommended reviews. The donut chart enables a visual comparison between these two categories while also indicating the proportion of each category relative to the whole. The use of distinctive colors and the presence of the center circle serve to highlight the contrast between the two categories and offer a clear and concise visual representation of the data.

##### 2. What is/are the insight(s) found from the chart?

The insight gleaned from the donut chart reveals a distinct contrast in the mean overall ratings between recommended and not recommended reviews. The mean overall rating for recommended reviews is notably higher in comparison to the mean overall rating for not recommended reviews. This observation suggests that reviews marked as recommended generally exhibit significantly higher overall ratings, indicative of a positive sentiment, whereas not recommended reviews tend to display considerably lower ratings, signifying a negative sentiment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The significant difference in mean overall ratings between recommended and not recommended reviews indicates that positive customer sentiment and recommendations are correlated with higher overall ratings. This insight can be leveraged by the airline to focus on improving customer satisfaction, addressing issues highlighted in not recommended reviews, and enhancing the overall travel experience. By doing so, the airline can aim to increase the number of positive reviews and recommendations, which can attract more customers and foster loyalty.

#### **Correlation Heatmap**

In [None]:
# Correlation Heatmap visualization code
variables = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

plt.figure(figsize=(10, 8))
sns.heatmap(airline_df[variables].corr(), annot=True)
plt.title('Correlation Heatmap')

##### 1. Why did you pick the specific chart?

A heatmap proves to be a valuable tool for swiftly pinpointing patterns of correlation among numerical features within the dataset. It serves to elucidate which features exhibit a more robust association with one another, offering insights that can guide subsequent analysis or modeling endeavors.

##### 2. What is/are the insight(s) found from the chart?

**Strong Positive Correlations:**
- Overall ratings are strongly positively correlated with seat comfort, cabin service, ground service, and value for money ratings.

**Moderate to Strong Positive Correlations:**
- Seat comfort, cabin service, and value for money ratings show strong positive correlations.

**Moderate Positive Correlations:**
- Food and beverage ratings are moderately positively correlated with cabin service and entertainment ratings.

**Moderate Negative Correlations:**
- Entertainment ratings have a moderate negative correlation with ground service ratings.

#### **Pair Plot**

In [None]:
variables = ['overall', 'seat_comfort', 'cabin_service', 'food_bev', 'entertainment', 'ground_service', 'value_for_money']

# Create a scatterplot matrix using Seaborn
sns.set(style='ticks')
sns.pairplot(airline_df[variables])

# Adjust layout and show the plot
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Pairplot is a data visualization tool that creates a grid of scatter plots to explore the pairwise relationships between multiple variables in a dataset. It's especially helpful for identifying correlations, patterns, and trends among variables. This can inform decisions in app development and marketing.

##### 2. What is/are the insight(s) found from the chart?

In terms of passenger ratings, overall satisfaction shows a positive correlation with all other ratings. This means that passengers who rate specific categories highly tend to also give higher overall ratings.

The strongest positive correlation exists between overall satisfaction and "value for money" (0.906), followed by "ground service" (0.846) and "cabin service" (0.795).

There's a moderate positive correlation (0.733) between "seat comfort" and "cabin service," indicating that passengers who rate seat comfort highly also tend to rate cabin service highly.

"Ground service" and "value for money" have a positive correlation (0.796), suggesting that passengers who experience better ground services also tend to rate the airline's value for money more positively.

"Food and beverage" and "entertainment" are positively correlated (0.610), implying that passengers who appreciate the quality of food and beverage also tend to rate the entertainment options higher.

The correlation between "entertainment" and "ground service" is weaker (0.470), indicating a milder relationship. Passengers who find entertainment satisfying may also perceive better ground services.

"Value for money" and "overall" have a strong positive correlation (0.906), suggesting that passengers who believe they are receiving good value for their money are more likely to give higher overall satisfaction ratings.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Hypothesis Statement 1:**
The distribution of ratings for food and beverage features at levels 2, 4, and 5 is approximately equal.

**Hypothesis Statement 2:**
On average, passengers in Business Class rate cabin service higher than passengers in Economy class.

**Hypothesis Statement 3:**
Passengers traveling for solo leisure purposes tend to give higher average ratings for seat comfort compared to passengers traveling for business purposes.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The food and beverage feature ratings of 2, 4, and 5 are not equally distributed, indicating that their frequencies are approximately equal.

Alternative Hypothesis (H1): The food and beverage feature ratings of 2, 4, and 5 are equally distributed, indicating that their frequencies are not approximately equal.


#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis: Food and beverage ratings 2, 4, and 5 are equally distributed.

# Select the relevant data for food and beverage ratings 2, 4, and 5
food_bev_ratings = airline_df[airline_df['food_bev'].isin([2, 4, 5])]['food_bev']

# Calculate the observed frequencies of each rating
observed_freq = food_bev_ratings.value_counts()

# Calculate the expected frequency assuming an equal chance for each rating
expected_freq = len(food_bev_ratings) / 3

# Perform a chi-square test to determine if the observed and expected frequencies are significantly different
chi2_stat, p_value = stats.chisquare(observed_freq, f_exp=expected_freq)

# Print the results of the hypothesis test
print("\nHypothesis 1:")
print("------"*5)
print(f"Chi-square statistic: {chi2_stat}")
print(f"P-value: {p_value}")



##### Which statistical test have you done to obtain P-Value?

In this code, a chi-square test is executed to calculate the p-value, assessing the statistical significance of the association between categorical variables. Specifically, it tests whether the observed frequencies of food and beverage ratings 2, 4, and 5 differ significantly from the expected frequencies, assuming an equal probability for each rating.

##### Why did you choose the specific statistical test?

I selected the chi-square test for Hypothesis 1, which aims to examine whether food and beverage ratings 2, 4, and 5 are equally distributed. This test is suitable for analyzing categorical data and determining if observed frequencies significantly differ from expected frequencies. In our case, we're assessing the distribution of these ratings to see if there's any statistically significant deviation from the expected distribution.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): Passengers who travel in Business Class do not rate cabin service equally on average compared to passengers who travel in Economy class.

Alternative Hypothesis (H1): Passengers who travel in Business Class rate cabin service higher on average compared to passengers who travel in Economy class.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis 2: Business Class passengers rate cabin service higher than Economy Class passengers.

# Select ratings for Business Class passengers
business_class_ratings = airline_df[airline_df['cabin'] == 'Business Class']['cabin_service']

# Select ratings for Economy Class passengers
economy_class_ratings = airline_df[airline_df['cabin'] == 'Economy Class']['cabin_service']

# Perform an independent two-sample t-test with unequal variances assumed
t_stat, p_value = stats.ttest_ind(business_class_ratings, economy_class_ratings, equal_var=False)

# Print the results of the hypothesis test
print("\nHypothesis 2:")
print('---'*10)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

In this code, we perform an independent two-sample t-test to calculate the p-value for Hypothesis 2. This test helps us assess if there is a statistically significant difference between the means of two independent groups. Specifically, we use the t-test to compare the average cabin service ratings of passengers in Business Class and Economy Class.

##### Why did you choose the specific statistical test?

For Hypothesis 2, which compares cabin service ratings between Business Class and Economy Class passengers, we opted for the independent two-sample t-test for the following reasons:

1. Data Type: The cabin service ratings are numeric and continuous, making the t-test an appropriate choice.

2. Two-Group Comparison: The hypothesis involves comparing the means of two distinct groups (Business Class and Economy Class passengers).

3. Normality Assumption: Although the t-test assumes normal distribution within each group, for larger sample sizes, it can still provide valid results even if this assumption is not entirely met.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): Passengers who travel for Solo Leisure do not rate seat comfort equally on average compared to passengers who travel for Business purposes.

Alternative Hypothesis (H1): Passengers who travel for Solo Leisure tend to rate seat comfort higher on average compared to passengers who travel for Business purposes.

#### 2. Perform an appropriate statistical test.

In [None]:
# Hypothesis 3: Solo Leisure travelers rate seat comfort higher than Business travelers.

# Select ratings for Solo Leisure travelers
solo_leisure_ratings = airline_df[airline_df['traveller_type'] == 'Solo Leisure']['seat_comfort']

# Select ratings for Business travelers
business_travel_ratings = airline_df[airline_df['traveller_type'] == 'Business']['seat_comfort']

# Perform an independent two-sample t-test with unequal variances assumed
t_stat, p_value = stats.ttest_ind(solo_leisure_ratings, business_travel_ratings, equal_var=False)

# Print the results of the hypothesis test
print("\nHypothesis 3:")
print("---"*10)
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")


##### Which statistical test have you done to obtain P-Value?

In Hypothesis 3 within the provided code, we employ an independent two-sample t-test, also known as Welch's t-test, to calculate the p-value. This test is implemented using the stats.ttest_ind() function from the SciPy library.

The independent two-sample t-test is chosen when comparing the means of two groups, particularly when the assumption of equal variances is not met. In this instance, we use the equal_var=False argument with the stats.ttest_ind() function, signifying that we do not assume equal variances between the two groups.

##### Why did you choose the specific statistical test?

We chose the independent two-sample t-test for Hypothesis 3 due to the following reasons:

1. Data Type: Our data involves two independent groups, Solo Leisure travelers and Business travelers, and we want to compare the means of a continuous variable, seat comfort ratings, between them.

2. Sample Size and Normality: The t-test is suitable when sample sizes are relatively large (usually over 30) and the data distribution is approximately normal. Even with moderate deviations from normality, t-tests can still yield valid results when sample sizes are very large and distributions are not highly skewed.

3. Assumption of Equal Variances: The assumption of equal variances between the two groups is not met in our case. Therefore, we use Welch's t-test, which doesn't assume equal variances and is robust when variances differ between groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Identify numerical columns in the dataframe
numerical_cols = airline_df.select_dtypes(include='number').columns

# Set the z-score threshold for identifying outliers
z_score_threshold = 3

# Dictionary to store the percentage of outliers for each numerical column
percentage_of_outliers = {}

# Loop through each numerical column and calculate the percentage of outliers
for col in numerical_cols:
    # Calculate the mean and standard deviation for the current column
    col_mean = airline_df[col].mean()
    col_std = airline_df[col].std()

    # Calculate the z-scores for all values in the column
    z_scores = np.abs((airline_df[col] - col_mean) / col_std)

    # Count the number of outliers based on the z-score threshold
    num_outliers = len(airline_df[z_scores > z_score_threshold])

    # Calculate the percentage of outliers for the current column
    percentage = (num_outliers / len(airline_df)) * 100

    # Store the percentage of outliers in the dictionary
    percentage_of_outliers[col] = percentage

# Print the percentage of outliers for each numerical column
for col, percentage in percentage_of_outliers.items():
    print(f"Percentage of outliers in {col}: {percentage:.2f}%")


##### What all outlier treatment techniques have you used and why did you use those techniques?

I did not apply any outlier removal techniques to our dataset because there were no outliers present.

### 2. Categorical Encoding

In [None]:
# Identify the categorical columns in the DataFrame
categorical_columns = airline_df.select_dtypes(include=["object"]).columns.unique()

# Print the categorical columns
print("Categorical Columns:")
for col in categorical_columns:
    print(col)


We will process the 'customer_review' feature as textual data. The 'author,' 'arrival_city,' and 'departure_city' features are not significant for our machine learning model, and we plan to drop them during feature manipulation.

In [None]:
# Define the categorical columns to be encoded
categorical_columns = ["airline", "traveller_type", "cabin", "recommended"]

# Create a label encoder object
le = LabelEncoder()

# Fit and transform the label encoder for each categorical column
for column in categorical_columns:
    airline_df[column] = le.fit_transform(airline_df[column])

# Print the encoded dataset
print("Encoded Dataset:")
print(airline_df)


#### What all categorical encoding techniques have you used & why did you use those techniques?

We employ label encoding for categorical columns since many machine learning algorithms exclusively work with numerical data. Label encoding assigns a unique integer to each category, enabling algorithms to comprehend the relationships between categories.

Advantages of label encoding include simplicity, efficiency, and compatibility with a wide range of machine learning algorithms.

### 3. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

We have a 'customer_review' feature containing textual data in our dataset. To utilize this data in feature selection, we will convert the text reviews into numeric representations using Natural Language Processing (NLP). This will help us identify which reviews are providing recommendations.

In [None]:
# Install the vaderSentiment package using pip
!pip install vaderSentiment

In [None]:
# Import the SentimentIntensityAnalyzer class from the vaderSentiment package
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


In [None]:
# Import the SentimentIntensityAnalyzer class from the vaderSentiment package
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Create a function to get the sentiment score for a given sentence
def sentiment_scores(sentence):
    # Create a SentimentIntensityAnalyzer object
    sid_obj = SentimentIntensityAnalyzer()

    # Calculate the sentiment scores for the sentence
    sentiment_dict = sid_obj.polarity_scores(sentence)

    # Return the compound sentiment score
    return sentiment_dict['compound']

# Create a new column 'numeric_review' to store sentiment polarity for each customer review
# Apply the 'sentiment_scores' function to each customer review in the 'customer_review' column
airline_df['numeric_review'] = airline_df['customer_review'].apply(sentiment_scores)


# Display the first 10 rows of the DataFrame, showing 'customer_review' and 'numeric_review' columns
print(airline_df[['customer_review', 'numeric_review']].head(10))


### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Drop unnecessary columns that are not needed for analysis
# This can help minimize feature correlation and simplify the dataset
columns_to_drop = ['author', 'review_date', 'date_flown', 'customer_review', 'month', 'flight_year', 'flight_month', 'arrival_city', 'departure_city']
airline_df = airline_df.drop(columns_to_drop, axis=1)


#### 2. Feature Selection

In [None]:
# Separate the feature matrix 'X' and the target variable 'y'
X = airline_df.drop(columns=['recommended'])  # Feature matrix
y = airline_df['recommended']  # Target variable

# Number of top features to select
k = 11

# Perform feature selection using ANOVA F-tests
# SelectKBest is used to select the top k features based on their F-scores
selector = SelectKBest(score_func=f_regression, k=k)
X_selected = selector.fit_transform(X, y)

# Get the indices of the selected features
selected_feature_indices = selector.get_support(indices=True)

# Get the names and scores of the selected features
selected_feature_names = X.columns[selected_feature_indices]
selected_feature_scores = selector.scores_[selected_feature_indices]

# Now, 'X_selected' contains only the selected features, and 'selected_feature_names' contains their names.
# Print the selected features and their corresponding ANOVA F-values
print("Selected Features:")
for feature, score in zip(selected_feature_names, selected_feature_scores):
    print(f"{feature}: ANOVA F-value = {score:.2f}")


##### What all feature selection methods have you used  and why?

In the code for feature selection in the merged dataset, we utilized the SelectKBest method with the ANOVA (Analysis of Variance) score function. Here's why:

SelectKBest with ANOVA: SelectKBest is a scikit-learn feature selection technique that selects the best 'k' features based on statistical tests. The ANOVA score function is employed for regression tasks, assessing the relationship between each feature and the continuous target variable.

Reason for Choice: We opted for this method because our target variable, 'Sales,' is continuous in a regression task. ANOVA F-values help us gauge the significance of each feature's relationship with the target. By selecting the top 'k' features, we aim to retain the most informative ones while reducing model complexity and potential overfitting.

##### Which all features you found important and why?

1. airline: Influences passenger experience due to varying service levels.
2. overall: Direct measure of passenger satisfaction, a critical indicator.
3. traveller_type: Different traveler types may have distinct preferences.
4. seat_comfort: Crucial for comfort during air travel.
5. cabin_service: Impacts passenger satisfaction, including in-flight service.
6. food_bev: Quality of food and beverages affects comfort.
7. entertainment: Enhances the overall travel experience.
8. ground_service: Shapes perception through check-in, boarding, etc.
9. value_for_money: Reflects passenger's perceived worth of expenses.
10. numeric_review: May capture additional insights from customer review text.

### 5. Data Splitting

In [None]:
# Split the data into training and testing sets
# Use a 80-20 split ratio, with 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# Print the shapes of the training and testing sets to verify the split
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)


##### What data splitting ratio have you used and why?

The test_size parameter in train_test_split determines the proportion of data allocated to the testing set when splitting the dataset. In this code, test_size=0.2 is used, meaning 20% of the data is allocated to testing, leaving 80% for training. Commonly used ratios are 80:20 (test_size=0.2) and 70:30 (test_size=0.3) to balance training data and reliable testing evaluation.

## ***7. ML Model Implementation***

In [None]:
# Define the column names for the evaluation metrics DataFrame
column_names = ["MODEL NAME", "ACCURACY", "RECALL", "PRECISION", "F1-SCORE", "ROC AUC SCORE"]

# Create an empty DataFrame with the specified column names
evaluation_results = pd.DataFrame(columns=column_names)

# We can now populate this DataFrame with evaluation metrics for different models.

In [None]:
# Define a function to add evaluation metrics for a model to a DataFrame
def add_metrics_details(model_name, y_test, y_pred, df):
    # Append a new row to the DataFrame with metrics for the model
    df = df.append({
        'MODEL NAME': model_name,
        'ACCURACY': accuracy_score(y_test, y_pred),
        'RECALL': recall_score(y_test, y_pred),
        'PRECISION': precision_score(y_test, y_pred),
        'F1-SCORE': f1_score(y_test, y_pred),
        'ROC AUC SCORE': roc_auc_score(y_test, y_pred)
    }, ignore_index=True)

    # Return the updated DataFrame
    return df


### Logistic Regression

In [None]:
# ML Model - 1 Implementation
log_reg = LogisticRegression(fit_intercept=True, max_iter=10000)

# Fit the Logistic Regression model on the training data
log_reg.fit(X_train, y_train)

# Predict using the trained model on the test data
y_pred_logreg = log_reg.predict(X_test)

# Calculate the accuracy score of the Logistic Regression model
score = log_reg.score(X_test, y_test)
print(f'Logistic regression score: {score}')

# Create a DataFrame to compare actual and predicted values
data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_logreg})

# Print the DataFrame to display actual and predicted values
print("Comparison of Actual and Predicted Values:")
print(data)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualize evaluation metric score chart

# Print the classification report
print(metrics.classification_report(y_test, y_pred_logreg))

# Calculate and print the accuracy score of the model
accuracy = accuracy_score(y_test, y_pred_logreg)
print(f'\nAccuracy score % of the model is {round(accuracy * 100, 2)}%\n')

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred_logreg, labels=[1, 0])

# Plot the confusion matrix as a heatmap
sns.heatmap(cm, annot=True, fmt=".1f", cmap='icefire')
plt.title('Confusion Matrix for Logistic Regression')


In [None]:
# Add evaluation metrics for the Logistic Regression model to metrics_df
evaluation_results = add_metrics_details("Logistic Regression", y_test, y_pred_logreg, evaluation_results)

# Print the updated metrics_df
print(evaluation_results)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (RandomizedSearchCV)

# Perform 10-fold cross-validation
scores = cross_val_score(log_reg, X_train, y_train, cv=10)
print("Cross-Validation Scores:", scores)
print("Mean CV Score:", scores.mean())
print("Max and Min CV Score:", scores.min(), scores.max())

# Define hyperparameters distribution
param_dist = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}

# Create RandomizedSearchCV object
logreg_hyper = RandomizedSearchCV(log_reg, param_distributions=param_dist, n_iter=10, cv=10)

# Fit the Logistic Regression model with hyperparameter optimization
logreg_hyper.fit(X_train, y_train)

# Get the best parameters and best score
best_params = logreg_hyper.best_params_
best_score = logreg_hyper.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

# Predict on the model with optimized hyperparameters
y_pred_logreg_hyper = logreg_hyper.predict(X_test)

In [None]:
# Visualize evaluation metric score chart for the model with optimized hyperparameters

# Print the classification report
print(metrics.classification_report(y_test, y_pred_logreg_hyper))

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred_logreg_hyper, labels=[1, 0])
print("Confusion Matrix:")
print(cm)

# Calculate and print the accuracy score of the model
accuracy = accuracy_score(y_test, y_pred_logreg_hyper)
print(f'\nAccuracy score % of the model is {round(accuracy * 100, 2)}%\n')


In [None]:
# Add evaluation metrics for the Logistic Regression model with tuning to metrics_df
evaluation_results = add_metrics_details("Logistic Regression With Tuning", y_test, y_pred_logreg_hyper, evaluation_results)

# Print the updated metrics_df
print(evaluation_results)


##### Which hyperparameter optimization technique have you used and why?

1. Faster Computation: RandomizedSearchCV is faster than GridSearchCV because it evaluates fewer hyperparameter combinations randomly.

2. Flexibility: It allows specifying the number of iterations, making it more flexible for exploring a large hyperparameter space.

3. Better Exploration: RandomizedSearchCV explores a broader range of hyperparameters, which can be advantageous when the best values are not on the grid points.

4. Resource-Efficient: It is more resource-efficient for computationally expensive models and large datasets due to fewer iterations compared to GridSearchCV.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After hyperparameter tuning, we observed slight improvements in most metrics:

- Accuracy increased slightly.
- Recall remained unchanged.
- Precision showed a slight improvement.
- F1-Score increased slightly.
- ROC AUC Score also improved marginally.

Overall, these improvements, while consistent, are modest. It seems hyperparameter tuning fine-tuned the model's performance, but the gains might not outweigh the computational cost involved.

### K-Nearest Neighbour - ML Model

In [None]:
# ML Model Implementation (K-Nearest Neighbors)

# Number of neighbors (k) for KNN
k = 5

# Create a KNeighborsClassifier with k neighbors
knn = KNeighborsClassifier(n_neighbors=k)

# Fit the KNN model on the training data
knn.fit(X_train, y_train)

# Predict using the trained KNN model on the test data
y_pred_knn = knn.predict(X_test)

# Calculate the accuracy score of the KNN model
score = knn.score(X_test, y_test)
print(f'K-Nearest Neighbors score : {score}')

# Create a DataFrame to compare actual and predicted values
data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_knn})

# Print the DataFrame to display actual and predicted values
print("Comparison of Actual and Predicted Values for KNN:")
print(data)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualize evaluation metric score chart for the KNN model

# Print the classification report
print(metrics.classification_report(y_test, y_pred_knn))

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred_knn, labels=[1, 0])
print("Confusion Matrix:")
print(cm)

# Calculate and print the accuracy score of the KNN model
accuracy = accuracy_score(y_test, y_pred_knn)
print(f'\nAccuracy score % of the model is {round(accuracy * 100, 2)}%\n')

# Plot the confusion matrix as a heatmap
sns.heatmap(cm, annot=True, fmt=".1f", cmap='icefire')
plt.title('Confusion Matrix for K-Nearest Neighbors')


In [None]:
# Add evaluation metrics for the K-Nearest Neighbors (KNN) model to metrics_df
evaluation_results = add_metrics_details("K-Nearest Neighbours", y_test, y_pred_knn, evaluation_results)

# Print the updated metrics_df
print(evaluation_results)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Perform 5-fold cross-validation
scores = cross_val_score(knn, X_train, y_train, cv=5)
print("Cross-Validation Scores:", scores)
print("Mean CV Score:", scores.mean())
print("Max and Min CV Score:", scores.max(), scores.min())

# Define hyperparameters to tune
param_grid = {
    'n_neighbors': np.arange(1, 9),  # Number of neighbors
    'weights': ['uniform', 'distance'],  # Weighting method
    'p': [1, 2]  # Distance metric (1: Manhattan, 2: Euclidean)
}

# Create GridSearchCV with 5-fold cross-validation
knn_hyper = GridSearchCV(knn, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the KNN model with hyperparameter tuning
knn_hyper.fit(X_train, y_train)

# Get the best model from GridSearchCV
best_model = knn_hyper.best_estimator_

print("Best Parameters:", knn_hyper.best_params_)
print("Best Score:", knn_hyper.best_score_)

# Predict on the test data using the best model
y_pred_knn_hyper = best_model.predict(X_test)

In [None]:
# Visualize evaluation metric score chart for the KNN model with hyperparameter tuning

# Print the classification report
print(metrics.classification_report(y_test, y_pred_knn_hyper))

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred_knn_hyper, labels=[1, 0])
print("Confusion Matrix:")
print(cm)

# Calculate and print the accuracy score of the KNN model with hyperparameter tuning
accuracy = accuracy_score(y_test, y_pred_knn_hyper)
print(f'\nAccuracy score % of the model is {round(accuracy * 100, 2)}%\n')


In [None]:
# Add evaluation metrics for the K-Nearest Neighbors (KNN) model with tuning to metrics_df
evaluation_results = add_metrics_details("K-Nearest Neighbours With Tuning", y_test, y_pred_knn_hyper, evaluation_results)

# Print the updated metrics_df
print(evaluation_results)


##### Which hyperparameter optimization technique have you used and why?

In the provided code, GridSearchCV was employed for hyperparameter optimization with the following benefits:

1. Exhaustive Search: GridSearchCV systematically explores all possible hyperparameter combinations, leaving no potential improvement unchecked.

2. Cross-Validation: It uses 5-fold cross-validation to estimate model performance on unseen data, enhancing reliability.

3. Best Parameters and Score: GridSearchCV identifies the best hyperparameter combination and provides the corresponding score, aiding in optimal parameter selection.

4. Automated Tuning: GridSearchCV automates the time-consuming process of hyperparameter tuning, saving effort.

5. Generalization: Through cross-validation, it ensures that chosen hyperparameters generalize well to new data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After comparing the results, it appears that there's not a significant difference in performance between the K-Nearest Neighbors model and the tuned version. The metrics show minimal variation:

- Accuracy is slightly lower after tuning.
- Recall remains similar.
- Precision improves slightly.
- F1-Score remains relatively consistent.
- ROC AUC Score is similar.

Hyperparameter tuning in this case hasn't led to significant improvements. It's possible that the default hyperparameters were already well-suited for this dataset, highlighting the importance of not blindly applying tuning, as default settings can sometimes be effective.

### Support Vector Machine - ML Model

In [None]:
# Support Vector Machine (SVM) Implementation

# Create an SVM classifier
svm = SVC()

# Fit the SVM model on the training data
svm.fit(X_train, y_train)

# Predict using the trained SVM model on the test data
y_pred_svm = svm.predict(X_test)

# Calculate the accuracy score of the SVM model
score = svm.score(X_test, y_test)
print(f'Support Vector Machine Score : {score}')

# Create a DataFrame to compare actual and predicted values
data = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred_svm})

# Print the DataFrame to display actual and predicted values
print("Comparison of Actual and Predicted Values for SVM:")
print(data)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualize evaluation metric score chart for the SVM model

# Print the classification report
print(metrics.classification_report(y_test, y_pred_svm))

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred_svm, labels=[1, 0])
print("Confusion Matrix:")
print(cm)

# Calculate and print the accuracy score of the SVM model
accuracy = accuracy_score(y_test, y_pred_svm)
print(f'\nAccuracy score % of the model is {round(accuracy * 100, 2)}%\n')

# Plot the confusion matrix as a heatmap
sns.heatmap(cm, annot=True, fmt=".1f", cmap='icefire')
plt.title('Confusion Matrix for Support Vector Machine')


In [None]:
# Add evaluation metrics for the Support Vector Machine (SVM) model to metrics_df
evaluation_results = add_metrics_details("Support Vector Machine", y_test, y_pred_svm, evaluation_results)

# Print the updated metrics_df
print(evaluation_results)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Perform 5-fold cross-validation for the SVM model
scores = cross_val_score(svm, X_train, y_train, cv=5)
print("Cross-Validation Scores:", scores)
print("Mean CV Score:", scores.mean())
print("Max and Min CV Score:", scores.max(), scores.min())

# Define hyperparameters to tune
param_grid = {
   'kernel': ['rbf'],
    'C': [10],
    'gamma': ['scale']
}

# Create GridSearchCV with 5-fold cross-validation
svm_hyper = GridSearchCV(estimator=svm, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the SVM model with hyperparameter tuning
svm_hyper.fit(X_train, y_train)

# Get the best model from GridSearchCV
best_model = svm_hyper.best_estimator_

print("Best Parameters:", svm_hyper.best_params_)
print("Best Score:", svm_hyper.best_score_)

# Predict on the test data using the best model with optimized hyperparameters
y_pred_svm_hyper = best_model.predict(X_test)


In [None]:
# Visualize evaluation metric score chart for the SVM model with hyperparameter tuning

# Print the classification report
print(metrics.classification_report(y_test, y_pred_svm_hyper))

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred_svm_hyper, labels=[1, 0])
print("Confusion Matrix:")
print(cm)

# Calculate and print the accuracy score of the SVM model with hyperparameter tuning
accuracy = accuracy_score(y_test, y_pred_svm_hyper)
print(f'\nAccuracy score % of the model is {round(accuracy * 100, 2)}%\n')


In [None]:
# Add evaluation metrics for the Support Vector Machine (SVM) model with tuning to metrics_df
evaluation_results = add_metrics_details("Support Vector Machine With Tuning", y_test, y_pred_svm_hyper, evaluation_results)

# Print the updated metrics_df
print(evaluation_results)


##### Which hyperparameter optimization technique have you used and why?

I employed GridSearchCV in the SVM model for hyperparameter optimization due to the following advantages:

1. Exhaustive Search: GridSearchCV thoroughly explores all hyperparameter combinations, ensuring no potential optimal configuration is overlooked.

2. Tuning Multiple Hyperparameters: SVM models have multiple hyperparameters, and GridSearchCV allows concurrent tuning of these parameters.

3. Cross-Validation: The method performs cross-validation to ensure selected hyperparameters generalize well to unseen data, mitigating overfitting.

4. Best Model Selection: GridSearchCV provides the best hyperparameter combination and its corresponding model, facilitating straightforward model selection.

5. Custom Scoring Metric: It allows optimization based on a user-defined scoring metric, enhancing flexibility in model performance evaluation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After comparing the results, it appears that there's not a substantial difference in performance between the SVM model and the tuned version. The metrics show minimal variation:

- Accuracy slightly decreases after tuning.
- Recall remains similar.
- Precision improves slightly.
- F1-Score remains consistent.
- ROC AUC Score is also similar.

Hyperparameter tuning hasn't resulted in significant performance improvements for the SVM model. This suggests that the default hyperparameters may have already been well-suited for this dataset, emphasizing the importance of assessing the impact of tuning on specific models and datasets.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

You've provided a clear and concise explanation of the key evaluation metrics for assessing positive business impact—precision, recall, and ROC AUC score—and their respective importance in different contexts. These metrics play a critical role in determining the effectiveness of a model in real-world applications, depending on the relative costs and consequences of false positives and false negatives. Striking a balance between precision and recall is often crucial for optimizing business outcomes.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
# Display the evaluation metrics for all models in the metrics_df DataFrame
print("Evaluation Metrics for All Models:")
print(evaluation_results)


Based on the provided evaluation metrics dataframe, it appears that the Support Vector Machine (SVM) without hyperparameter tuning is performing well in terms of accuracy, recall, precision, and ROC AUC score. Here's a summary of its performance:

- **Accuracy:** The accuracy score indicates the overall correctness of the model's predictions, and the SVM without tuning has a high accuracy score.

- **Recall:** Recall, also known as sensitivity or true positive rate, measures the model's ability to correctly identify positive instances. The SVM without tuning has a good recall score, indicating that it effectively identifies positive cases.

- **Precision:** Precision measures the model's ability to make positive predictions correctly. The SVM without tuning has a reasonable precision score, suggesting that it maintains a good balance between correctly identifying positive cases and minimizing false positives.

- **ROC AUC Score:** The ROC AUC score reflects the model's ability to distinguish between positive and negative instances. The SVM without tuning has a high ROC AUC score, indicating strong discriminative power.

In summary, the SVM without hyperparameter tuning appears to provide a good trade-off between accuracy, recall, precision, and ROC AUC score, suggesting that it's performing well in identifying positive instances while maintaining overall prediction quality. However, it's important to consider other factors such as model complexity and computational resources when choosing a model for deployment in practice.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In [None]:
# Assuming svm_model is your trained SVM model
svm_model = SVC(C=1, kernel='linear', probability=True)  # Example parameters, replace with your tuned parameters
svm_model.fit(X_train, y_train)

# Get the coefficients of the support vector machine model
coef = svm_model.coef_[0]

# Get the feature names from your feature dataframe
feature_names = X_train.columns

# Create a DataFrame to display the coefficients and their corresponding features
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coef})

# Sort the DataFrame by coefficient magnitude (absolute value)
coef_df = coef_df.reindex(coef_df['Coefficient'].abs().sort_values(ascending=False).index)

# Plot the coefficients using Matplotlib
plt.figure(figsize=(10, 6))
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color='skyblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.title('Feature Importance - SVM Coefficients')
plt.show()


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Exploratory Data Analysis (EDA) Insights:**
- Traveller Types and Cabin Class: Business travelers tend to rate cabin service higher than other traveler types, and Business Class passengers rate cabin service higher than Economy Class passengers.
- Seat Comfort and Travel Purpose: Solo Leisure travelers tend to rate seat comfort higher compared to Business travelers.
- Food and Beverage Ratings: Ratings of 2, 4, and 5 are equally distributed for food and beverage, indicating a balanced distribution.

**Machine Learning Models Evaluated:**
- Logistic Regression: Achieved an initial accuracy of 96%.
- K-Nearest Neighbours: Performed well with an accuracy of 95.97%, slightly decreased to 95.92% after hyperparameter tuning.
- Support Vector Machine (SVM): Emerged as a strong contender with an accuracy of 96.57%.

**Feature Importance:**
- Key attributes like seat comfort, cabin service, and overall ratings consistently influenced model predictions, albeit to varying degrees.

**Final Model Selection:**
- Chose the Support Vector Machine (SVM) model with hyperparameter tuning as the final prediction model due to its remarkable 96.57% accuracy.
- SVM's ability to handle complex relationships and non-linear decision boundaries made it the optimal choice.

**Feature Importance Analysis:**
- Explored feature importance through coefficient analysis, providing insights into attribute contributions.
- Acknowledged that SVM models may not offer straightforward feature importance interpretation due to their decision boundary nature.

**Conclusion:**
- Project journey encompassed data exploration, model construction, and evaluation, resulting in a reliable airline passenger referral prediction model.
- Insights gained can guide the airline industry in understanding passenger preferences and enhancing services for improved customer satisfaction and referrals.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***