## AIRLANE PASSENGER SATISFACTION PREDICTION - ASSIGNMENT 1

In this first assignment, we will work with tabular data in order to deploy a simple Multilayer Perceptron (MLP). In particular, we will *predict the level of satisfaction of airline passengers*. The tasks to be implemented are split in three main sections:

* Exploratoy Data Analysis (EDA)
* Data Preprocessing
* MLP Implementation

In the first part of the assignment we will perform what is commonly known as an exploratory data analysis (EDA), in which we aim to gain some knowledge about the dataset and get to understand the nature of the data we're dealing with. We will get familiar with **Pandas**, a really useful **Python** package when dealing with tabular data. Such exploratory analysis will allow us to know more about the distribution of our data and find anomalies that can potencially affect the training of our AI model.

Then, we will develop a feature engineering pipeline to preprocess our data. Here, we can decide how we want to deal with the anomalies previously found and modify/transform/prepare our data before being used for training (for instance, it is quite common to include some feature standardization). At the same time, this step usually includes the data splitting into train, validation and test.

Finally, the MLP deployment will be carried out with **Tensorflow**, one of the most commonly used Python packages when it comes deep learning (DL) architectures. We will code from scratch our model and we will see how to implement the basic steps to train it and test it using a proper data splitting. This is essential in order to have a "fair" evaluation of the methodology used in our project.

But enough of theoretical explanations, let's get dirty!

**NOTE**: Througout the different tasks in this assignment, you will find some questions marked as **Q**. These questions should be answered at the end of the Notebook (there is a Markdown cell prepared for this purpose).

### Exploratory Data Analysis (EDA)

The first thing you'll need to do, is to load the dataset from Kaggle. To do that, we need to add the dataset to our notebook. This can be done by clicking in *Add data* in the top-right corner of the notebook and searching for **Airline Passenger Satisfaction**. Once the dataset has been added (it should appear in our notebook's input, again in the top-right corner), we can start. If you want to learn more about the information contained in this dataset, check https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction.

In [None]:
# Importing the right packages in our Python 3 environment. In Kaggle, most of the well-known packages are already installed.

import numpy as np
import pandas as pd

from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense
from tensorflow.keras import callbacks
from tensorflow.keras import optimizers
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras import Sequential

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix

import seaborn as sns
from matplotlib import pyplot as plt

 **Task 1** 
 
Read the training data using read_csv and the right path. "Display" (with head()) the dataframe to get an overview of columns (features) that are included in the dataset.

In [None]:
# Hint: Input data files are available in the read-only "../input/" directory

train_df = pd.read_csv(, index_col=0) # Type your solution
train_df. # Type your solution

**Task 2** 

It is always important to get familiar with our data in order to make the proper decisions. To begin with it, let's see the type of data we are working with (use info()).

In [None]:
train_df. # Type your solution

**Task 3** 

We can also focus on some of the features to get an initial idea of the distribution of our dataset. For instance, let's check how many instances we have for the column "satisfaction".

In [None]:
# Hint: use np.unique with return counts. 

unique, count = np.unique(, ) # type your solution
print("The number of occurances of each class in the dataset = %s " % dict (zip(unique, count) ), "\n" )

**Task 4** 

As mentioned before, anomalies are an important aspect when dealing with tabular data. Missing values is one of the most common and one of the first choices we might need to make is how to deal with them. There are different ways of doing so: mean/median imputation, predict the missing value using other features, drop those samples... Check the number of missings for every feature in our data!

In [None]:
# Hint: you can use isna() and sum().

train_df. # type your solution

**Task 5** 

In order to make following tasks simpler, we will change "satisfaction" column from string to integer.

In [None]:
# Hint: Use the .map() function and change 'neutral or dissatisfied' -> 0 and 'satisfied' -> 1
print("Values before convertion: %s" % list(train_df['satisfaction'][:5]))

train_df['satisfaction'] = train_df['satisfaction']. # Type your solution

print("Values after convertion: %s" % list(train_df['satisfaction'][:5]))

**Task 6**

Let's visualize some feature distribution! Such visualizations can give us an idea of which features are most relevant. Show "Gender" distribution in terms of the "satisfaction".

In [None]:
# Hint: you can use seaborn barplot with variables "Gender" and "satisfaction". 

sns.barplot(, )  # Type your solution

**Task 7** 

Following with our visualization approach, use a barplot once again to take a look at the passenger class distribution.

**Q1**: Explain what this plot is reporting. Is there any class that have higher chances to be satisfied? And they say money can't buy everything...

In [None]:
# Hint: use seaborn countplot with the right axis - 0 or 1?

f,ax=plt.subplots(1,2,figsize=(18,8))
train_df['Class'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Customers By Class')
ax[0].set_ylabel('Count')

f = sns.barplot(, ) # Type your solution here

ax[1].set_title('different sample sizes')
plt.show()

**Task 8** 

Another important part of an EDA with tabular data is to check the correlation between features. If two features are correlated, an increase (or decrease) might lead to an increase (or decrease) in the other one. This means the information contained by both features is similar and there is little to no variance in information. This is known as multicolinearity. In essence, one of the features could be deemed as "redundant". A way to deal with it could be to drop one of the redundant features as it will reduce training time, to mention one of the few advantages. Another option could be to reduce the feature space by means of an autoencoder (which you'll hear about in the theoretical lectures) or reduction techniques such as principal component analysis (PCA). In this case, we will only plot a correlation heatmap for educational purposes.

**Q2**: Which features display the largest positive and negative correlations regarding the variable "satisfaction"? Why not all features are included? Report the plot.

In [None]:
# Hint: use seaborn for plotting the heatmap. Correlation matrix from our data can be obtanied with the .corr() method.

sns.heatmap(, annot=True, cmap='RdYlGn',linewidths=0.2, fmt='.2f') # Type your solution here
fig=plt.gcf()
fig.set_size_inches(14,10)
plt.show()

### Data Preprocessing

Thanks to EDA, we were able to find two main issues (we could have done a more exhaustive process to find other anomalies):
* Our data contains some missing values
* There are columns that are irrelevant for our project.

Now, it is time to deal with them! 

Remember that, in this preprocessing step, it is a common practice to apply some feature normalization/standardization. After doing this, we will have our data prepared for training.

**Task 9**

We will drop the irrelevant variables first, such as "id", and then ALL the rows with empty values (NA). Define a variable *train_y* containing the outcome ("satisfaction"). In addition, define a variable *train_x* removing the "satisfaction" column, since it has been defined as our dependent variable.

In [None]:
# Hint: use dropna()

train_df = train_df # Type your solution here (drop)
train_df = train_df # type your solution here (dropna)

train_y =  # type your solution here
train_x =  # type your solution here

**Task 10** 

As you might have noticed, the dataset includes several non-numerical variables (categorical). In order to process them, we will need to transform them into "dummy variables" (we already did something similar with "satisfaction"). In this case, convert the non-numerical variables to dummy with get_dummies. Another suitable option can be one-hot-encoding from sklearn. Feel free to experiment with it and to find out more about the differences betweem them!.

In [None]:
# Hint: use pandas get_dummies()

train_x =  # Type your solution here
train_x.head()

**Task 11** 

Standardize the dataset. First, create the scaler object by using the StandardScaler() and, then, transform the data.

In [None]:
# Hint: Use .fit_transform() method to first fit to data and then transform it.

scaler =  # Type your solution here

train_x_scaled =  # Type your solution here

train_x_scaled = pd.DataFrame(train_x_scaled, columns=train_x.columns) # For visualization purposes, we convert the data to DataFrame
train_x_scaled.head()

**Task 12** 

The final step is split the data in train and test subsets to get an unbiased evaluation of our model. In essence, we first make use of the training set to "teach" the model and make it learn patterns from the data. Then, the test set will be kept apart and not used until the model has been fully trained and we want to see the model performane on "unseen data". Think about it this way: You have your own business and plan to deploy a model in the real world so, how do you evaluate how good is the model when it comes to "new data" that you did not have when you trained the model? That is where the test set comes in handy!

This step can be quite tedious as you need to ensure your test set is representative enough of the actual problem to solve. It is important to consider the distribution of your data, the proportion of labels, the split percentage between train and test...

Luckily for us, our dataset is already split! So, once we've encoded our non-numerical variables and scaled the features for our training data, we just need to repeat the same procedure on the test set.

In [None]:
# Hint: For scaling the test set, we'll' use the previous scaler to use the same convertion used for the training set. Use .transform() method

# Read data
test_df = pd.read_csv(, index_col=0) # type your solution here

# Convert "satisfaction" column from string to integer
test_df['satisfaction'] =  # type your solution here

# Drop irrelevant variables and NA values
test_df =  # type your solution here (drop)
test_df =  # type your solution here (dropna)

# Define variables test_x and test_y
test_y =  # type your solution here
test_x =  # type your solution here

# Convert non-numerical variables (categorical) into dummies variables.
test_x =  # type your solution here

# Standardize the test set
test_x_scaled = # Type your solution here

test_x_scaled = pd.DataFrame(test_x_scaled, columns=test_x.columns) # For visualization purposes, we convert the data to DataFrame

print("Shape of the training set:", train_x_scaled.shape)
print("Shape of the training labels:", train_y.shape)
print("Shape of the testing set:", test_x_scaled.shape)
print("Shape of the testing labels:", test_y.shape)

### MLP Implementation

Time to create our classifier! We will design a very simple MLP classifier consisting of 3 layers. Take into consideration that the design of a classifier needs time and a proper practice would be to either test the configurations or simply select an existing architecture such as ResNet or EfficientNet.

**Task 13**

Design a classifier with 3 hidden layers of 25, 50 and 100 neurons, respectively, with ReLU as activation function. Remember to indicate the right input dimension in the first layer. What about the last layer? How many classes are we predicting? What activation function should we choose considering that we are aiming for a classification problem?

In [None]:
# Before creating the model.. transform your training labels to categorical with to_categorical() function.
train_y_cat = # type your solution

def mlp_model():
    
    mlp = Sequential()
    mlp.add() # Type your solution
    mlp.add() # Type your solution*
    mlp.add() # Type your solution
    mlp.add() # Type your solution
    
    mlp.summary()
    
    return mlp

**Task 14** 

Once we have defined our model, we will proceed to train it! The first thing we need to do is to define the optimizer. There are several options available (feel free to explore them: https://keras.io/api/optimizers/) but for now, we'll make use of Stochastic Gradient Descent (SGD). We will set a fixed learning rate of 0.001. Following, we compile the model and choose our loss function (which loss function is the most suitable for this task?) and metrics. Finally, we fit the model. Here, one needs to decide the batch size, the number of epochs and callbacks (if any). We'll use a batch size of 128 and 200 epochs (it may take around 4-5 min). In addition, we will use an additional validation set (validation_split = 0.2) which will allow us to closely monitorize the values of our loss and accuracy while training. This gives us an initial idea of how the model is performing in an independent set. 

**Q3**: In this case, we've used accuracy as a metric to evaluate our ML algorithm. Do you think it is a suitable one? Why? Is there any other metric we could use better than our choice?

In [None]:
mlp = mlp_model()
optimizer =  # Type your solution
mlp.compile(optimizer=optimizer, loss="", metrics=["accuracy"]) # Type your solution. Compile the model!
history = mlp.fit() # Type your solution

**Task 15** 

Plot accuracy and loss curves.

**Q4**: After seeing the curves - do you think it is necessary to train for 200 epochs? Is there any way to automatically "control" for how long we train without the need to train for the initially specified number of epochs?

In [None]:
f,ax=plt.subplots(1,2,figsize=(18,8))

ax[0].plot() # Type your solution
ax[0].plot() # Type your solution
ax[0].set_title('model accuracy')
ax[0].set_ylabel('accuracy')
ax[0].set_xlabel('epoch')
ax[0].legend(['train', 'val'], loc='upper left')

ax[1].plot() # Type your solution
ax[1].plot() # Type your solution
ax[1].set_title('model loss')
ax[1].set_ylabel('loss')
ax[1].set_xlabel('epoch')
ax[1].legend(['train', 'val'], loc='upper left')

### Predictions on test set

Once the model has been trained, we can make predictions on our test set.

**Task 16**

Predict the class probability for the test set data. Convert those probabilities into a class prediction by taking the argmax() of them.
Compute the accuracy for your test set and plot a confusion matrix.

**Q5**: What is reporting this confusion matrix?

In [None]:
predictions = mlp.predict() # Type your solution

# Hint - obtain argmax of predictions
pred_argmax = np.argmax() # Type your solution

# Obtain accuracy with argmax
acc_test = accuracy_score() # Type your solution
# Obtain confusion matrix with argmax
conf_mat = confusion_matrix() # Type your solution

print("Accuracy test:", acc_test)
print("Confusion matrix:", conf_mat)

### QUESTIONS

**Q1** (Task 7): 

**Q2** (Task 8): 

**Q3** (Task 14): 

**Q4** (Task 15): 

**Q5** (Task 16): 