<a href="https://colab.research.google.com/github/oleksandr-maksymikhin/mit-ai-course/blob/main/AI_3_Assignment_3_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Market Segmentation For Airlines
Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

In this problem, we'll see how clustering can be used to find similar groups of customers who belong to an airline's frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.

The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook "Data Mining for Business Intelligence," by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.

There are seven different variables in the dataset, described below:


*   **Balance** = number of miles eligible for award travel
*   **QualMiles** = number of miles qualifying for TopFlight status
*   **BonusMiles**= number of miles earned from non-flight bonus transactions in the past 12 months
*   **BonusTrans** = number of non-flight bonus transactions in the past 12 months
*   **FlightMiles** = number of flight miles in the past 12 months
*   **FlightTrans** = number of flight transactions in the past 12 months
*   **DaysSinceEnroll** = number of days since enrolled in the frequent flyer program

# Problem 1.1 - Normalizing the Data

Read the dataset AirlinesCluster.csv into Python and call it "airlines".

Looking at the summary of airlines, which TWO variables have (on average) the smallest values? Which TWO variables have (on average) the largest values?

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt

np.random.seed(42)


# Load Data
airlines = pd.read_csv("AirlinesCluster.csv")
print(airlines.describe())

## Problem 1.2 - Normalizing the Data

In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to normalize the data before clustering?


*   If we don't normalize the data, the clustering algorithms will not work.
*   If we don't normalize the data, it will be hard to interpret the results of the clustering.
*   If we don't normalize the data, the clustering will be dominated by the variables that are on a larger scale.
*   If we don't normalize the data, the clustering will be dominated by the variables that are on a smaller scale.




## Problem 1.3 - Normalizing the Data

Let's go ahead and normalize our data and create a normalized data frame called "airlinesNorm".

If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero.

In the normalized data, which variable has the largest maximum value? Which variable has the smallest minimum value?

In [None]:
# Normalize Data
airlines_norm = (airlines - airlines.mean()) / airlines.std()

In [None]:
list(zip(airlines.columns, airlines_norm.max(axis = 0)))

In [None]:
list(zip(airlines.columns, airlines_norm.min(axis = 0)))

## Problem 2.1 - Hierarchical Clustering

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters. According to the dendrogram, which of the following is a potential choice for the number of clusters? Select all that apply.
*   2
*   3
*   6
*   7



In [None]:
# Hierarchical Clustering
airline_dist = pdist(airlines_norm, metric="euclidean")
airline_dist = np.round(airline_dist, 7)

# linkage_matrix = linkage(airline_dist, method="ward")

from scipy.spatial.distance import squareform
linkage_matrix = linkage(airline_dist, method="ward")

In [None]:
# Plot Dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()

## Problem 2.2 - Hierarchical Clustering

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function. How many data points are in Cluster 1?

In [None]:
# Cut Tree into Clusters
cluster_groups = fcluster(linkage_matrix, 5, criterion='maxclust')
print(pd.Series(cluster_groups).value_counts())

## Problem 2.3 - Hierarchical Clustering

Now, compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret.

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)?

In [None]:
# Compute Cluster Means
cluster_df = pd.DataFrame(airlines, index=range(len(airlines)))
cluster_df["Cluster"] = cluster_groups
cluster_means = cluster_df.groupby("Cluster").mean()
print(cluster_means)


How would you describe the customers in Cluster 1?
*   Relatively new customers who don't use the airline very often.
*   Median frequency travels and primarily takes elite status flights.
*   Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
*   Relatively loyal customers who have not traveled that much.
*   Loyal customers who have accumulated a lot of points and awards to be redeemed through both flight and non-flight transactions.

## Problem 2.4 - Hierarchical Clustering

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)?

How would you describe the customers in Cluster 2?
*   Relatively new customers who don't use the airline very often.
*   Median frequency travels and primarily takes elite status flights.
*   Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
*   Relatively loyal customers who have not traveled that much.
*   Loyal customers who have accumulated a lot of points and awards to be redeemed through both flight and non-flight transactions.

## Problem 2.5 - Hierarchical Clustering

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)?

How would you describe the customers in Cluster 3?
*   Relatively new customers who don't use the airline very often.
*   Median frequency travels and primarily takes elite status flights.
*   Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
*   Relatively loyal customers who have not traveled that much.
*   Loyal customers who have accumulated a lot of points and awards to be redeemed through both flight and non-flight transactions.

## Problem 2.6 - Hierarchical Clustering

How would you describe the customers in Cluster 4?
*   Relatively new customers who don't use the airline very often.
*   Median frequency travels and primarily takes elite status flights.
*   Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
*   Relatively loyal customers who have not traveled that much.
*   Loyal customers who have accumulated a lot of points and awards to be redeemed through both flight and non-flight transactions.

## Problem 2.7 - Hierarchical Clustering

Compared to the other clusters, Cluster 5 has the smallest average values in which variables (if any)?

How would you describe the customers in Cluster 5?
*   Relatively new customers who don't use the airline very often.
*   Median frequency travels and primarily takes elite status flights.
*   Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
*   Relatively loyal customers who have not traveled that much.
*   Loyal customers who have accumulated a lot of points and awards to be redeemed through both flight and non-flight transactions.

## Problem 3.1 - K-Means Clustering

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument max_iter to 1000.

How many clusters have more than 1,000 observations?

In [None]:
# K-Means Clustering
np.random.seed(88)
kmeans = KMeans(n_clusters=5, max_iter=1000, random_state=88)
kmeans.fit(airlines_norm)
print(pd.Series(kmeans.labels_).value_counts())

## Problem 3.2 - K-Means Clustering

Now, compare the cluster centroids to each other either by dividing the data points into groups, or by looking at the output of cluster_centers_, (Note that the output of cluster_centers_ will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to look at the original data like we did for hierarchical clustering.)

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?
*   Yes, because the clusters are displayed in order of size, so the largest cluster will always be first.
*   Yes, because the clusters are displayed according to the properties of the centroid, so the cluster order will be similar.
*   No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.
*   No, because the clusters produced by the k-means algorithm will never be similar to the clusters produced by the Hierarchical algorithm.


In [None]:
# K-Means Cluster Centers
print(pd.DataFrame(kmeans.cluster_centers_, columns=airlines.columns))