# Group 20 — Exploratory Data Analysis
This notebook performs an initial exploratory data analysis (EDA) on the provided customer and flights databases.
The goals are:
- Inspect imports and data quality
- Identify missing or strange values
- Preprocess data for downstream modeling

## Table of contents
- [Import data](#import-data)
- [Data Exploration](#data-exploration)
  - [Customer DB](#customer-db)
    - [Inspect import](#inspect-import-customer)
    - [Check categorical values](#check-cat-values-customer)
    - [Check Outliers](#check-outliers)
  - [Flights DB](#flights-db)
    - [Inspect import](#inspect-import-flights)
    - [Check Outliers](#check-outliers-flight)
- [Preprocessing](#preprocessing)
  - [Missing Values](#missing-values)
  - [Convert data types](#convert-data-types)

# <a id="import-data"></a> Import data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import seaborn as sns
from scipy import stats

# Load the data

customer_db = pd.read_csv("data/DM_AIAI_CustomerDB.csv", index_col=0 )
flights_db = pd.read_csv("data/DM_AIAI_FlightsDB.csv")


# <a id="data-exploration"></a> Data Exploration

### <a id="customer-db"></a> Customer DB

#### <a id="inspect-import-customer"></a> Inspect import

In [None]:
customer_db.head()

In [None]:
customer_db.info()

In [None]:
duplicated_loyalty_ids = customer_db[customer_db['Loyalty#'].duplicated()]['Loyalty#'].unique()
print(f"Number of unique Duplicated Loyalty IDs: {len(duplicated_loyalty_ids)}")

In [None]:
sns.boxplot(x='LoyaltyStatus', y='Income', data=customer_db.dropna(subset=['Income']))
plt.title('Income by Loyalty Tier')
plt.xlabel('Loyalty Tier')
plt.ylabel('Income')
plt.show()

In [None]:
customer_db.groupby('LoyaltyStatus')['Income'].agg(['count', 'mean', 'median'])

In [None]:

groups = [group["Income"].dropna() for _, group in customer_db.groupby("Education")]
f_stat, p_val = stats.f_oneway(*groups)
print(f"ANOVA F-statistic: {f_stat:.3f}, p-value: {p_val:.4f}")

#### <a id="check-cat-values-customer"></a> Check categorical values

In [None]:
categorical_cols = [
     'Province or State', 'City', 'Gender', 'Education',
    'Location Code', 'Marital Status', 'LoyaltyStatus', 'EnrollmentType'
]

# Create the figure and axes
fig, axes = plt.subplots(
    nrows=math.ceil(len(categorical_cols) / 3),
    ncols=3,
    figsize=(15, 15)
)
# TODO: remove Country subplot

# Generate a plot for each categorical column
for ax, col in zip(axes.flatten(), categorical_cols):
    customer_db[col].value_counts().plot(
        kind='barh',
        ax=ax,
        title=f'Distribution of {col}'
    )
    ax.set_xlabel("Count")
    ax.set_ylabel("")
    ax.invert_yaxis()

plt.tight_layout()
plt.show()

### Scatter Plot for the customer dataset


In [None]:
numeric_features = ['Income', 'Customer Lifetime Value']

customer_db_reset = customer_db.reset_index()

sns.scatterplot(
    data=customer_db_reset,
    x='Income',        
    y='Customer Lifetime Value',
    alpha=0.6
)

plt.suptitle('Relationships between Customer Features', y=1.02)
plt.show()

### Observations from the scatter plot

- **No clear linear relationship:** Income and Customer Lifetime Value don't have a strong correlation, given the fact that the points are widely scattered across all income levels, suggesting that higher income does not necessarily imply an higher customer lifetime value.

- **Cluster of low-income customers:** There is a noticeable dense cluster of data points near Income = 0, indicating that a large portion of customers have very low (or missing) income values

- **High Variance in Customer Lifetime Value:** Across all income ranges, the CLV varies greatly — some customers with low income have high CLV, and vice versa. This suggests that factors other than income likely play a more important role in determining CLV.


### <a id="check-outliers"></a> Check Outliers

In [None]:

numeric_features = ["Income", "Customer Lifetime Value"]


In [None]:
#checking the histogram of income and customer lifetime value
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for ax, feat in zip(np.atleast_1d(axes).flatten(), numeric_features):
    ax.hist(customer_db[feat].dropna(), bins=30)
    ax.set_title(feat, y=-0.13)

plt.suptitle("Customer DB — Histogram (Income)")
plt.show()

In [None]:
#TODO : label box plots correctly

#checking boxplot for income and customer lifetime value
fig, axes = plt.subplots(1, 2, figsize=(12, 3))

for ax, feat in zip(np.atleast_1d(axes).ravel(), numeric_features):
    ax.boxplot(customer_db[feat].dropna().values, vert=False)

plt.suptitle("Income and Lifetime Value Boxplots - Customer DB")
plt.show()

In [None]:
missing_cust = customer_db.isna().sum()
display(pd.DataFrame({
    'Missing Count': missing_cust[missing_cust > 0],
    '%': (missing_cust[missing_cust > 0] / len(customer_db) * 100).round(2)
}))

### <a id="flights-db"></a> Flights DB

#### <a id="inspect-import-flights"></a> Inspect import

In [None]:
flights_db.head()

In [None]:
flights_db.info()

### <a id="check-outliers-flight"></a> Check Outliers

In [None]:
numeric_features = [
    "NumFlights",
    "NumFlightsWithCompanions",
    "DistanceKM",
    "PointsAccumulated",
    "PointsRedeemed",
    "DollarCostPointsRedeemed"
]


In [None]:
# TODO : Use bar plots for discrete values
fig, axes = plt.subplots(2, 3, figsize=(20, 8), constrained_layout=True)

for ax, feat in zip(axes.flatten(), numeric_features):
    ax.hist(flights_db[feat].dropna(), bins=20)
    ax.set_title(feat, y=-0.13)

plt.suptitle("Flights DB — Histograms")
plt.show()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(20, 8), constrained_layout=True)

for ax, feat in zip(axes.flatten(), numeric_features):
    unique_vals = flights_db[feat].nunique()
    
    # Se poucos valores distintos → bar plot
    if unique_vals < 30:
        value_counts = flights_db[feat].value_counts().sort_index()
        ax.bar(value_counts.index, value_counts.values)
    else:
        ax.hist(flights_db[feat].dropna(), bins=20)
    
    ax.set_title(feat, y=-0.13)

plt.suptitle("Flights DB — Bar Plots (discrete) and Histograms (continuous)")
plt.show()


In [None]:
#checking to see if there are float flights or float flight companions 
invalid_fractional_flights = flights_db[
    (flights_db['NumFlights'] % 1 != 0) |
    (flights_db['NumFlightsWithCompanions'] % 1 != 0)
]

print(f"Number of rows with impossible fractional flight counts: {len(invalid_fractional_flights)}")
if not invalid_fractional_flights.empty:
    display(invalid_fractional_flights[[ 'Year', 'Month', 'NumFlights', 'NumFlightsWithCompanions']].head(10))

In [None]:
#checking to see if there are any individuals that were not on a flight but their recorded distance was >0
invalid_flights = flights_db[(flights_db['NumFlights'] == 0) & (flights_db['DistanceKM'] > 0)]

print(f"Number of inconsistent rows (NumFlights=0 & DistanceKM>0): {len(invalid_flights)}")
if not invalid_flights.empty:
    display(invalid_flights.head())

### Scatter Plots for the flights dataset

In [None]:
#TODO : check this pairplot
g = sns.pairplot(
    data=flights_db[numeric_features],
    diag_kind='scatter',      
    plot_kws={'alpha': 0.5},
    height=2
)

plt.suptitle('Relationships between Flight Features', y=1.02)
plt.show()

### Observations from the Scatter Plot

#### **Strong Linear Correlations**
- **DistanceKM and PointsAccumulated:**  
  There is a *nearly perfect linear correlation* between these two variables. This makes sense since loyalty programs typically award points proportional to the distance flown. The diagonal line pattern confirms this rule-based relationship.  
  → **Interpretation:** The more kilometers a customer flies, the more points they accumulate — a direct, systematic link rather than behavioral variation.

- **PointsRedeemed and DollarCostPointsRedeemed:**  
  This pair also shows a *perfect positive linear relationship*. The cost in dollars grows in direct proportion to the points redeemed.  
  → **Interpretation:** The airline’s redemption system consistently converts points to monetary value at a fixed rate, suggesting a stable and predictable reward conversion mechanism.

---

#### **Moderate Positive Trends**
- **NumFlights and DistanceKM / PointsAccumulated:**  
  Customers who take more flights generally cover more distance and accumulate more points, although with more variation than the linear cases above.  
  → **Interpretation:** Some customers may take many short flights, while others take fewer long ones — explaining the moderate rather than perfect correlation.

- **NumFlights and NumFlightsWithCompanions:**  
  There’s a clear positive association — passengers who fly more frequently also tend to fly more often with companions.  
  → **Interpretation:** This could indicate a segment of loyal customers who consistently travel with family, friends, or colleagues, possibly representing a valuable demographic for group travel promotions.

---

#### **Clustered Data Patterns**

- **Clustered distributions:**  
  For many features, data points appear concentrated in specific ranges (e.g., low flight counts, moderate distances).  
  → **Interpretation:** Most customers likely fly infrequently, while a smaller subset are heavy travelers. This skew could affect model training if not accounted for.

---

#### **Weak or Nonlinear Relationships**
- **PointsRedeemed and NumFlightsWithCompanions:**  
  No clear pattern is visible here, suggesting that redeeming points does not depend on whether the customer tends to fly alone or with companions.  
  → **Interpretation:** Redemption behavior might be more influenced by individual loyalty strategies, travel frequency, or availability of redemption opportunities.

- **DistanceKM and PointsRedeemed:**  
  The relationship seems weakly positive but scattered, implying that not all high-distance travelers redeem their points frequently.  
  → **Interpretation:** Some high-value customers may be accumulating points for larger future redemptions or are less engaged in reward usage.

---

#### **General Insights**
- The pair plot confirms logical relationships within the airline loyalty data — distance, flights, and accumulated points are strongly linked, while redemption behavior is more customer-specific.  
- The lack of strong cross-feature noise suggests data integrity is good. However, some discretization and clustering may require normalization for modeling or visualization purposes.  
- From a business perspective, segmenting customers based on flight frequency, distance traveled, and redemption patterns could yield meaningful insights for targeted marketing or retention strategies.


In [None]:
#checking boxplot 
fig, axes = plt.subplots(2, 3,figsize=(20, 8), constrained_layout=True)

for ax, feat in zip(axes.flatten(), numeric_features):
    sns.boxplot(data=flights_db, x=feat, ax=ax)
    ax.set_xlabel(feat)                           
    ax.set_ylabel('')        

fig.suptitle("Flights DB — Box Plots", y=1.02)
plt.show() 

## Correlation Matrix for the flights dataset

In [None]:
fig = plt.figure(figsize=(10, 8))

corr = flights_db[numeric_features].corr(method="pearson")

sns.heatmap(data=corr, annot=True, )


plt.show()

# Observations of the Correlation Matrix

| Pair | Correlation | Interpretation |
|------|--------------|----------------|
| NumFlights and DistanceKM | 0.62 | Strong positive — more flights generally means more total distance flown. |
| NumFlights and PointsAccumulated | 0.62 | Strong positive — more flights results in more points earned. |
| NumFlights and NumFlightsWithCompanions | 0.51 | Moderate positive — people who fly often also tend to fly with companions more. |
| NumFlightsWithCompanions and DistanceKM | 0.39 | Moderate — more companion flights slightly increase total distance. |
| PointsRedeemed and DollarCostPointsRedeemed | 1.00 | Perfect correlation — these two represent the same underlying concept (points redeemed vs. their dollar cost).
| PointsRedeemed and NumFlights / DistanceKM / PointsAccumulated | 0.19–0.34 | Weak relationships — redeeming points doesn’t strongly depend on flying behavior in this dataset. |


### Insights

The dataset splits into two main groups:

- **Flight activity metrics:**  
  `NumFlights`, `NumFlightsWithCompanions`, `DistanceKM`, `PointsAccumulated`

- **Redemption metrics:**  
  `PointsRedeemed`, `DollarCostPointsRedeemed`

---

- These two groups are weakly correlated with each other, suggesting that accumulating points and redeeming them behave independently.  

- The perfect correlation between `PointsRedeemed` and `DollarCostPointsRedeemed` indicates redundancy — there is only need to keep one of them (`PointsRedeemed`).



# <a id="preprocessing"></a> Preprocessing

In [None]:
# Remove duplicates in customer_db based on 'Loyalty#'
customer_db_cleaned = customer_db[~customer_db['Loyalty#'].isin(duplicated_loyalty_ids)]

# Remove duplicates in flights_db based on 'Loyalty#'
flights_db_cleaned = flights_db[~flights_db['Loyalty#'].isin(duplicated_loyalty_ids)]

# Number of records after removing duplicates
print(f'% of Customer DB records remaining: {round(customer_db_cleaned.shape[0] / customer_db.shape[0], 2)}')
print(f'% of Flights DB records remaining: {round(flights_db_cleaned.shape[0] / flights_db.shape[0], 2)}')

# Merge the cleaned dataframes
merged_db = customer_db_cleaned.merge(flights_db_cleaned, on='Loyalty#', how='inner')

# Check if the merge has the same number of rows as flights_db_cleaned
print(f'Number of records in merged DB: {merged_db.shape[0] == flights_db_cleaned.shape[0]}')

## <a id="missing-values"></a> Missing Values

In this step, we will create a simple helper function called missing_report() to check how many missing values exist in each column of both datasets: CustomerDB and FlightsDB. This will help us identify which features might need cleaning or imputation later. The function will return the total number of missing entries and their corresponding percentage, sorted from the highest to the lowest.

In [None]:
def missing_report(df: pd.DataFrame) -> pd.DataFrame:
    out = df.isna().agg(['sum', 'mean']).T
    out.columns = ['Total', 'Percentage']
    out['Percentage'] = (out['Percentage'] * 100).round(2)
    return out.sort_values(['Total', 'Percentage'], ascending=False)

In [None]:
customer_missing = missing_report(customer_db)
flights_missing  = missing_report(flights_db)

customer_missing

In [None]:
flights_missing

After running the missing values report, we can see that the CancellationID column in the CustomerDB dataset has a very high proportion of missing values (around 86%). This actually makes sense: most customers likely never cancelled their membership or subscription, so this field would naturally remain empty for them.

Apart from that, only two other columns: Income and Customer Life have a very small number of missing records (around 0.12%), which is negligible. All other fields in CustomerDB, as well as every column in the FlightsDB dataset, are completely filled, showing that both datasets have excellent data quality overall.

## <a id="convert-data-types"></a> Convert data types