
# **Artificial Intelligence  FEUP**
## **2nd Assignment - Supervised Learning**
### **3rd Year - 2nd Semester - 2024/2025**



# Import Libraries and Set Options

Import necessary libraries for data handling, visualization, and machine learning.  
Set pandas display options and seaborn plot style for better readability.

- **pandas** for loading CSV files and handling structured data efficiently.
- **NumPy** for fast numerical operations and array manipulations.
- **matplotlib** and **seaborn** to visualize data and uncover patterns through charts and plots.
- **scikit-learn** for machine learning tasks like splitting datasets, scaling features, training models, and measuring their performance with built-in metrics.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

sns.set_theme(style="whitegrid")
pd.set_option('display.max_columns', None)

# Load and Preview Dataset

Load the dataset from a CSV file and take a quick look at the first few rows.  
Also, check the shape, data types, and basic statistics of the dataset.

In [None]:
df = pd.read_csv('./original/train.csv')
df.head()  # Preview dataset

print(f"Dataset shape: {df.shape}")
df.info()
df.describe()

# Check and Handle Missing Values

Identify if there are any missing values in the dataset.  
If missing values exist, fill them with the median of the respective columns.

In [None]:
missing_total = df.isnull().sum().sum()
print(f"Valores em falta no dataset de treino: {missing_total}")

if missing_total > 0:
    print("Preencher valores em falta com a mediana.")
    df = df.fillna(df.median(numeric_only=True))
else:
    print("Nenhum valor em falta encontrado.")

# Exploratory Data Analysis (EDA)

Visualize the distribution of the target variable (smoking).  
Plot a correlation heatmap to understand relationships between numerical features.

In [None]:
# Distribution of smoking variable
sns.countplot(x='smoking', data=df)
plt.title("Smoker Status Distribution")
plt.xlabel("Smoker")
plt.ylabel("Count")
plt.show()

# Correlation heatmap
correlation_matrix = df.corr()
plt.figure(figsize=(16, 12))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# Boxplots of Features by Smoking Status

Create boxplots for selected features grouped by smoking status to identify patterns or differences.

In [None]:
features = ['age', 'height(cm)', 'weight(kg)', 'waist(cm)', 'eyesight(left)', 'eyesight(right)',
            'hearing(left)', 'hearing(right)', 'systolic', 'relaxation', 'fasting blood sugar',
            'Cholesterol', 'triglyceride', 'HDL', 'LDL', 'hemoglobin', 'Urine protein',
            'serum creatinine', 'AST', 'ALT', 'Gtp', 'dental caries']

for feature in features:
    plt.figure(figsize=(8, 5))
    sns.boxplot(x='smoking', y=feature, data=df)
    plt.title(f"{feature} by Smoking Status")
    plt.xlabel("Smoker")
    plt.ylabel(feature)
    plt.tight_layout()
    plt.show()

## Remove Irrelevant Features

Based on the boxplots and visual inspection, some features show little to no difference between smoker and non-smoker groups.  
These features will be removed to simplify the model and avoid noise.

We drop the following irrelevant features:  
`eyesight(left)`, `eyesight(right)`, `hearing(left)`, `hearing(right)`, and `dental caries`.


In [None]:
# List of irrelevant features to drop
irrelevant_features = ['eyesight(left)', 'eyesight(right)', 'hearing(left)', 'hearing(right)', 'dental caries']

# Drop irrelevant features
df = df.drop(columns=irrelevant_features)

# Verify the features were dropped
print("Columns after dropping irrelevant features:")
print(df.columns)


## Remove Outliers (IQR Method)

Based on the visual inspection above, we remove outliers from selected numeric features using the Interquartile Range (IQR) method.


In [None]:
# use IQR to filter out extreme values in numeric features
numeric_features = ['age', 'height(cm)', 'weight(kg)', 'waist(cm)', 'systolic', 'relaxation',
                    'fasting blood sugar', 'Cholesterol', 'triglyceride', 'HDL', 'LDL',
                    'hemoglobin', 'serum creatinine', 'AST', 'ALT', 'Gtp']

for feature in numeric_features:
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    before = df.shape[0]
    df = df[(df[feature] >= Q1 - 1.5 * IQR) & (df[feature] <= Q3 + 1.5 * IQR)]
    after = df.shape[0]
    print(f"{feature}: removed {before - after} outliers")


# Data Cleaning and Splitting

Remove unnecessary columns (like 'id').  
Separate the dataset into features (X) and target variable (y).

In [None]:
df = df.drop(columns=['id'])  # Drop id column
X = df.drop(columns=['smoking'])  # Features
y = df['smoking']  # Target variable

# Feature Scaling

Normalize the feature data using standardization (mean=0, std=1) to improve model performance.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-Test Split

Split the data into training and testing sets with stratification to maintain class distribution.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y)