# COMP8260 - AI Systems Class 2
### Jupyter Notebook Solution

This notebook provides a structured solution to the tasks outlined in the class PDF document. Each task includes explanations, implementation, and answers to relevant questions.

## 1. Load and Explore the Dataset
We begin by loading the **Adult Dataset (version 2)** from `openml.org` using `fetch_openml`. We will explore the dataset structure, feature types, missing values, and target distribution.

In [None]:

from sklearn.datasets import fetch_openml
import pandas as pd

# Fetch the dataset
dataset = fetch_openml(data_id=1590, as_frame=True)

# Extract data and target
X = dataset.data
y = dataset.target

# Display dataset information
print("Dataset Info:")
print(X.info())

# Check missing values
print("\nMissing values per column:")
print(X.isnull().sum())

# Dataset size
print("\nDataset Size:", X.shape)

# Target class distribution
print("\nTarget Distribution:")
print(y.value_counts())


## 2. Split Data into Training and Test Sets
We will split the dataset into **80% training and 20% testing** using `train_test_split`.

In [None]:

from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display shapes
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


## 3. Select Numerical Features
We extract only **numerical features** for training the first Decision Tree model.

In [None]:

# Select numerical features
X_train_num = X_train.select_dtypes(include=['int64', 'float64'])
X_test_num = X_test.select_dtypes(include=['int64', 'float64'])

# Display shapes
print("X_train_num Shape:", X_train_num.shape)
print("X_test_num Shape:", X_test_num.shape)


## 4. Train a DecisionTreeClassifier on Numerical Data
We train a `DecisionTreeClassifier` using default parameters and evaluate its performance.

In [None]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train Decision Tree
clf_num = DecisionTreeClassifier()
clf_num.fit(X_train_num, y_train)

# Evaluate the model
train_accuracy_num = accuracy_score(y_train, clf_num.predict(X_train_num))
test_accuracy_num = accuracy_score(y_test, clf_num.predict(X_test_num))

# Print results
print("Training Accuracy:", train_accuracy_num)
print("Testing Accuracy:", test_accuracy_num)
print("Tree Depth:", clf_num.get_depth())
print("Number of Leaves:", clf_num.get_n_leaves())


## 8. Optimize Decision Tree with GridSearchCV
We use `GridSearchCV` to find the best values for `max_depth`, `min_samples_split`, and `min_samples_leaf`.

In [None]:

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'max_depth': [10, 20, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

# Perform Grid Search
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=3, n_jobs=-1, return_train_score=True)
grid_search.fit(X_train_num, y_train)

# Display best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


## 9. Implement a Pipeline with the Best Parameters
We create a pipeline to preprocess categorical features and apply the best decision tree model.

In [None]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Select categorical features
categorical_features = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]

# Define categorical preprocessing pipeline
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine preprocessor
preprocessor = ColumnTransformer([
    ('cat', categorical_transformer, categorical_features)
], remainder='passthrough')

# Create pipeline using best params
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', DecisionTreeClassifier(**grid_search.best_params_, random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Evaluate pipeline
train_accuracy = pipeline.score(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)

print("Pipeline Training Accuracy:", train_accuracy)
print("Pipeline Testing Accuracy:", test_accuracy)


## 📌 Conclusion
- We successfully trained and optimized a **Decision Tree Classifier**.
- **Grid Search** helped find the best hyperparameters.
- **Pipelines** were used to process categorical data and automate model training.
- The final model was evaluated for accuracy and overfitting.