

# **IFI8410: Programming for Business**



## **Assignment 09**

In [None]:
# Do not change the content of this cell. Execute this cell first, and everytime after you restarted the kernel.
%reload_ext autoreload
%autoreload 2

##  Multiple linear regression model.

### Problem 9.1: Data Exploration and Preprocessing for House Price Prediction

**Question:**

Analyze and preprocess the **HousePricePrediction.csv** dataset to prepare it for a machine learning model that predicts house prices.


**Tasks:**

- Load the dataset and perform an initial exploration to understand the distributions and relationships of each feature with price.

- Identify and Handle Missing Values (if any), filling them with appropriate values.

- Transform Features:

    * Convert date information to a more useful format, such as extracting the year and month as separate features.

    * One-hot encode boolean features **has_basement, renovated, nice_view, perfect_condition, has_lavatory,** and **single_floor**.

- Normalize Numeric Columns: Normalize or scale numerical features **bedrooms, grade, real_bathrooms, living_in_m2,** and **quartile_zone**.


**Requirements:**

- Use `pd.to_datetime()` to convert the date column and extract year and month.
  
- Fill missing values with either the median (for numerical columns) or the mode (for categorical columns).

- Normalize or scale all numerical features to a range suitable for machine learning models.


**Expected Output**:

- Return a cleaned **pandas DataFrame df** that can be used in further modeling tasks.

- This DataFrame should be free of missing values, contain transformed features, and have normalized numerical columns.


**Hint**:

- Use pandas function `pd.to_datetime(. , errors='coerce')` to convert **date** into `datetime` as datatype.
You can use this code block:
    
    `df['date'] = pd.to_datetime(df['date'], errors='coerce')`

    `df['year'] = df['date'].dt.year`

    `df['month'] = df['date'].dt.month`

    `df = df.drop(columns=['date'])`
  
- Separate the numerical and boolean columns via:

    `numeric_cols = df.select_dtypes(
        include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    ).columns`

    `bool_cols = df.select_dtypes(include=['bool']).columns`

- Impute numeric features with their **median** values and boolean features with their **most frequently occurring value**. Use scikit-learn's `SimpleImputer` class for that.

- Use `pd.get_dummies()` to one-hot encode boolean features, ensuring all values are in numeric form. Note: You will have to convert the boolean data type into integer. 

- Use `StandardScaler` from `sklearn.preprocessing` for normalization of numeric features.


In [None]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

def preprocess_housing_data(df: pd.DataFrame) -> pd.DataFrame:

    # Write your code here



    return df


In [None]:
# Sample usage:
# Load the dataset
import pandas as pd
file_path = '/data/IFI8410/sess10/HousePricePrediction.csv'
df = pd.read_csv(file_path)

# Run the preprocessing function
processed_df = preprocess_housing_data(df)

# Display a few rows of the processed data
processed_df.head()


#### Save your solution to a file ...

In [None]:
%%writefile solution_9_1.py

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

def preprocess_housing_data(df: pd.DataFrame) -> pd.DataFrame:


    return df


#### Test 9.1 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 1

### Problem 9.2: Feature Selection and Dataset Splitting

**Question:**

Using the preprocessed dataset from **Problem 9.1**, perform feature selection and split the dataset into training and testing sets.


**Tasks:**

- **Select Relevant Features**:

    * Choose predictor columns that likely impact the house price based on your understanding of the dataset. 
    * For example, you may want to include features like `bedrooms, grade, living_in_m2, real_bathrooms`, and categorical indicators for conditions.
    * Exclude the target column, price, from the features set.

- **Split the Dataset**:

    * Split the data into training and test sets with an 80:20 ratio.
    * Use price as the target variable for prediction.


**Requirements**:

- You can use **price** as target column.
  
- Use `train_test_split` from `sklearn.model_selection` to divide the data into training and test sets.

- Ensure that price is separated as the target variable, **y**, while other selected columns form **X**.

- Return **X_train, X_test, y_train,** and **y_test**.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

def select_and_split_data(df: pd.DataFrame, target_column: str = 'price') -> tuple:

    # Write your code here


    return X_train, X_test, y_train, y_test


In [None]:
# Sample usage:
# Load the dataset
file_path = '/data/IFI8410/sess10/HousePricePrediction.csv'
df = pd.read_csv(file_path)

#processed_df` is the preprocessed DataFrame from Problem 9.1
X_train, X_test, y_train, y_test = select_and_split_data(processed_df)

#### Save your solution to a file ...

In [None]:
%%writefile solution_9_2.py

from sklearn.model_selection import train_test_split
import pandas as pd

def select_and_split_data(df: pd.DataFrame, target_column: str = 'price') -> tuple:


    return X_train, X_test, y_train, y_test


#### Test 9.2 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 2

### Problem 9.3: Model Training and Evaluation

**Question:**

Using the training and test sets from **Problem 9.2**, train a linear regression model to predict house prices and evaluate its performance.


**Your task is to:**

- **Train the Model**: Use LinearRegression from sklearn.linear_model to train a model on the training set.

- **Evaluate the Model**: Calculate MAE, MSE, and RMSE on the test set predictions to understand the model’s performance.


**Requirements:**

- Use **LinearRegression()** to fit the model on the training data (X_train, y_train).

- Predict house prices for X_test.

- Calculate and return **MAE, MSE,** and **RMSE** metrics.


**Hint**:

- Use `mean_absolute_error, mean_squared_error, and np.sqrt()` to calculate evaluation metrics.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

def train_and_evaluate_model(X_train, X_test, y_train, y_test) -> tuple:
    
    # Write your code here



    return mae, mse, rmse


In [None]:
# Example usage:
# Load the dataset
file_path = '/data/IFI8410/sess10/HousePricePrediction.csv'
df = pd.read_csv(file_path)

# Assuming X_train, X_test, y_train, y_test are already defined from Problem 9.2
mae, mse, rmse = train_and_evaluate_model(X_train, X_test, y_train, y_test)

#### Save your solution to a file ...

In [None]:
%%writefile solution_9_3.py

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

def train_and_evaluate_model(X_train, X_test, y_train, y_test) -> tuple:



    return mae, mse, rmse


#### Test 9.3 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 3

### 9.4 Logistic Regression Problem: Predicting Employee Attrition

**Problem:**

Train a Logistic Regression model using the employee dataset (**HR_Analytics.csv**) to predict whether an employee is likely to leave (attrition).


**Your Task:**

Write a function that performs the following steps:

- `Data Loading and Preprocessing`:

    * Load the dataset.Convert specified categorical columns into numerical format using One-Hot Encoding.

    * Normalize numerical features to ensure consistency in model performance.

- `Data Splitting`: Split the dataset into training and test sets, using 70% of the data for training and 30% for testing.

- `Model Training`: Train a LogisticRegression model on the training set.

- `Model Evaluation`: Evaluate the model's performance on the test set using the following metrics:

    * Accuracy
    
    * Precision
    
    * Recall
    
    * F1-Score

- Return these calculated metrics as function outputs.


**Requirements**:

- Encoding Categorical Features: Use OneHotEncoder to convert categorical columns to numerical values. Ensure unknown levels are ignored, and set the encoder to drop the first category of each feature.

- Output Metrics: The function should return four metrics: **accuracy, precision, recall,** and **F1-score**.


**Hints**:

- One-Hot Encoding: Use OneHotEncoder from sklearn on categorical columns with the settings drop='first' and handle_unknown='ignore' to avoid issues with new categories.

- Data Splitting: Use a 70:30 train-test split ratio for balanced evaluation.
Model Training: Use the LogisticRegression model from sklearn with appropriate settings.

- Metrics Calculation: Calculate **accuracy, precision, recall,** and **F1-score** using `accuracy_score`, `precision_score`, `recall_score`, and `f1_score` `from `sklearn.metrics`.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def process_and_train_logistic_regression(
    df: pd.DataFrame, target: str, categorical_features: list, numeric_features: list
) -> tuple:

    # Write your code here
    

    return accuracy, precision, recall, f1score


In [None]:
# Example usage
file_path = '/data/IFI8410/sess10/HR_Analytics.csv'
df = pd.read_csv(file_path)

# Define target, categorical, and numeric columns
target = 'left'
categorical_features = ['Department', 'salary']
numeric_features = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours',
                    'time_spend_company', 'Work_accident', 'promotion_last_5years']

# Run the function
accuracy, precision, recall, f1_score = process_and_train_logistic_regression(
    df, target, categorical_features, numeric_features
)

# Print labeled output
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1_score:.4f}")


#### Save your solution to a file ...

In [None]:
%%writefile solution_9_4.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def process_and_train_logistic_regression(df: pd.DataFrame, target: str, categorical_features: list, numeric_features: list) -> tuple:


    return accuracy, precision, recall, f1score


#### Test 9.4 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 4

### 9.5: Visualizing Confusion Matrix for Employee Attrition Prediction

**Problem:**

Extend the logistic regression model to visualize the confusion matrix, showing the number of true positives, true negatives, false positives, and false negatives.


**Outline:**

Train the logistic regression model to predict employee attrition, as previously implemented.

Visualize the model’s confusion matrix to better understand classification results.

Your Task: Write a function that performs the following steps:


**Model Training and Prediction:**

Use the existing logistic regression model to train on the employee dataset and make predictions.


**Confusion Matrix Calculation:**

Calculate the confusion matrix metrics: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).


**Visualization:**

- Visualize the confusion matrix using a heatmap that clearly shows the counts for TP, TN, FP, and FN.

- Add labels and color to the heatmap to improve interpretability.


**Requirements**:

- Use `confusion_matrix` from `sklearn.metrics` to calculate **TP, TN, FP,** and **FN**.
  
- Use `seaborn` and `matplotlib` to create a heatmap of the confusion matrix.


**Hints**:

- Return the confusion matrix which is defined as `cm = confusion_matrix(y_test, y_pred)`.
  
- Visualize your results: Use `sns.heatmap` with annotations to visualize the confusion matrix. Label the axes as "Predicted" and "Actual" and add a title for clarity.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

def compute_confusion_matrix(df: pd.DataFrame, target: str, categorical_features: list, numeric_features: list) -> np.ndarray:
    
    # Write your code here



    return cm


In [None]:
# Example usage
file_path = '/data/IFI8410/sess10/HR_Analytics.csv'
df = pd.read_csv(file_path)

# Define target, categorical, and numeric columns
target = 'left'
categorical_features = ['Department', 'salary']
numeric_features = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours',
                    'time_spend_company', 'Work_accident', 'promotion_last_5years']

# Run the function to visualize the confusion matrix
cm = compute_confusion_matrix(df, target, categorical_features, numeric_features)

# Visualize the confusion matrix using a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted Negative', 'Predicted Positive'],
            yticklabels=['Actual Negative', 'Actual Positive'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix for Employee Attrition Prediction')
plt.show()


#### Save your solution to a file ...

In [None]:
%%writefile solution_9_5.py

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

def compute_confusion_matrix(df: pd.DataFrame, target: str, categorical_features: list, numeric_features: list) -> np.ndarray:
    


    return cm


#### Test 9.5 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 5

### 9.6 Load Data, Split in Train and Test Sets

**Question:**

You are given a CSV file **Fraud.csv** 

- Use a Naive Bayes Classifier to predict fraud based on categorical features from a given dataset. 

- Preprocess categorical features using one-hot encoding and use a pipeline to simplify the process of transforming and training the model.

- Return the predicted labels as numpy array **y_pred_nb**.


**Hint:**

- Use **Pipeline** to handle both preprocessing (one-hot encoding for categorical columns) and model fitting in one step.

- **MultinomialNB** is suitable for categorical features, especially after one-hot encoding.

- Set up the pipeline to include both the preprocessing of categorical data and fitting of the Naive Bayes model.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

def train_naive_bayes(file_path: str, categorical_features: list, target_feature: str) -> np.ndarray:
    
    # Write your code here
    


    return y_pred_nb

In [None]:
# Load dataset
file_path = '/data/IFI8410/sess10/Fraud_data.csv' 
df = pd.read_csv(file_path)

# Define features and target
categorical_features = [
    'Make', 'AccidentArea', 'Sex', 'MaritalStatus', 'Fault', 
    'PolicyType', 'VehicleCategory', 'PoliceReportFiled', 
    'WitnessPresent', 'BasePolicy'
]
target_feature = 'FraudFound_P'

y_pred_nb = train_naive_bayes(file_path, categorical_features, target_feature)

print("Predictions:", y_pred_nb[:10])  # Display first 10 predictions

#### Save your solution to a file ...

In [None]:
%%writefile solution_9_6.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def train_naive_bayes(file_path: str, categorical_features: list, target_feature: str) -> np.ndarray:


    
    return y_pred_nb


#### Test 9.6 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 6

### 9.7 Evaluation Metrics

**Question:** 

Evaluate a **Naive Bayes model** on a test dataset using common classification metrics.

After training your Naive Bayes model on a training set, evaluate the model on the test set by calculating the following metrics:

- **Accuracy:** The overall proportion of correct predictions.

- **Precision:** The proportion of true positive predictions out of all positive predictions.

- **Recall:** The proportion of true positive predictions out of all actual positives.

- **F1-Score:** The harmonic mean of precision and recall, providing a balance between the two.

Return the final result in form of a dictionary:
**{
    "accuracy": accuracy,
    "precision": precision,
    "recall": recall,
    "f1_score": f1
}**

**Hint:**

- You can use **accuracy_score**, **precision_score**, **recall_score**, and **f1_score** from `sklearn.metrics`.
  
- Ensure that **y_test** and **y_pred** are arrays of the same shape and data type.
  

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

def naive_bayes_metrics(file_path: str, categorical_features: list, target_feature: str) -> dict:
    
    # Write your code here
    


    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }


In [None]:
# Example usage:
# Load data
file_path = '/data/IFI8410/sess10/Fraud_data.csv' 
df = pd.read_csv(file_path)

# Define features and target
categorical_features = [
    'Make', 'AccidentArea', 'Sex', 'MaritalStatus', 'Fault',
    'PolicyType', 'VehicleCategory', 'PoliceReportFiled',
    'WitnessPresent', 'BasePolicy'
]
target_feature = 'FraudFound_P'

metrics = naive_bayes_metrics(file_path, categorical_features, target_feature)

# Print metrics
print(f"Accuracy: {metrics['accuracy']:.2f}")
print(f"Precision: {metrics['precision']:.2f}")
print(f"Recall: {metrics['recall']:.2f}")
print(f"F1-Score: {metrics['f1_score']:.2f}")

#### Save your solution to a file ...

In [None]:
%%writefile solution_9_7.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def naive_bayes_metrics(file_path: str, categorical_features: list, target_feature: str) -> dict:



    
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }

#### Test 9.7 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 7

### 9.8 Support Vector Machine (SVM)

**Question:**

You are provided with a dataset containing information about potential fraud cases. Your goal is to build an SVM model to predict fraud cases (**FraudFound_P** as target feature). However, the dataset is imbalanced, which could negatively impact the model's ability to detect fraud cases effectively. To improve performance:

- Use `OneHotEncoder` to preprocess categorical features.
- Apply `SMOTE` to balance the training data.
- Train an **SVM classifier** with a non-linear kernel (RBF).
- Return the predicted and original labels on the test set.


**Task:** 

Write a function `load_and_train_svm_model` that:

- Loads the dataset, processes categorical columns with `OneHotEncoder`, and applies `SMOTE` to balance the data.
  
- Trains the SVM model on the balanced data.
  
- Returns the predicted and original labels for the test data set **y_pred_svm, y_test**.


**Hint:**

- Use `ColumnTransformer` to handle one-hot encoding of categorical features in a pipeline (see `scikit-learn` library).
- Implement `SMOTE` from the `imblearn` library to create a balanced dataset by generating synthetic samples of the minority class. For example use `smote = SMOTE(random_state=1)` and then apply `smote.fit(.)` to your data.
- Use SVC with the rbf kernel for non-linear classification.
- Be sure to preprocess the test set using the same one-hot encoding pipeline before making predictions.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from collections import Counter

def load_and_train_svm_model(file_path: str, categorical_features: list, target_feature: str):

    # Write your code here
    

    

    return y_pred_svm, y_test


In [None]:
# Example usage
file_path = '/data/IFI8410/sess10/Fraud_data.csv' 
categorical_features = [
    'Make', 'AccidentArea', 'Sex', 'MaritalStatus', 'Fault', 
    'PolicyType', 'VehicleCategory', 'PoliceReportFiled', 
    'WitnessPresent', 'BasePolicy'
]
target_feature = 'FraudFound_P'

y_pred_svm, y_test = load_and_train_svm_model(file_path, categorical_features, target_feature)
print("SVM Predictions:", y_pred_svm[:10]) 


#### Save your solution to a file ...

In [None]:
%%writefile solution_9_8.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from collections import Counter

def load_and_train_svm_model(file_path: str, categorical_features: list, target_feature: str):

    

    return y_pred_svm, y_test


#### Test 9.8 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 8

### 9.9 Evaluation Metrics

**Question:**

Given a dataset with a target class imbalance, train an **SVM classifier** to predict the target variable while handling the imbalance in the data. Use SMOTE to balance the training data and apply one-hot encoding to categorical features. Your task is to:

**Task:** 

Write a function `train_and_evaluate_svm` with following tasks:

- Load and preprocess the data, handling categorical features with one-hot encoding.
- Balance the training dataset using `SMOTE`.
- Train an SVM model using a pipeline that includes the necessary preprocessing and handles the class imbalance.
- Evaluate the model on the test set using metrics such as **accuracy, precision, recall,** and **F1-score**.
- Return the metrics as a dictyionary in the form 
  **metrics = {
        "accuracy": accuracy_score(y_test, y_pred_svm),
        "precision": precision_score(y_test, y_pred_svm, pos_label=1),
        "recall": recall_score(y_test, y_pred_svm, pos_label=1),
        "f1_score": f1_score(y_test, y_pred_svm, pos_label=1)
    }**

  
**Hint:**

- **One-Hot Encoding:** Use a `ColumnTransformer` to one-hot encode the categorical features, as this will help the model interpret the categorical data effectively.
- **SMOTE for Re-balancing Target Data:** Apply SMOTE to the one-hot encoded training data to create a balanced dataset.
- **Pipeline:** Set up a pipeline that includes preprocessing and model fitting. This will ensure that the same transformations are applied to both the training and test sets, maintaining consistency in feature representation.
- **Class Weights in SVM:** Use class_weight='balanced' in the SVM model to handle any remaining imbalance, which helps optimize the decision boundary accordingly..


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE

def train_and_evaluate_svm(file_path: str, categorical_features: list, target_feature: str) -> dict:

    # Write your code here
    


    
    return metrics


In [None]:
# Example usage:
file_path = '/data/IFI8410/sess10/Fraud_data.csv' 
categorical_features = [
    'Make', 'AccidentArea', 'Sex', 'MaritalStatus', 'Fault', 
    'PolicyType', 'VehicleCategory', 'PoliceReportFiled', 
    'WitnessPresent', 'BasePolicy'
]
target_feature = 'FraudFound_P'

metrics = train_and_evaluate_svm(file_path, categorical_features, target_feature)
print("SVM Metrics:", metrics)

#### Save your solution to a file ...

In [None]:
%%writefile solution_9_9.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE

def train_and_evaluate_svm(file_path: str, categorical_features: list, target_feature: str) -> dict:


    
    return metrics


#### Test 9.9 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 9

### 9.10 Comparison of Performance for Naive Bayes and SVM

**Question:**

- Compare the performance of a **Naive Bayes classifier** and a **SVM classifier** on the given dataset. 

- Evaluate their respective **accuracy, precision, recall,** and **F1-score** to determine which model performs better. 

- Also, examine their confusion matrices to better understand where each model succeeds or fails.


**Tasks:**

Create a function `evaluate_model(y_true, y_pred, model_name)` that:

- Prints all metrics. You can use this format:

    `print(f"\n{model_name} Performance:")`

  
    `print(f"Accuracy: {accuracy:.2f}")`

  
    `print(f"Precision: {precision:.2f}")`

  
    `print(f"Recall: {recall:.2f}")`

  
    `print(f"F1-Score: {f1:.2f}")`
  
- Plots the confusion matrix. (Note: You can use the `sklearn` class `ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)`.)


**Hint:**

- Train Two Models: Use the Naive Bayes and SVM classifiers, applying the same preprocessing steps (like one-hot encoding for categorical features).

- Evaluate with Metrics: After training both models, use accuracy, precision, recall, and F1-score to evaluate them.

- Compare Confusion Matrices: Plot the confusion matrices to observe how each model performs on true positives, false positives, true negatives, and false negatives.

- Interpret Results: Use the metrics and the confusion matrices to draw conclusions about the strengths and weaknesses of each model.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Load data
file_path = '/data/IFI8410/sess10/Fraud_data.csv' 
df = pd.read_csv(file_path)

# Define features and target
categorical_features = [
    'Make', 'AccidentArea', 'Sex', 'MaritalStatus', 'Fault', 
    'PolicyType', 'VehicleCategory', 'PoliceReportFiled', 
    'WitnessPresent', 'BasePolicy'
]
target_feature = 'FraudFound_P'

# Split data into features and target
X = df[categorical_features]
y = df[target_feature]

# Encode the target feature
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Set up preprocessing for categorical columns (One-Hot Encoding)
preprocessorForCategoricalColumns = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Naive Bayes Classifier
nbClassifier = MultinomialNB(alpha=1.0)
nb_model = Pipeline(steps=[('preprocessor', preprocessorForCategoricalColumns), ('classifier', nbClassifier)])
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)

# SVM Classifier
svmClassifier = SVC(kernel='linear', random_state=1)
svm_model = Pipeline(steps=[('preprocessor', preprocessorForCategoricalColumns), ('classifier', svmClassifier)])
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

In [None]:
# Evaluation function
def evaluate_model(y_true, y_pred, model_name):
    
    # Write your code here
     


    
    plt.show()


In [None]:
# Example usage:

# Evaluate Naive Bayes
evaluate_model(y_test, y_pred_nb, "Naive Bayes Classifier")

# Evaluate SVM
evaluate_model(y_test, y_pred_svm, "SVM Classifier")

#### Save your solution to a file ...

In [None]:
%%writefile solution_9_10.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Evaluation function
def evaluate_model(y_true, y_pred, model_name):


    
    plt.show()
    

#### Test 9.10 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 10

# Run all the tests again ...

In [None]:
! ./test/run_test.sh

# Homework Submission

- This homework is due by **2024-11-13, 6:00 PM (EDT)**.

- Make sure that all your programs and output files are in the exact folder as specified in the instructions.

- All file names on this system are case sensitive. Verify if you copy your work from a local computer to your home directory on ARC.

**Execute the cell below to submit your assignment**

In [None]:
! ./submit.sh -y