

# **IFI8410: Programming for Business**



## **Final Exam**

In [None]:
# Do not change the content of this cell. Execute this cell first, and everytime after you restarted the kernel.
%reload_ext autoreload
%autoreload 2

### Prepare dataset for classification problems

In [None]:
# Load the dataset
import pandas as pd
import numpy as np
file_path = '/data/IFI8410/finalexam/HR_Analytics.csv'
data = pd.read_csv(file_path)

In [None]:
data.head()

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
# Create binary target feature for binary classification:
data.loc[data['salary'] == 'low', 'binary_salary'] = 0
data.loc[data['salary'] == 'medium', 'binary_salary'] = 0
data.loc[data['salary'] == 'high', 'binary_salary'] = 1
data['binary_salary'] = data['binary_salary'].astype(int)
data.drop(columns=['salary'], inplace = True)
data.head()

### Problem 1: Train a decision tree classification model on a dataset.

Write a function that processes the **HR_Analytics.csv** dataset and trains a Decision Tree classifier to predict fraud:

`process_and_train_decision_tree(df: pd.DataFrame, categorical_features: list) -> tuple:`

...


Your task is to:

- Load and preprocess the data by one-hot encoding categorical columns.
- Split the dataset into training and test sets with a 70:30 ratio.
- Train a DecisionTreeClassifier model using the training data.
- Evaluate the model using accuracy, precision, recall, and F1-score on the test set.

Requirements:

- Use a `OneHotEncoder` to encode categorical features, ensuring unknown levels are ignored.
- The first category is dropped, and dense output is returned.
- Return the calculated metrics **accuracy, precision, recall, F1-score** on the test set in the function output.

Hint:

- Use as target feature **binary_salary**; it describes whether a person's salary is Low (target label = 0) or Medium/High (target label = 1).
- **Splitting the Data into Training and Test Data:** Use `train_test_split` with a 70% train and 30% test split.
- **One-Hot Encoding of Categorical Features:** Apply OneHotEncoder on the categorical columns, setting parameters to ignore unknown values, drop the first level, and return a dense matrix.
- **Numeric Features:** Make sure that all your numeric features are actually numeric.
- **Model Training:** Use `DecisionTreeClassifier` to train on the training data. Simply use `DecisionTreeClassifier(random_state=1)`.
- **Metrics Calculation:** Use scikit-learn’s **accuracy_score, precision_score, recall_score,** and **f1_score** functions to return all the classification metrics in your final output.


#### Create your solution here

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def process_and_train_decision_tree(df: pd.DataFrame, categorical_features: list, target_feature: str) -> tuple:
    
    # Write your code here
        

    
    return accuracy, precision, recall, f1score
    

#### Example Usage:

In [None]:
# Copy the dataset:
df = data.copy()

# Separate numeric, categorical and numeric features:
numerical_features = [
    'satisfaction_level',
    'last_evaluation',
    'number_project',
    'average_montly_hours',
    'time_spend_company',
]

categorical_features = [
    'Work_accident',
    'left',
    'promotion_last_5years',
    'Department',
]

target_feature = 'binary_salary'

# Run the training and evaluation function
accuracy, precision, recall, f1score = process_and_train_decision_tree(df, categorical_features, target_feature)

# Print metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1score:.2f}")

#### Save your solution to a file ...

In [None]:
%%writefile solution_1.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def process_and_train_decision_tree(df: pd.DataFrame, categorical_features: list, target_feature: str) -> tuple:


    
    return accuracy, precision, recall, f1score
    

#### Test 1: Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 1

### Problem 2: Train a kNN classification model on a dataset

You are given the above dataset from Problem 1 with several features and a target column. Your task is to implement a **KNN classifier** that includes necessary preprocessing steps to handle both categorical and numerical data.

The aim is to build a reliable **KNN** model using **k=3** with optimized preprocessing.

Write a function `knn_salary_classification(data)`.


Problem Statement:

Implement the function `knn_salary_classification` that will:

- Take the dataset as a pandas DataFrame as input.

- Apply preprocessing to one-hot encode categorical features. Use **Label Encoding**.

- Train a KNN classifier using **k=3**.

Evaluate the classifier using the testing set and return the **accuracy** of the classifer and the **classification report** (that includes metrics like precision, recall, and F1-score) as tuple in the function's final output.

Hint:

- Use as feature columns within the function following list of column names:

    `feature_columns = [
        'satisfaction_level',
        'last_evaluation',
        'number_project',
        'average_montly_hours',
        'time_spend_company',
        'Work_accident',
        'left',
        'promotion_last_5years',
        'Department'
    ]`

- Use as target columnn **binary_salary**.
- Use the `LabelEncoder` from `sklearn.preprocessing` for categorical variables.
- Use `KNeighborsClassifier` for the kNN classifier model.


#### Create your solution here

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier

def knn_salary_classification(data):
    
    # Write your code here
    



    return accuracy, report
    

#### Example Usage:

In [None]:
# Copy the dataset:
df = data.copy()

# Call the function using the full dataset
accuracy, report = knn_salary_classification(df)

# Display the full dataset results
print("\nAccuracy of KNN Salary Classification (k=3):", accuracy)
print("Classification Report:\n", report)

#### Save your solution to a file ...

In [None]:
%%writefile solution_2.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier

def knn_salary_classification(data):
    




    return accuracy, report
    

#### Test 2 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 2

### Problem 3: Train a Naive-Bayes classification model on a dataset

Evaluate a **Naive Bayes model** on a test dataset using common classification metrics.

For this purpose, create a function `naive_bayes_metrics(df, categorical_features, target_feature)` which takes the dataset as pandas DataFrame, the categorical features as list and the target feature as a string as input variables.

**Important:** In this problem only the categorical features are used as input features for model training. You can drop the numerical features. Use 
the following list of categorical imnput features: `['Work_accident','left','promotion_last_5years','Department']`

as categorical input features.

After training your **Naive Bayes model** on a training set, evaluate the model on the test set by calculating the following metrics:

- **Accuracy:** The overall proportion of correct predictions.

- **Precision:** The proportion of true positive predictions out of all positive predictions.

- **Recall:** The proportion of true positive predictions out of all actual positives.

- **F1-Score:** The harmonic mean of precision and recall, providing a balance between the two.

Return the final result in form of a dictionary: `{ "accuracy": accuracy, "precision": precision, "recall": recall, "f1_score": f1 }`

Hint:

- Performn a 70:30 train-test split of your input dataframe to create training and test data sets.

- One-hot encode the categorical features. You can use `OneHotEncoder` from `sklearn.preprocessing`.

- Use `MultinomialNB(alpha=1.0)` from `sklearn.naive_bayes` to train your Naive-Bayes classification model.

- You can use `ColumnTransformer` and `Pipeline` from `sklearn` to perform the feature pre-processing and model training together in one machine-learning pipeline.
  
- You can use **accuracy_score, precision_score, recall_score,** and **f1_score** from `sklearn.metrics` to compute the metrics.

- Ensure that **y_test** (the original test labels) and **y_pred** (the predicted test labels) are arrays of the same shape and data type.


#### Create your solution here

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

def naive_bayes_metrics(df: pd.DataFrame, categorical_features: list, target_feature: str) -> dict:
    
    # Write your code here
    


    
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }
    

#### Example Usage:

In [None]:
# Copy the dataset:
df = data.copy()

# Define numerical, categorical and target features:
# numerical_features = [
#     'satisfaction_level',
#     'last_evaluation',
#     'number_project',
#     'average_montly_hours',
#     'time_spend_company',
# ]

categorical_features = [
    'Work_accident',
    'left',
    'promotion_last_5years',
    'Department',
]

target_feature = 'binary_salary'

# Call the function using the dataset and the list of categorical features and the target feature
metrics = naive_bayes_metrics(df, categorical_features, target_feature)

# Print metrics
print(f"Accuracy: {metrics['accuracy']:.2f}")
print(f"Precision: {metrics['precision']:.2f}")
print(f"Recall: {metrics['recall']:.2f}")
print(f"F1-Score: {metrics['f1_score']:.2f}")

#### Save your solution to a file ...

In [None]:
%%writefile solution_3.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

def naive_bayes_metrics(df: pd.DataFrame, categorical_features: list, target_feature: str) -> dict:
    

    



    
    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }

#### Test 3 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 3

### Problem 4: Train a Linear Regression model on a dataset

Analyze and preprocess the **predict_home_value.csv** dataset to prepare it for a linear regression model that predicts house prices.

Tasks:

**Important:** For this problem, only use the **numeric features** in the dataset for model training. 


Split the Dataset:

- Create a function `select_and_split_data(df, numeric_columns, target_column)` that takes in the dataset as pandas DataFrame, the numeric columns and the target column to split the dataset into training and test sets.
- Split the data into training and test sets with an **80:20** ratio.
- Use **SALEPRICE** as the target column.
- The function shall return the training and test datasets for model training and evaluation, **X_train, X_test, y_train, y_test**.


Transform Features and Train Regression Model:

- Create a function `preprocess_train_and_evaluate_model(X_train, X_test, y_train, y_test, numeric_columns)` that then preprocesses the numeric features and trains the ML model on the trainig data and evaluates it on the test data.

- As numerical columns you can use `['LOTAREA', 'OVERALLCOND', 'YEARBUILT', 'FULLBATH', 'HALFBATH',
       'BEDROOMABVGR', 'KITCHENABVGR', 'TOTRMSABVGRD', 'FIREPLACES',
       'GARAGECARS', 'POOLAREA', 'MOSOLD', 'YRSOLD']`

- Normalize or scale the numerical columns.

- Predict house values for **X_test** via linear regression.

- Calculate and return **MAE (mean absolute error), MSE (mean squared error),** and **RMSE (root mean squared error)** as regression evaluation metrics.


Hint:

- Use `StandardScaler` from `sklearn.preprocessing` for normalization of numeric features.

- Use `LinearRegression()` from `sklearn` to fit the model on the training data **X_train, y_train**.

- You can use `ColumnTransformer` and `Pipeline` from `sklearn` to perform the feature pre-processing and model training together in one machine-learning pipeline.

- Use **mean_absolute_error, mean_squared_error** from `sklearn` and **np.sqrt()** to calculate the three evaluation metrics based on the test data.


#### Prepare dataset for regression problem

In [None]:
# Load the dataset
import pandas as pd
file_path = '/data/IFI8410/finalexam/predict_home_value.csv'
data = pd.read_csv(file_path)

In [None]:
data.head()

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.drop(columns=['ID', 'POOLQC', 'FENCE'], inplace=True) 

In [None]:
data.columns

In [None]:
numeric_features = data.select_dtypes(
    include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
).columns
numeric_features

In [None]:
numeric_columns = ['LOTAREA', 'OVERALLCOND', 'YEARBUILT', 'FULLBATH', 'HALFBATH',
       'BEDROOMABVGR', 'KITCHENABVGR', 'TOTRMSABVGRD', 'FIREPLACES',
       'GARAGECARS', 'POOLAREA', 'MOSOLD', 'YRSOLD']

In [None]:
target_column = 'SALEPRICE'

#### Create your solution here

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

def select_and_split_data(
    df: pd.DataFrame, numeric_columns: list, target_column: str = 'SALEPRICE'
) -> tuple:

    # Write your code here



    
    return X_train, X_test, y_train, y_test


def preprocess_train_and_evaluate_model(
    X_train, X_test, y_train, y_test, numeric_columns
) -> tuple:
    
    # Write your code here
    


    return mae, mse, rmse
    

#### Example Usage:

In [None]:
df = data.copy()

numeric_columns = ['LOTAREA', 'OVERALLCOND', 'YEARBUILT', 'FULLBATH', 'HALFBATH',
       'BEDROOMABVGR', 'KITCHENABVGR', 'TOTRMSABVGRD', 'FIREPLACES',
       'GARAGECARS', 'POOLAREA', 'MOSOLD', 'YRSOLD']

target_column = 'SALEPRICE'

X_train, X_test, y_train, y_test = select_and_split_data(df, numeric_columns, target_column)

mae, mse, rmse = preprocess_train_and_evaluate_model(X_train, X_test, y_train, y_test, numeric_columns)

# Print metrics
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")

#### Save your solution to a file ...

In [None]:
%%writefile solution_4.py

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

def select_and_split_data(
    df: pd.DataFrame, numeric_columns: list, target_column: str = 'SALEPRICE'
) -> tuple:




    
    return X_train, X_test, y_train, y_test


def preprocess_train_and_evaluate_model(X_train, X_test, y_train, y_test, numeric_columns: list) -> tuple:
        


    

    return mae, mse, rmse


#### Test 4 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 4

### Problem 5: Compare the results of KMEANS Clustering and Agglomerative Hierarchical Clustering (AHC) on a dataset

A dataset is prepared for you that uses `make_moons` from `sklearn.datasets` to generate a dataset with two interleaved half-moon shapes (Dataset Parameters: n_samples=500 (500 data points), noise=0.2 (introduce noise for complexity), random_state=42 (ensure reproducibility))

You are tasked to compare two different clustering techniques on this complex dataset.

The goal is to understand how each method performs on non-linear (two-dimensional) data and visualize the clustering results.

Tasks:

**Apply the following two clustering algorithms to the dataset:**

**KMeans++:**
- Use `KMeans` from `sklearn.cluster`.
- Initialization method: **k-means++**.
- Number of clusters: 2.

**Agglomerative Hierarchical Clustering (AHC):**
- Use `AgglomerativeClustering` from `sklearn.cluster`.
- Number of clusters: 2.

**Visualize Clustering Results:** For each clustering method:
- Plot the data points colored by their assigned cluster.
- Display all six clustering results side by side in a 2x3 grid for comparison.
- Save the visualization as comparison_of_clustering_techniques.png.

Create two functions:

- `apply_clustering_methods(X)` returns the **results** as a dictionary: the keys are the **method names** "K-Means++" and "AHC" as strings, the values are the **clustering labels** that are returned for each data point in the data set.

    `def apply_clustering_methods(X):`
          `...`
          `return results`
  
- `plot_clustering_results(X, results)` plots the data points in **X** with two different colors based on their predicted label:
  * Use the values in the dictionary **results** retuned by `apply_clustering_methods`.
  * Use 'Feature 1' as x-axis label and 'Feature 2' as y-axis label.
  * You can combine the figures from the two clustering mrethoids as sub plots. Use:
  
    `fig, axes = plt.subplots(2, 1, figsize=(18, 12))`

    `axes = axes.flatten()`
  
  * Save the figure named **comparison_of_clustering_techniques.png**.
  * Submit the .png file as your solution.


#### Prepare dataset for clustering problem

In [None]:
# Create a dataset with half-moon shapes
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=100, noise=0.2, random_state=42)

#### Create your solution here

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering, Birch
from sklearn.mixture import GaussianMixture

def apply_clustering_methods(X):

    # Write your code here


    
    return results

def plot_clustering_results(X, results):

    # Write your code here





#### Example Usage:

In [None]:
# Try out your code:

# Apply clustering methods
clustering_results = apply_clustering_methods(X)

# Plot and save results
plot_clustering_results(X, clustering_results)


#### Save your solution to a file ...

In [None]:
%%writefile solution_5.py

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering, Birch
from sklearn.mixture import GaussianMixture

def apply_clustering_methods(X):


    
    
    return results

def plot_clustering_results(X, results):





#### Test 5 Execute the cell below to test your solution...

In [None]:
! test/run_test.sh 5

# Run all the tests again ...

In [None]:
! ./test/run_test.sh

# Final Exam

- The final gexam will be held on **2024-12-11, 6:00 PM (EDT)**.

- Make sure that all your programs and output files are in the exact folder as specified in the instructions.

- All file names on this system are case sensitive. Verify if you copy your work from a local computer to your home directory on ARC.

**Execute the cell below to submit your assignment**

In [None]:
! ./submit.sh -y