# Programming Assignment 1
* CSCI-5931 : Deep Learning
* Spring 2024
* Instructor: Ashis Kumer Biswas
* Student name: Carol Kiekhaefer


In [None]:
# run code in a Jupyter notebook cell
# % indicates it's a Jupyter magic command; allows me to execute conda commands directly from the notebook interface. 
# This command installs the pyarrow package into the currently active conda environment
#%conda install pyarrow  

## Little Background about the problem

`Customer churn`` occurs when customers stop doing business with a company, also known as customer attrition. It is also referred to as loss of clients or customers.

You are given sensitive information of 9,000 of an European Bank, EBQ. Your task is to build an Artificial Neural Network (ANN) based on the dataset such that later the ANN model can predict correctly who is going to leave next. This predictive analysis is vital for the EBQ bank to revise their business strategy towards customer retention. What do you think?

Anyway, you are recruited by the bank to do the data science. And, the head of the bank only trusts heads, i.e., brains…. I mean neural networks for making any decisions. And luckily you were in Dr. B’s class and you know something(?) about the ANN that you could successfully convince the head of the bank during the interview. He has put a lot of faith in you. Now, can you solve his problem?

Tasks:

1. Please download the zip file, `PA1-deliverables.zip``. Unzip it in your workspace. Here below is the file hierarchy of "PA1-deliverables/" folder:

```
PA1-deliverables
├── 2024-Spring-DL-PA1-assignment.ipynb
├── dataset
│   └── datasetX.csv
├── figures
│   ├── le.png
│   ├── nn-1.png
│   ├── nn-1.svg
│   ├── nn-2.png
│   ├── nn-2.svg
│   ├── nn-3.png
│   ├── nn-3.svg
│   └── ohe.png
└── saved_models
```
As you can see you will mostly be working with the 2023-Fall-DL-PA1-assignment.ipynb, i.e., the jupyter notebook. The notebook accesses the dataset files: `dataset/datasetX.csv` containing few customer information and is labeled (i.e., the target column, `Exited` is present). Here below is a brief summary of the features you will find in the datasets:

* `CustomerId`: a unique identifier for each customer within the dataset. These values are not ordered sequentially within the dataset, and are only used to identify a specific customer. It typically does not have any influence to whether a customer leaves the business.
* `Surname`: A string used to identify the customer in the dataset. Surname may be distinct amidst all or most customers. Because of this, it most likely won't affect the target variable. 
* `CreditScore`: a numeric representation of the customer's individual fiscal credit score. Typically used to indicate eligibility for loans. Current credit scores use a range from 300 to 850, but the FICO auto score range uses 250-900. This feature likely determines retention rate of customers. 
* `Geography`: this feature contains a categorical string representing the name of a country the customer is from originally. 
* `Gender`: this feature contains a categorical string representing the gender of the customer ("Male"/"Female"). 
* `Age`: a numerical integer representation of a customer's age. Intuition suggests that older customers are likely to have higher retention than younger customers.
* `Tenure`: a numerical integer representation. It is assumed that this feature represents the number of total years the customer has been retained. It is likely that customers which have been retained longer will continue to be retained.
* `Balance`: a numerical floating point number (to two decimal places of precision) indicating the customer's current bank balance (assumed total across all accounts). Customers with a greater balance may be less likely to exit the account due to difficulty of transfer. 
* `NumOfProducts`: numeric integer value. It is assumed that this value represents the number of accounts (products) that this customer has open. Further evaluation of this feature would be needed to determine the usefulness of this feature, but at face-value, intuition dictates that a customer with more products is less likely to exit. 
* `HasCrCard`: boolean flag (0 or 1) representing whether the customer has a credit card or not. 
* `IsActiveMember`: boolean flag (0 or 1) representing whether the customer is an active member of the bank. It is assumed this indicates whether the customer has transactions on the regular banking statement. Intuition dictates that inactive members are more likely to exit. 
* `EstimatedSalary`: numerical floating point representation of the customer's predicted salary (to two decomal places) intuition dictates that customers with different incomes may behave differently with respect to retention rate. 
* `Exited`: boolean flag (0 or 1) representing whether the customer has exited their account. This is the target variable for the dataset. It should not be dropped, but should not be included as the training input (X), and should instead be separated as the target label (y). 

You will also see an empty directory `saved_models/`, that is for you to save all the models you'd train in this assignment.

`figures/` directory contains few image files used to properly document this assignment. Please do not delete and when possible please move them with this jupyter notebook for proper display of its contents.

> In this Jupyter notebook please write your solutions / codes in the cells marked with `#Your solution goes here...`. You may add additional code cells after that cell if you desire. But, please do not remove any cell originally given in the notebook.

> After you solve the assignment in the jupyter notebook, be sure to execute and save it so that execution/results/printouts are also saved with it.
> Finally, submit only the saved jupyter notebook (`2024-Spring-DL-PA1-assignment.ipynb`) in Canvas to receive grade. For this assignment, Canvas only will accept the jupyter notebook in "*.ipynb" extension.

## Task 1 : (10 points)
* Define a function named `summarize_dataset` that takes only one argument: `csv_file`, where `csv_file` is the name of the given `csv` file with this assignment, i.e., `datasetX.csv`. 
  * The function is expected to summarize the given dataset in the following way:
```
total number of rows = a
total number of columns = b
number of columns having non-numeric values = c
columns with missing values = [ (d1, e1)  (d2, e2), ... ]
gender based summary of exited column = [ (f1, g1)  (f2, g2), ... ]
age based summary of exited column = [ ('below or equal to 40', h1)  ('above 40', h2) ]
credit score summary =  i +/- j 
```
  
where,

* `a` is total number of rows in the dataset.
* `b` is total number of columns in the dataset.
* `c` is number of columns having non-numeric values.
* $(d_i, e_i)$ (i.e., a pair/tuple entry) represents column name ($d_i$) and number of missing values present in that column ($e_i$). If number of missing values in a column is zero (0), you do not need to list it. Please sort the tuple entries in descending order of $e_i$ values.
* $g_i$ represents the percentage of gender $f_i$ who exited. Please sort the tuple entries entries in descending order of $g_i$ values. Also, print the percentages in 2 decimal places after the decimal point, and print use `%` symbol after the percentage value.
* $h_1$ and $h_2$ represents the percentage of $\leq 40$ year olds who exited, and the percentage $>40$ year olds who exited.  Also, print the percentages in 2 decimal places after the decimal point, and print use `%` symbol after the percentage value.
* `j` and `k` are average and standard deviation of credit scores among the data samples respectively. Please print the way it is shown above. Also, print the both values in 2 decimal places after the decimal point.


In [None]:
#Your solution goes here...
import pandas as pd
import numpy as np
import torch

def summarize_dataset(csv_file):
    
    """
    Summarize a churn dataset for the European Bank EBQ from a CSV file.
    
    This function processes a CSV file containing churn data of the European Bank EBQ,
    summarizing important statistics to aid in predictive analysis for customer retention.
    It outputs the total number of rows and columns, identifies non-numeric and missing values,
    and provides detailed summaries based on gender, age, and credit scores.
    
    This summary includes statistical insights crucial for preparing the data for an Artificial
    Neural Network (ANN) model aimed at predicting customer churn.

    The function provides the following summary statistics:
    - Total number of rows and columns in the dataset.
    - Number of columns containing non-numeric values.
    - A list of columns with missing values alongside the count of missing entries for each.
    - A gender-based summary of the 'Exited' column, indicating the percentage of customers 
      from each gender who have exited.
    - An age-based summary of the 'Exited' column, categorizing customer exits by age groups: 
      40 and below, and above 40.
    - A summary of the 'CreditScore' column, including its mean and standard deviation.

    Parameters:
    - csv_file (str): The filepath of the CSV file to be analyzed. This file should be 
                      formatted correctly and include at least the following columns: 
                      'Gender', 'Age', 'Exited', and 'CreditScore'. 

    Outputs:
    - Total number of rows and columns in the dataset.
    - Number of columns having non-numeric values.
    - Columns with missing values, sorted in descending order by the number of missing values.
    - Gender-based summary of the 'Exited' column, showing the percentage of customers who exited,
      sorted in descending order by percentage.
    - Age-based summary of the 'Exited' column, categorizing customers as 'below or equal to 40'
      and 'above 40', and displaying the exit percentages for these categories.
    - Credit score summary, including the mean and standard deviation of credit scores among the customers.
    
    This function prints the summary statistics directly to the console. The information 
    is organized and displayed in a user-friendly manner, allowing for easy interpretation.

    Example Usage:
    --------------
    >>> summarize_dataset('path/to/datasetX.csv')

    Note:
    - This function is designed with the assumption that the dataset structure aligns with 
      the specified requirements. If the dataset's structure or column names differ, 
      modifications to the function may be necessary.
    - Ensure that the dataset does not contain sensitive information or that appropriate 
      data handling measures are in place to protect privacy.

    Returns:
    None. The function directly prints the summary statistics.
    """
    
    # Load the dataset
    data = pd.read_csv(csv_file)
    
    # Determine the total number of rows and columns
    total_rows = data.shape[0]
    total_columns = data.shape[1]
    
    # Identify non-numeric columns
    non_numeric_columns = data.select_dtypes(include=['object']).shape[1]
    
    # Identify columns with missing values and their counts
    missing_values = data.isnull().sum()
    columns_with_missing_values = [(column, missing) for column, missing in missing_values.items() if missing > 0]
    columns_with_missing_values.sort(key=lambda x: x[1], reverse=True)
    
    # Gender-based summary of the 'Exited' column
    gender_summary = data.groupby('Gender')['Exited'].mean() * 100
    gender_summary = [(gender, f"{percentage:.2f}%") for gender, percentage in gender_summary.items()]
    gender_summary.sort(key=lambda x: x[1], reverse=True)
    
    # Age-based summary of the 'Exited' column
    age_below_or_equal_40 = data[data['Age'] <= 40]['Exited'].mean() * 100
    age_above_40 = data[data['Age'] > 40]['Exited'].mean() * 100
    age_based_summary = [
        ('below or equal to 40', f"{age_below_or_equal_40:.2f}%"),
        ('above 40', f"{age_above_40:.2f}%")
    ]
    
    # Credit score summary (mean and standard deviation)
    credit_score_mean = data['CreditScore'].mean()
    credit_score_std = data['CreditScore'].std()
    credit_score_summary = f"{credit_score_mean:.2f} +/- {credit_score_std:.2f}"
    
    # Print the summaries to the console
    print(f"Total number of rows = {total_rows}")
    print(f"Total number of columns = {total_columns}")
    print(f"Number of columns having non-numeric values = {non_numeric_columns}")
    print(f"Columns with missing values = {columns_with_missing_values}")
    print(f"Gender based summary of exited column = {gender_summary}")
    print(f"Age based summary of exited column = {age_based_summary}")
    print(f"Credit score summary = {credit_score_summary}")
    
    return data




In [None]:
# Load the dataset via summarize_dataset
# Verify accuracy of the path
import os

# Use the absolute path to the file
file_path = "./dataset/datasetX.csv"
data = summarize_dataset(file_path)


**Note that two variables, Age and CreditScore have missingness which will required imputation prior to use in any of the models.**

## Task 2
* Preprocessing the given dataset for the model training.

### Task 2.1 (10 points)

* First preprocessing that we are going to do on the dataset is dropping two features (i.e., columns) that, I think, are irrelevant and would not make any meaningful relationship with the `Exited` feature. The features are: `CustomerId` and `Surname`.
* Make sure to create a variable called `dataset_dropped` that will store the revised dataset.
* Please print the name of the columns of the revised dataset.

In [None]:
#Your solution goes here...

# Load the dataset where 'data' is our DataFrame loaded via summarize_dataset function  

# Drop the 'CustomerId' and 'Surname' columns
dataset_dropped = data.drop(['CustomerId', 'Surname'], axis=1)

# Store the names of the columns in the revised dataset in a variable
dataset_dropped_columns = dataset_dropped.columns

# Print the names of the columns in the revised dataset
print("Revised dataset columns:", dataset_dropped_columns)

### Task 2.2 (10 points)
* Second Preprocessing that we are going to do is *Shuffle Rows* of the dataset obtained from `Task 2.1`.
* "It is extremely important to shuffle the training data, so that you do not obtain entire minibatches of highly correlated examples. As long as the data has been shuffled, everything should work OK. Different random orderings will perform slightly differently from each other but this will be a small factor that does not matter much." -- [Ian Goodfellow](https://qr.ae/pGBgw8)
* Use a random seed value `4321` in case you will call any stochastic method.
* Make sure to create a variable called `dataset_shuffled` that will store the revised dataset.


In [None]:
# The following code finds the maximum value in the 'CreditScore' column
max_credit_score = dataset_dropped['CreditScore'].max()

### Explanation of key parts of the next code for dataset_shuffled

* sample(frac=1): This shuffles the entire dataset. The frac parameter specifies the fraction of rows to return in the random sample, so frac=1 means return all rows, but in a random order.

* random_state=4321: This sets the seed for the random number generator, ensuring that the shuffle operation is reproducible. Using the same seed value means you'll get the same order of shuffled rows every time you run this code.

* reset_index(drop=True): After shuffling, the index of the DataFrame will be in the order of the shuffled rows. reset_index(drop=True) resets the index to the default integer index (0, 1, 2, ...) and avoids adding the old index as a column in the DataFrame.

This approach ensures that our dataset is shuffled, addressing the concern of avoiding minibatches of highly correlated examples during training, which is crucial for training machine learning models effectively.

In [None]:
#Your solution goes here...

# Assuming dataset_dropped is our DataFrame obtained from the previous steps
# (e.g., after dropping the 'CustomerId' and 'Surname' columns)

# Shuffle the rows of the dataset
dataset_shuffled = dataset_dropped.sample(frac=1, random_state=4321).reset_index(drop=True)

# Print the first few rows of the shuffled dataset to verify
print(dataset_shuffled.head())


### Task 2.3: (10 points)

* Third Preprocessing that we will do is X-y Partitioning of the dataset obtained from `Task 2.2`.
* In its current state, the dataset contains both independent (input, `X`) and the target (output, `y`) features within the same dataframe. For ease of of the training process, we need to partition the training features from the target feature into two separate dataframes. 
* Make sure, the following cell contains at least two variables: `X` and `y`:
  * `X` contains part of the dataset with only independent features, and 
  * `y` having only the dependent/target feature.

In [None]:
#Your solution goes here...

# Assuming dataset_shuffled is our DataFrame obtained from the previous step, Task 2.2
# and is already shuffled as per the last step

# Partition the dataset into X and y
# X contains all columns except the target feature 'Exited'
X = dataset_shuffled.drop('Exited', axis=1)

# y contains only the target feature 'Exited'
y = dataset_shuffled['Exited']

# Comments:
# -------------------
# - The .drop('Exited', axis=1) method is used on the shuffled DataFrame to create X.
#   This removes the 'Exited' column (specifying axis=1 for columns) and retains all other columns.
#   The resulting DataFrame, X, consists of only the independent features.

# - To create y, we simply select the 'Exited' column from the dataset_shuffled DataFrame.
#   The resulting y holds the values of the dependent (target) variable.

#  Print the shapes of X and y to verify the partitioning:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

# This provides a quick check to ensure X contains all features except target, and y contains only target feature. 
# Step is included to confirm that dataset has been correctly partitioned into independent & dependent variables.


### Task 2.4 (10 points)
* Fourth Preprocessing that we will do is Train-Test Split of X, y obtained from `Task 2.3`.
* Now that we have X and y tables with appropriate feature pruning performed, we must split the data into a training partition (`X_train, y_train`) and a testing partition (`X_test, y_test`). 
* The training partitions (`X_train, y_train`) will be used to train your model, while the test partition (`X_test, y_test`) will be set aside during the training steps, and will only be used to evaluate the trained model. 
* Training and test splits should be mutually exclusive to the datasets... i.e., a sample can not be both in training and test sets.
* Please perform a 80-20 split, meaning 80% of the (X,y) dataset will be in (X_train, y_train) split, while, remaining 20% will be in (X_test,y_test) split. 
* Please use random seed `4321` prior to calling any stochastic methods.
* Make sure the following cell contains at least 4 variables: `X_train`, `y_train`, `X_test`, `y_test`.

In [None]:
#Your solution goes here...
from sklearn.model_selection import train_test_split

# X and y have been defined in previous steps as the features and target variable, respectively

# Perform an 80-20 train-test split
# Use random_state=4321 to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4321)

#  Comments:
# -------------------
# - `train_test_split` is called with X & y, datasets containing independent features and target variable.
# - `test_size=0.2` specifies that 20% of data should be allocated to test set, & 80% will be used for training.
# - `random_state=4321` sets seed for the random number generator used to shuffle data before splitting. 
# - This ensures that the split is reproducible; running this code multiple times will produce the same split, which is important for experimental repeatability.

# Print the shapes of the train and test sets to verify the split:
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)

# This will confirm that the dataset has been split according to the specified proportions.
# The shapes of X_train and y_train should match, and the shapes of X_test and y_test should match.
# This ensures that each feature vector in X has a corresponding target variable in y.



In [None]:
X_train.head()

### Task 2.5 (10 points)

* Fifth preprocessing that we will do is the *Conversion of Categorical features to Numerical*
* Please adopt the `One Hot Encoding` method instead of `Label Encoding` while converting the categorical features. 
* Make sure the following cell contains a variable named `X_train_ohe` that would contain one hot encoded `X_train` data; on the two categorical columns: 'Geography','Gender'. Please save the encoder for later use; e.g., encode `X_test` dataset, or any future test sample given to you. Under any circumstance, you must not encode `X_test` independently like you would do for `X_train`.
* Now, encode the `X_test` data using the one hot encoder you saved while you encoded the `X_train`, and name the variable `X_test_ohe`.


* **Both encoding techniques are outlined below**:
> A little background first: Categorical features are features that contain values that are not numeric. It would be absurd to work with non-numeric features if you ask neurons in your ANN to compute the weighted sum of inputs, and then pass through activation function, right? These maths are undefined. An obvious solution you may be intrigued to do is dropping the features! Aha! Wrong!! Every piece of data is precious... may present with valuable insights of the data samples to find the patterns to map inputs with output/targets. So, we should include them. But, how?

The answer is via "Encoding". 

Several types of encodings are used in practice. Here below are just 2 popular ones:
1. **Label Encoding**, where labels are encoded as subsequent numbers. Say, for a categorical feature named "Category" with three categorical values: {“Cat”, “Dog” or “Zebra”} can be encoded to "0", "1", "2" respectively as in figure below. The issue with this type of encoding may unintentionally impose a type of ordering of the categories, that may add bias to the training.


![label-encoding](figures/le.png)

2. **One Hot Encoding**, ignores the ordering of the categories all together. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1. Also, don't forget to remove the original categorical features. Here below just an example, how to convert the categorical feature called "Category" having the {“Cat”, “Dog” or “Zebra”} values into three new binary features: "Cat", "Dog", "Zebra".

![label-encoding](figures/ohe.png)

**A note on the Dummy Variable Trap**
The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated (i.e., becomes multi-collinear). This means that one variable can be predicted from the others, making it difficult to interpret predicted coefficient variables in regression models. In other words, the individual effect of the dummy variables on the prediction model can not be interpreted well because of multicollinearity.

Using the one-hot encoding method, a new dummy variable is created for each categorical variable to represent the presence (1) or absence (0) of the categorical variable. For example, if tree species is a categorical variable made up of the values pine, or oak, then tree species can be represented as a dummy variable by converting each variable to a one-hot vector. This means that a separate column is obtained for each category, where the first column represents if the tree is pine and the second column represents if the tree is oak. Each column will contain a 0 or 1 if the tree in question is of the column's species. These two columns are multi-collinear since if a tree is pine, then we know it's not oak and vice versa. The machine learning models trained on dataset having this multi-collinearity suffers. A remedy is to drop first (or any one) of the dummy (i.e., one-hot) features created.

- Below, we set up a ColumnTransformer object from the sklearn.compose module of the scikit-learn library; The column_transformer is called a ColumnTransformer object, specifically from the scikit-learn library. It is a utility for performing column-wise transformations within a machine learning pipeline, allowing different columns of the input data to undergo distinct preprocessing steps before further analysis or model training

- ColumnTransformer is a utility that allows different columns of the input data matrix to be transformed differently. This is useful in datasets where different types of features (e.g., categorical, numerical) require different preprocessing steps.  This class is designed to apply different transformations to different columns of a dataframe or NumPy array. 

- `transformers`: list of transformations to apply. Each element in the list is a tuple containing three elements:
    - First element of the tuple ('encoder') is a name or identifier for the transformer.
    - Second element (OneHotEncoder(drop='first')) specifies the transformer object to apply. OneHotEncoder is specified with parameter drop='first', which means for each categorical column, the first category is dropped to avoid creating collinear features (to avoid the dummy variable trap).
    - Third element of the tuple (['Geography', 'Gender']) specifies the columns to which the transformer should be applied. The square brackets denote a list in Python, meaning that the OneHotEncoder is applied to both the 'Geography' and 'Gender' columns.
- remainder='passthrough': Parameter that specifies what to do with remaining columns not explicitly selected for transformation in transformers list. By setting remainder='passthrough', all columns not specified in the transformers list are passed through without changes and are concatenated to the output of the transformers.

In addition to the transformations described above, we will also check missingness for each variable amd impute based on percent missingness.

In [None]:
# Function to calculate the percentage of missing values in each column
def calculate_nan_percentage(df):
    nan_percentages = df.isnull().mean() * 100
    return nan_percentages

# Calculate the percentage of missing values for each column in the dataset
missing_percentages = calculate_nan_percentage(X_train)

missing_percentages

Note that the percentages of missing values in age and CreditScore are 4.41% and 0.29% respectively.

- Under 1%: Missingness is often inconsequential, and any form of imputation (mean, median, mode) can be used.
- 1% - 5%: Simple imputations like mean/median for numerical features, or mode for categorical features, are typically safe.
- 5% - 15%: More caution is needed. Consider using model-based methods, like K-Nearest Neighbors (KNN) or Multiple Imputation by Chained Equations (MICE).

In [None]:
# Find columns with missing values and get their names
columns_with_missing_values = X_train.columns[X_train.isnull().any()]

# Print the column names with missing values
print("Columns with missing values:", columns_with_missing_values)


In [None]:
#Your solution goes here...
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer

# Assuming X_train and X_test are defined and include 'Geography' and 'Gender'

# Define the column transformer with OneHotEncoder for 'Geography' and 'Gender'
column_transformer = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(drop='first'), ['Geography', 'Gender'])
    ],
    remainder='passthrough'
)

# Fit and transform the X_train dataset, then transform the X_test dataset
X_train_ohe = column_transformer.fit_transform(X_train)
X_test_ohe = column_transformer.transform(X_test)

# Retrieve column names for the one-hot encoded features
ohe_feature_names = column_transformer.named_transformers_['encoder'].get_feature_names_out(input_features=['Geography', 'Gender'])

# Identify non-transformed (passed through) columns
non_transformed_cols = [col for col in X_train.columns if col not in ['Geography', 'Gender']]

# Combine all column names
all_feature_names = np.concatenate((ohe_feature_names, non_transformed_cols))

# Convert the output arrays to DataFrames with the proper column names
X_train_ohe = pd.DataFrame(X_train_ohe, columns=all_feature_names)
X_test_ohe = pd.DataFrame(X_test_ohe, columns=all_feature_names)

# Initialize imputers
mean_imputer = SimpleImputer(strategy='mean')
knn_imputer = KNNImputer(n_neighbors=5)

# Apply mean imputation to 'CreditScore'
X_train_ohe['CreditScore'] = mean_imputer.fit_transform(X_train_ohe[['CreditScore']])

# Apply KNN imputation to 'Age'
X_train_ohe['Age'] = knn_imputer.fit_transform(X_train_ohe[['Age']])


#  Comments:
# -------------------
# - The OneHotEncoder is configured with drop='first' to avoid the dummy variable trap by dropping the first category in each categorical variable. This reduces the number of dummy variables by one per feature, mitigating multicollinearity.
# - The ColumnTransformer allows us to specify which columns should be one-hot encoded while leaving the rest of the dataset unchanged. It's a flexible tool for applying different transformations to different columns of a dataset.
# - We fit the ColumnTransformer to the X_train dataset using `fit_transform`, which both learns the encoding scheme (i.e., which categories exist in the data) and applies the transformation. 
# - For the X_test dataset, we use `transform` without fitting, ensuring that the same encoding scheme learned from X_train is applied. This is crucial to maintain consistency in the feature space between training and testing datasets.
# - It's important not to fit the encoder to X_test because doing so would allow the encoder to learn potentially different categories, leading to inconsistencies between the encoded features of the training and testing sets. 
# - Using the encoder fitted on the training set ensures that the testing set is transformed according to the same scheme, even if some categories are missing in the testing set.

# Note: After transforming with OneHotEncoder, the output is a NumPy array. 
# If we need to work with pandas DataFrame (i.e., to keep column names), we will need to convert it back to DataFrame. This might also involve manually setting the column names to reflect the one-hot encoded features and any other features passed through.
# Approach to convert output of ColumnTransformer back to a pandas DataFrame & manually handle column names for 
# one-hot encoded features along with other features is detailed in the next code chunk. 

### To convert the output of the ColumnTransformer back to a pandas DataFrame and manually handle column names for one-hot encoded features along with other features, we can follow these steps:
* .named_transformers_ is an attribute of the ColumnTransformer object and is a dictionary that stores the individual transformers used within the ColumnTransformer, accessible by their names. When we define a ColumnTransformer, we can assign a name to each transformer operation, which allows us to later reference these transformers by their given names.

* Retrieve Column Names for One-Hot Encoded Features: Use the `get_feature_names_out` method from the OneHotEncoder to get the names of the one-hot encoded features.

* After applying transformations like OHE, the original feature names often change. A single categorical feature with three categories would be transformed into three separate binary features. The `get_feature_names_out()` function generates the names for these new features based on the categories found in the data.

* Identify Non-Transformed Columns: Determine which columns were not transformed (i.e., passed through) and maintain their original names.

* Combine All Column Names: Create a comprehensive list of column names that includes both the one-hot encoded feature names and the names of the non-transformed features.

* Convert the Output Arrays to DataFrames: Use the comprehensive list of column names when converting the NumPy arrays (results of transformations) back into pandas DataFrames.

In [None]:

# The ColumnTransformer 'column_transformer' has already been defined and fitted as previously shown

# Step 1: Retrieve column names for the one-hot encoded features
ohe_feature_names = column_transformer.named_transformers_['encoder'].get_feature_names_out()

# Step 2: Identify non-transformed (passed through) columns
# ColumnTransformer does not directly provide the names of the untouched columns.
# Thus we will need to extract from column names that are not 'Geography' or 'Gender'- not encoded, e.g., from X_train
non_transformed_cols = [col for col in X_train.columns if col not in ['Geography', 'Gender']]

# Step 3: Combine all column names (ensure the order matches the order of columns in the transformed array)
all_feature_names = list(ohe_feature_names) + non_transformed_cols

# Step 4: Convert the output arrays to DataFrames with the proper column names
X_train_ohe_df = pd.DataFrame(X_train_ohe, columns=all_feature_names)
X_test_ohe_df = pd.DataFrame(X_test_ohe, columns=all_feature_names)

# Note: The DataFrame conversion assumes that the order of columns in the transformed array matches the order
# in 'all_feature_names'. If we apply other transformations or if order does not match, we will need to adjust accordingly.

# Now, X_train_ohe_df and X_test_ohe_df are pandas DataFrames with columns properly named.


### Task 2.6: (10 points)

* Sixth Preprocessing that we are going to do is *Normalization of X_train_ohe, and X_test_ohe*

* Now that we have all numerical training and test datasets: `X_train_ohe` and `X_test_ohe` respectively, we can normalize each features in both of the datasets. **Normalization** is just one of the way to scale each feature. In class you'll learn a ton of other ways to scale. For this task, let's resort to **Normalization**.

> "The rule of thumb for scaling datasets, is we scale training dataset first, then using the statistics that we learn during the scaling process, we scale the test dataset. We do not learn any new statistics while we scale the test dataset."

* Also, scaling is commonly performed column-wise, and never sample/row wise.

* Make sure the following cell contains the two scaled variables: `X_train_scaled` and `X_test_scaled` based on the requirements mentioned above.

### Notes on the normalization that will be performed in the next code chunk:

* Maintaining Data Integrity: By fitting the scaler only on the training data, we ensure that the model is not exposed to any information from the test dataset during training. This practice prevents information leakage and ensures that the model's performance metrics are a reliable indicator of how well the model will perform on unseen data.

* Applying the Same Scale: Transforming the test data using the scaler fitted on the training data ensures that the test data is scaled using the same parameters (min and max values for each feature) as the training data. This consistency is crucial for the model's ability to generalize from training to testing data.

* Column-wise Scaling: Normalization (and scaling in general) is applied column-wise because we want to adjust each feature's values to a specific range while maintaining their distribution across all samples. This process does not mix information between features but standardizes their scales.

Now, X_train_scaled and X_test_scaled will be our normalized training and test datasets, respectively, ready for training machine learning models.


In [None]:
#Your solution goes here...
import sklearn as sk
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# X_train_ohe and X_test_ohe are already defined as DataFrames from previous steps

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler to the training data and transform it
X_train_scaled = scaler.fit_transform(X_train_ohe)
# transform the test data using the same scaler but without fitting it again
X_test_scaled = scaler.transform(X_test_ohe)

# Convert the scaled arrays back to DataFrames with the correct column names
X_train_scaled = pd.DataFrame(X_train_scaled, columns=all_feature_names)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=all_feature_names)


# Comments:
# -------------------
# "The rule of thumb for scaling datasets is to scale the training dataset first, then use the
# statistics (i.e., min and max values for MinMaxScaler) learned during the scaling process to scale
# the test dataset. We do not learn any new statistics while we scale the test dataset."
# This approach ensures that the model is not inadvertently exposed to any information from the test set
# during training, maintaining the integrity of the testing process and preventing data leakage.

# Scaling is performed column-wise to ensure that each feature is normalized independently.
# This is critical because different features may have different scales and distributions,
# and normalization allows each feature to contribute equally to the distance calculations in many algorithms.


In [None]:
import sys
print(sys.version)

## Task 3: (10 points)
* *Designing your first Artificial Neural Network (ANN) based classifier* using i) **Micrograd**, and ii) **Tensorflow** or PyTorch**.:

  > **Micrograd** by Andrej Karpathy [[video](https://youtu.be/VMj-3S1tku0?si=D5m1IJW5AkJzhvLE)][[git-prepo](https://github.com/karpathy/micrograd.git)]

  > **Keras/Tensorflow** @ Python reference:  please take a look here [https://keras.io/getting-started/sequential-model-guide/](https://keras.io/getting-started/sequential-model-guide/). 

  > **PyTorch** @Python reference: Please take a look at [Deep Learning with PyTorch Guide](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)
  
  
### Step 1: The ANN architecture
* Let's design the first artificial network architecture for the classifier we would like to build. Here below is one. How did I get this architecture? Maybe in my dream! Haha. Someday you will get one too. Until that, let's follow the architecture below:
  ![Task 3 ANN architecture](figures/nn-1.png)
  * **Input layer** will have 11 units as the dimension of training set: `X_train_scaled` (i.e, number of columns = 11).
  * **First hidden layer** will have 5 neurons, each with "Rectified Linear Unit (`ReLU``)" as activation function.
  * **Second hidden layer** will have 4 neurons, each with "`ReLU`" as activation function.
  * **Output layer** will have just 1 neuron, with `sigmoid`` activation function. 
    * The reason behind a single neuron with `sigmoid` activation at the output layer is that, output of this neuron will tell the probability score of the target outcome: "Exited" True or False. If the output neuron produces value above 0.5, we will say the neural network predicted "True", otherwise, False. This is the beauty of using sigmoid function at the output layer as we can interpret the output value of the neuron as probability score.
* The architecture will come to life when you initiate the training process with training data.
  * The training process needs a g**radient descend based optimizer**, and a convex looking **loss function**.  
  * For this task, let's choose the `adam` optimizer, and the `binary_crossentropy` as the loss function.
### Step 2: The Training process
* Let's start the training process with the training dataset, `X_train_scaled`.
  * Gradient descend based optimization updates run in iterations. When number of iterations equal the total number of training samples, we call that `1 epoch` has passed. Let's continue the training for `25 epochs`. But, you are welcome to run longer than this. There are, however, simpler way to determine if you should early stop your training. 
    * (Optional) Can you extract information about optimization in each epoch? If so, draw a epoch-loss plot, where X-axis needs to show epoch numbers, and Y-axis will show the `binary_crossentropy` loss value in that particular epoch iteration.
* Don't forget to save the model into a file in the `saved_models/` directory so that you can re-use it later for further prediction. Let's give it a name: `model-ann-11-5-4-1-xx` with an extension of your choosing, with `xx` must be replaced by any of `{mc,pt,tf}`, where `mc` to denote if that's a micrograd based model that you are saving, or `pt` for a pytorch model, or `tf` for a tensorflow model.

### Step 3: The Evaluation

#### (part 3.1) Evaluating your model with the entire test dataset:

* Load your trained model `model-ann-11-5-4-1-xx` from the file, and have it predict the entire test set you have at head: (`X_test_scaled`). Luckily, for each of the test sample in the set, you also have ground true `Exited` value in the `y_test`. 
* Please report/print your model's predictive performance on the test set in terms of `accuracy`, `precision`, `recall`, and `F1 scores`.

#### (part 3.2) Evaluating your model with 1 test sample with known Exited value

* Here is a single test sample for which we know the ground true `Exited` value:

| CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember |EstimatedSalary | Exited |
| :---        |    :----:   |          ---: | :---        |    :----:   |          ---: | :---        |    :----:   |          ---: | :---        |    :----:   |          ---: |          ---: |
| 55443322 | Reynolds |709|Germany|Male|30|9|115479.48|2|1|1|134732.99|0|

* Load your trained model `model-ann-11-5-4-1-xx` from the file, and have it predict the test sample above. Please don't forget to preprocess this test samples so that it is compliant with the input and model requirements.
* Please report whether it predicts a 0 or 1 for the `Exited` target, and also comment whether your model makes a mistake or predicts correctly.

#### (part 3.3) Evaluating your model with 1 test sample without known Exited value

* Here is a single test sample for which we **do not know** the ground true `Exited` value:

| CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember |EstimatedSalary | 
| :---        |    :----:   |          ---: | :---        |    :----:   |          ---: | :---        |    :----:   |          ---: | :---        |    :----:   |          ---: |
| 55443323 | Nguyen |603|France|Female|76|20|123456.78|5|1|1|55000.00|


* Load your trained model `model-ann-11-5-4-1-xx` from the file, and have it predict the test sample above. Please don't forget to preprocess this test samples so that it is compliant with the input and model requirements.
* Please report whether it predicts a 0 or 1 for the `Exited` target. Can you comment on this data sample whether your model captured the pattern in the population?



The rationale for the proposed ANN is as follows:

Input Layer: The 11 units corresponds to the number of features (excluding CustomerId, Surname, and Exited) that are directly related to the customer's behavior and attributes which may influence their decision to leave the bank. These attributes provide foundational data required for predicting churn. 

First Hidden Layer: The first hidden layer with 5 neurons is a design choice that represents a level of abstraction derived from the input data. The use of ReLU (Rectified Linear Unit) activation function is suitable for introducing non-linearity into the model, allowing it to learn complex patterns in the data. The number of neurons (5) is fairly arbitrary but represents a balance between model complexity and computational efficiency. It's enough to begin processing the patterns from inputs without being overly complex.

Second Hidden Layer: The second hidden layer with 4 neurons continues the process of abstraction and pattern recognition in the data. Again, employing ReLU ensures the model can capture non-linear relationships. The decrease in neurons (from 5 to 4) from the first to the second hidden layer further refines the representations without making the network too wide. This slight reduction potentially reduces the risk of overfitting on the training data.

Output Layer: The output layer consists of a single neuron with a sigmoid activation function, which is a standard approach for binary classification problems like churn prediction (target variable Exited is either 0 or 1). The sigmoid function outputs a probability score between 0 and 1, indicating the likelihood of a customer exiting. A threshold of 0.5 is commonly used to classify outcomes as True (churn) or False (no churn), making the output easily interpretable as a probability.

Training Process
•	Optimizer: The Adam optimizer, favored for deep learning model training, merges AdaGrad and RMSProp's strengths, enhancing stochastic gradient descent. It employs exponentially decaying averages of past gradients for momentum and adjusts learning rates using these averages, ensuring efficiency and suitability for large datasets with varying gradient characteristics. This approach ensures computational efficiency and adaptability, particularly for data with noise or sparsity. 
•	Loss Function: Binary cross entropy is an appropriate loss function for binary classification problems. It measures the distance between the distribution of the predictions and the true distribution of the target variable. For churn prediction in this data, it quantifies how well the model predicts the actual customer churn status, making it an appropriate choice for optimizing the model during training.

Overall, the performance and the choice of hyperparameters such as the number of neurons in the hidden layers may need to be fine-tuned based on the actual data and through validation techniques to achieve optimal results.



In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import numpy as np

y_train_tensor = torch.FloatTensor(y_train.values).view(-1, 1)
print(y_train_tensor)

In [None]:
# Re-calculate percentage of missing values in each column to verify that the imputation was successful
def calculate_nan_percentage(df):
    nan_percentages = df.isnull().mean() * 100
    return nan_percentages

# Calculate the percentage of missing values for each column in the dataset
missing_percentages = calculate_nan_percentage(X_train_scaled)

missing_percentages


In [None]:
print(X_train_scaled.head())

**PyTorch approach for part 3.1: Evaluating your model with the entire test dataset:**

- First we will Load our trained model model-ann-11-5-4-1-xx from the file, and have it predict the entire
test set we have at head: (X_test_scaled). Luckily, for each of the test sample in the set, we also have ground true Exited value in the y_test.

- Please report/print your model’s predictive performance on the test set in terms of accuracy,
precision, recall, and F1 scores.

### Information relevant to the ANN architecture, training, and evaluation

- import torch.nn as nn imports the neural network module from PyTorch (torch.nn). This module contains all the building blocks for creating neural networks, such as layers (e.g., linear, convolutional) and activation functions.

- import torch.nn.functional as F imports the functional API from PyTorch, which provides functions for operations like activation functions (ReLU, sigmoid, etc.), loss functions, and convolution operations. It's a stateless alternative to the nn module.

- import torch.optim as optim imports the optimization module from PyTorch (torch.optim). This module includes various optimization algorithms for training neural networks, such as SGD, Adam, etc.

- y_train_tensor = torch.FloatTensor(y_train.values).view(-1, 1) This line converts the y_train data into a PyTorch tensor of type FloatTensor, which is necessary for computations in PyTorch models. The .values extracts the values from a pandas Series or DataFrame as a NumPy array. The view(-1, 1) reshapes the tensor to ensure it has a single column, making it suitable for operations expecting a specific input shape (e.g., the loss function). The -1 in the view function infers the appropriate size for that dimension based on the original size, ensuring the total number of elements remains constant.

- ReLU (Rectified Linear Unit) is an Activation Function used as a nonlinear activation function in neural networks. Its primary role is to introduce nonlinearity into the network, allowing the model to learn complex patterns in the data.  The ReLU function is defined as f(x)=max(0,x), meaning it outputs the input directly if it is positive; otherwise, it will output zero.  It is applied to the output of neurons in intermediate layers of the network. For example, after a linear transformation in a layer, ReLU can be applied to each neuron's output before passing it to the next layer.  ReLU helps mitigate the vanishing gradient problem, accelerates the convergence of stochastic gradient descent compared to sigmoid or tanh functions, and is computationally efficient. Limitations: It can lead to dying ReLU problem, where neurons output zero for all inputs and gradients cannot flow through the neuron during backpropagation.

- Adam (Adaptive Moment Estimation) is an optimization algorithm used to minimize the loss function during the training of neural networks. Its role is to update the network weights iteratively based on training data to reduce the difference between the predicted output and the actual label.  Adam combines ideas from two other extensions of stochastic gradient descent: AdaGrad, which adapts the learning rate for each parameter, and RMSProp, which uses a moving average of squared gradients to normalize the gradient.

It is used during the training phase of the model, specifically for updating the weights after each batch of data is processed.

Adam is known for its effectiveness in practice, especially on problems with large datasets or parameters. It automatically adjusts the learning rate during training, is computationally efficient, and has little memory requirements.
Limitations: Despite its popularity, Adam might not converge to the optimal solution under certain conditions, and tuning its hyperparameters (learning rate, beta values) is crucial for optimal performance.

ReLU and Adam serve complementary purposes in the construction and training of ANNs, with ReLU focusing on the model's internal data processing and Adam on optimizing the learning process.

### Problem 3.1 Solution

#### The code chunk below addresses the steps defined as: 

- Step 1: The ANN architecture
- Step 2: The Training process
- Step 3: The Evaluation: Model evaluation with the entire test dataset

In [None]:
#Your solution using PyTorch implementing the training process with stopping criteria

# Here class ChurnModel defines our neural network architecture.

# X_train_scaled and y_train are our features and labels respectively
# Code to defiine the model using PyTorch
# Layers are defined as:
# - Input layer will have 11 units as the dimension of training set: X_train_scaled (i.e,
# - First hidden layer will have 5 neurons, each with “Rectified Linear Unit (‘ReLU“)” as activation function.
# - Second hidden layer will have 4 neurons, each with “ReLU” as activation function.
# - Output layer will have just 1 neuron, with ‘sigmoid“ activation function.
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import numpy as np
from torch.optim import Adam


class ChurnModel(nn.Module):
    def __init__(self):
        super(ChurnModel, self).__init__()
        self.layer1 = nn.Linear(11, 5)
        self.layer2 = nn.Linear(5, 4)
        self.output = nn.Linear(4, 1)

    def forward(self, x):
        x = F.relu(self.layer1(x))
        x = F.relu(self.layer2(x))
        x = torch.sigmoid(self.output(x))  # Ensure output is between 0 and 1
        return x

#  Our X_train_scaled and y_train are already defined and the former is scaled properly
X_train_tensor = torch.FloatTensor(X_train_scaled.values)
y_train_tensor = torch.FloatTensor(y_train.values).view(-1, 1)

# Initialize the model, loss function, and optimizer
model = ChurnModel()
criterion = nn.BCELoss()  # Binary Cross Entropy Loss
optimizer = Adam(model.parameters(), lr=0.001)  # Learning rate can be adjusted


epochs = 25
loss_values = []
early_stopping_threshold = 0.001
for epoch in range(epochs):
    optimizer.zero_grad()
    outputs = model(X_train_tensor)

    # Debugging: Directly check outputs range before loss calculation
    print("Output range before loss calculation:", outputs.min().item(), outputs.max().item())
    
    # Ensure outputs are strictly within the [0, 1] range
    outputs = torch.clamp(outputs, 0, 1)

    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()
    loss_values.append(loss.item())
    if epoch > 1 and abs(loss_values[-2] - loss_values[-1]) < early_stopping_threshold:
        print("Early stopping criteria met")
        break
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')


# Plotting the loss values
plt.plot(range(1, len(loss_values)+1), loss_values, marker='o')
plt.title('Epoch vs. Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

# Save the model
model_type = 'pt'  # 'mc' for micrograd, 'pt' for PyTorch, 'tf' for TensorFlow
saved_models_dir = 'saved_models'
if not os.path.exists(saved_models_dir):
    os.makedirs(saved_models_dir)
model_filename = f'model-ann-11-5-4-1-{model_type}.pth'
model_save_path = os.path.join(saved_models_dir, model_filename)
torch.save(model.state_dict(), model_save_path)
print(f"Model saved to {model_save_path}")

# Load the model for evaluation
model.load_state_dict(torch.load(model_save_path))
model.eval()

# Evaluation on the Test set
# X_test_scaled and y_test are defined and scaled
X_test_tensor = torch.FloatTensor(X_test_scaled.values)
y_test_tensor = torch.FloatTensor(y_test.values).view(-1, 1)
with torch.no_grad():
    y_pred = model(X_test_tensor)
    y_pred_class = (y_pred >= 0.5).float()
    accuracy = accuracy_score(y_test_tensor.numpy(), y_pred_class.numpy())
    precision = precision_score(y_test_tensor.numpy(), y_pred_class.numpy())
    recall = recall_score(y_test_tensor.numpy(), y_pred_class.numpy())
    f1 = f1_score(y_test_tensor.numpy(), y_pred_class.numpy())

print("Model Evaluation Results on Test Set:")
print(f" - The model's Accuracy: {accuracy:.4f} indicates the proportion of the total number of predictions that were correct.")
print(f" - The model's Precision: {precision:.4f} reflects the proportion of positive identifications that were actually correct. It shows how reliable the model is when it predicts a sample as positive.")
print(f" - The model's Recall: {recall:.4f} measures the proportion of actual positives that were identified correctly. It highlights the model's capability to find all relevant cases within the dataset.")
print(f" - The model's F1 Score: {f1:.4f} is the harmonic mean of precision and recall, providing a balance between them. It's particularly useful when the class distribution is uneven.")

# Example: Including the model's architectural summary if needed
print("\nModel Architectural Summary:")
print(" - Input Layer: 11 units (matching the dimensionality of the input features)")
print(" - First Hidden Layer: 5 neurons with ReLU activation")
print(" - Second Hidden Layer: 4 neurons with ReLU activation")
print(" - Output Layer: 1 neuron with Sigmoid activation (for binary classification)")


#### Solution for (part 3.2) Evaluating your model with 1 test sample with known Exited value

* Here is a single test sample for which we know the ground true `Exited` value:

| CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember |EstimatedSalary | Exited |
| :---        |    :----:   |          ---: | :---        |    :----:   |          ---: | :---        |    :----:   |          ---: | :---        |    :----:   |          ---: |          ---: |
| 55443322 | Reynolds |709|Germany|Male|30|9|115479.48|2|1|1|134732.99|0|

* Load your trained model `model-ann-11-5-4-1-xx` from the file, and have it predict the test sample above. Please don't forget to preprocess this test samples so that it is compliant with the input and model requirements.
* Please report whether it predicts a 0 or 1 for the `Exited` target, and also comment whether your model makes a mistake or predicts correctly.

In [None]:
# Both 'column_transformer' and 'scaler' were previously defined and fitted on the training data
# Also assuming 'model' is loaded and ready for prediction

# Define the single sample data as a dictionary
sample_data = {
    'CreditScore': [709],
    'Geography': ['Germany'],
    'Gender': ['Male'],
    'Age': [30],
    'Tenure': [9],
    'Balance': [115479.48],
    'NumOfProducts': [2],
    'HasCrCard': [1],
    'IsActiveMember': [1],
    'EstimatedSalary': [134732.99]
}

# Convert the sample data to a DataFrame
sample_df = pd.DataFrame(sample_data)

# Apply the transformations to the sample
# Note: 'fit_transform' is used on training data, but for new data, we use 'transform' only
sample_transformed = column_transformer.transform(sample_df)

# The 'scaler' function is already fitted to the training data where the scaler = MinMaxScaler
sample_scaled = scaler.transform(sample_transformed)

# Convert the scaled sample back to a DataFrame (for readability)
# Extract the feature names for the transformed columns
feature_names = column_transformer.get_feature_names_out()

# Create a DataFrame for the transformed and scaled sample
sample_scaled_df = pd.DataFrame(sample_scaled, columns=feature_names)

# Convert to PyTorch tensor before prediction
sample_tensor = torch.FloatTensor(sample_scaled_df.values)

### Could this have been done without making the dataframe?  Which is better?

# Convert to PyTorch tensor
# sample_tensor = torch.FloatTensor(sample_scaled)

# Ensure the model is in evaluation mode
model.eval()

# Predict the 'Exited' value for the sample
with torch.no_grad():
    prediction = model(sample_tensor)
    predicted_class = (prediction >= 0.5).float().item()  # Convert to binary class (0 or 1)

# Print the prediction
print(f"Predicted 'Exited' value: {predicted_class}")

#### Analogous implementation of our ANND with Micrograd

**Approach using Micrograd**

Micrograd is a minimalistic automatic differentiation library primarily used for educational purposes and developed by Andrej Karpathy. Unlike PyTorch or TensorFlow, Micrograd is much simpler and doesn't come with built-in functionalities for layers, optimizers, or loss functions in a full deep learning framework. Implementing an ANN with Micrograd involves creating these components from scratch.  We will use the approach presented by Karpathy in his Micrograd video.

Micrograd is designed to work with its own Value class, which is a tiny scalar autograd engine and does not support tensors or vectorized operations directly. This means that the ANN is implemented with scalar operations.  This is not efficient for real-world applications but serves as a good learning exercise.

For implementation with Micrograd , we do not need to implement the `adam` optimizer. Please use the default optimizer found in the code repo.
MANDATORY TASK: However, you need to add the `sigmoid` activation capability to follow along with task of the assignment.
OPTIONAL TASK: If you are interested, you may implement binary cross-entropy loss function, instead of the loss function developed in the code repo. Once again, this part is optional.

#### My Micrograd approach will include:

- **Activation Functions**: The sigmoid function is defined for use in the output layer. This matches the activation function used in your PyTorch model.
    
-**Loss Function**: The binary_cross_entropy function implements the binary cross-entropy loss, suitable for binary classification tasks.
   
- **Model Components**: The Neuron, Layer, and SimpleANN classes are defined to construct the neural network. Each layer's neurons are initialized with small random weights for diversity in the learning process.
    
- **Training Loop**: Demonstrates how to iterate over epochs, compute forward passes, calculate loss, perform backpropagation, and update weights manually without an optimizer like Adam.

-**Data Conversion:** convert train and test data into a format compatible with Micrograd using Value objects. This preprocessing step is crucial for ensuring that the data is in the correct format for training with Micrograd. By wrapping each number in the Value class, we enable Micrograd to compute gradients with respect to these numbers, which is essential for the optimization step in training neural networks. This conversion should be applied to both the training and testing datasets before using them for model training or evaluation with Micrograd.


In [None]:

#%pip install micrograd

The code below is a comprehensive implementation of a neural network using Micrograd, a minimalist automatic differentiation library. 

The code encompasses  
    - the definition of activation functions, 
    - a binary cross-entropy loss function, 
    - the neural network architecture (including neurons, layers, and the overall model), 
    - data preprocessing, and 
    - the training loop. 

**Key Points:**

- Activation Functions: sigmoid and relu are used for non-linear transformations within the network.

- Binary Cross-Entropy Loss: Used for binary classification tasks, provides a measure of the difference between the predicted values and the actual labels.

- Neuron and Layer: Building blocks of the network, with neurons forming layers, and layers forming the network.

- SimpleANN: Represents the neural network model, consisting of a sequence of layers.

- Data Preprocessing: Converts the raw data into a format compatible with Micrograd, ensuring each data point is wrapped in a Value object for automatic differentiation.

- Training Loop: Iterates over epochs, shuffles data, creates mini-batches, performs forward and backward passes, and updates model parameters.

The model = SimpleANN() line is crucial as it initializes the neural network model that we train with our dataset. This setup forms a complete pipeline for training a neural network using Micrograd, from data preprocessing to model evaluation.

I have devloped my micrograd approach and the full code addressing each of these `Key Points` is below.  

I am awaure that testing and improving such a comprehensive code snippet in one go is challenging, especially without the ability to run and verify the output directly in this environment. To effectively test and validate each part of the code, I will break it down into smaller, manageable sections. 

Here's how I will approach this:

In [None]:
from micrograd.engine import Value
import numpy as np
import math

# Define the sigmoid activation function
def sigmoid(x):
    # Sigmoid function for non-linear activation, ensures output [0, 1]
    # Use Python's math.exp for the exponential calculation
    return 1 / (1 + Value(math.exp(-x.data)))

# Define the ReLU activation function
def relu(x):
    # ReLU (Rectified Linear Unit) for non-linear activation, ensures output is non-negative
    return x if x.data > 0 else Value(0)

def test_activation_functions():
    # Test inputs for activation functions
    inputs = [Value(-1), Value(0), Value(1)]

    # Expected outputs for sigmoid: approximately sigmoid(-1), sigmoid(0), sigmoid(1)
    expected_sigmoid_outputs = [0.26894142, 0.5, 0.73105858]

    # Expected outputs for ReLU: relu(-1), relu(0), relu(1)
    expected_relu_outputs = [0, 0, 1]

    # Test Sigmoid
    sigmoid_outputs = [sigmoid(x).data for x in inputs]
    assert all(abs(so - eo) < 1e-5 for so, eo in zip(sigmoid_outputs, expected_sigmoid_outputs)), "Sigmoid function test failed"

    # Test ReLU
    relu_outputs = [relu(x).data for x in inputs]
    assert all(ro == eo for ro, eo in zip(relu_outputs, expected_relu_outputs)), "ReLU function test failed"

    print("Activation function tests passed.")

# Run the test
test_activation_functions()


In [None]:
# TODO this was a test, you can remove.
# Let's simulate a minimal example based on the provided class implementations and generate fake data

# Assuming necessary imports and function definitions from the shared code snippets
from micrograd.engine import Value

# Re-defining the Neuron, Layer, and SimpleANN classes based on the provided implementations
class Neuron:
    def __init__(self, nin):
        self.w = [Value(0.01 * np.random.randn()) for _ in range(nin)]
        self.b = Value(0)

    def __call__(self, x):
        act = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
        return act

class Layer:
    def __init__(self, nin, nout, activation=relu):
        self.neurons = [Neuron(nin) for _ in range(nout)]
        self.activation = activation

    def __call__(self, x):
        outs = [self.activation(neuron(x)) for neuron in self.neurons]
        return outs

class SimpleANN:
    def __init__(self):
        self.layer1 = Layer(11, 5, activation=relu)
        self.layer2 = Layer(5, 4, activation=relu)
        self.output_layer = Layer(4, 1, activation=sigmoid)

    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.output_layer(x)
        return x[0]

    def parameters(self):
        params = []
        for layer in [self.layer1, self.layer2, self.output_layer]:
            for neuron in layer.neurons:
                params += neuron.w
                params.append(neuron.b)
        return params

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0

# Generating fake data for a single input
np.random.seed(42) # For reproducible random values
X_fake = [Value(x) for x in np.random.randn(11)] # Simulate a single input with 11 features

# Initialize the model and perform a forward pass with the fake data
model = SimpleANN()
try:
    y_pred = model.forward(X_fake)
    print("Forward pass successful. Output:", y_pred)
except TypeError as e:
    print("Error during forward pass:", str(e))


In [None]:
X_train_scaled.head()

In [None]:
import numpy as np
from micrograd.engine import Value

# Binary cross entropy function for Value objects
def binary_cross_entropy(y_pred, y_true):
    epsilon = 1e-15
    # Ensure y_pred values are clamped to the range [epsilon, 1-epsilon]
    y_pred_clamped_data = max(min(y_pred.data, 1 - epsilon), epsilon)
    y_pred_clamped = Value(y_pred_clamped_data)
    
    # Compute the binary cross entropy
    loss = -(y_true * Value(math.log(y_pred_clamped.data)) + 
             (1 - y_true) * Value(math.log(1 - y_pred_clamped.data)))
    return loss

class Neuron:
    def __init__(self, nin):
        self.w = [Value(0.01 * np.random.randn()) for _ in range(nin)]
        self.b = Value(0)

    def __call__(self, x):
        # x is a list of Value objects; perform element-wise multiplication and sum
        act = sum(wi * xi for wi, xi in zip(self.w, x)) + self.b
        return act

class Layer:
    def __init__(self, nin, nout, activation=relu):
        self.neurons = [Neuron(nin) for _ in range(nout)]
        self.activation = activation

    def __call__(self, x):
        # Process each neuron's output through the activation function
        outs = [self.activation(neuron(x)) for neuron in self.neurons]
        return outs

class SimpleANN:
    def __init__(self):
        self.layer1 = Layer(11, 5, activation=relu)
        self.layer2 = Layer(5, 4, activation=relu)
        self.output_layer = Layer(4, 1, activation=sigmoid)

    def forward(self, x):
        # Process input through each layer
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.output_layer(x)
        return x[0]

    def parameters(self):
        params = []
        for layer in [self.layer1, self.layer2, self.output_layer]:
            for neuron in layer.neurons:
                params += neuron.w
                params.append(neuron.b)
        return params

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0

# Assuming X_train_scaled and y_train are defined elsewhere
X_train_scaled_values = [[Value(x) for x in sample] for sample in X_train_scaled.values]

#TODO check that this is being converted properly... consider doing y_train.values 
# it could be using the header again.... 
y_train_values = [Value(y) for y in y_train]

model = SimpleANN()
epochs = 100
lr = 0.01
batch_size = 32

def create_batches(X, y, batch_size):
    for i in range(0, len(X), batch_size):
        yield X[i:i + batch_size], y[i:i + batch_size]

# Training loop corrected for handling Value objects
for epoch in range(epochs):
    permutation = np.random.permutation(len(X_train_scaled_values))
    X_train_shuffled = [X_train_scaled_values[i] for i in permutation]
    y_train_shuffled = [y_train_values[i] for i in permutation]
    
    batches = create_batches(X_train_shuffled, y_train_shuffled, batch_size)
    epoch_losses = []
    for X_batch, y_batch in batches:
        batch_loss = 0
        for x, y in zip(X_batch, y_batch):
            model.zero_grad()
            y_pred = model.forward(x)
            loss = binary_cross_entropy(y_pred, y)
            loss.backward()
            for p in model.parameters():
                p.data -= lr * p.grad
            batch_loss += loss.data
        
        epoch_losses.append(batch_loss / len(X_batch))
    
    avg_loss = sum(epoch_losses) / len(epoch_losses)
    print(f"Epoch {epoch + 1}, Average Loss: {avg_loss:.4f}")

In [None]:
# # Save the model
# import pickle
# model_filename = 'model-ann-11-5-4-1-micrograd.pkl'
# model_save_path = os.path.join(saved_models_dir, model_filename)
# with open(model_save_path, 'wb') as f:
#     pickle.dump(model, f)
# print(f"Model saved to {model_save_path}")

# Evaluate performance on the test set
X_test_scaled_values = [[Value(x) for x in sample] for sample in X_test_scaled] # Convert to list of Value objects
# Create a function to predict output using our trained Micrograd model. 
# Since output layer uses a sigmoid activation, output will be in range [0, 1]. 
# As done with PyTorch, consider outputs greater than or equal to 0.5 as class 1 (positive class) and less than 0.5 as class 0 (negative class).
def predict(model, X):
    predictions = []
    for x in X:
        pred = model.forward(x)
        pred_class = 1 if pred.data >= 0.5 else 0
        predictions.append(pred_class)
    return predictions
# Calculate Performance Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Convert y_test to integer values for comparison
y_test_int = [int(y) for y in y_test]

# Predict on the test set
y_pred = predict(model, X_test_scaled_values)

# Calculate metrics
accuracy = accuracy_score(y_test_int, y_pred)
precision = precision_score(y_test_int, y_pred)
recall = recall_score(y_test_int, y_pred)
f1 = f1_score(y_test_int, y_pred)

# Print the metrics
print("Model Evaluation Results on Test Set:")
print(f" - The model's Accuracy: {accuracy:.4f}")
print(f" - The model's Precision: {precision:.4f}")
print(f" - The model's Recall: {recall:.4f}")
print(f" - The model's F1 Score: {f1:.4f}")

# Model Architectural Summary
print("\nModel Architectural Summary:")
print(" - Input Layer: 11 units (matching the dimensionality of the input features)")
print(" - First Hidden Layer: 5 neurons with ReLU activation")
print(" - Second Hidden Layer: 4 neurons with ReLU activation")
print(" - Output Layer: 1 neuron with Sigmoid activation (for binary classification)")

In [None]:
#%pip install numpy scikit-learn


## Task 4: (10 points)

* Repeat Task 3 with the following new architecture of the neural network:

![Task 4 ANN architecture](figures/nn-2.png)

* Input layer will still have 11 units as the dimension of training set (i.e, number of columns = 11).
* Hidden-layer-1: 8 neurons, with relu activation
* Hidden-layer-2: 8 neurons, with relu activation,
* Hidden-layer-3: 8 neurons, with relu activation,
* Output-layer: 1 neuron with sigmoid.



In [None]:
#Your solution using Micrograd

In [None]:
#Your solution using either PyTorch or Tensorflow


## Task 5: (10 points)

* Repeat Task 3 with the following new architecture of the neural network:

![Task 5 ANN architecture](figures/nn-3.png)

* Input layer will still have 11 units as the dimension of training set (i.e, number of columns = 11).
* Hidden-layer-1: 8 neurons, with relu activation
* Hidden-layer-2: 4 neurons, with relu activation,
* Hidden-layer-3: 2 neurons, with relu activation,
* Output-layer: 1 neuron with sigmoid.



In [None]:
#Your solution using Micrograd

In [None]:
#Your solution using either PyTorch or Tensorflow


# That's all. Thanks for your work... :)

Now, do the following to earn credit --

0. Setting up:
    - Make sure you actually experiment with this assignment. I would encourage (again) you to go through the possible compute resource you can use [video](https://youtu.be/XEfP9YTDdBM?si=NS0H46lOJZhI9b4H).
    - It's always better to work in python virtual environment. Here are some resources for you to create and work in virtual environments [[win+mac+ubuntu](https://ashiskb.info/posts/2022/09/biswas/blog-1-python-venv/)][[windows+gpu](https://ashiskb.info/posts/2023/08/biswas/blog-win10-tensorflow/)][[ubuntu+gpu](https://ashiskb.info/posts/2023/08/biswas/blog-ubuntu-tensorflow/)]
1. Please make sure to execute each cell in this jupyter notebook, and hit the 'Save' button, or go "File > Save and Checkpoint" menu option to save the notebook.
2. Submit this notebook "2024-Spring-DL-PA1-assignment.ipynb" into Canvas "Assignment 1" entry. 
3. Done!
