##### The cell below is for you to keep track of the libraries used and install those libraries quickly
##### Ensure that the proper library names are used and the syntax of `%pip install PACKAGE_NAME` is followed

In [None]:
#%pip install pandas
#%pip install matplotlib
# add commented pip installation lines for packages used as shown above for ease of testing
# the line should be of the format %pip install PACKAGE_NAME

## **DO NOT CHANGE** the filepath variable
##### Instead, create a folder named 'data' in your current working directory and
##### have the .parquet file inside that. A relative path *must* be used when loading data into pandas

In [3]:
# Can have as many cells as you want for code
import pandas as pd
filepath = "./data/catB_train.parquet"
# the initialised filepath MUST be a relative path to a folder named data that contains the parquet file



### **ALL** Code for machine learning and dataset analysis should be entered below.
##### Ensure that your code is clear and readable.
##### Comments and Markdown notes are advised to direct attention to pieces of code you deem useful.

## Inspiration
Upon initial analysis of the data, we have found that most relevant columns are binary in nature. Thus, initially, we opted for a decision tree classifier for its interpretability. However, we later found the dataset to be rather imbalanced with only 700 positive cases out of 18,992 rows, leading us to switch to a Random Forest model. We were inspired to use this model as it's ability to build multiple decision trees and aggregate predictions not only reduces the risk of overfitting but also effectively handles imbalanced classes. In the end, this model has helped to produce a more accurate solution to predict customer satisfaction in the insurance acquisition process.

## How we built it 
The product was created under the notion of trying out different types of classification algorithms. After testing out several, we found that Decision Trees was quite suitable for our choice of using TRUE FALSE columns. As we tested, we realised the accuracy was not up to our standard and we started to look for variants of this algorithm. It was here that we found the Random Forest Classification Method.

As such, after filtering out the columns that we wanted and changed 'stat_flag' into three separate columns for TRUE FALSE use, we can dropped all of the NAs and proceeded to train the model with class weights that was suitable for the problem. With that, we tested it out using a randomly selected test dataframe used a confusion matrix to check our accuracy.

In this state, the test results for the dataset was 
True Negative (TN): 3058
False Positive (FP): 186
False Negative (FN): 120
True Positive (TP): 32
To calculate accuracy using the formula 
Accuracy = True Positives / (False Positives + True Positives)

Therefore, the Accuracy was 32/218 = 0.147 (rounded).

## Challenges we ran into 
Choosing the suitable machine learning model was one of the main challenges we faced as we realised some models we considered at the beginning do not work well with the dataset. Trial and error to find the best model took quite a significant amount of time which was not very desirable especially with the short duration given. There were very few true positive cases in the dataset which made it more challenging to split training data and test data to train our model.

## What we learned 
Throughout the past 3 days, my team and I was able to apply what we have learnt in the classroom, experimenting with different classifiers and figuring the strengths and limitations of each classifier. We have tried many classifiers, such as KNN, SVM, Decision Tree and Random Forest, and found out through trial and error that, for example, Decision Tree in this context was not the most suitable due to the skewed nature of the dataset. We eventually settled on Random Forest, as we learned through comparing the accuracy that it was the most suitable model for this dataset.

We also learned how the different factors, such as status and purchase history were able to drastically affect the customer's propensity, which was learned through using different variables and seeing how the training data compared with the test data. This allowed us to understand how each factor in the Insurance Industry can impact the consumer's decision, and taught us the importance of data analytics and machine learning in this industry.


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

df = pd.read_parquet(filepath)

# Convert target col to 0 or 1 
df["f_purchase_lh"] = df["f_purchase_lh"].fillna(0) 
#Select unique client number variables and target column 
df = df[["flg_substandard","flg_is_borderline_standard","flg_is_revised_term","flg_is_rental_flat","flg_has_health_claim","flg_has_life_claim","flg_gi_claim","flg_is_proposal","flg_with_preauthorisation","flg_is_returned_mail","f_purchase_lh"]] 
#Remove rows with NA 
df = df.dropna()

# Filter out the columns you want
selected_columns = ['flg_substandard', 'flg_is_borderline_standard', 'flg_is_revised_term', 'flg_is_rental_flat', 
                    'flg_has_health_claim', 'flg_has_life_claim', 'flg_gi_claim', 'flg_is_proposal', 'flg_with_preauthorisation',
                    'flg_is_returned_mail', 'f_purchase_lh']  # Replace with your actual column names
df_selected = df[selected_columns]

# Assuming your prediction target variable is called 'f_purchase_lh'
X = df_selected.drop('f_purchase_lh', axis=1)
y = df_selected['f_purchase_lh']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest Classifier
model = RandomForestClassifier(class_weight={0: 1, 1: 17})  # You can set class weights if needed

# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the performance of the model
conf_matrix = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[3058  186]
 [ 120   32]]


## The cell below is **NOT** to be removed
##### The function is to be amended so that it accepts the given input (dataframe) and returns the required output (list).
##### It is recommended to test the function out prior to submission
-------------------------------------------------------------------------------------------------------------------------------
##### The hidden_data parsed into the function below will have the same layout columns wise as the dataset *SENT* to you
##### Thus, ensure that steps taken to modify the initial dataset to fit into the model are also carried out in the function below

In [5]:
def testing_hidden_data(hidden_data: pd.DataFrame) -> list:
    df = hidden_data
    #Select unique client number variables and target column 
    df = df[["flg_substandard","flg_is_borderline_standard","flg_is_revised_term","flg_is_rental_flat","flg_has_health_claim","flg_has_life_claim","flg_gi_claim","flg_is_proposal","flg_with_preauthorisation","flg_is_returned_mail"]] 
    #Remove rows with NA 
    df = df.dropna()

    # Filter out the columns you want
    selected_columns = ['flg_substandard', 'flg_is_borderline_standard', 'flg_is_revised_term', 'flg_is_rental_flat', 
                    'flg_has_health_claim', 'flg_has_life_claim', 'flg_gi_claim', 'flg_is_proposal', 'flg_with_preauthorisation',
                    'flg_is_returned_mail']  # Replace with your actual column names
    df_selected = df[selected_columns]

    result = model.predict(df_selected)
    return result

##### Cell to check testing_hidden_data function

In [None]:
# This cell should output a list of predictions.
test_df = pd.read_parquet(filepath)
test_df = test_df.drop(columns=["f_purchase_lh"])
print(testing_hidden_data(test_df))