##### The cell below is for you to keep track of the libraries used and install those libraries quickly
##### Ensure that the proper library names are used and the syntax of `%pip install PACKAGE_NAME` is followed

In [None]:
#%pip install pandas 
#%pip install matplotlib
# add commented pip installation lines for packages used as shown above for ease of testing
# the line should be of the format %pip install PACKAGE_NAME 
# %pip install joblib


## **DO NOT CHANGE** the filepath variable
##### Instead, create a folder named 'data' in your current working directory and 
##### have the .parquet file inside that. A relative path *must* be used when loading data into pandas

In [None]:
# Can have as many cells as you want for code
import pandas as pd
filepath = "./data/catB_train.parquet" 
# the initialised filepath MUST be a relative path to a folder named data that contains the parquet file

### **ALL** Code for machine learning and dataset analysis should be entered below. 
##### Ensure that your code is clear and readable.
##### Comments and Markdown notes are advised to direct attention to pieces of code you deem useful.

<u>Data Cleaning:</u>

Firstly, all columns with zero or only one unique value across all observations were discarded, as such variables will not serve as relevant predictors for the target column.  

For categorical data such as race_desc and annual_income_desc, one-hot encoding was employed to create dummy columns for each category in order for upcoming modeling tasks to utilize the information. Data involving dates such as min_occ_data and cltdob_fix were also manipulated to represent these variables in years and age, numerical values that can be easily used in modeling. 

Columns representing the various insurance product metrics and the unique policy identifiers were grouped first by the product metric (ape_, sumins_, prepaid_*) and then by the type of policy (gi, ltc, grp, inv, lh) to obtain a more generalized interpretation of the inclinations for each individual, appending the mean amount of each subcategory as new columns. This is also due to how the various insurance product metrics were encoded by unique policy identifiers, of which we had no access to the information of the exact insurance policies. Hence, no real insight can be derived. The same categorisation approach was employed for the purchase and lapse metrics such as f_ever_bought_, n_months_last_bought_, lapse_ape_, n_months_since_lapse_. Grouping similar information into meaningful categories will help the model’s performance and prevent the noise from each unique policy from overly influencing the model’s prediction and accuracy. 

We replaced the NaN values in categorical predictors such as flg_standard, flg_is_borderline_standard etc with their respective mean values to minimize the impact and any noise of these missing values on the model. 

Next, we also removed the columns hh_size_est,  ctrycode_desc and clntnum as there was already a similar column (hh_size_est) with higher precision, whereas ctrycode_desc and clntum were nominal categorical variables with no intrinsic meaning and hence yield no value in either data visualization or prediction. 

We also observed that the target column (f_purchase_lh) contained values of 1 and NaN, and replaced the NaN values with 0, indicating that these individuals did not purchase life / health insurance in the next 3 months.

There were NaN values in the target column (f_purchase_lh) and replaced them with 0s. After transforming the columns aforementioned, we applied a MinMaxScaler to scale the different features to a comparable range. 

<u>Data visualisation:</u>

Observing the recency_lapse variable boxplots between customers who purchased life insurance products within the next three months and those who did not, a large proportion of customers who purchased generally tended to have more recent lapses compared to customers who did not purchase. It could suggest that recent lapses suggest that the customer is still active and hence will be more likely to purchase insurance, compared to customers who have lapsed on their payment for an extended period of time and are more likely to be inactive. 
![boxplot](./images/boxplot.jpeg) 

The bar plot for customers on whether they have valid direct mailing addresses appears to be a good indicator of the target, too. An extremely significant majority of customers who either did not provide a valid mailing address or any mailing address (as indicated by the grey bar) all did not purchase life or health insurance in the next 3 months, which could be understood as them not anticipating or hoping for any follow-ups with regards to the policy and hence do not warrant further effort from the insurance agents. 
![is_valid_dm](./images/is_valid_dm.jpeg) 


<u>Feature selection:</u>

For feature selection, we utilized SelectKBest as part of the feature selection module sklearn.feature.selection to select the top k features that are most relevant with regards to the target variable. The scoring function we used was mutual_info_classif, which is commonly used for classification tasks with both categorical and numerical variables. In this context, it quantifies the dependency between the feature and the target, assigning a higher score to features that are more informative to the target variable. From our work, we found that roughly 16 features were required to yield the best results for model prediction through cross validation with 5 splits. 

<u>Modeling:</u>

From the data visualization figure, we noticed that an overwhelming 96.1% of the target variable f_purchase_lh were NaNs (1s: 710, NaNs: 17282), indicating that almost all of the customers surveyed would not purchase life or health insurance products within the next three months. In a heavily imbalanced dataset like this, machine learning models might be biased towards the majority target classes, leading to wildly inaccurate predictions. Thus, we tackled this issue by first upsampling the minority sample, in which case is the 1s of f_purchase_lh, by 10% and thereafter utilizing Synthetic Minority Oversampling Technique (SMOTE) to artificially generate more synthetic data via interpolation between existing minority samples. This led to a much more balanced dataset suitable for training. Downsampling was not chosen due to its possibility of removing important information unknowingly in the process. 

<u>Evaluation of model performance:</u>

We made use of the train_test_split function to test our data, utilising it the standard 80-20 split. We considered the use of different models, both linear and nonlinear, such as LogisticRegression, SVM, MLPClassifier and RandomForestClassifier. Ultimately, we decided on utilising LinearSVC. The justifications were mainly due to the presence of regularisation parameter to ensure there is no overfitting. We also performed hyperparameters tuning by running cross validation, with 5 folds, on different values for the penalty function, C, of the SVM.SVC model. After this, we made use of the classification report to evaluate our model using different metrics such as precision, recall to calculate our F1 score. 

![MLPClassifier_confusion_matrix](./images/MLPClassifier_confusion_matrix.jpg) 
![LinearSVC_confusion_matrix](./images/LinearSVC_confusion_matrix.jpg) 
![DecisionTree_confusion_matrix](./images/DecisionTree_confusion_matrix.jpg) 

![LogisticRegression_confusion_matrix](./images/LogisticRegression_confusion_matrix.jpeg) 

![LogisticRegression_classification_report](./images/LogisticRegression_classification_report.jpeg) 

![LinearSVC_classification_report](./images/LinearSVC_classification_report.jpeg) 

To decide which model was best suited for our dataset, we trained various models to see which was able to perform best with our given dataset. With reference to the confusion matrix, MLP Classifier with 100 layers gave us the best true positive results with 2767 true positives, followed by linear svc at 2887. As for true negatives, logistic regression gave the best true negatives at 89, followed by linear svc at 84. After careful consideration, we chose to utilise linear SVC as it was consistent at identifying both positive and negative results, giving a weighted average f1 score of 0.88 as per the generated classification report, which was higher than the other models we tried. Furthermore, it gave a relatively efficient run time, taking just 20 seconds to train as compared to the MLP Classifier of 2 minutes.

In [None]:
import numpy as np
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
from decimal import Decimal
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline


def clean_data(input_data, isTraining):
    test_df = input_data

    # cols to replace nan to mean
    columns_to_replace = [
        'flg_substandard', 'flg_is_borderline_standard', 'flg_is_revised_term',
        'flg_is_rental_flat', 'flg_has_health_claim', 'flg_has_life_claim', 
        'flg_gi_claim', 'flg_is_proposal', 'flg_with_preauthorisation',
        'flg_is_returned_mail', 'is_consent_to_mail', 'is_consent_to_email',
        'is_consent_to_call', 'is_consent_to_sms', 'is_valid_dm', 'is_valid_email',
        'is_housewife_retiree', 'is_sg_pr', 'is_class_1_2',
        'is_dependent_in_at_least_1_policy', 'f_ever_declined_la',
        'flg_latest_being_lapse', 'flg_latest_being_cancel', 'f_hold_839f8a',
        'f_hold_e22a6a', 'f_hold_d0adeb', 'f_hold_c4bda5', 'f_hold_ltc',
        'f_hold_507c37', 'f_hold_gi', 'f_ever_bought_ltc_1280bf',
        'f_ever_bought_grp_6fc3e6', 'f_ever_bought_grp_de05ae',
        'f_ever_bought_inv_dcd836', 'f_ever_bought_grp_945b5a',
        'f_ever_bought_grp_6a5788', 'f_ever_bought_ltc_43b9d5',
        'f_ever_bought_grp_9cdedf', 'f_ever_bought_lh_d0adeb',
        'f_ever_bought_grp_1581d7', 'f_ever_bought_grp_22decf',
        'f_ever_bought_lh_507c37', 'f_ever_bought_lh_839f8a',
        'f_ever_bought_inv_e9f316', 'f_ever_bought_grp_caa6ff',
        'f_ever_bought_grp_fd3bfb', 'f_ever_bought_lh_e22a6a',
        'f_ever_bought_grp_70e1dd', 'f_ever_bought_grp_e04c3a',
        'f_ever_bought_grp_fe5fb8', 'f_ever_bought_grp_94baec',
        'f_ever_bought_grp_e91421', 'f_ever_bought_lh_f852af',
        'f_ever_bought_lh_947b15', 'f_ever_bought_32c74c', 'f_elx',
        'f_mindef_mha', 'f_retail', 'flg_affconnect_show_interest_ever',
        'flg_affconnect_ready_to_buy_ever', 'flg_affconnect_lapse_ever',
        'flg_hlthclaim_839f8a_ever', 'recency_hlthclaim_839f8a',
        'flg_hlthclaim_14cb37_ever', 'giclaim_cnt_success',
        'recency_giclaim_success', 'giclaim_cnt_unsuccess',
        'recency_giclaim_unsuccess', 'flg_gi_claim_29d435_ever',
        'flg_gi_claim_058815_ever', 'flg_gi_claim_42e115_ever',
        'flg_gi_claim_856320_ever'
    ]
    for col in columns_to_replace:
        test_df[col] = test_df[col].fillna(test_df[col].mean())
    test_df.drop(columns=['hh_size_est', "ctrycode_desc",'clntnum'], inplace=True)

    if isTraining:
        #Fill na for target column with 0s
        test_df['f_purchase_lh']=test_df['f_purchase_lh'].fillna(0)

    # Cleaning annual income col
    columns = ['annual_income_est']
    test_df['annual_income_est'] = test_df['annual_income_est'].fillna('None')

    # Perform one-hot encoding
    df_encoded = pd.get_dummies(test_df['annual_income_est'], dummy_na=True)

    # Replace NaN values with 0 in the dummy columns
    df_encoded_ = df_encoded.iloc[:, :-2]
    test_df.drop('annual_income_est', axis=1, inplace=True)

    result_df = pd.concat([test_df, df_encoded_], axis=1)

    # Clean race
    result_df['race_desc'] = result_df['race_desc'].fillna('Others')
    df_dummies = pd.get_dummies(result_df['race_desc'], prefix='is_')

    # Concatenate the dummy columns with the original DataFrame
    df = pd.concat([result_df, df_dummies], axis=1)

    # Drop the original categorical column if needed
    df.drop('race_desc', axis=1, inplace=True)


    # Change dob to age by years
    current_date = datetime.now()
    mean_date = df[df['cltdob_fix'] != 'None']['cltdob_fix'].astype('datetime64[ns]').mean()
    if isTraining:
        df = df[df['cltdob_fix'] != 'None']
    else:
        df['cltdob_fix'] = df['cltdob_fix'].replace('None', mean_date)
    df['cltdob_fix'] = (current_date - pd.to_datetime(df['cltdob_fix'])).dt.days // 365
    df=df.rename(columns={'cltdob_fix': 'age'})

    mean_date = df[df['min_occ_date'] != 'None']['min_occ_date'].astype('datetime64[ns]').mean()
    # Change occ date to number of years
    if isTraining:
        df = df[df['min_occ_date'] != 'None']
    else:
        df['min_occ_date'] = df['min_occ_date'].replace('None', mean_date)
    df['min_occ_date'] = (current_date - pd.to_datetime(df['min_occ_date'])).dt.days // 365
    df=df.rename(columns={'min_occ_date': 'years_since_first_int'})

    # Clean gender
    df_dummy = pd.get_dummies(df['cltsex_fix'], prefix='is')


    # Concatenate the dummy columns with the original DataFrame
    df = pd.concat([df, df_dummy], axis=1)

    # Drop the original categorical column if needed
    df.drop('cltsex_fix', axis=1, inplace=True)
    df["is_Female"] = df["is_Female"].astype(int)
    df["is_Male"] = df["is_Male"].astype(int)
    
    # Clean Customer Status 
    df_dummy = pd.get_dummies(df['clttype'], prefix='is')
    # Concatenate the dummy columns with the original DataFrame
    df = pd.concat([df, df_dummy], axis=1)

    # Drop the original categorical column if needed
    df.drop('clttype', axis=1, inplace=True)
    df["is_P"] = df["is_P"].astype(int)
    df["is_G"] = df["is_G"].astype(int)
    df["is_C"] = df["is_C"].astype(int)

    # Clean stat_flag
    df_dummy = pd.get_dummies(df['stat_flag'], prefix='is')
    # Concatenate the dummy columns with the original DataFrame
    df = pd.concat([df, df_dummy], axis=1)



    # Drop the original categorical column if needed
    df.drop('stat_flag', axis=1, inplace=True)
    df["is_ACTIVE"] = df["is_ACTIVE"].astype(int)
    df["is_LAPSED"] = df["is_LAPSED"].astype(int)
    df["is_MATURED"] = df["is_MATURED"].astype(int)
    #grouping all the APEs, SUMINs, PREMPAIDs together

    p0 = ["lapse_", ""]
    p1 = ['ape', 'sumins', 'prempaid']
    types = ['_gi','_ltc', '_grp', '_inv', '_lh']
    for k in p0:
        for i in p1:
            for j in types:
                prefix = k + i + j
                grp_cols = [col for col in df.columns if col.startswith(prefix)]
                if grp_cols:
                    avg = df[grp_cols].mean(axis=1)
                    df = pd.concat([df, avg.rename("consolidated_"+prefix)], axis=1)
                    df.drop(columns=grp_cols, axis=1, inplace=True)

    for i in p1:
        grp_cols = [col for col in df.columns if col.startswith(i)]
        df.drop(columns=grp_cols, inplace=True)

    p1 = ['ltc', 'gi', 'lh', 'grp']
    for i in p1:
        prefix = "f_ever_bought_" + i
        grp_cols = [col for col in df.columns if col.startswith(prefix)]
        if grp_cols:
            avg = df[grp_cols].mean(axis=1).fillna(0)
            df = pd.concat([df, avg.rename("consolidated_" + prefix )], axis=1)
            df.drop(columns=grp_cols, axis=1, inplace=True)

    grp_cols = [col for col in df.columns if col.startswith("f_ever_bought")]
    df.drop(columns=grp_cols, inplace=True)

    p0 = ['last_bought_','since_lapse_']
    p1 = ['ltc', 'gi', 'lh', 'grp', 'inv']
    for i in p0:
        for j in p1:
            prefix = "n_months_" + i+ j
            grp_cols = [col for col in df.columns if col.startswith(prefix)]
            if grp_cols:
                avg = df[grp_cols].mean(axis=1).fillna(0)
                df = pd.concat([df, avg.rename("consolidated_" + prefix )], axis=1)
                df.drop(columns=grp_cols, axis=1, inplace=True)

    grp_cols = [col for col in df.columns if col.startswith("n_months")]
    df.drop(columns=grp_cols, inplace=True)

    thresh=0.9
    for col in df.columns:
        if (df[col].isna().sum() > 0).any():
            if df[col].isna().sum()/df.shape[0]>thresh:
                df.drop(columns=col,inplace=True)
            else:
                df[col]=df[col].astype(float)
                df[col]=df[col].fillna(df[col].mean())
        
        elif df[col].dtype == object:
            df[col]=df[col].astype(float)

    # Minmaxscaler, apply to grp, inv and 1h
    scaler=MinMaxScaler()
    columns=['consolidated_prempaid_grp',
            'consolidated_prempaid_grp',
            'consolidated_prempaid_lh',
            'consolidated_n_months_last_bought_ltc',
            'consolidated_n_months_last_bought_gi',
            'consolidated_n_months_last_bought_lh',
            'consolidated_n_months_last_bought_grp',
            'consolidated_n_months_last_bought_inv',
            'consolidated_n_months_since_lapse_ltc',
            'consolidated_n_months_since_lapse_lh',
            'consolidated_n_months_since_lapse_grp',
            'consolidated_n_months_since_lapse_inv',
            'consolidated_ape_ltc',
            'consolidated_ape_grp',
            'consolidated_ape_inv',
            'consolidated_ape_lh',
            'consolidated_lapse_ape_ltc',
            'consolidated_lapse_ape_grp',
            'consolidated_lapse_ape_inv',
            'consolidated_lapse_ape_lh',
            'consolidated_sumins_ltc',
            'consolidated_sumins_grp',
            'consolidated_sumins_inv',
            'consolidated_sumins_lh',
            'age',
            'hh_20',
            'pop_20',
            'hh_size'
    ]
    df[columns]=scaler.fit_transform(df[columns])

    if isTraining:
        # Dropping those with only 0 or 1 unique values
        columns_to_drop1 = [col for col in df.columns if df[col].nunique() == 1 or df[col].nunique() == 0]
        df = df.drop(columns=columns_to_drop1)
    else:
        df = df.fillna(df.mean())
    return df

# Oversampling, returns X_data, y_data
def oversample_data(x_input,y_input):
    over1=SMOTE(sampling_strategy='auto')
    over2=RandomOverSampler(sampling_strategy=0.1)
    steps=[('o7',over2),('o1',over1)]
    pipeline=Pipeline(steps=steps)
    X,y=pipeline.fit_resample(x_input,y_input)
    return X, y

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif,chi2
from sklearn.svm import LinearSVC
import numpy as np

class Model:
    def __init__(self, pre_trained_model = None, selected_cols = None):
        if pre_trained_model == None:
            self.model = LinearSVC(random_state=42, C=0.028,class_weight='balanced')
        else:
            self.model = pre_trained_model
        if selected_cols != None:
            self.selected_col = selected_cols
        else:
            self.selected_col = None

    def fit(self, input_data):
        # Clean and oversample data
        np.random.seed(42)
        data = clean_data(input_data, True)
        # Select best features
        y=data['f_purchase_lh']
        X=data.drop(columns='f_purchase_lh')
        selector = SelectKBest(mutual_info_classif, k=16)
        X_new = selector.fit_transform(X, y)
        selected_columns_indices = selector.get_support(indices=True)
        selected_columns = X.columns[selected_columns_indices]
        print(selected_columns)
        self.selected_col = selected_columns
        X_new = pd.DataFrame(X, columns=selected_columns)
        x_final,y_final = oversample_data(X_new,y)
        self.model.fit(x_final, y_final)
        

    def predict(self, X_test):
        data = clean_data(X_test, False)
        data_to_test = data[self.selected_col]
        pred_y = self.model.predict(data_to_test)
        
        return pred_y
    


In [None]:
#TESTING FOR BEST K

# from sklearn.svm import LinearSVC
# from sklearn.feature_selection import SelectKBest, mutual_info_classif
# from sklearn.model_selection import train_test_split
# k_values = [15,16,17,18,19,20] #after testing lower bounds and higher bounds
# cv_scores = []
# for k in k_values:
#     selector = SelectKBest(mutual_info_classif, k=k)
#     X_new = selector.fit_transform(X,y)
#     selected_columns_indices = selector.get_support(indices=True)
#     selected_columns = X.columns[selected_columns_indices]
#     print(selected_columns)
#     X_new = pd.DataFrame(X, columns=selected_columns)
#     xnew,ynew=pipeline.fit_resample(X_new,y)
#     X_train, X_test, y_train, y_test = train_test_split(xnew, ynew,test_size=0.2)
#     # Model
#     model = LinearSVC(random_state=42, C=0.026)
#     scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
#     mean_score = np.mean(scores)
        
#         # Append the mean score to the list of cross-validation scores
#     cv_scores.append(mean_score)
# # Find the value of k with the highest cross-validation score
# best_k = k_values[np.argmax(cv_scores)]
# print(best_k)

In [None]:
#OUR FUNCTION FOR LEARNING CURVE 

# kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# train_sizes, train_scores, test_scores = learning_curve(
#     model, xnew, ynew, cv=kfold, scoring='accuracy', train_sizes=np.linspace(0.1, 1.0, 10)
# )

# train_scores = 1-np.mean(train_scores,axis=1)#converting the accuracy score to misclassification rate
# test_scores = 1-np.mean(test_scores,axis=1)#converting the accuracy score to misclassification rate
# plt.plot(train_sizes, train_scores, label='Training')
# plt.plot(train_sizes, test_scores, label='Cross-validation')
# plt.xlabel('Training Set Size')
# plt.ylabel('Misclassification rate')
# plt.legend(loc='best')
# plt.show()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import classification_report

test_df = pd.read_parquet(filepath)
X=test_df.drop(columns='f_purchase_lh')
y=test_df['f_purchase_lh']
y=y.fillna(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2) 
concatenated_train_data = pd.concat([X_train, y_train], axis=1)
model = Model()
model.fit(test_df)


print(X_test.shape)
pred_y = model.predict(X_test)


y_test = y_test.fillna(0)

f1_test = f1_score(y_test, pred_y)

print('The f1 score for the testing data:', f1_test)

#Ploting the confusion matrix
confusion_matrix(y_test, pred_y)

classification_rep = classification_report(y_test, pred_y)
print('Classification Report:\n', classification_rep)

## The cell below is **NOT** to be removed
##### The function is to be amended so that it accepts the given input (dataframe) and returns the required output (list). 
##### It is recommended to test the function out prior to submission
-------------------------------------------------------------------------------------------------------------------------------
##### The hidden_data parsed into the function below will have the same layout columns wise as the dataset *SENT* to you
##### Thus, ensure that steps taken to modify the initial dataset to fit into the model are also carried out in the function below

In [None]:
import joblib
import pickle
import numpy as np
def load_model(training_data):
    pre_trained_model = joblib.load('model.joblib')
    with open('selected_col.pkl', 'rb') as file:
        selected_col = pickle.load(file)
    model = Model(pre_trained_model)
    model.fit(training_data)
    
def testing_hidden_data(hidden_data: pd.DataFrame) -> list:
    '''DO NOT REMOVE THIS FUNCTION.

The function accepts a dataframe as input and return an iterable (list)
of binary classes as output.

The function should be coded to test on hidden data
and should include any preprocessing functions needed for your model to perform. 
    
All relevant code MUST be included in this function.'''
    final_result = model.predict(hidden_data)
    result=final_result.tolist()
    return result

##### Cell to check testing_hidden_data function

In [None]:
test_df=pd.read_parquet(filepath)
test_df=test_df.drop(columns=['f_purchase_lh'])
print(testing_hidden_data(test_df))

### Please have the filename renamed and ensure that it can be run with the requirements above being met. All the best!