## Purpose and Observation

### Purpose:-

- To fit a model on 'Dataset_model'
- Dataset: created by isolating first policy bought by a customer (using policy_owner_number and RCD)

### Steps:-
- Oversampling and 3 fold Cross validation performed
- Iteration1: 
    - train test split to check feature importance
    - K fold validation to check parameters
- Iteration 2:
    - K fold with select features
- Iteration 3:
    - Fine tuning: Voluntarily increasing Recall to lower false negatives.
    


# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

from sklearn.ensemble import RandomForestClassifier
from statistics import mean

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder


# Read Dataset

**dataset used Dataset_model: Merged cleaned, continuous variables imputed with mean, rest recordes with NaN dropped. then isolate records for first RCD, groupedby policy_owner_number**

In [None]:
#import dataset
dataset = pd.read_csv("Dataset_model.csv")

#convert RCD to datetime
dataset["RCD"] = pd.to_datetime(dataset["RCD"])

In [None]:
dataset.head()

# Descriptive Stats for reference

In [None]:
#tweak to show full values in describe instead of in exp terms
pd.options.display.float_format = "{:.2f}".format
np.set_printoptions(suppress = True)
#describe and info
with pd.option_context('display.max_columns', None):
    display(dataset.head())
    display(dataset.describe())
    display(dataset.info())

# Dropping unneeded variables
- Dropping columns
    - 'Own_Education': keeping Own_Edu
    - 'own_occupation: keeping Occ_Profile, Occupation_Group, Occupation
    - 'policy_number', 'policy_owner_number': Identifiers
    - 'RCD': Datetime
    - 'Freq': used to obtain target variable

In [None]:
dataset.drop(['Own_Education', 'own_occupation',
              'policy_number', 'policy_owner_number', 
              'RCD', 
              'Freq'], axis = 1, inplace = True)

# Feature engineering


In [None]:
#convert floats to int
floatlist = list(dataset.select_dtypes('float').columns)

for col in floatlist:
    dataset.loc[:,col] = dataset.loc[:,col].apply(np.ceil)
    if dataset.loc[:,col].isna().sum() == 0:
        dataset[col] = dataset[col].astype('int64')

In [None]:
# Labelencode for categorical vars

#create list of categorical vars
catcolsm = list(dataset.select_dtypes('object').columns)

#encoder
le = LabelEncoder()

#function to encode
def labelencode(data, col):
    nonulls=np.array(data.dropna())
    impute_reshape = nonulls.reshape(-1,1)
    impute_ordinal = le.fit_transform(impute_reshape)
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
    globals()[col+'_map'] = dict(zip(range(len(le.classes_)), le.classes_))
    return data

#labelencode
for col in catcolsm:
    labelencode(dataset[col], col)
    
#review results
display(dataset.head())

#display dictionaries created.
for col in catcolsm:
    display(globals()[col+'_map'])


In [None]:
#Convert object dtype to category dtype
for col in catcolsm:
    dataset[col] = dataset[col].astype('category')

In [None]:
#Seperating target variable
X1=dataset.drop(['target'],axis=1)
y1=dataset[['target']]
X1.info()

###### Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
oversample = RandomOverSampler(sampling_strategy=0.5, random_state= 211)

In [None]:
#oversample fit
X1_over, y1_over = oversample.fit_resample(X1,y1)

#check Counts after oversampling
print("Before oversampling:", Counter(y1['target']))
print("Before oversampling:", Counter(y1_over['target']))

# Random Forest Classifier: Iteration 1

### 1. dataset split, feature importance

#### 1a. train test split

In [None]:
#fit model1
#took 5-6 mins to run

x_train, x_test, y_train,  y_test = train_test_split(X1_over, y1_over, test_size=0.2, random_state=0)
rf=RandomForestClassifier(n_estimators=100)
rf.fit(x_train, y_train)

In [None]:
#checking feature importances again using RandomForestClassifier
imp_dict = {}
for i,j in list(zip(x_test, rf.feature_importances_)):
    imp_dict[i] = j
imp_dict = {k:v for k,v in sorted(imp_dict.items(), key =lambda item: item[1])}

plt.bar([x for x in range(len(imp_dict.keys()))], [y for y in imp_dict.values()])
plt.show()

display(imp_dict)

**Imp variables from RandomForestClassifier** (> or close to 0.05)
- premium, afyp, sum_assured, Owner_salary, city, CUST_prod_cat, DSTNAME, STATNAME, age: having high importance
- Own_Edu, Occupation: moderate importance



#### 1b. Kfold

In [None]:

#randomforestclassifier
clf1 = RandomForestClassifier(n_estimators = 50, 
                              random_state = 12,
                             max_features = 10,
                             max_depth = 15,
                             min_samples_leaf = 2)

#Kfolds
cv1 = StratifiedKFold(n_splits = 3, random_state=123, shuffle = True)

#list and Dataframe for different metrics
df_score1 = pd.DataFrame(columns = ['loop', 'metric', 'train', 'test'])

loop1, metric1, training1, testing1, confusion_matrix_all1 = [], [], [], [], []
auc1, fprs1, tprs1 = [], [], []

#Fit model
for (train,test), i in zip(cv1.split(X1_over,y1_over), range(3)):
    #fit
    X_train1, X_test1, y_train1, y_test1 = X1_over.iloc[train], X1_over.iloc[test], y1_over.iloc[train], y1_over.iloc[test]
    clf1.fit(X_train1, y_train1)
    
    #predict
    y_predict_train1 = clf1.predict(X_train1)
    y_predict_test1 = clf1.predict(X_test1)
    
    #scores
    #accuracy
    loop1.append(i)
    metric1.append('accuracy')
    training1.append(metrics.accuracy_score(y_train1, y_predict_train1))
    testing1.append(metrics.accuracy_score(y_test1, y_predict_test1))
    
    #f1
    loop1.append(i)
    metric1.append('f1')
    training1.append(metrics.f1_score(y_train1, y_predict_train1))
    testing1.append(metrics.f1_score(y_test1, y_predict_test1))
    
    #precision
    loop1.append(i)
    metric1.append('precision')
    training1.append(metrics.precision_score(y_train1, y_predict_train1))
    testing1.append(metrics.precision_score(y_test1, y_predict_test1))
    
    #recall
    loop1.append(i)
    metric1.append('recall')
    training1.append(metrics.recall_score(y_train1, y_predict_train1))
    testing1.append(metrics.recall_score(y_test1, y_predict_test1))
    
    #confusion matrix on test data
    confusion_matrix_all1.append(metrics.confusion_matrix(y_test1, y_predict_test1))
    
    #AUC, ROC for test data
    y_predict_prob1 = clf1.predict_proba(X_test1)
    auc_score1 = metrics.roc_auc_score(y_test1, y_predict_prob1[:,1])
    fpr1, tpr1, threshold1 = metrics.roc_curve(y_test1, y_predict_prob1[:, 1])
    fprs1.append(fpr1)
    tprs1.append(tpr1)
    auc1.append(auc_score1)
    #fprs and tprs contain 3 arrays now
    
#append Dataframe and view results
df_score1['loop'] = loop1
df_score1['metric'] = metric1
df_score1['train'] = training1
df_score1['test'] = testing1

display(df_score1)



In [None]:
#plot Rocs
for i in range(3): 
    plt.plot(fprs1[i], tprs1[i], linestyle = '--', label = 'fold %d (AUROC = %0.3f)' %(i,auc1[i]))
#title
plt.title('ROC plot')
#labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
#legend
plt.legend()
#show plot
plt.show()

print('mean accuracy on test dataset:', df_score1[df_score1['metric'] == 'accuracy']['test'].mean())
print('mean f1 score on test dataset:', df_score1[df_score1['metric'] == 'f1']['test'].mean())
print('mean precision on test dataset:', df_score1[df_score1['metric'] == 'precision']['test'].mean())
print('mean recall on test dataset:', df_score1[df_score1['metric'] == 'recall']['test'].mean())

##### Mean Confusion Matrix

In [None]:
cm1 = np.mean(confusion_matrix_all1, axis = 0)
total1 = np.sum(cm1)
cm1 = pd.DataFrame(cm1, columns = [0,1] , index = [0,1])
display(cm1)

#percentage
print('percentage:')
display(cm1/total1*100)

# Random Forest Classifier: Iteration 2

#### Select important features 
1. feature importance close to 0.05 and above
2. Making sure atleast one feature from every categorization (mentioned in EDA: Amounts, etc) is present. Choosing variable with highest feature importance if not present in selection made in point 1.


In [None]:
#Seperating target variable
X2=X1_over[['premium', 'afyp', 'sum_assured', 'Owner_salary', 'city', 'CUST_prod_cat', 'DSTNAME', 'STATNAME', 'age',
           'Own_Edu', 'Occupation',
            'City_classification',
            'contract_type',
            'channel_flag',
            'Product_Club_Manual',
           'Policy_term']]
y2=y1_over[['target']]
X2.info()

#### Kfold

In [None]:

#randomforestclassifier
clf2 = RandomForestClassifier(n_estimators = 50, 
                              random_state = 13,
                             max_features = 10,
                             max_depth = 15,
                             min_samples_leaf = 2)

#Kfolds
cv2 = StratifiedKFold(n_splits = 3, random_state=124, shuffle = True)

#list and Dataframe for different metrics
df_score2 = pd.DataFrame(columns = ['loop', 'metric', 'train', 'test'])

loop2, metric2, training2, testing2, confusion_matrix_all2 = [], [], [], [], []
auc2, fprs2, tprs2 = [], [], []

#Fit model
for (train,test), i in zip(cv2.split(X2,y2), range(3)):
    #fit
    X_train2, X_test2, y_train2, y_test2 = X2.iloc[train], X2.iloc[test], y2.iloc[train], y2.iloc[test]
    clf2.fit(X_train2, y_train2)
    
    #predict
    y_predict_train2 = clf2.predict(X_train2)
    y_predict_test2 = clf2.predict(X_test2)
    
    #scores
    #accuracy
    loop2.append(i)
    metric2.append('accuracy')
    training2.append(metrics.accuracy_score(y_train2, y_predict_train2))
    testing2.append(metrics.accuracy_score(y_test2, y_predict_test2))
    
    #f1
    loop2.append(i)
    metric2.append('f1')
    training2.append(metrics.f1_score(y_train2, y_predict_train2))
    testing2.append(metrics.f1_score(y_test2, y_predict_test2))
    
    #precision
    loop2.append(i)
    metric2.append('precision')
    training2.append(metrics.precision_score(y_train2, y_predict_train2))
    testing2.append(metrics.precision_score(y_test2, y_predict_test2))
    
    #recall
    loop2.append(i)
    metric2.append('recall')
    training2.append(metrics.recall_score(y_train2, y_predict_train2))
    testing2.append(metrics.recall_score(y_test2, y_predict_test2))
    
    #confusion matrix on test data
    confusion_matrix_all2.append(metrics.confusion_matrix(y_test2, y_predict_test2))
    
    #auc2, ROC for test data
    y_predict_prob2 = clf2.predict_proba(X_test2)
    auc_score2 = metrics.roc_auc_score(y_test2, y_predict_prob2[:,1])
    fpr2, tpr2, threshold2 = metrics.roc_curve(y_test2, y_predict_prob2[:, 1])
    fprs2.append(fpr2)
    tprs2.append(tpr2)
    auc2.append(auc_score2)
    #fprs2 and tprs2 contain 3 arrays now
    
#append Dataframe and view results
df_score2['loop'] = loop2
df_score2['metric'] = metric2
df_score2['train'] = training2
df_score2['test'] = testing2

display(df_score2)



In [None]:
#plot Rocs
for i in range(3): 
    plt.plot(fprs2[i], tprs2[i], linestyle = '--', label = 'fold %d (AUROC = %0.3f)' % (i,auc2[i]))
#title
plt.title('ROC plot')
#labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
#legend
plt.legend()
#show plot
plt.show()

print('mean accuracy on test dataset:', df_score2[df_score2['metric'] == 'accuracy']['test'].mean())
print('mean f1 score on test dataset:', df_score2[df_score2['metric'] == 'f1']['test'].mean())
print('mean precision on test dataset:', df_score2[df_score2['metric'] == 'precision']['test'].mean())
print('mean recall on test dataset:', df_score2[df_score2['metric'] == 'recall']['test'].mean())

##### Mean Confusion Matrix

In [None]:
cm2 = np.mean(confusion_matrix_all2, axis = 0)
total2 = np.sum(cm2)
cm2 = pd.DataFrame(cm2, columns = [0,1] , index = [0,1])
display(cm2)

#percentage
print('percentage:')
display((cm2/total2*100))

# Random Forest Fine tuning

#### Predict with new threshold

In [None]:
#set threshold

threshold = 0.25

df_score3 = pd.DataFrame(columns = ['loop', 'metric', 'test'])
loop3, metric3, testing3, confusion_matrix_all3 = [],[],[], []
#For every fold in Kfold cv2
#create column predicted based on threshold
for (train,test), i in zip(cv2.split(X2,y2), range(3)):
    
    #fitting and getting predicted values using iteration 2 splits
    X_train2, X_test2, y_train2, y_test2 = X2.iloc[train], X2.iloc[test], y2.iloc[train], y2.iloc[test]
    clf2.fit(X_train2, y_train2)
    y_predict_prob2 = clf2.predict_proba(X_test2)
    predicted = (y_predict_prob2[:,1] >= threshold).astype('int')
    
    #accuracy
    loop3.append(i)
    metric3.append('accuracy')
    testing3.append(metrics.accuracy_score(y_test2, predicted))
    
    #f1
    loop3.append(i)
    metric3.append('f1')
    testing3.append(metrics.f1_score(y_test2, predicted))
    
    #precision
    loop3.append(i)
    metric3.append('precision')
    testing3.append(metrics.precision_score(y_test2, predicted))
    
    #recall
    loop3.append(i)
    metric3.append('recall')
    testing3.append(metrics.recall_score(y_test2, predicted))
    
    #confusion matrix on test data
    confusion_matrix_all3.append(metrics.confusion_matrix(y_test2, predicted))

df_score3['loop'] = loop3
df_score3['metric'] = metric3
df_score3['test'] = testing3

display(df_score3)

In [None]:
print('mean accuracy on test dataset:', df_score3[df_score3['metric'] == 'accuracy']['test'].mean())
print('mean f1 score on test dataset:', df_score3[df_score3['metric'] == 'f1']['test'].mean())
print('mean precision on test dataset:', df_score3[df_score3['metric'] == 'precision']['test'].mean())
print('mean recall on test dataset:', df_score3[df_score3['metric'] == 'recall']['test'].mean())

##### Mean Confusion Matrix

In [None]:
cm3 = np.mean(confusion_matrix_all3, axis = 0)
total3 = np.sum(cm3)
cm3 = pd.DataFrame(cm3, columns = [0,1] , index = [0,1])
display(cm3)

#percentage
print('percentage:')
display((cm3/total3*100))