## Purpose and Observation

### Purpose:-

- To fit a model on 'Dataset_model'
- Dataset: created by isolating first policy bought by a customer (using policy_owner_number and RCD)

### Observations:-

#### Main conclusion:-
- Final iteration
    - variables used: 'premium', 'afyp', 'sum_assured', 'Owner_salary', 'city', 'CUST_prod_cat', 'DSTNAME', 'STATNAME', 'age',
           'Own_Edu', 'Occupation',
            'contract_type',
            'City_classification',
            'channel_flag',
            'Product_Club_Manual',
           'Policy_term'
    - accuracy on training data: 0.9998
    - accuracy on testing data: 0.9392
    - AUC: 0.792
    - Confusion matrix (testing data only): pre defined threshold
        - True positives: 57.07
        - True negatives: 94.48
        - False positives: 42.93
        - False negatives: 5.52

- Fine tuning
    - accuracy on testing data: 93.47
    - Confusion matrix (testing data only): threshold value = 0.37
        - True positives: 46.57
        - True negatives: 95.15
        - False positives: 53.43
        - False negatives: 4.85


# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

from sklearn.ensemble import RandomForestClassifier



# Read Dataset

**dataset used Dataset_model: Merged cleaned, continuous variables imputed with mean, rest recordes with NaN dropped. then isolate records for first RCD, groupedby policy_owner_number**

In [None]:
#import dataset
dataset = pd.read_csv("Dataset_model.csv")

#convert RCD to datetime
dataset["RCD"] = pd.to_datetime(dataset["RCD"])

In [None]:
dataset.head()

# Descriptive Stats for reference

In [None]:
#tweak to show full values in describe instead of in exp terms
pd.options.display.float_format = "{:.2f}".format

#describe and info
with pd.option_context('display.max_columns', None):
    display(dataset.head())
    display(dataset.describe())
    display(dataset.info())

# Dropping unneeded variables
- Dropping columns
    - 'Own_Education': keeping Own_Edu
    - 'own_occupation: keeping Occ_Profile, Occupation_Group, Occupation
    - 'policy_number', 'policy_owner_number': Identifiers
    - 'RCD': Datetime
    - 'Freq': used to obtain target variable

In [None]:
dataset.drop(['Own_Education', 'own_occupation',
              'policy_number', 'policy_owner_number', 
              'RCD', 
              'Freq'], axis = 1, inplace = True)

# Feature engineering


In [None]:
#convert floats to int
floatlist = list(dataset.select_dtypes('float').columns)

for col in floatlist:
    dataset.loc[:,col] = dataset.loc[:,col].apply(np.ceil)
    if dataset.loc[:,col].isna().sum() == 0:
        dataset[col] = dataset[col].astype('int64')

In [None]:
# Labelencode for categorical vars

#import library
from sklearn.preprocessing import LabelEncoder


#create list of categorical vars
catcolsm = list(dataset.select_dtypes('object').columns)

#encoder
le = LabelEncoder()

#function to encode
def labelencode(data, col):
    nonulls=np.array(data.dropna())
    impute_reshape = nonulls.reshape(-1,1)
    impute_ordinal = le.fit_transform(impute_reshape)
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
    globals()[col+'_map'] = dict(zip(range(len(le.classes_)), le.classes_))
    return data

#labelencode
for col in catcolsm:
    labelencode(dataset[col], col)
    
#review results
display(dataset.head())

#display dictionaries created.
for col in catcolsm:
    display(globals()[col+'_map'])


In [None]:
#Convert object dtype to category dtype
for col in catcolsm:
    dataset[col] = dataset[col].astype('category')

In [None]:
#Seperating target variable
X=dataset.drop(['target'],axis=1)
y=dataset[['target']]
X.info()

# Random Forest Classifier: Iteration 1

### 1. train test split, feature importance

In [None]:
#fit model
#took 5-6 mins to run

x_train, x_test, y_train,  y_test = train_test_split(X, y, test_size=0.2, random_state=0)
rf=RandomForestClassifier(n_estimators=100)
rf.fit(x_train, y_train)

In [None]:
#checking feature importances again using RandomForestClassifier

plt.bar([x for x in range(len(rf.feature_importances_))], rf.feature_importances_)
plt.show()

display(list(zip(x_test, rf.feature_importances_)))

**Imp variables from RandomForestClassifier** (> or close to 0.05)
- premium, afyp, sum_assured, Owner_salary, city, CUST_prod_cat, DSTNAME, STATNAME, age: having high importance
- Own_Edu, Occupation: moderate importance



### 2. Accuracy

In [None]:
#Check fit on train data

y_pred_train=rf.predict(x_train)
print("Accuracy against the training data: ", metrics.accuracy_score(y_train, y_pred_train))

In [None]:
#Check accuracy on Test Data

y_pred = rf.predict(x_test)
print("Accuracy against test data: ", metrics.accuracy_score(y_test, y_pred))

### 3. AUC, ROC, Confusion matrix

#### 3a. Prediction Probabilities

In [None]:
r_probs = [0 for _ in range(len(y_test))]
rf_probs = rf.predict_proba(x_test)

#Note: r_auc refers to the worst case model possible

#### 3b. AUC

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

r_auc = roc_auc_score(y_test, r_probs) 
rf_auc = roc_auc_score(y_test, rf_probs[:, 1]) 
# We use [:,1] since predict.proba gives values in column 0 as probs for value 0, in column 1: probs for value 1

print('Random classifier: AUROC = %.3f' % (r_auc))
print('RandomForestClassifier: AUROC = %.3f' % (rf_auc))

#### 3c. ROC Curve

In [None]:
rf_fpr, rf_tpr, rf_threshold = roc_curve(y_test, rf_probs[:, 1])
r_fpr, r_tpr, r_threshold = roc_curve(y_test, r_probs)

In [None]:
#plot

plt.plot(r_fpr, r_tpr, linestyle = '--', label = 'Random prediction (AUROC = %0.3f)' % r_auc)
plt.plot(rf_fpr, rf_tpr, linestyle = '--', label = 'Random prediction (AUROC = %0.3f)' % rf_auc)

#title
plt.title('ROC plot')

#labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

#legend
plt.legend()

#show plot
plt.show()


#### 3d. Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = pd.DataFrame(confusion_matrix(y_test, y_pred), columns = [0,1] , index = [0,1])
display(cm)

In [None]:
display((cm/np.sum(cm))*100)

plt.figure(figsize = (10,7))
sns.heatmap (cm/np.sum(cm), annot = True, fmt = '.2%', cmap = 'Blues')
plt.show()

false negatives: 5.76 to 10% ok.

# Random Forest Classifier: Iteration 2

#### Select only important features 
feature importance close to 0.05 and above

In [None]:
#Seperating target variable
X2=dataset[['premium', 'afyp', 'sum_assured', 'Owner_salary', 'city', 'CUST_prod_cat', 'DSTNAME', 'STATNAME', 'age',
           'Own_Edu', 'Occupation',
            'contract_type',
            'City_classification',
            'channel_flag',
            'Product_Club_Manual',
           'Policy_term']]
y2=dataset[['target']]
X2.info()

### 1. train test split, feature importance

In [None]:
#fit model
#took 5-6 mins to run

x_train2, x_test2, y_train2,  y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=0)
rf=RandomForestClassifier(n_estimators=100)
rf.fit(x_train2, y_train2)

In [None]:
#checking feature importances again using RandomForestClassifier

plt.bar([x for x in range(len(rf.feature_importances_))], rf.feature_importances_)
plt.show()

display(list(zip(x_test2, rf.feature_importances_)))

partial dependence plots?
skater?

### 2. Accuracy

In [None]:
#Check fit on train data

y_pred_train2=rf.predict(x_train2)
print("Accuracy against the training data: ", metrics.accuracy_score(y_train2, y_pred_train2))

In [None]:
#Check accuracy on Test Data

y_pred2 = rf.predict(x_test2)
print("Accuracy against test data: ", metrics.accuracy_score(y_test2, y_pred2))

### 3. AUC, ROC, Confusion matrix

#### 3a. Prediction Probabilities

In [None]:
r_probs2 = [0 for _ in range(len(y_test2))]
rf_probs2 = rf.predict_proba(x_test2)

#### 3b. AUC

In [None]:
r_auc2 = roc_auc_score(y_test2, r_probs2) 
rf_auc2 = roc_auc_score(y_test2, rf_probs2[:, 1]) 
# We use [:,1] since predict.proba gives values in column 0 as probs for value 0, in column 1: probs for value 1

print('Random classifier: AUROC = %.3f' % (r_auc2))
print('RandomForestClassifier: AUROC = %.3f' % (rf_auc2))

#### 3c. ROC Curve

In [None]:
rf_fpr2, rf_tpr2, rf_threshold2 = roc_curve(y_test2, rf_probs2[:, 1])
r_fpr2, r_tpr2, r_threshold2 = roc_curve(y_test2, r_probs2)

In [None]:
#plot

plt.plot(r_fpr2, r_tpr2, linestyle = '--', label = 'Random prediction (AUROC = %0.3f)' % r_auc2)
plt.plot(rf_fpr2, rf_tpr2, linestyle = '--', label = 'Random prediction (AUROC = %0.3f)' % rf_auc2)

#title
plt.title('ROC plot')

#labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

#legend
plt.legend()

#show plot
plt.show()

#### 3d. Confusion Matrix

In [None]:
cm2 = pd.DataFrame(confusion_matrix(y_test2, y_pred2), columns = [0,1] , index = [0,1])
display(cm2)

In [None]:
display((cm2/np.sum(cm2))*100)

plt.figure(figsize = (10,7))
sns.heatmap (cm2/np.sum(cm2), annot = True, fmt = '.2%', cmap = 'Blues')
plt.show()

# Random Forest Fine tuning

#### 1. Predict with new threshold

In [None]:
#set threshold

threshold = 0.37

#create column predicted based on threshold

predicted = (rf_probs2[:,1] >= threshold).astype('int')

#### 2. Accuracy

In [None]:
print("Accuracy against test data: ", metrics.accuracy_score(y_test2, predicted))

#### 3. Confusion Matrix

In [None]:
#Create confusion matrix again
cm3 = pd.DataFrame(confusion_matrix(y_test2, predicted), columns = [0,1] , index = [0,1])
display(cm3)

In [None]:
display((cm3/np.sum(cm3))*100)

plt.figure(figsize = (10,7))
sns.heatmap (cm3/np.sum(cm3), annot = True, fmt = '.2%', cmap = 'Blues')
plt.show()