# Purpose and Observation

## Purpose:
    - EDA: Variable Profiling to see impact on target variables and get insights on columns to eliminate
    
## Main Observation:-

#### Variables explored:-
- Variables to explore (32)
    - Amounts (4) : Owner_Salary, afyp, premium, sum_assured
    - Gender (2): Own_gender, LA_gender
    - Education (2): Own_Edu, Own_Education
    - Occupation (4):  Occ_Profile, Occupation_Group, own_occupation, Occupation
    - Internal categorization (2): risk_status, contract_type 
    - Product (5): Product_Description, Par_NonPar, Product_brief_Category, Product_Club_Manual, CUST_prod_cat
    - Location (5): city, DSTNAME, STATNAME, Focus_region, City_classification
    - Time (4): Age, PPT, Policy_term, billing_frequency
    - Flags (3): channel_flag, Med_Flag, ECS_flag
    - Marital Status (1): Martial_status 
- Variables not suitable to be explored on:
    - Identifiers (2): policy_number, policy_owner_number
    - Date (1): RCD
    - Target variable itself and frequency (2): Freq, target

#### Results of exploration:-
    - Amounts : significant impact on target variable
    - Gender: low impact on target variable
    - Education : moderate impact on target variable
    - Occupation : significant impact on target variable (especially more detailed categories: Occupation)
    - Internal categorization: moderate impact on target variable
    - Product : siginificant impact on target variable (especially more detailed categories: CUST_prod_cat)
    - Location : siginificant impact on target variable (STATNAME, DSTNAME, city)
    - Time : siginificant impact on target variable (Age, Policy Term)
    - Flags : moderate impact on target variable
    - Marital status: moderate impact on target variable


# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import csv


# Read Clean and Missing Values Treated Data

#### Moving ahead with merged dataset.
**dataset used Merged_clean_and_dropped: Merged cleaned, continuous variables imputed with mean, rest recordes with NaN dropped**

In [None]:
df_merged = pd.read_csv('Dataset_model.csv')

df_merged["RCD"] = pd.to_datetime(df_merged["RCD"])



# Descriptive stats for reference

In [None]:
with pd.option_context('display.max_columns', None):
    pd.options.display.float_format = "{:.2f}".format
    display(df_merged.head())
    display(df_merged.describe())
    display(df_merged.info())

# EDA

### Variables and count

In [None]:
print('All variables:', df_merged.columns)
print('Total number of variables: ', len(df_merged.columns))


### Variables explored to check impact on target variable
- Variables to explore (32)
    - Amounts (4) : Owner_Salary, afyp, premium, sum_assured
    - Gender (2): Own_gender, LA_gender
    - Education (2): Own_Edu, Own_Education
    - Occupation (4):  Occ_Profile, Occupation_Group, own_occupation, Occupation
    - Internal categorization (2): risk_status, contract_type 
    - Product (5): Product_Description, Par_NonPar, Product_brief_Category, Product_Club_Manual, CUST_prod_cat
    - Location (5): city, DSTNAME, STATNAME, Focus_region, City_classification
    - Time (4): Age, PPT, Policy_term, billing_frequency
    - Flags (3): channel_flag, Med_Flag, ECS_flag
    - Marital Status (1): Martial_status 
- Variables not suitable to be explored on:
    - Identifiers (2): policy_number, policy_owner_number
    - Date (1): RCD
    - Target variable itself and frequency (2): Freq, target

### Binning Continuous Variables

In [None]:
#Owner_salary
#create buckets
df_merged['income_brackets'] = pd.qcut(df_merged['Owner_salary'].rank(method='first'), q= 10)

#age
#create buckets
df_merged['age_brackets'] = pd.cut(df_merged['age'], bins = [0, 18, 24, 34, 44, 55, 110])

#Policy_term
#create buckets
df_merged['Policy_term_brackets'] = pd.cut(df_merged['Policy_term'], bins = [0, 18, 24, 34, 44, 55, 110])

#billing_frequency
#create buckets
df_merged['billing_frequency_brackets'] = pd.cut(df_merged['billing_frequency'], bins = [-1,3,6,9,12])

#PPT
#create buckets
df_merged['PPT_brackets'] = pd.cut(df_merged['PPT'], bins = [0, 18, 24, 34, 44, 55, 110])

#premium
#create buckets
df_merged['premium_brackets'] = pd.qcut(df_merged['premium'].rank(method='first'), q= 10)

#afyp
#create buckets
df_merged['afyp_brackets'] = pd.qcut(df_merged['afyp'].rank(method='first'), q= 10)

#sum_assured
#create buckets
df_merged['sum_assured_brackets'] = pd.qcut(df_merged['sum_assured'].rank(method='first'), q= 10)

### Create lists for variables as noted above

In [None]:
Amount = ['income_brackets', 'afyp_brackets', 'premium_brackets', 'sum_assured_brackets']
Gender = ['Own_gender','LA_gender']
Education = ['Own_Edu','Own_Education']
Occupation = ['Occ_Profile', 'Occupation_Group', 'own_occupation', 'Occupation']
Int_Cat = ['risk_status', 'contract_type']
Product = ['Product_Description', 'Par_NonPar', 'Product_brief_category', 'Product_Club_Manual', 'CUST_prod_cat']
Location = ['City_classification', 'Focus_region', 'STATNAME', 'DSTNAME', 'city']
Time = ['age_brackets', 'PPT_brackets', 'Policy_term_brackets', 'billing_frequency_brackets']
Flag = ['channel_flag', 'Med_Flag', 'ECS_flag']
Mar_stat = ['Marital_status']

# Variable Profiling

#### 1. Amount

In [None]:
for col in Amount:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 60)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

xticks? legend?

***NOTES: All 'Amount'vars have an impact on target variable. All may be included in model***

#### 2. Gender

In [None]:
for col in Gender:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Gender'vars have very less impact on target variable. Neither may be included in model***

#### 3. Education

In [None]:
for col in Education:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 60)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Education'vars have an impact on target variable. Variables with broader and suitable categories as per preliminary exploration may be included***

#### 4. Occupation

In [None]:
for col in Occupation:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 60)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Occupation'vars have an impact on target variable. Variables with broader and suitable categories as per preliminary exploration may be included***

#### 5. Int_Cat

In [None]:
for col in Int_Cat:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 90)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Int_Cat 'vars have an impact on target variable. Though, their meaning cannot be understood. Hence, may be excluded from model***

#### 6. Product

In [None]:
for col in Product:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 90)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Product'vars have an impact on target variable (Product_brief_category shows lowest impact as per proportions). Variables with broader and suitable categories as per preliminary exploration may be included***

#### 7.. Location

In [None]:
for col in Location:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 90)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Location'vars have an impact on target variable. Variables with broader and suitable categories as per preliminary exploration may be included (STATNAME may also be included)***

#### 8. Time

In [None]:
for col in Time:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 60)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Time'vars have an impact on target variable. All may be included in the model***

#### 9. Flag

In [None]:
for col in Flag:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 60)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: All 'Flag'vars have an impact on target variable. All may be included in the model***

#### 10. Marital Status

In [None]:
for col in Mar_stat:
    plt.figure(figsize= (10,5))
    x = pd.DataFrame(df_merged.groupby(col).target.mean()*100)
    sns.set(style= 'whitegrid')
    sns.barplot(x = x.index, y = 'target', data = x, palette = 'spring')
    plt.xticks(rotation = 60)
    plt.title(col+'_Graph')
    plt.xlabel(col)
    plt.ylabel("proportion of '1'")
    plt.savefig(col+'.png', bbox_inches = 'tight')
    plt.show()
    plt.close()

***NOTES: Marital status has an impact on target variable. May be included in the model***

## Observation:-

    - Amounts : significant impact on target variable
    - Gender: low impact on target variable
    - Education : moderate impact on target variable
    - Occupation : significant impact on target variable (especially more detailed categories: Occupation)
    - Internal categorization: moderate impact on target variable
    - Product : siginificant impact on target variable (especially more detailed categories: CUST_prod_cat)
    - Location : siginificant impact on target variable (STATNAME, DSTNAME, city)
    - Time : siginificant impact on target variable (Age, Policy Term)
    - Flags : moderate impact on target variable
    - Marital status: moderate impact on target variable