# Purpose and Observation

## Purpose:
    - Preliminary data exploration to get insights on columns to eliminate
    
## Observations:-

### Following variables are compared to get a better understanding of dataset
##### Categorical variables:-
    - Gender (2): Own_gender, LA_gender
    - Education (2): Own_Edu, Own_Education
    - Occupation (4):  Occ_Profile, Occupation_Group, own_occupation, Occupation
    - Product (5): Product_Description, Par_NonPar, Product_brief_Category, Product_Club_Manual, CUST_prod_cat
    - Location (5): city, DSTNAME, STATNAME, Focus_region, City_classification
    - Flags (3): channel_flag, Med_Flag, ECS_flag
##### Contiunuous variables:-
    - Owner_salary, afyp, premium, sum_assured
   
## Main Conclusions:-

### Categorical variables:-
    
   - Gender: 
        - Own_gender has some missing values.
        - LA_gender doesn't have any missing value, thus, making it a better  category to retain in case of a conflict
    - Education: 
        - Own_Education may be dropped, which contains abbreviations of column Own_Edu. 
        - We may retain Own_Edu
    - Occupation: 
        - own_occupation may be dropped, which contains abbreviations of column Occupation. 
        - Of the remaining three, Occupation is the most detailed category. Occ_Profile seems to be the broadest category
    - Product: 
        - Product_Description is the most detailed category. 
        - Product_Club_Manual second most detailed category.
        - Prdouct_brief_category appears to be the broadest appropriate classification. 
            - Also, 'SAFAL JEEVAN' is a part of 'TRADITIONAL' policies
     - Location:
         - from most detailed to broadest, the categories are city, DSTNAME, STATNAME, City_Classificaiton, Focus_Region
         - City_classification: 'METRO' is not an appropriate title. It contains cities besides the 4 metros. 'MEGA' cities would be the appropriate name instead of METRO
      - Flags: all categories are included for every product brief category

### Continuous variables:-
   **For Own_salary, afyp, premium, sum_assured**
    - Graphs are highly skewed to right
    - the values show a lot of variation.
    - Hence, Mean Value can be used for imputation.

# Exploration steps

### Categorical variables comparison: - 
   **Moving ahead with merged dataset**
    - Gender:
        - own gender, LA gender
    - Education
        - own edu, edu
    - Occupation
        - Occupation_Group, Occ_Profile, Occupation 
    - Products
        -  Product_Description, Product_Club_Manual, CUST_prod_cat, Product_brief_category, PAR_nonPAR
    - Location
        - Focus region, STATNAME, DSTNAME, City_Classification, city
    - Flags
        - ECS flag, med flag, channel flag

### Continuous variables exploration: - 
    - Explored Own_salary, afyp, premium, sum_assured

# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import csv


# Read cleaned data

#### Moving ahead with merged dataset

In [None]:
df_merged = pd.read_csv('Merged_clean.csv')

df_merged["RCD"] = pd.to_datetime(df_merged["RCD"])


In [None]:
#commenting cleaning code
""" Mentioning cleaning codes, remove quotations if dont have clean data file


df_merged=pd.read_csv('merged.csv')


#Marital_status, Own_Edu, Occupation_Group fix
df_merged.replace(['N', 'N.A', 'MISSING'], np.nan, inplace=True )


#STATNAME fix and Focus_region fix

objectlistm = list(df_merged.select_dtypes('object').columns)
for col in objectlistm:
    df_merged[col] = df_merged[col].str.upper()


# putting them under SOUTH
df_merged['Focus_region'].replace(['KKG','ANDHRA','TAMIL NADU'], 'SOUTH', inplace = True)

# RCD and LA_DOB fix
df_merged["LA_DOB"]= pd.to_datetime(df_merged["LA_DOB"])
df_merged["RCD"] = pd.to_datetime(df_merged["RCD"])


#Float dtype fix
floatlist = list(df_merged.select_dtypes('float').columns)
display(df_merged.loc[:, floatlist])

for col in floatlist:
    df_merged.loc[:,col] = df_merged.loc[:,col].apply(np.ceil)
    if df_merged.loc[:,col].isna().sum() == 0:
        df_merged[col] = df_merged[col].astype('int64')

#drop Unnamed:0
x= "Unnamed: 0"

df_merged.drop([x], axis=1, inplace=True)

"""

# Descriptive stats for reference


In [None]:
pd.options.display.float_format = "{:.2f}".format

display(df_merged.describe())

display(df_merged.info())

## Check Missing values

In [None]:
#check all records with atleast one missing value
nulls = df_merged.isnull().any(axis=1).sum()
print("records with missing values: ", nulls)
print("percentage records with missing values: ", 100*nulls/len(df_merged))

In [None]:
features_with_na=[features for features in df_merged.columns if df_merged[features].isnull().sum() >1]

for feature in features_with_na:
    print(feature, np.round(df_merged[feature].isnull().mean()*100,4), ' %missing values')

## Categorical Variables Exploration

**making cross tabulations to understand categorization.**

In [None]:
df_merged.select_dtypes('object').columns

### 1. Gender: Own_gender, LA_gender

In [None]:
#Own_gender vs LA_gender

display(pd.crosstab(df_merged['Own_gender'], df_merged['LA_gender']))
print('Null values in Own_gender:', df_merged['Own_gender'].isnull().sum())
print('Null values in LA_gender:', df_merged['LA_gender'].isnull().sum())

***NOTES: LA_gender has no null values, though gives information about gender on person who is 'covered' rather than the proposer***

### 2. Education:  Own_Edu, Own_Education

In [None]:
#Own_Edu vs Own_Education

display(pd.crosstab(df_merged['Own_Education'],df_merged['Own_Edu']))
print('Null values in Own_Education:', df_merged['Own_Education'].isnull().sum())
print('Null values in Own_Edu:', df_merged['Own_Education'].isnull().sum())

***NOTES: Either can be used as a single broad classification to be used further***

### 3. Occupation: Occupation_Group, Occ_Profile, Occupation, Own_occupation.

***Own_occupation is just abbreviations for Occupation. Skipping that comparison***

In [None]:
#Occupation_Group vs Occ_Profile

display(pd.crosstab(df_merged['Occupation_Group'],df_merged['Occ_Profile']))
print('Null values in Occupation_Group:', df_merged['Occupation_Group'].isnull().sum())
print('Null values in Occ_Profile:', df_merged['Occ_Profile'].isnull().sum())

In [None]:
# Occ_Profile vs Occupation
print('Null values in Occ_Profile:', df_merged['Occ_Profile'].isnull().sum())
print('Null values in Occupation:', df_merged['Occupation'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Occ_Profile','Occupation']).size())

In [None]:
# Occupation_Group vs Occupation
# To see the differences of occupation categorized into Salaried-High, Salaried-Low, Salaried-Medium

print('Null values in Occupation_Group:', df_merged['Occupation_Group'].isnull().sum())
print('Null values in Occupation:', df_merged['Occupation'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Occupation_Group','Occupation']).size())

***NOTES: Occ_Profile appears to be the broadest category. Occupation is the most detailed category***

### 4. Products: Product_Description, Product_Club_Manual, CUST_prod_cat, Product_brief_category, PAR_nonPAR

In [None]:
#Product_Description vs Product_Club_Manual

with pd.option_context('display.max_rows', None, 'display.max_columns', None ):
    display(pd.crosstab(df_merged['Product_Description'], df_merged['Product_Club_Manual'], margins = True))

***NOTES: Both are almost as detailed, though Product_Club_Manual seems more broad categorization***

In [None]:
# grouped value counts view
print('Null values in CUST_prod_cat:', df_merged['Product_Description'].isnull().sum())
print('Null values in Product_brief_category:', df_merged['Product_Club_Manual'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_Club_Manual','Product_Description']).size())

In [None]:
#CUST_prod_cat vs Product_brief_category
print('Null values in CUST_prod_cat:', df_merged['CUST_prod_cat'].isnull().sum())
print('Null values in Product_brief_category:', df_merged['Product_brief_category'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(pd.crosstab(df_merged['CUST_prod_cat'], df_merged['Product_brief_category'], margins = True))


***NOTES: Product_brief_category seems to be a suitable broad categorization out of the two. Also SAFAL JEEVAN appears to be TRADITIONAL insurance***

In [None]:
#CUST_prod_cat vs Product_Club_Manual: only for reference- Product_Club_Manual seems a better categorization

print('Null values in CUST_prod_cat:', df_merged['CUST_prod_cat'].isnull().sum())
print('Null values in Product_Club_Manual:', df_merged['Product_Club_Manual'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_Club_Manual','CUST_prod_cat']).size())


In [None]:
#same columns, different grouping
with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['CUST_prod_cat', 'Product_Club_Manual']).size())

In [None]:
#Product_Club_Manual vs Product_brief_category

print('Null values in Product_Club_Manual:', df_merged['Product_Club_Manual'].isnull().sum())
print('Null values in Product_brief_category:', df_merged['Product_brief_category'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(pd.crosstab(df_merged['Product_Club_Manual'], df_merged['Product_brief_category'], margins = True))


***NOTES: both Product_brief_category and Product_Club_Manual give important information at the broadest possible categorization for reference. Product_Club_Manual is more detailed one***

In [None]:
# Grouped value counts view of both product categories

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_brief_category','Product_Club_Manual']).size())

In [None]:
#Product_brief_category vs Par_NonPar

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_brief_category','Par_NonPar']).size())

print('\n')

#Product_Club_Manual vs Par_NonPar
with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Par_NonPar','Product_Club_Manual']).size())

***NOTES: Product_brief_category seems to be the better classifier, along with Product_Club_Manual. Par_Nonpar is too broad a categorization***

### 5. Location: Focus_region, STATNAME, DSTNAME, city, City_classification

***city, DSTNAME are too detailed. Skipping that comparison***

In [None]:
#Focus region vs STATNAME

print('Null values in Focus_region:', df_merged['Focus_region'].isnull().sum())
print('Null values in STATNAME:', df_merged['STATNAME'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Focus_region','STATNAME']).size())

***NOTES: STATNAME is too detailed, but has much lesser values compared to city and DSTNAME***

In [None]:
#Focus region vs City_classification


print('Null values in Focus_region:', df_merged['Focus_region'].isnull().sum())
print('Null values in City_Classification:', df_merged['City_classification'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(pd.crosstab(df_merged['Focus_region'],df_merged['City_classification'], margins = True))


In [None]:
#City_classification vs STATNAME

print('Null values in City_classification:', df_merged['City_classification'].isnull().sum())
print('Null values in STATNAME:', df_merged['STATNAME'].isnull().sum())

with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['City_classification','STATNAME']).size())

***NOTES: There should be only 4 METRO cities. (Perhaps the classification is MEGA CITIES). Null values are same in both columns. City_classification seems to be a good broad categorization, Focus region becomes too generic***

#### 6. ECS FLAG, MED FLAG, CHANNEL FLAG

In [None]:
#ECS
with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_brief_category','ECS_flag']).size())
    
#Med flag
with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_brief_category','Med_Flag']).size())

#Channel flag
with pd.option_context('display.max_rows', None):
    display(df_merged.groupby(['Product_brief_category','channel_flag']).size())

## Continuous Variables Exploration

- Removing top and bottom 1 percentile data to get graph (data too skewed to get plot otherwise)
- Plotting
    - Owner_Salary
    - afyp
    - premium
    - sum assured

#### 1. Owner_salary 

In [None]:
#checking null values

print('records with nan Owner salary: ', df_merged['Owner_salary'].isnull().sum())


In [None]:
#removing those quantiles

l1 = list(set(df_merged[df_merged['Owner_salary'] < df_merged['Owner_salary'].quantile(0.99)].index).intersection(set(df_merged[df_merged['Owner_salary'] > df_merged['Owner_salary'].quantile(0.01)].index)))
print('No of records excluded:', len(df_merged) - len(l1))

#Showing records used to plot graph and records excluded for reference
display(df_merged.loc[l1])
display(df_merged.drop(l1, axis = 'index'))

In [None]:
#Plot graph
sns.distplot(df_merged.loc[l1, 'Owner_salary'])

#### 2. afyp 

In [None]:
#checking null values
print('records with nan afyp: ', df_merged['afyp'].isnull().sum())

In [None]:
#removing those quantiles

l2 = list(set(df_merged[df_merged['afyp'] < df_merged['afyp'].quantile(0.99)].index).intersection(set(df_merged[df_merged['afyp'] > df_merged['afyp'].quantile(0.01)].index)))
print('No of records excluded:', len(df_merged) - len(l2))

In [None]:
#Plot graph
sns.distplot(df_merged.loc[l2, 'afyp'])

#### 3. premium

In [None]:
#checking null values
print('records with nan premium: ', df_merged['premium'].isnull().sum())

In [None]:
#removing those quantiles

l3 = list(set(df_merged[df_merged['premium'] < df_merged['premium'].quantile(0.99)].index).intersection(set(df_merged[df_merged['premium'] > df_merged['premium'].quantile(0.01)].index)))
print('No of records excluded:', len(df_merged) - len(l3))

In [None]:
#Plot graph
sns.distplot(df_merged.loc[l3, 'premium'])

#### 4. sum_assured 

In [None]:
#checking null values
print('records with nan sum_assured: ', df_merged['sum_assured'].isnull().sum())

In [None]:
#removing those quantiles

l4 = list(set(df_merged[df_merged['sum_assured'] < df_merged['sum_assured'].quantile(0.99)].index).intersection(set(df_merged[df_merged['sum_assured'] > df_merged['sum_assured'].quantile(0.01)].index)))
print('No of records excluded:', len(df_merged) - len(l4))

In [None]:
#Plot graph
sns.distplot(df_merged.loc[l4, 'sum_assured'])

#### Observations : 
- Graphs are highly skewed to right
- the values show a lot of variation.
- Hence, Mean Value can be used for imputation.