# Purpose and Observations:

### Purpose:-
 - Complete cleaning and exporting clean data


### Observations:-
 - "policy_number" is a unique identifier for both 'merged' datasets

### Steps:-
1. Marital Status Categories: M- Married, S-Single, W-widow, D-divorced, N- treated as NaN
2. Own_Edu: Missing values represented by N.A. and MISSING, treated as NaN
3. Occupation_Group: Missing value represented by N.A, treated as NaN
4. Focus_region: Andhra and ANDHRA to be coalesced. convert ANDHRA, TAMIL NADU, KKG to SOUTH. (Converted string to upper case for all variables)
5. STATNAME: duplicates present in different sentence cases. (Converted string to upper case for all variables)
6. RCD and LA DOB datatype to be changed to DateTime format
7. Convert float datatypes to int datatypes
8. drop Unnamed column


## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import csv


## Read data and Cleaning

**moving ahead with 'merged' dataset**

In [None]:

df_merged=pd.read_csv('merged.csv')

display(df_merged.info())

display(df_merged.head())

In [None]:
#value counts
   
print(" MERGED ".center(80,'#'), "\n")
for col in df_merged.select_dtypes(include=['object']).columns:
    print(col.center(40,'#'), "\n")
    print(df_merged[col].value_counts(), "\n")

In [None]:
#describe them
pd.options.display.float_format = "{:.2f}".format

display(df_merged.describe())

#### Data cleaning:- <br />
1. Marital Status Categories: M- Married, S-Single, W-widow, D-divorced, N- ?? <br />
    NOTE: treating N as NaN for further analysis
2. Own_Edu: if N.A and MISSING in Own_Edu be treated as same,i.e, Missing Values? <br />
    NOTE: treated as NaN for further analysis
3. Occupation_Group: Missing value represented by N.A <br />
    NOTE: treated as NaN for fruther analysis
4. Focus_region: Andhra and ANDHRA to be coalesced. convert ANDHRA, TAMIL NADU, KKG to SOUTH
5. STATNAME: Converted to anyone case, duplicates present in different cases
6. RCD and LA DOB datatype to be changed to DateTime format
7. Convert float datatypes to int datatypes
8. drop Unnamed column

#### 1. 2. 3.

In [None]:
#Marital_status, Own_Edu, Occupation_Group fix
df_merged.replace(['N', 'N.A', 'MISSING'], np.nan, inplace=True )

#### 4. 5.

In [None]:
#STATNAME fix and Focus_region fix

objectlistm = list(df_merged.select_dtypes('object').columns)
for col in objectlistm:
    df_merged[col] = df_merged[col].str.upper()


# putting them under SOUTH
df_merged['Focus_region'].replace(['KKG','ANDHRA','TAMIL NADU'], 'SOUTH', inplace = True)

#### 6.

In [None]:
# RCD and LA_DOB fix
df_merged["LA_DOB"]= pd.to_datetime(df_merged["LA_DOB"])
df_merged["RCD"] = pd.to_datetime(df_merged["RCD"])


#### 7.

In [None]:
#Float dtype fix
floatlist = list(df_merged.select_dtypes('float').columns)
display(df_merged.loc[:, floatlist])

for col in floatlist:
    df_merged.loc[:,col] = df_merged.loc[:,col].apply(np.ceil)
    if df_merged.loc[:,col].isna().sum() == 0:
        df_merged[col] = df_merged[col].astype('int64')


In [None]:
#drop Unnamed:0
x= "Unnamed: 0"
df_merged.drop([x], axis=1, inplace=True)

In [None]:
#check revised value counts    
print(" MERGED ".center(80,'#'), "\n")
for col in df_merged.select_dtypes(include=['object']).columns:
    print(col.center(40,'#'), "\n")
    print(df_merged[col].value_counts(), "\n")

#### Export Cleaned data

In [None]:
#merged
df_merged.to_csv('Merged_clean.csv', index = False)

**NOTE: after importing cleaned data, will still need to convert RCD and LA_DOB to datetime from objects dtype. <br />
This occurs since csv files dont store datetime**