# 4.3 Data Wrangling Exercise


## Submission questions

Create a short document (1-2 pages) in your github describing the data wrangling steps that you undertook to clean your capstone project data set. What kind of cleaning steps did you perform? How did you deal with missing values, if any? Were there outliers, and how did you decide to handle them? This document will eventually become part of your milestone report.

## Approach

1. Data exploration (and visualization)
2. Data cleaning
3. Data transformation

This process has been and is expected to be repeated (iterated) as the data analysis demands.

### Data exploration and cleaning

It involved loading the data, checking the data types of different variables, importing packages as necessary. As a part of data exploration, features with missing values ( > 96%) and unknown/missing values have beeb identified. 
Under data cleaning those columns have been eliminated and missing values were replaced with NaN.

In [1]:
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
#read the data file
input_file_name = 'diabetes.csv'
rawdata_df = pd.read_csv(input_file_name, na_values = "?", engine='python')

In [3]:
#shape of the dataframe
rawdata_df.shape

(101766, 50)

In [4]:
#make a copy of the dataframe
cleandata_df = rawdata_df.copy()

#checking for the number of missing values per feature (if > 0)
for x in cleandata_df.columns:
    is_object = cleandata_df[x].dtype
    if (is_object == object and 
        cleandata_df[x][cleandata_df[x] == '?'].count() > 0):
        print(x,': % missing values - ', 
              cleandata_df[x][cleandata_df[x] == '?'].count()*100/cleandata_df[x].count())

In [5]:
#drop the feature 'weight'
cleandata_df.drop(['weight'], inplace = True, axis = 1)

In [6]:
#check the shape again 
cleandata_df.shape

(101766, 49)

In [7]:
#replace missing values with NaN
cleandata_df.replace('Unknown/Invalid', np.NaN, inplace = True)

In [8]:
#checking if '?'s are replaced with NaN
for z in cleandata_df.columns:
    is_object1 = cleandata_df[z].dtype
    if is_object1 == object and cleandata_df[z][cleandata_df[z] == '?'].count() > 0:
        print(z, ': % missing values - ', 
              cleandata_df[z][cleandata_df[z] == '?'].count()*100/cleandata_df[z].count())

### Data Transformation

Columns where replacement/recoding of values were necessary were identified and replaced with new codes.
For example, the column "Readmitted" with values "No", ">30", and "<30" were replaced with "0" for "No" and ">30" and "1" for "<30".

In [9]:
#'Age' feature
age_values = {"age":{"[0-10)": 1, "[10-20)": 2, "[20-30)": 3, 
                    "[30-40)": 4, "[40-50)": 5, "[50-60)": 6,
                    "[60-70)": 7, "[70-80)": 8, "[80-90)": 9,
                    "[90-100)": 10}}
cleandata_df.replace(age_values, inplace = True)

In [10]:
#checking for number of readmissions within 30 days
cleandata_df['readmitted'].value_counts()

NO     54864
>30    35545
<30    11357
Name: readmitted, dtype: int64

In [11]:
#converting readmitted < 30 days value to 1 and rest to 0
readmitted_values = {"readmitted": {"NO": 0, ">30": 0, "<30": 1}}
cleandata_df.replace(readmitted_values, inplace = True)

In [12]:
#checking for new values
cleandata_df['readmitted'].value_counts()

0    90409
1    11357
Name: readmitted, dtype: int64

In [13]:
#writing the dataframe into csv for further analysis
output_file_name = "clean_data_diabetes.csv"
cleandata_df.to_csv(output_file_name, sep = '\t')