# Data Cleaning

### Demo by Joey Naro

In [None]:
# if you don't already have word2number
import sys
! {sys.executable} -m pip install word2number

In [None]:
from word2number import w2n
import pandas as pd

### This data set was modified for the purpose of this presentation. It was originally a data set to detect kidney disease

In [None]:
# Load data from file
df = pd.read_csv('kidney_disease_uncleaned.csv')
df.head(15)

## What are some things that we might want to change?

In [None]:
df.shape

### As you can see, many of these features appear to be binary. However, things like capitalization and spelling result in extra feature values. 

## Categorical Data Correction

### Let's take a closer look at the rbc feature.

In [None]:
# get unique rbc values


### It appears as though rbc should only have values normal and abnormal.

### Let's convert all values to lower case.

In [None]:
# convert rbc to lower case


# recheck the unique values in rbc


### Now all we have left is misspellings.

### We can do this one by one.

In [None]:
# change ubnormal to abnormal


# recheck the unique values in rbc
df['rbc'].unique()

### We can also use regex to fix most of the misspellings at once.

In [None]:
# regex to correct normal values, na is needed when NaN is present


# recheck the unique values in rbc
df['rbc'].unique()

### The NaN can be handled with value imputaion, but we'll save that for another demo.

### Let's take a look at the values in pcc

In [None]:
df['pcc'].unique()

### We can easily get rid of those spaces in the middle by replacing them with nothing

In [None]:
# replace spaces with nothing


df['pcc'].unique()

### On a similar note, df['pcc'].str.strip() will remove leading and trailing whitespaces but leave the middle untouched.

## Number Correction

### We can find hidden incorrect values by checking the data types. If a feature seems to have numeric values but is of type object, we know to look closer for incorrect values.

### id values should be integers. Let's take a closer look.

In [None]:
df['id'].head(15)

### Here we can see that at least one number was spelled out.

### This is easy to fix using by applying the word2number library.

In [None]:
# convert word id values to ints


# check the id values
df['id'].head(15)

In [None]:
df.head(15)

### It looks like some of the wc values have commas, which cause them not to be interpreted as integers. That's a simple fix.

In [None]:
# get rid of the commas


# clean up the feature
df['wc'] = df['wc'].str.replace('?', '')
df['wc'] = df['wc'].str.strip()

# convert the data type to numeric values


In [None]:
df.head(15)

## Removing Duplicates

### Rows 12 and 13 are duplicates. Here's an easy way to fix that.

In [None]:
len(df)

In [None]:
# drop duplicates


len(df)

In [None]:
df.dtypes

## One-hot Encoding

### Now that we have the numeric values, we may need to convert categorical values to numeric values. We can do this with one-hot encoding.

In [None]:
# One hot encode pc


new_df

### Because the values of each column are interdependent, we do not always need to create an extra column.

In [None]:
# one hot encode with 2 columns


new_df

### We can take care of all of our one hot encoding in one easy step.

In [None]:
# more clean up
df['ba'] = df['ba'].str.replace(' ', '')
df['classification'] = df['classification'].str.strip()
df['dm'] = df['dm'].str.strip()

# one hot encode all categorical data
# ['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'classification']


new_df

### We can then combine the encodings with the numeric values to have a data frame of fully numeric values.

In [None]:
# combine all numeric data
# ['age', 'bp', 'sg', 'bgr', 'bu', 'sc', 'wc']

new_df

In [None]:
new_df.dtypes

## Additional Links:

### Original dataset: https://www.kaggle.com/mansoordaku/ckdisease

### Recommended Pandas tutorial: https://www.youtube.com/watch?v=vmEHCJofslg