# Assignment: Data preprocessing

## Objectives

The objectives of this assignment are:
1.	to get familiar with the Jupyter Notebook environment
2.	to learn the basics of manipulating data frames

## Setup

In the assignment, use the CKD dataset that is available at [https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease](https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease).


## Task

Load the CKD dataset into a single data frame and read its description. Construct a pipeline for modifying the data frame.

The modified data frame should meet the following requirements:

- It should include exactly the following columns:
  - age
  - blood pressure
  - specific gravity
  - albumin
  - sugar
  - blood glucose random
  - blood urea
  - sodium
  - potassium
  - hemoglobin
  - packed cell volume
  - white blood cell count
  - red blood cell count
  - class

- The hemoglobin values should be expressed in g/l. In the original data set, they are expressed as g/dl.

- The values of the class column should be recoded as a or c (affected or control).

- Rows with three or more missing values should be removed. Indicate the number of rows left in the modified data frame.

// TODO

Next, split the data frame into two data frames, one for the affected individuals and one for the control individuals. Display the data frames, and indicate the number of rows in each data frame.

For each data frame, calculate the basic statistics for each column, and provide clear, readable histograms for each numerical column. Do you see any outliers? If so, how would you handle them?

Finally, calculate the correlation matrix and visualize it for each data frame. Clearly describe the results and your interpretation for it.


## Deliverables

Submit a GitHub permalink that points to the Jupyter notebook as instructed in Oma.
The submitted notebook must contain the step-to-step analysis pipeline, complete with Markdown blocks and comments that clearly explain what has been done.


In [41]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# fetch dataset
chronic_kidney_disease = fetch_ucirepo(id=336)

# data (as pandas dataframes)
X = chronic_kidney_disease.data.features
y = chronic_kidney_disease.data.targets

df = pd.DataFrame(X)
df2 = pd.DataFrame(y)

df = df[['bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc']]
df['class'] = df2['class']

df = df.rename(columns={'bp': 'Blood pressure', 'sg': 'specific gravity', 'al': 'albumin', 'su': 'sugar',
                        'bgr': 'blood glucose random', 'bu': 'blood urea', 'sod': 'sodium', 'pot': 'potassium',
                        'hemo': 'hemoglobin', 'pcv': 'packed cell volume', 'wbcc': 'white blood cell count',
                        'rbcc': 'red blood cell count', })
df.hemoglobin = df.hemoglobin / 10
df['class'] = df['class'].map({'ckd': 'a', 'notckd': 'c'})

df = df.dropna(thresh=len(df.columns) - 2)

df


Unnamed: 0,Blood pressure,specific gravity,albumin,sugar,blood glucose random,blood urea,sodium,potassium,hemoglobin,packed cell volume,white blood cell count,red blood cell count,class
0,80.0,1.020,1.0,0.0,121.0,36.0,,,1.54,44.0,7800.0,5.2,a
3,70.0,1.005,4.0,0.0,117.0,56.0,111.0,2.5,1.12,32.0,6700.0,3.9,a
4,80.0,1.010,2.0,0.0,106.0,26.0,,,1.16,35.0,7300.0,4.6,a
5,90.0,1.015,3.0,0.0,74.0,25.0,142.0,3.2,1.22,39.0,7800.0,4.4,a
6,70.0,1.010,0.0,0.0,100.0,54.0,104.0,4.0,1.24,36.0,,,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,80.0,1.020,0.0,0.0,140.0,49.0,150.0,4.9,1.57,47.0,6700.0,4.9,c
396,70.0,1.025,0.0,0.0,75.0,31.0,141.0,3.5,1.65,54.0,7800.0,6.2,c
397,80.0,1.020,0.0,0.0,100.0,26.0,137.0,4.4,1.58,49.0,6600.0,5.4,c
398,60.0,1.025,0.0,0.0,114.0,50.0,135.0,4.9,1.42,51.0,7200.0,5.9,c
