## Part 1: Restructuring Data

In the first part of this programming exercise, your goal is to recover the original format of the Pima Indian Diabetes dataset. Here, you are given the same data, but in a much less manageable form. You should use the Numpy, Scipy and / or Pandas packages to implement a modular (ie. function-based) pipeline for restructuring the data. The final result should be identical to the downloadable data.

You may have to look back at the data in pima-indians-diabetes.csv to figure out the format of the messy version here.

Avoid using outside tools like a text editor or a spreadsheet program. Instead, all your transformations should be done programmatically. You might even want to build some defensive coding strategies in if you are feeling really ambitious!

In [7]:
import pandas as pd
import numpy as np

### Load data

In [2]:
messy_df = pd.read_csv("data/messy-pima-indians-diabetes.csv")

In [12]:
s = messy_df.shape
print("There are {} rows".format(s[0]))

There are 7833 rows


Each 9 rows consist information about one sample. Therefore, the columns of the table must be the following:
* times_pregnant
* plasma_glucose_concentration
* diastolic_blood_pressure
* triceps_thickness
* 2_hour_serum_insulin
* BMI
* diabetes_pedigreen
* age
* diabetes

In [45]:
messy_df.head(9)

Unnamed: 0,Non-diabetic
0,times_pregnant6.0000
1,plasma_glucose_concentration148.0000
2,diastolic_blood_pressure72.0000
3,triceps_thickness35.0000
4,2_hour_serum_insulin0.0000
5,BMI33.6000
6,diabetes_pedigreen0.6270
7,age50.0000
8,diabetes1.0000


## Extract the information for each column

* Make a list of all the names of the columns
* Loop the dataframe and split by name of the column
* Add the value to the array into the dictionary's column name 

In [110]:
columns_dic = {
          "times_pregnant":[], 
          "plasma_glucose_concentration":[],
          "diastolic_blood_pressure":[],
          "triceps_thickness":[],
          "2_hour_serum_insulin":[],
          "BMI":[],
          "diabetes_pedigreen":[],
          "age":[],
          "diabetes":[]
        }

columns_list = ["times_pregnant", 
          "plasma_glucose_concentration",
          "diastolic_blood_pressure",
          "triceps_thickness",
          "2_hour_serum_insulin",
          "BMI",
          "diabetes_pedigreen",
          "age",
          "diabetes"]

In [111]:
for i,r in enumerate(messy_df.loc[:,"Non-diabetic"]):
    for c in columns_list:
        if c in r:
            v = r.split(c)[1]
            columns_dic[c].append(v)

There are two columns `times_pregnant` and `diabetes` that have double the values because they repeat.

In [112]:
for c in columns_dic:
    print(c, len(columns_dic[c]))

times_pregnant 1536
plasma_glucose_concentration 768
diastolic_blood_pressure 768
triceps_thickness 768
2_hour_serum_insulin 768
BMI 768
diabetes_pedigreen 768
age 768
diabetes 1536


**Remove the redundant values**
* Loop over the columns and delete every other value 

In [113]:
def remove_redundant_values(arr):
    new_list = []
    for i, r in enumerate(arr):
        if i % 2 != 0:
            new_list.append(arr[i])
    return new_list        

In [114]:
columns_dic["times_pregnant"] = remove_redundant_values(columns_dic["times_pregnant"])
columns_dic["diabetes"] = remove_redundant_values(columns_dic["diabetes"])

### Make a DataFrame from the Dictionary

In [115]:
df = pd.DataFrame(columns_dic)
df.head()

Unnamed: 0,times_pregnant,plasma_glucose_concentration,diastolic_blood_pressure,triceps_thickness,2_hour_serum_insulin,BMI,diabetes_pedigreen,age,diabetes
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0,1.0
1,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0,1.0
2,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
3,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1.0
4,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1.0
