## Part 1: Restructuring Data

In the first part of this programming exercise, your goal is to recover the original format of the Pima Indian Diabetes dataset. Here, you are given the same data, but in a much less manageable form. You should use the Numpy, Scipy and / or Pandas packages to implement a modular (ie. function-based) pipeline for restructuring the data. The final result should be identical to the downloadble data.

You may have to look back at the data in pima-indians-diabetes.csv to figure out the format of the messy version here.

Avoid using outside tools like a text editor or a spreadsheet program. Instead, all your transformations should be done programmatically in a way that can be tested in Part 2.

In [219]:
import pandas as pd
import numpy as np

In [220]:

# You should read in this data and restructure it to make it identical to the
# pima-indians-diabetes.csv introduced in the previous topic.
original_data = pd.read_csv('./data/pima-indians-diabetes.csv', header=0, index_col=None)
messy_data = "./data/messy-pima-indians-diabetes.csv"

original_data.head()


Unnamed: 0,times_pregnant,plasma_glucose_concentration,diastolic_blood_pressure,triceps_thickness,2-hour_serum_insulin,BMI,diabetes_pedigreen,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


**Split data into diabetic and non-diabetic**
Non-Diabetic = own set of data up to where it says Diabetic 
    - 0 for non-diabetic
    - 1 for diabetic

all values are the label followed by the value with no Split
need to use regex for first non alpha character to get float
each row is also bookend by times_pregnent  

1. times_pregnant6.0000  
2. plasma_glucose_concentration148.0000  
3. diastolic_blood_pressure72.0000  
4. triceps_thickness35.0000  
4. 2_hour_serum_insulin0.0000  
4. BMI33.6000  
4. diabetes_pedigreen0.6270  
4. age50.0000  
4. diabetes1.0000  **here diabetes 1 means non-diabetic**
10. times_pregnant6.0000  

** Issue: there are 10 values per dataset but there is 2 left over at end so likely that missing 8 values somewhere in the dataset**  
Otherwise would have a multiple of 10

**Procedure: **  
1. Remove 'Non-diabetic' and 'Diabetic' headers in data  
2. Remove 'Particpants' datapoints
2. Split remaining column into chunks of 10 - list of lists containing each data entry  
    - This doesn't work because assumes a complete dataset
    - Need to split it into array with start and end being times_pregnant
    - OR do this procedure still then search for value that doesn't match in the df column
3. Convert this into a dataframe  
4. Remove duplicate times_pregnant column  
5. reverse diabetic column 0 -> 1 and 1 -> 0   


In [221]:

df = pd.read_csv(messy_data, header=None, delimiter='\n')
df

Unnamed: 0,0
0,Non-diabetic
1,times_pregnant6.0000
2,plasma_glucose_concentration148.0000
3,diastolic_blood_pressure72.0000
4,triceps_thickness35.0000
...,...
7829,BMI30.4000
7830,diabetes_pedigreen0.3150
7831,age23.0000
7832,diabetes0.0000


In [223]:
# find and remove Non-diabetic / Diabetic rows
def remove_string_data(df):
    a = df[df[0] == 'Non-diabetic'].index[0]
    b = df[df[0] == 'Diabetic'].index[0]

    df = df.drop([a,b])
    df = df.drop(df[df[0].str.startswith('Participants')].index)
    # # new_df = new_df.drop[b]
    return df

new_df = remove_string_data(df)

df.shape[0] - new_df.shape[0]

154

In [224]:
data = np.array(new_df.values)
data = data.flatten()

data.shape

reshaped = np.reshape(data, (-1,10))

columns = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', '2_hour_serum_insulin', 'BMI', 'diabetes_pedigreen', 'age', 'diabetes', 'duplicate_times_pregnant']

df = pd.DataFrame(reshaped, columns=columns)
df.head()



Unnamed: 0,times_pregnant,plasma_glucose_concentration,diastolic_blood_pressure,triceps_thickness,2_hour_serum_insulin,BMI,diabetes_pedigreen,age,diabetes,duplicate_times_pregnant
0,times_pregnant6.0000,plasma_glucose_concentration148.0000,diastolic_blood_pressure72.0000,triceps_thickness35.0000,2_hour_serum_insulin0.0000,BMI33.6000,diabetes_pedigreen0.6270,age50.0000,diabetes1.0000,times_pregnant6.0000
1,times_pregnant8.0000,plasma_glucose_concentration183.0000,diastolic_blood_pressure64.0000,triceps_thickness0.0000,2_hour_serum_insulin0.0000,BMI23.3000,diabetes_pedigreen0.6720,age32.0000,diabetes1.0000,times_pregnant8.0000
2,times_pregnant0.0000,plasma_glucose_concentration137.0000,diastolic_blood_pressure40.0000,triceps_thickness35.0000,2_hour_serum_insulin168.0000,BMI43.1000,diabetes_pedigreen2.2880,age33.0000,diabetes1.0000,times_pregnant0.0000
3,times_pregnant3.0000,plasma_glucose_concentration78.0000,diastolic_blood_pressure50.0000,triceps_thickness32.0000,2_hour_serum_insulin88.0000,BMI31.0000,diabetes_pedigreen0.2480,age26.0000,diabetes1.0000,times_pregnant3.0000
4,times_pregnant2.0000,plasma_glucose_concentration197.0000,diastolic_blood_pressure70.0000,triceps_thickness45.0000,2_hour_serum_insulin543.0000,BMI30.5000,diabetes_pedigreen0.1580,age53.0000,diabetes1.0000,times_pregnant2.0000


In [225]:
# check to make sure all columns start with times_pregnant to know no missing values
a = df['times_pregnant'].str.startswith('times', na=False)
df.shape[0], a.sum()

(768, 768)

In [226]:
# remove extra pregnant category
df = df.drop(columns=['duplicate_times_pregnant'])



['times_pregnant',
 'plasma_glucose_concentration',
 'diastolic_blood_pressure',
 'triceps_thickness',
 '2_hour_serum_insulin',
 'BMI',
 'diabetes_pedigreen',
 'age',
 'diabetes']

In [227]:
for col in list(df.columns):
    df[col] = df[col].str.replace(col, '')

df.head()

Unnamed: 0,times_pregnant,plasma_glucose_concentration,diastolic_blood_pressure,triceps_thickness,2_hour_serum_insulin,BMI,diabetes_pedigreen,age,diabetes
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0,1.0
1,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0,1.0
2,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
3,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1.0
4,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1.0


In [244]:
original_data.head()

Unnamed: 0,times_pregnant,plasma_glucose_concentration,diastolic_blood_pressure,triceps_thickness,2-hour_serum_insulin,BMI,diabetes_pedigreen,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [243]:
#convert diabetes values
# new_df = df['diabetes'].map(lambda x: dict(1=0, 0=1)[x], )
def convert_diabetes(x):
    if x == 1.0:
        return 0.0
    else:
        return 1.0

original_counts = df['diabetes'].value_counts()
df['diabetes'] = df['diabetes'].astype('float').apply(convert_diabetes)
  
print(original_counts, df['diabetes'].value_counts())
df.head()

1.0    500
0.0    268
Name: diabetes, dtype: int64 0.0    500
1.0    268
Name: diabetes, dtype: int64


Unnamed: 0,times_pregnant,plasma_glucose_concentration,diastolic_blood_pressure,triceps_thickness,2_hour_serum_insulin,BMI,diabetes_pedigreen,age,diabetes
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0,1.0
1,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0,1.0
2,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
3,3.0,78.0,50.0,32.0,88.0,31.0,0.248,26.0,1.0
4,2.0,197.0,70.0,45.0,543.0,30.5,0.158,53.0,1.0


## Part 2: Unit Testing

Below, are a series of simple arithmetic functions. Define class of test-cases for these functions that will adequately assure you they are working properly.

Hopefully you implemented Part 1 using a pipeline of functions. Here, you should design and implement unit tests for each function. Be sure to test edge-cases with values not neccesarily observed in the dataset. You may have to refer to the Python unittest package documentations: https://docs.python.org/3/library/unittest.html

In [266]:
import unittest
# Implement your tests here for the functions in the following cell.


class myTest(unittest.TestCase):
    def test_incr(self):
        self.assertEqual(incr(1), 2)
        self.assertEqual(incr(-1), 0)

    def test_decr(self):
        self.assertEqual(decr(1), 0)
        self.assertEqual(decr(-10), -11)
    
    def test_add(self):
        self.assertEqual(add(2,4), 6)

    def test_subt(self):
        self.assertEqual(subt(4), 3)
    
    def test_mult(self):
        self.assertEqual(mult(4,8), 32)
    
    def test_divi(self):
        self.assertEqual(divi(8,4), 2)
        with self.assertRaises(TypeError):
            divi('hello', 2)
    
        
# Should you want to delete a test case from within Jupyter notebook,
# you can run the following code to remove the class from the set of
# global variables: 
#
# `del IncrTestCast`

In [267]:
def incr(x):
    return x + 1

def decr(x):
    return x - 1

def add(x,y):
    return x + y

def subt(x):
    return x - 1

def mult(x,y):
    return x * y

def divi(x,y):
    return x / y
    

# This strange Python simulates running your code as if it were executed
# from the command-line, instead of within a Notebook. All it does is 
# call the automatically generated main() function (which is usually
# wrapped in Jupyter) with an explicit argument array with one value.
if __name__ == '__main__':
    unittest.main(argv=[''], exit=False,verbosity=2)

test_add (__main__.myTest) ... ok
test_decr (__main__.myTest) ... ok
test_divi (__main__.myTest) ... ok
test_incr (__main__.myTest) ... ok
test_mult (__main__.myTest) ... ok
test_subt (__main__.myTest) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.009s

OK
