Imports

In [1]:
import os
import pandas as pd

In [2]:
#format Jupyter so we can see the tables

In [3]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 150)

The data in the MSF_Dataset_Complete File is misleading. The column that should correspond to Hypertension, corresponds to number of siblings. We are guessing that in the process of creating the "complete file" there was some error in copy-pasting. This explains why our initial results were so good, but also why there were a number of high correlations in our data set (whole columns were improperly copy-pasted by the dataset creators.)

In order to get more accurate results, we need to import and combine the individual files, and label them properly.



Outline:

Import the following files:
    1) MSF_HealthOutcome_450
    2) MSF_Mother_lifestyle_450
    3) MSF_Mother_Social_450
    4) MSF_Mother_stress_450
    5) MSF_Physical&health_Fetaures_450   [sic]
    
    

    
    

In [4]:
#in relation to the present notebook, all dataset files are stored in a directory called MSF Dataset_450

cwd = os.getcwd()

health_outcome_df = os.path.join(cwd, 'MSF Dataset_450', 'MSF_HealthOutcome_450.xlsx')

social_df = os.path.join(cwd, 'MSF Dataset_450', 'MSF_Dataset_Social_450_modified_nam.xlsx')
stress_df = os.path.join(cwd, 'MSF Dataset_450', 'MSF_Mother_stress_450_modified_nam.xlsx')
physhealth_df = os.path.join(cwd, 'MSF Dataset_450', 'MSF_Physical&health_Fetaures_450.xlsx')

pd.read_excel(health_outcome_df, skiprows = 5, index_col = 'Mother_UID')



Unnamed: 0_level_0,PreTerm,Full Term,Weight_Baby_Kg,Hospital Stay in days,NICU Stay,Jaundice,C-section,Vaginal Delivery,Hours_In_Labour,Induce_Pain
Mother_UID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,1,2.566,5,0,0,0,1,18,0
2,0,1,3.1,5,0,0,0,1,20,0
3,0,1,2.15,7,0,0,1,0,5,0
4,0,1,2.5,5,0,0,0,1,10,0
5,0,1,2.67,5,0,0,0,1,20,0
6,0,1,2.56,7,0,0,1,0,10,0
7,1,0,2.1,7,1,1,1,0,18,0
8,0,1,2.5,5,0,1,0,1,20,0
9,0,1,2.94,7,0,0,1,0,12,0
10,0,1,2.9,7,0,0,1,0,15,0


In [5]:
#Notes on cleaning lifestyle_df

#We want our columns to be easily machine readable. Current hierarchical organization of columns is unreadable for algorithms.
#I'm heavily modifying the file in Excel and saving it as "MSF_Mother_lifestyle_450_modified_nam" 

#nested values are flattened.
#daily diet and sleep patterns need to be further parsed to avoid repetition in column names
#side note: surveyed sleep is poorly defined here: if someone sleeps exactly 8 hours they fall between the cracks of the divisions "More than 8 hours and Less than 7 hours"
lifestyle_path = os.path.join(cwd, 'MSF Dataset_450', 'MSF_Mother_lifestyle_450_modified_nam.xlsx')

lifestyle_df = pd.read_excel(lifestyle_path)
lifestyle_df.iloc[2,0] = 'Mother_UID'
lifestyle_df.columns = lifestyle_df.iloc[2]
lifestyle_df = lifestyle_df.drop(range(0,3))
lifestyle_df = lifestyle_df.set_index("Mother_UID")
lifestyle_df.index = lifestyle_df.index.astype(int)
lifestyle_df

2,Exercise_a,Exercise_b,Exercise_c,Laptop_a,Laptop_b,Laptop_c,Outside Food_a,Outside Food_b,Outside Food_c,Tea/Coffee_a,Tea/Coffee_b,Tea/Coffee_c,Cigratte_a,Cigratte_b,Cigratte_c,Alcohol_a,Alcohol_b,Alcohol_c,NOISE/AIR pollution_a,NOISE/AIR pollution_b,NOISE/AIR pollution_c,Health Concious_a,Health Concious_b,Health Concious_c,Diet_Grains_veg_pulses_rice_salad_a,Diet_More_pulses_and_rice_a,Diet_dairy_prods_a,Diet_snacks_high_carbs_a,Diet_non_vegetarian_a,Diet_fruits_salads_a,Diet_Grains_veg_pulses_rice_salad_b,Diet_More_pulses_and_rice_b,Diet_dairy_prods_b,Diet_snacks_high_carbs_b,Diet_non_vegetarian_b,Diet_fruits_salads_b,Diet_Grains_veg_pulses_rice_salad_c,Diet_More_pulses_and_rice_c,Diet_dairy_prods_c,Diet_snacks_high_carbs_c,Diet_non_vegetarian_c,Diet_fruits_salads_c,Sleep_early_riser_a,Sleep_night_owl_a,sleep_mr_thn_8_hrs_a,Sleep_lss_thn_7_hrs_a,Sleep_early_riser_b,Sleep_night_owl_b,Sleep_mr_thn_8_hrs_b,Sleep_lss_thn_7_hrs_b,Sleep_early_riser_c,Sleep_night_owl_c,Sleep_mr_thn_8_hrs_c,Sleep_lss_thn_7_hrs_c,sunlight_a,sunlight_b,sunlight_c,Travel_Time_a,Travel_Time_b,Travel_Time_c,Travel_Mode_a,Travel_Mode_b,Travel_Mode_c,Works_As_b,Works_As_c,Contraceptive_Time_,Contraceptive_Type_before_preg,Intercourse_,Cravings_a,Cravings_b,Cravings_c
Mother_UID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1
1,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,3,3,3,2,2,2,3,3,3,3,1,1,6,1,2.0,2.0,2.0
2,4,4,4,3,3,3,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,0,3,3,3,2,2,2,3,3,3,3,1,1,6,1,2.0,2.0,2.0
3,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,2,2,2,4,4,4,3,1,1,6,1,2.0,2.0,2.0
4,4,4,4,3,3,3,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,0,2,2,2,2,2,2,3,3,3,1,1,1,6,1,2.0,2.0,2.0
5,4,4,4,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,4,4,4,1,1,1,1,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,4,4,4,2,2,2,2,2,2,1,1,1,6,1,2.0,1.0,3.0
6,4,4,4,3,3,3,2,2,2,3,3,3,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,2,2,2,2,2,2,1,1,4,2,1,2.0,2.0,2.0
7,3,3,3,3,3,3,3,3,3,2,2,2,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,2,2,2,2,2,2,1,1,1,6,1,1.0,2.0,2.0
8,4,4,4,3,3,3,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,0,0,1,0,0,0,3,3,3,2,2,2,4,4,4,1,1,1,6,1,2.0,3.0,2.0
9,3,3,3,3,3,3,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,0,0,1,0,0,0,1,0,4,4,4,2,2,2,3,3,3,3,1,1,6,1,2.0,2.0,2.0
10,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,3,3,3,3,3,3,4,1,1,6,1,2.0,2.0,2.0


In [6]:
lifestyle_freqs_df = lifestyle_df.describe()

In [7]:
lifestyle_freqs_df


2,Exercise_a,Exercise_b,Exercise_c,Laptop_a,Laptop_b,Laptop_c,Outside Food_a,Outside Food_b,Outside Food_c,Tea/Coffee_a,Tea/Coffee_b,Tea/Coffee_c,Cigratte_a,Cigratte_b,Cigratte_c,Alcohol_a,Alcohol_b,Alcohol_c,NOISE/AIR pollution_a,NOISE/AIR pollution_b,NOISE/AIR pollution_c,Health Concious_a,Health Concious_b,Health Concious_c,Diet_Grains_veg_pulses_rice_salad_a,Diet_More_pulses_and_rice_a,Diet_dairy_prods_a,Diet_snacks_high_carbs_a,Diet_non_vegetarian_a,Diet_fruits_salads_a,Diet_Grains_veg_pulses_rice_salad_b,Diet_More_pulses_and_rice_b,Diet_dairy_prods_b,Diet_snacks_high_carbs_b,Diet_non_vegetarian_b,Diet_fruits_salads_b,Diet_Grains_veg_pulses_rice_salad_c,Diet_More_pulses_and_rice_c,Diet_dairy_prods_c,Diet_snacks_high_carbs_c,Diet_non_vegetarian_c,Diet_fruits_salads_c,Sleep_early_riser_a,Sleep_night_owl_a,sleep_mr_thn_8_hrs_a,Sleep_lss_thn_7_hrs_a,Sleep_early_riser_b,Sleep_night_owl_b,Sleep_mr_thn_8_hrs_b,Sleep_lss_thn_7_hrs_b,Sleep_early_riser_c,Sleep_night_owl_c,Sleep_mr_thn_8_hrs_c,Sleep_lss_thn_7_hrs_c,sunlight_a,sunlight_b,sunlight_c,Travel_Time_a,Travel_Time_b,Travel_Time_c,Travel_Mode_a,Travel_Mode_b,Travel_Mode_c,Works_As_b,Works_As_c,Contraceptive_Time_,Contraceptive_Type_before_preg,Intercourse_,Cravings_a,Cravings_b,Cravings_c
count,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,200,200,200
unique,4,4,4,4,4,4,4,4,4,3,3,3,2,2,2,2,2,2,3,3,3,4,4,4,1,2,2,2,2,2,1,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,5,5,5,5,5,5,7,7,7,5,5,6,6,3,3,3,3
top,4,4,4,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,4,4,4,2,2,1,1,1,3,1,1,1,6,1,2,2,2
freq,198,196,177,176,176,193,225,225,221,234,233,217,438,442,448,436,438,447,228,229,229,352,353,352,450,449,444,441,395,429,450,449,444,441,395,429,450,449,444,442,395,429,335,328,264,423,336,329,265,421,324,293,311,422,240,239,241,219,217,221,196,196,204,237,292,367,365,280,159,176,160


In [8]:
#We have nans. We'll set these to -1 for now.

lifestyle_df.fillna(-1, inplace=True)

In [9]:
lifestyle_df_objs = lifestyle_df.astype(str)
lifestyle_freqs_df = lifestyle_df_objs.describe()

In [10]:
lifestyle_freqs_df

2,Exercise_a,Exercise_b,Exercise_c,Laptop_a,Laptop_b,Laptop_c,Outside Food_a,Outside Food_b,Outside Food_c,Tea/Coffee_a,Tea/Coffee_b,Tea/Coffee_c,Cigratte_a,Cigratte_b,Cigratte_c,Alcohol_a,Alcohol_b,Alcohol_c,NOISE/AIR pollution_a,NOISE/AIR pollution_b,NOISE/AIR pollution_c,Health Concious_a,Health Concious_b,Health Concious_c,Diet_Grains_veg_pulses_rice_salad_a,Diet_More_pulses_and_rice_a,Diet_dairy_prods_a,Diet_snacks_high_carbs_a,Diet_non_vegetarian_a,Diet_fruits_salads_a,Diet_Grains_veg_pulses_rice_salad_b,Diet_More_pulses_and_rice_b,Diet_dairy_prods_b,Diet_snacks_high_carbs_b,Diet_non_vegetarian_b,Diet_fruits_salads_b,Diet_Grains_veg_pulses_rice_salad_c,Diet_More_pulses_and_rice_c,Diet_dairy_prods_c,Diet_snacks_high_carbs_c,Diet_non_vegetarian_c,Diet_fruits_salads_c,Sleep_early_riser_a,Sleep_night_owl_a,sleep_mr_thn_8_hrs_a,Sleep_lss_thn_7_hrs_a,Sleep_early_riser_b,Sleep_night_owl_b,Sleep_mr_thn_8_hrs_b,Sleep_lss_thn_7_hrs_b,Sleep_early_riser_c,Sleep_night_owl_c,Sleep_mr_thn_8_hrs_c,Sleep_lss_thn_7_hrs_c,sunlight_a,sunlight_b,sunlight_c,Travel_Time_a,Travel_Time_b,Travel_Time_c,Travel_Mode_a,Travel_Mode_b,Travel_Mode_c,Works_As_b,Works_As_c,Contraceptive_Time_,Contraceptive_Type_before_preg,Intercourse_,Cravings_a,Cravings_b,Cravings_c
count,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450,450
unique,4,4,4,4,4,4,4,4,4,3,3,3,2,2,2,2,2,2,3,3,3,4,4,4,1,2,2,2,2,2,1,2,2,2,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,5,5,5,5,5,5,7,7,7,5,5,6,6,3,4,4,4
top,4,4,4,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,4,4,4,2,2,1,1,1,3,1,1,1,6,1,-1,-1,-1
freq,198,196,177,176,176,193,225,225,221,234,233,217,438,442,448,436,438,447,228,229,229,352,353,352,450,449,444,441,395,429,450,449,444,441,395,429,450,449,444,442,395,429,335,328,264,423,336,329,265,421,324,293,311,422,240,239,241,219,217,221,196,196,204,237,292,367,365,280,250,250,250


Observations about lifestyle:

Almost all participants in study report that they didn't smoke or drink at any point in their lives.

In [11]:
lifestyle_df.astype(int)

2,Exercise_a,Exercise_b,Exercise_c,Laptop_a,Laptop_b,Laptop_c,Outside Food_a,Outside Food_b,Outside Food_c,Tea/Coffee_a,Tea/Coffee_b,Tea/Coffee_c,Cigratte_a,Cigratte_b,Cigratte_c,Alcohol_a,Alcohol_b,Alcohol_c,NOISE/AIR pollution_a,NOISE/AIR pollution_b,NOISE/AIR pollution_c,Health Concious_a,Health Concious_b,Health Concious_c,Diet_Grains_veg_pulses_rice_salad_a,Diet_More_pulses_and_rice_a,Diet_dairy_prods_a,Diet_snacks_high_carbs_a,Diet_non_vegetarian_a,Diet_fruits_salads_a,Diet_Grains_veg_pulses_rice_salad_b,Diet_More_pulses_and_rice_b,Diet_dairy_prods_b,Diet_snacks_high_carbs_b,Diet_non_vegetarian_b,Diet_fruits_salads_b,Diet_Grains_veg_pulses_rice_salad_c,Diet_More_pulses_and_rice_c,Diet_dairy_prods_c,Diet_snacks_high_carbs_c,Diet_non_vegetarian_c,Diet_fruits_salads_c,Sleep_early_riser_a,Sleep_night_owl_a,sleep_mr_thn_8_hrs_a,Sleep_lss_thn_7_hrs_a,Sleep_early_riser_b,Sleep_night_owl_b,Sleep_mr_thn_8_hrs_b,Sleep_lss_thn_7_hrs_b,Sleep_early_riser_c,Sleep_night_owl_c,Sleep_mr_thn_8_hrs_c,Sleep_lss_thn_7_hrs_c,sunlight_a,sunlight_b,sunlight_c,Travel_Time_a,Travel_Time_b,Travel_Time_c,Travel_Mode_a,Travel_Mode_b,Travel_Mode_c,Works_As_b,Works_As_c,Contraceptive_Time_,Contraceptive_Type_before_preg,Intercourse_,Cravings_a,Cravings_b,Cravings_c
Mother_UID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1
1,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,3,3,3,2,2,2,3,3,3,3,1,1,6,1,2,2,2
2,4,4,4,3,3,3,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,0,3,3,3,2,2,2,3,3,3,3,1,1,6,1,2,2,2
3,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,2,2,2,4,4,4,3,1,1,6,1,2,2,2
4,4,4,4,3,3,3,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,0,2,2,2,2,2,2,3,3,3,1,1,1,6,1,2,2,2
5,4,4,4,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,4,4,4,1,1,1,1,0,0,1,1,1,1,0,0,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,4,4,4,2,2,2,2,2,2,1,1,1,6,1,2,1,3
6,4,4,4,3,3,3,2,2,2,3,3,3,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,2,2,2,2,2,2,1,1,4,2,1,2,2,2
7,3,3,3,3,3,3,3,3,3,2,2,2,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,2,2,2,2,2,2,1,1,1,6,1,1,2,2
8,4,4,4,3,3,3,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,0,0,1,0,0,0,3,3,3,2,2,2,4,4,4,1,1,1,6,1,2,3,2
9,3,3,3,3,3,3,2,2,2,1,1,1,1,1,1,1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,0,0,1,0,0,0,1,0,4,4,4,2,2,2,3,3,3,3,1,1,6,1,2,2,2
10,3,3,3,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,3,3,3,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,0,1,0,0,0,1,0,0,0,1,0,0,4,4,4,3,3,3,3,3,3,4,1,1,6,1,2,2,2


In [12]:
#Loading and cleaning social_df

#Like the lifestyle dataframe, the social dataframe contains nested categories. I've fixed these in the excel file to expedite cleaning here.

social_df = pd.read_excel(social_df, index_col = "Mother_UID", skiprows=range(1,6))

In [20]:
#loading and cleaning stress_df

stress_df = pd.read_excel(stress_df, skiprows = [2,3,4,5], header=1, index_col = 'Mother_UID')

In [21]:
#left off here---------------------------------------------------------------------------------------------------
#loading and cleaning physhealth_df
#As before, the organization here is messy. I'm going to rework it in excel and then load it here.

physhealth_df = pd.read_excel(physhealth_df) 
physhealth_df.rename({'Unnamed: 0': 'Mother_UID',
                     })

Unnamed: 0.1,Unnamed: 0,Age_Of_Mother,weight_before_preg,wt_before_delivery,Height(cm),BMI,Hemoglobin,PCOS,Age_Father,Fertility_Treatment,Miscarriage History,Menstrual_Cycle,Unnamed: 12,Time_Taken_To_Concieve,Issues_Pregnancy,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,,1.0,2.0,3,4.0,5.0,6.0,7.0,8.0,9.0,10,11,12,13,14,15,16,17,18,19,20,21,22,23
1,,1.0,2.0,3,4.0,5.0,6.0,7.0,8.0,9.0,10,11,12,13,14,15,16,17,18,19,20,21,22,23
2,,,,Missing_Values,,,,,,,Missing_Values,10,,23,25/30,,,,,,,,,
3,,,,,,,,,,,,12,,28,thyroid,HyperTension,Diabetes,Gastric Issue,Cold/viral,LowAmnotic,HighAmniotic,,,
4,Mother_UID,,,,,,,,,,,a,b,a,a,b,c,d,e,f,g,h,IVF,no of births(single/Twins)
5,1,29.0,59.0,60,156.0,25.0,12.5,0.0,31.0,0.0,0,3,3,1,0,0,0,0,0,0,0,1,0,1
6,2,24.0,54.0,56,145.0,26.0,12.5,0.0,28.0,0.0,0,1,1,1,0,0,0,0,0,0,0,1,0,1
7,3,28.0,62.0,65,151.0,28.0,11.5,0.0,31.0,0.0,0,3,3,1,0,0,0,0,0,0,0,1,0,1
8,4,25.0,49.0,52,151.0,22.0,11.5,0.0,30.0,0.0,0,2,2,1,0,0,0,0,0,0,0,1,0,1
9,5,21.0,39.0,42,151.0,18.0,10.1,0.0,25.0,0.0,0,2,2,1,0,0,0,0,0,0,0,1,0,1


In [None]:
# Easy EDA: Check out this? https://www.youtube.com/watch?v=fRRYpyne2po