## Explatory Data Analysis



#### Birthweight data

Variables in the data are as follows:


|     Name                 |     Variable                                           |     Data type    |
|--------------------------|--------------------------------------------------------|------------------|
|     ID                   |     Baby number                                        |                  |
|      length              |     Length of baby (cm)                                |     Scale        |
|      Birthweight         |     Weight of baby (kg)                                |     Scale        |
|      headcirumference    |     Head Circumference                                 |     Scale        |
|      Gestation           |     Gestation (weeks)                                  |     Scale        |
|      smoker              |     Mother smokes 1 = smoker 0 =   non-smoker          |      Binary      |
|      motherage           |     Maternal age                                       |     Scale        |
|      mnocig              |     Number of cigarettes smoked per day   by mother    |     Scale        |
|      mheight             |     Mothers height (cm)                                |      Scale       |
|      mppwt               |     Mothers pre-pregnancy weight (kg)                  |      Scale       |
|      fage                |     Father's age                                       |      Scale       |
|     fedyrs               |     Father’s years in education                        |     Scale        |
|      fnocig              |     Number of cigarettes smoked per day   by father    |     Scale        |
|      fheight             |     Father's height (cm)                               |      Scale       |
|      lowbwt              |     Low birth weight, 0 = No and 1 = yes               |      Binary      |
|     mage35               |     Mother over 35, 0 = No and 1 = yes                 |     Binary       |

 
Birthweight is the dependent variable. Lets first investigate this variable.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('../data/Birthweight_reduced_kg_R.csv')
df.head()

Unnamed: 0,ID,Length,Birthweight,Headcirc,Gestation,smoker,mage,mnocig,mheight,mppwt,fage,fedyrs,fnocig,fheight,lowbwt,mage35
0,1360,56,4.55,34,44,0,20,0,162,57,23,10,35,179,0,0
1,1016,53,4.32,36,40,0,19,0,171,62,19,12,0,183,0,0
2,462,58,4.1,39,41,0,35,0,172,58,31,16,25,185,0,1
3,1187,53,4.07,38,44,0,20,0,174,68,26,14,25,189,0,0
4,553,54,3.94,37,42,0,24,0,175,66,30,12,0,184,0,0


Lets checkout mean maternal age.

In [4]:
mu = df['mage'].mean()
med = df.mage.median()

print(f"Mean maternal age: {mu}    Median maternal age: {med}")

Mean maternal age: 25.547619047619047    Median maternal age: 24.0


Lets check out mean values for multiple columns

- length in cm
- birthweight in kg

In [6]:
df[['Length','Birthweight']].mean()

Length         51.333333
Birthweight     3.312857
dtype: float64

Lets check out standard deviation of multiple columns

- length in cm
- birthweight in kg

In [9]:
df[['Length','Birthweight']].std()

Length         2.935624
Birthweight    0.603895
dtype: float64

Lets check out skewness of multiple columns

- length in cm
- birthweight in kg

In [9]:
df[['Length','Birthweight']].skew()

Length        -0.247898
Birthweight   -0.055529
dtype: float64

Lets check out excess kurtosis (normal dist=0) of multiple columns

- length in cm
- birthweight in kg

In [15]:
df[['Length','Birthweight']].kurtosis()

#plt.hist(df['Birthweight'], bins=25,density=True);

Length         1.062173
Birthweight    0.038160
dtype: float64

Lets check all stats simultaneously

In [12]:
df['Birthweight'].describe()

count    42.000000
mean      3.312857
std       0.603895
min       1.920000
25%       2.940000
50%       3.295000
75%       3.647500
max       4.570000
Name: Birthweight, dtype: float64

Lets check the correlation between birth length and weight

In [26]:
df[['Length','Birthweight']].corr()


Unnamed: 0,Length,Birthweight
Length,1.0,0.726833
Birthweight,0.726833,1.0


In [27]:
df[['Length','Birthweight','Headcirc']].corr()

Unnamed: 0,Length,Birthweight,Headcirc
Length,1.0,0.726833,0.563172
Birthweight,0.726833,1.0,0.684616
Headcirc,0.563172,0.684616,1.0


In [28]:
df[['Length','Birthweight','Headcirc','mage']].corr()

Unnamed: 0,Length,Birthweight,Headcirc,mage
Length,1.0,0.726833,0.563172,0.075268
Birthweight,0.726833,1.0,0.684616,0.000173
Headcirc,0.563172,0.684616,1.0,0.145842
mage,0.075268,0.000173,0.145842,1.0
