## Explatory Data Analysis



#### Birthweight data

Variables in the data are as follows:


|     Name                 |     Variable                                           |     Data type    |
|--------------------------|--------------------------------------------------------|------------------|
|     ID                   |     Baby number                                        |                  |
|      length              |     Length of baby (cm)                                |     Scale        |
|      Birthweight         |     Weight of baby (kg)                                |     Scale        |
|      headcirumference    |     Head Circumference                                 |     Scale        |
|      Gestation           |     Gestation (weeks)                                  |     Scale        |
|      smoker              |     Mother smokes 1 = smoker 0 =   non-smoker          |      Binary      |
|      motherage           |     Maternal age                                       |     Scale        |
|      mnocig              |     Number of cigarettes smoked per day   by mother    |     Scale        |
|      mheight             |     Mothers height (cm)                                |      Scale       |
|      mppwt               |     Mothers pre-pregnancy weight (kg)                  |      Scale       |
|      fage                |     Father's age                                       |      Scale       |
|     fedyrs               |     Father’s years in education                        |     Scale        |
|      fnocig              |     Number of cigarettes smoked per day   by father    |     Scale        |
|      fheight             |     Father's height (cm)                               |      Scale       |
|      lowbwt              |     Low birth weight, 0 = No and 1 = yes               |      Binary      |
|     mage35               |     Mother over 35, 0 = No and 1 = yes                 |     Binary       |

 
Birthweight is the dependent variable. Lets first investigate this variable.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('../data/Birthweight_reduced_kg_R.csv')
# df.info()

df.head()

Unnamed: 0,ID,Length,Birthweight,Headcirc,Gestation,smoker,mage,mnocig,mheight,mppwt,fage,fedyrs,fnocig,fheight,lowbwt,mage35
0,1360,56,4.55,34,44,0,20,0,162,57,23,10,35,179,0,0
1,1016,53,4.32,36,40,0,19,0,171,62,19,12,0,183,0,0
2,462,58,4.1,39,41,0,35,0,172,58,31,16,25,185,0,1
3,1187,53,4.07,38,44,0,20,0,174,68,26,14,25,189,0,0
4,553,54,3.94,37,42,0,24,0,175,66,30,12,0,184,0,0


Lets checkout mean maternal age.

In [6]:
mu = df['mage'].mean()
med = df.mage.median()

print(f"Mean maternal age: {mu:.3f}    Median maternal age: {med}")

Mean maternal age: 25.548    Median maternal age: 24.0


In [15]:
print('Mean')
# Show mean of 3 columns
print(df[ ['mage', 'Gestation', 'mheight'] ].mean())

print('\n\nMedian')
# Show median of all columns
print(df.median())


Mean
mage          25.547619
Gestation     39.190476
mheight      164.452381
dtype: float64


Median
ID             821.000
Length          52.000
Birthweight      3.295
Headcirc        34.000
Gestation       39.500
smoker           1.000
mage            24.000
mnocig           4.500
mheight        164.500
mppwt           57.000
fage            29.500
fedyrs          14.000
fnocig          18.500
fheight        180.500
lowbwt           0.000
mage35           0.000
dtype: float64


Lets check out mean values for multiple columns

- length in cm
- birthweight in kg

In [6]:
df[['Length','Birthweight']].mean()

Length         51.333333
Birthweight     3.312857
dtype: float64

Lets check out standard deviation of multiple columns

- length in cm
- birthweight in kg

In [21]:
# Variance
df[['Length','Birthweight']].var()

# Standard deviation
df[['Length','Birthweight']].std()


Length         2.935624
Birthweight    0.603895
dtype: float64

Lets check out skewness of multiple columns

- length in cm
- birthweight in kg

In [9]:
df[['Length','Birthweight']].skew()

Length        -0.247898
Birthweight   -0.055529
dtype: float64

Lets check out excess kurtosis (normal dist=0) of multiple columns

- length in cm
- birthweight in kg

In [24]:
df[['Length','Birthweight']].kurtosis()

#plt.hist(df['Birthweight'], bins=10,density=True);

Length         1.062173
Birthweight    0.038160
dtype: float64

Lets check all stats simultaneously

In [27]:
df.describe()

Unnamed: 0,ID,Length,Birthweight,Headcirc,Gestation,smoker,mage,mnocig,mheight,mppwt,fage,fedyrs,fnocig,fheight,lowbwt,mage35
count,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0,42.0
mean,894.071429,51.333333,3.312857,34.595238,39.190476,0.52381,25.547619,9.428571,164.452381,57.5,28.904762,13.666667,17.190476,180.5,0.142857,0.095238
std,467.616186,2.935624,0.603895,2.399792,2.643336,0.505487,5.666342,12.511737,6.504041,7.198408,6.863866,2.160247,17.308165,6.978189,0.354169,0.297102
min,27.0,43.0,1.92,30.0,33.0,0.0,18.0,0.0,149.0,45.0,19.0,10.0,0.0,169.0,0.0,0.0
25%,537.25,50.0,2.94,33.0,38.0,0.0,20.25,0.0,161.0,52.25,23.0,12.0,0.0,175.25,0.0,0.0
50%,821.0,52.0,3.295,34.0,39.5,1.0,24.0,4.5,164.5,57.0,29.5,14.0,18.5,180.5,0.0,0.0
75%,1269.5,53.0,3.6475,36.0,41.0,1.0,29.0,15.75,169.5,62.0,32.0,16.0,25.0,184.75,0.0,0.0
max,1764.0,58.0,4.57,39.0,45.0,1.0,41.0,50.0,181.0,78.0,46.0,16.0,50.0,200.0,1.0,1.0


Lets check the correlation between birth length and weight

In [26]:
df[['Length','Birthweight']].corr()


Unnamed: 0,Length,Birthweight
Length,1.0,0.726833
Birthweight,0.726833,1.0


In [27]:
df[['Length','Birthweight','Headcirc']].corr()

Unnamed: 0,Length,Birthweight,Headcirc
Length,1.0,0.726833,0.563172
Birthweight,0.726833,1.0,0.684616
Headcirc,0.563172,0.684616,1.0


In [28]:
df[['Length','Birthweight','Headcirc','mage']].corr()

Unnamed: 0,Length,Birthweight,Headcirc,mage
Length,1.0,0.726833,0.563172,0.075268
Birthweight,0.726833,1.0,0.684616,0.000173
Headcirc,0.563172,0.684616,1.0,0.145842
mage,0.075268,0.000173,0.145842,1.0


Lets check out the  mean birtweight of babies whose mothers smoked/non-smoked during pregnancy.

In [36]:
# Mean birthweight of babies whose mother smoked
mean_smoked = df['Birthweight'][df.smoker == 1].mean()

# Mean birthweight of babies whose mother not smoked
mean_notsmoked = df.Birthweight[df['smoker'] == 0].mean()

print(f"Mean birtweight of smoking mothers: {mean_smoked:.3f}, non-smoking mothers: {mean_notsmoked:.3f}")

Mean birtweight of smoking mothers: 3.134, non-smoking mothers: 3.510
