

## Codebook for the Chilean earthquake pregnancy outcome data set.
    For each variable (column), the first line below gives Spanish name, English equivalent, and data type as labeled by STATA. The second line gives more explanation of the variable.

    id_clinica = Clinic_ID
      ID# of the medical clinic (numeric, long)
      
    id_excel = Excel_File_ID (numeric, int)
    ID# of the excel line in the first "whole data" file before eliminating non-eligible    pregnancies/deliveries (see chart of data cleaning)

    dia = Day (numeric, byte)
     Day of the month the baby was born
     
    mes = Month  (numeric, byte)
    Month of birth of the infant (numeric)
    
    ano = Year (numeric, int)
    Year of birth
    
    sexo = Sex (numeric, byte)
     Values: Mujer = Female; Hombre = Male
    Sex of the infant. In the Stata data file these were converted to numeric values:
	0 (hombre) =  male,   1 (mujer) = females
    
    peso = Weight (numeric, int)
     The birth weight of the infant, in grams, rounded to 1 gram
     
    talla = Length (numeric, float)
    The length (i.e., height) of the baby in cm, rounded to 0.1 cm  "Talla" actually means "size" but this is the length measurement.
    
    cc = Head_circ (Head circumference (numeric, float) )
    The circumference of the head, measured at about the forehead/eyebrow level, in cm, rounded to 0.01 cm.
    
    apgar = Apgar_1 [the Apgar score at one minute of life]    (numeric, byte)
     See apgar5 for complete explanation.
     
    apgar5 = Apgar_5 [the Apgar score at five minutes of life]   (numeric, byte)
    The Apgar score is a sum of five physiologic parameters each scored from zero to two. So the Apgar score can range from 0 to 10.  This score is routinely calculated for all newborn infants at 1 and 5 minutes of age. It is a clinical summary of how well a newborn is making the transition into life outside of the uterus. A "normal" score is 7 or more. Lower scores indicate a "difficult" transition, and may reflect either intrinsic or extrinsic factors that are affecting the infant.  While both scores are useful as a standardized documentation of  the transition process, only the 5 minute Apgar has some prediction  power for long term outcome. 
    
    comuna = Municipality (numeric, byte)
    They used Average Annual Income (in millions of Chilean Pesos) binned into 3 categories, to identify the "place of residence", using their  term from the publication.
     In the Stata data set this is given 3 values, defined as the following:
 	0 = > 1.5 million (high incomel)
	1 = 1.0 million to 1.5 million (medium income)
	2 = 500,000 to 1.0 million (low income)
    
    aeg = wgt_for_age (the baby's weight for gestational age (numeric (byte) )
    Newborn babies are rated as being small for gestational age, appropriate for gestational age, or large for gestational age.   These designations come from population-based nomograms, and are a function of birth weight and the gestational age.  The designation is important because small for gestational age (SGA) and large for gestational age (LGA) babies are biomarkers of  risk for important problems in the first days of life. SGA is defined as <10%ile for gestational age. LGA is defined as >90%ile for gestational age. In the Stata data set these are designated as follows:
	0 (aeg) = Appropriate for Gestational Age (AGA)
	1 (Peg) = Small for Gestational Age (SGA)
	2 (geg) = Large for Gestational Age (LGA)
    
    eg = Gest_age (Gestational age (numeric, byte))
     The gestational age of the child at birth. Given in completed weeks (i.e., 37 weeks plus 4 days is 37)
     
    trim_exp = Trimester  (numeric, float)
    Pregnancies are divided into thirds, called trimesters. This variable indicates the specific trimester (part) of the pregnancy that corresponded  to the date of the earthquake, whether in the quake year or in the control year. The study population is comprised of women who were   pregnant during the time of the earthquake, or at the same time of the year in the previous  year. This variable only classifies the  trimester against the time of year, irrespective of which year.  In the Stata data set this is given as follows:
	1 (primero) = First 
	2 (segundo) = Second 
	3 (tercero) = Third 
    
    bajo_peso = Low_birthwgt (Low birthweight (numeric, float))
    A different biomarker of a newborn's risk of problems at birth is whether the child is "Low Birthweight", routinely defined as <2500 grams at birth. It is an older and weaker means of identifying babies at risk of problems in the newborn period  than the size for gestational  age (above) or gestational age at birth (below) but it continues to be recorded in many studies of newborns. The Stata data set gives two  values:
	0 = normal birthweight (sobre 2500)
	1 = low birthweight (bajo 2500)
    
    pretermino = Premature (numeric, float)
    Designates a baby as having been born premature (less than 37 weeks of gestation). A stronger indicator of risk of problems in the newborn period than Low Birthweight. Values in Stata:
     	0 = (sobre 37) = Not premature
        1 =  (34 - 37) = Premature
        
    edad_mama = Maternal_age (numeric, float)
    Age of the mother, in years.
    
    paridad = Parity (numeric, float)
     Parity is the number of live births the mother has had previous to this current baby. 
     
    trim_exp_g = Trim_study (The trimester of pregnancy at the study interval for each year.
    This variable breaks down the trim_exp variable into those who were in the control year and those who were in the earthquake year.  Stata gives 6 values, as follows:
	1 = First trimester, year 2009 (primero2009)
	2 = Second trimester, year 2009 (segundo2009)
	3 = Third trimester, year 2009 (tercero2009)
	4 = First trimester, year 2010 (primero2010)
	5 = Second trimester, year 2010 (segundo2010)
	6 = Third trimester, year 2010 (tercero2010)
    
    pi = Ponderal_index (numeric, float)
    Used as a more complex measure of adequate or inadequate fetal growth. Calculated as weight in grams / length (or height) in cm cubed
    
    exposed = Exposed (numeric, float)
    Whether or not the pregnancy was exposed to the earthquake, i.e. a control or experimental. Given as a number which Stata strangely does not define. We can define it by data inspection, as follows:
	0 = Not exposed (control group)
	1 = Exposed (earthquake group)

    The data were examined for correlations using correlation matrix graph plots, correlation heat map plots, and  numeric correlation matrix. The variables  Clinic_ID and Excel_File_ID were dropped before doing correlation as these were record-keeping variables. From the correlation evaluations, everal strong correlations were seen. Using a value of the correlation statistic > |0.25|, the variables "Year", "Length", "Weight", "Trim_study", and "Premature" were dropped. The variables "Day" and "Month" were also dropped as containing irrelevant information. 
    
    Further correlation studies and further paring of the variables from the main dataframe were done as below. The final dataframe that I used for the miniproject was "df_newlite4", which had one outcome variable "Exposed" and the following seven predictor variables:

Dataframe: df_newlite4

	Head_circ                     
	Sex                            
	Trimester                   
	Premature               
	apgar_5                       
	Maternal_age           
    Ponderal_index         

The interim dataframes created were "df_newlite", "df_newlite2", and "df_newlite3".

In [7]:

from __future__ import print_function
import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/Admin/Documents/DATA SCIENCE/Data Sets/Chile_earthquake/Chilean_Earthquake_Binary_data.csv')
#drop "Clinic_ID" and "Excel_File_ID" from dataframe
df.drop('Clinic_ID', axis=1, inplace=True, errors='ignore')
df.drop('Excel_File_ID', axis=1, inplace=True, errors='ignore')


#identify records with missing values
#display(df.iloc[df['Parity'].isna().get_values(),:].transpose() \
#           .style.highlight_null(null_color='red'))

#replace NaN in Parity with 0, which is the most common label
df['Parity'].fillna(0, inplace=True)

print('Summary Statistics for Initial Variables After Imputing NaN Values and Dropping "Clinic_ID", "Excel_File_ID"')
display(df.describe())

Summary Statistics for Initial Variables


Unnamed: 0,Day,Month,Year,Sex,Weight,Length,Head_circ,apgar_1,apgar_5,Municipality,Wgt_for_age,Gest_age,Trimester,Low_birthwgt,Premature,Maternal_age,Parity,Trim_study,Ponderal_index,Exposed
count,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0,4076.0
mean,15.374141,7.113837,2009.517664,0.512022,3355.048086,49.696516,34.689463,8.767664,9.363837,1.480128,0.167566,38.802993,1.999755,0.013248,0.028214,27.839794,0.611138,3.552748,2.729498,0.517664
std,8.736868,2.762731,0.499749,0.499917,384.896728,1.723517,1.187216,0.730237,0.521778,0.638901,0.481448,1.07473,0.820145,0.11435,0.165604,5.380683,0.782109,1.677835,0.228661,0.499749
min,1.0,2.0,2009.0,0.0,1790.0,40.0,29.0,2.0,6.0,0.0,0.0,34.0,1.0,0.0,0.0,14.0,0.0,1.0,1.696,0.0
25%,8.0,5.0,2009.0,0.0,3098.75,48.5,34.0,9.0,9.0,1.0,0.0,38.0,1.0,0.0,0.0,24.0,0.0,2.0,2.579992,0.0
50%,15.0,7.0,2010.0,1.0,3345.0,50.0,35.0,9.0,9.0,2.0,0.0,39.0,2.0,0.0,0.0,28.0,0.0,4.0,2.717342,1.0
75%,23.0,10.0,2010.0,1.0,3610.0,51.0,35.5,9.0,10.0,2.0,0.0,40.0,3.0,0.0,0.0,31.0,1.0,5.0,2.866327,1.0
max,31.0,12.0,2010.0,1.0,4890.0,56.0,40.0,10.0,10.0,2.0,2.0,41.0,3.0,1.0,1.0,44.0,5.0,6.0,5.283747,1.0


In [8]:
## Ignore some columns and create new dataframe "df_newlite"
## Check variable types

columns_ignore = ['Day', 'Month', 'Year']
df_lite = df.drop(columns_ignore, axis=1)

df_lite.info()

## All variables are int64 or float64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4076 entries, 0 to 4075
Data columns (total 17 columns):
Sex               4076 non-null int64
Weight            4076 non-null int64
Length            4076 non-null float64
Head_circ         4076 non-null float64
apgar_1           4076 non-null int64
apgar_5           4076 non-null int64
Municipality      4076 non-null int64
Wgt_for_age       4076 non-null int64
Gest_age          4076 non-null int64
Trimester         4076 non-null int64
Low_birthwgt      4076 non-null int64
Premature         4076 non-null int64
Maternal_age      4076 non-null int64
Parity            4076 non-null float64
Trim_study        4076 non-null int64
Ponderal_index    4076 non-null float64
Exposed           4076 non-null int64
dtypes: float64(4), int64(13)
memory usage: 541.4 KB


In [None]:

#map some categorical variables to work with parallel coordinate plots and have a more natural ordering

## HEBER: I don't know WHY this remap is necessary as these are all numerics. 
## But if not done, I got an error in splitting for Logistic Regressions.
## But it also created a new problem variables "Sex", "Premature", and "Trim_Study"
## by creating empty variables set, shown below when variables are listed using INFO 
## I had to fix this in the next code block by adding more code.
df_lite['Sex'] = df_lite['Sex'].map({'female': 0, 'male': 1})
df_lite['Premature'] = df_lite['Premature'].map({'Premature': 1, 'Not premature': 0})
df_lite['Trim_study'] = df_lite['Trim_study'].map({'First2009': '1', 'Second2009': '2', 'Third2009': '3', 
                          'First2010': '4', 'Second2010': '5', 'Third2010': '6' })


In [None]:
## PROBLEM. Variables Sex, Premature, and Trim_study are all filled now with NaN.
## This happened in cell 3 with mapping "to replace text with numbers". But they were already numbers.
## so Remap eliminated all data.
## Replace with original data.

df_newlite = df_lite
df_newlite['Sex'] = df['Sex']
df_newlite['Premature'] = df['Premature']
df_newlite['Trim_study'] = df['Trim_study']

df_newlite.info()

In [None]:
## GET NUMERIC CORRELATION MATRIX OF FULL DATA SET (df, not the reduced df_newlite)

corr = df.corr()
corr.style.background_gradient().set_precision(2)

In [None]:
## Look at correlation plots of df_newlite

import seaborn as sns
import matplotlib.pyplot as plt
# label the variables with numbers

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: df_newlite
        size: vertical and horizontal size of the plot'''

    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
plt.matshow(df_newlite.corr())

## Show the plot matrix
import pandas as pd
pd.plotting.scatter_matrix(df_newlite, alpha = 0.3, figsize = (14,14), diagonal = 'kde');


In [None]:

## CLEARLY THERE ARE SOME HIGH CORRELATIONS EVIDENT IN BOTH df AND df_newlite. FIRST, "TRIM_STUDY" IS 87% CORRELATED
## WITH THE OUTCOME "EXPOSED", WHICH SHOULD BE OBVIOUS BECAUSE IT IS CODED AGAINST THIS OUTCOME.
## NEXT, LENGTH AND WEIGHT ARE USED TO CALCULATE PONDURAL_INDEX.
## SO BEGIN BY CHOOSING TO IGNORE "LENGTH", "WEIGHT", "TRIM_STUDY".
## DROP IGNORED COLUMNS  IN df_newlite AND CREATE NEW DATAFRAME = df_newlite2
columns_ignore = ['Length', 'Weight', 'Trim_study', 'Day', 'Month', 'Year']
df_newlite2 = df.drop(columns_ignore, axis=1)

print('Summary Statistics for newlite2 data')
display(df_newlite2.describe())

In [None]:
## REPLOT THE CORRELATION HEAT MAP AND THE MATRIX USING df_newlite2
## AND ALSO THE NUMERIC CORRELATION MATRIX
def plot_corr(df,size=14):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: df_newlite2
        size: vertical and horizontal size of the plot'''

    corr = df_newlite2.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
plt.matshow(df_newlite2.corr())

## Show the plot matrix
pd.plotting.scatter_matrix(df_newlite2, alpha = 0.3, figsize = (14,14), diagonal = 'kde');

## NUMERIC CORRELATION MATRIX
corr = df_newlite2.corr()
corr.style.background_gradient().set_precision(2)

In [None]:
## Additional highly correlated variables are "maternal_age" with "parity" (45%), 
## "Gest_age" with "Premature" (54%), "Low_birthwgt" with "Premature" (30%). "Ponderal_index" correlates with 
## "Head_circ" (24%), and "Gest_age" correlates with "Head_circ" (23%).
## From this result I will further remove Prematurity because correlates with several others.
## Creating df_newlite3

df.columns_ignore = ['Length', 'Weight', 'Trim_study', 'Day', 'Month', 'Year', 'Premature']
df_newlite3 = df.drop(columns_ignore, axis=1)


In [None]:
## SUMMARY STATISTICS FOR df_newlite3

print('Summary Statistics for newlite3 data')
display(df_newlite3.describe())

In [None]:
## Numeric correlation matrix for df_newlite3 
corr = df_newlite3.corr()
corr.style.background_gradient().set_precision(2)


## REPLOT CORRELATION HEAT MATRIX AND CORRELATION PLOT MATRIX for df_newlite3

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: df_newlite3
        size: vertical and horizontal size of the plot'''

    corr = df_newlite3.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
plt.matshow(df_newlite2.corr())

## Show the plot matrix
pd.plotting.scatter_matrix(df_newlite3, alpha = 0.3, figsize = (14,14), diagonal = 'kde');

In [None]:
## AT THIS POINT IN THE MINI-PROJECT I HAD COMPUTED BOTH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE
## USING df_newlite3, AND EXAMINED AND COMPARED THE WEIGHTS OF BOTH MODELS.
## BASED ON THE WEIGHTS IN THE SVM MODEL, I FURTHER PRUNED THE DATAFRAME TO REMOVE VARIABLES WHICH
## HAD WEIGHTS < |0.05|. THE VARIABLES I REMOVED ARE "APGAR_1", "WGT_FOR_AGE", "GEST_AGE", "LOW_BIRTHWGT", 
## "MUNICIPALITY", AND "PARITY". THIS CREATED df_newlite4.

df_newlite4 = df.drop(['Day', 'Month', 'Year', 'Weight', 'Length','apgar_1', 'Municipality', 'Wgt_for_age', 'Gest_age', 'Low_birthwgt', 'Parity', 'Trim_study'], axis = 1)

In [None]:
df_newlite4.info()

In [None]:
## HERE ARE THE CORRELATION MATRIX AND PLOTS FOR df_newlite4. I DID NOT CHANGE THE DF AFTER THIS IN THE MINI PROJECT

## Numeric correlation matrix for df_newlite4 
corr = df_newlite4.corr()
corr.style.background_gradient().set_precision(2)


## REPLOT CORRELATION HEAT MATRIX AND CORRELATION PLOT MATRIX for df_newlite4

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: df_newlite3
        size: vertical and horizontal size of the plot'''

    corr = df_newlite4.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);
plt.matshow(df_newlite2.corr())

## Show the plot matrix
pd.plotting.scatter_matrix(df_newlite4, alpha = 0.3, figsize = (14,14), diagonal = 'kde');