# Baby Weight Prediction Project
## by Oliver Ochs

“A new baby's gender, name, time of birth, and birth weight are nice information for a birth announcement, but birth weight is especially important for an obstetrician. A large size at delivery has long been associated with an increased risk of injuries to a newborn and its mom. So the better a doctor can predict birth weight, the easier the delivery may be.” - WebMD

With the goal being to see what type of regression model works best on with dataset, this project tests time and performance of training models in order to predict newborn babies weights.

## Read data from CSV and describe variables.

In [1]:
import pandas as pd
import numpy as np

# description of dataset values
f = open('data-description.txt', 'r')
file_contents = f.read()
print(file_contents)
f.close()

# dataset
data = pd.read_csv("baby-weights-dataset.csv")

ID : Unique Identification number of a baby
SEX : Sex of the baby
MARITAL: Marital status of its parents
FAGE : Age of father
GAINED : Weight gained during pregnancy
VISITS : Number of prenatal visits
MAGE : Age of mother
FEDUC : Father's years of education
MEDUC : Mother's years of education
TOTALP : Total pregnancies
BDEAD : number of children born alive now dead
TERMS : Number of other terminations 
LOUTCOME : Outcome of last delivery
WEEKS : Completed weeks of gestation
RACEMOM : Race of mother/child from this set {0: 'Unknown',1:'OTHER_NON_WHITE', 2:'WHITE', 3:'BLACK', 4:'AMERICAN_INDIAN', 5:'CHINESE', 6:'JAPANESE', 7:'HAWAIIAN', 8:'FILIPINO', 9:'OTHER_ASIAN'}
RACEDAD : Race of Father from this set {0:'Unknown',1:'OTHER_NON_WHITE', 2:'WHITE', 3:'BLACK', 4:'AMERICAN_INDIAN', 5:'CHINESE', 6:'JAPANESE', 7:'HAWAIIAN', 8:'FILIPINO', 9:'OTHER_ASIAN'}
HISPMOM : Hispanic from this set {C:Cubans, M:Mexicans,  N:No, O:Colombians P:Peruvians, S:Salvadorans, U:Guatemalans }
HISPDAD : Is dad h

In [2]:
def initial_stats():
    # Computed mean, stdev, min, max, 25% percentile, median and 75% percentile of BWEIGHT target variable
    cols = ['ID', 'SEX', 'MARITAL', 'FAGE', 'GAINED', 'VISITS', 'MAGE', 'FEDUC', 'MEDUC', 'TOTALP', 'BDEAD',
            'TERMS', 'LOUTCOME', 'WEEKS', 'RACEMOM', 'RACEDAD', 'HISPMOM', 'HISPDAD', 'CIGNUM', 'DRINKNUM',
            'ANEMIA', 'CARDIAC', 'ACLUNG', 'DIABETES', 'HERPES', 'HYDRAM', 'HEMOGLOB', 'HYPERCH', 'HYPERPR',
            'ECLAMP', 'CERVIX', 'PINFANT', 'PRETERM', 'RENAL', 'RHSEN', 'UTERINE', 'BWEIGHT']
    mean = data[['BWEIGHT']].mean()
    std = data[['BWEIGHT']].std()
    min = data[['BWEIGHT']].min()
    max = data[['BWEIGHT']].max()
    tf = data[['BWEIGHT']].quantile(.25)
    med = data[['BWEIGHT']].median()
    sf = data[['BWEIGHT']].quantile(.75)
    result = np.array([mean, std, min, max, tf, med, sf])
    return result
print("Initial statistics on BWEIGHT variable in dataset")
print("mean:", initial_stats()[0])
print("standard deviation:", initial_stats()[1])
print("min:", initial_stats()[2])
print("max:", initial_stats()[3])
print("25th percentile:", initial_stats()[4])
print("median:", initial_stats()[5])
print("75th percentile:", initial_stats()[6])

Initial statistics on BWEIGHT variable in dataset
mean: [7.25806583]
standard deviation: [1.32946068]
min: [0.1875]
max: [13.0625]
25th percentile: [6.625]
median: [7.375]
75th percentile: [8.0625]


## Manipulate data to ensure proper functionality

In [3]:
def data_to_num(full_dataset):
    # Takes full_dataset (Pandas dataframe) as input, and returns a revised
    # full_dataset Dataframe after replacing all the non-numeric variables (i.e.,
    # categorical variables) with mapped numeric encoding.
    
    tonum = {"HISPDAD": {"C": 1, "M": 2, "N": 3, "O": 4,
                         "P": 5, "S": 6, "U": 7},
             "HISPMOM": {"C": 1, "M": 2, "N": 3, "O": 4,
                         "P": 5, "S": 6, "U": 7}
             }
    full_dataset.replace(tonum, inplace=True)
    return full_dataset
num_data = data_to_num(data)

In [4]:
def fill_nan(full_dataset):
    # Given the full_dataset (Pandas Dataframe), checks if there are missing values, and if yes,
    # counts how many, and impute the missing values with corresponding mean values.
    # Finally, returns the counting result as a Pandas dataframe with 2 columns
    # {variable_name,num_of_missing_values). Also, returns the revised full_dataset after the missing
    # value imputations is done.
    import pandas as pd

    missing_count = pd.DataFrame()
    revised_full_dataset = pd.DataFrame()

    missing_count = full_dataset.isnull().sum()
    revised_full_dataset = full_dataset.fillna(full_dataset.mean())
    return (missing_count, revised_full_dataset)
revised_num_dataset = fill_nan(num_data)
print("Missing values Filled")
print(revised_num_dataset[0])

Missing values Filled
ID          0
SEX         0
MARITAL     0
FAGE        0
GAINED      1
VISITS      0
MAGE        0
FEDUC       1
MEDUC       0
TOTALP      0
BDEAD       0
TERMS       0
LOUTCOME    0
WEEKS       1
RACEMOM     0
RACEDAD     0
HISPMOM     0
HISPDAD     0
CIGNUM      1
DRINKNUM    0
ANEMIA      0
CARDIAC     0
ACLUNG      0
DIABETES    0
HERPES      0
HYDRAM      1
HEMOGLOB    0
HYPERCH     0
HYPERPR     0
ECLAMP      0
CERVIX      0
PINFANT     0
PRETERM     0
RENAL       0
RHSEN       0
UTERINE     0
BWEIGHT     0
dtype: int64


## Correlation of features to target

In [5]:
def correlated_vars(full_dataset_all_numeric):
    # Takes in full_dataset (Pandas dataframe) where all the categorical variables are
    # already replaced with numeric values, returns a list of top 20 highly correlated variables
    # (with respect to the target variable) as a Pandas dataframe with 2 columns {variable,corr_score}.
    # The corr_score between a variable x and the target variable y is computed using the
    # Pearson Correlation Coefficient.

    import pandas as pd

    data = full_dataset_all_numeric.corr(method='pearson')['BWEIGHT'][1:-1]
    result = data.nlargest(20)

    return result.to_frame()
kept_vars = correlated_vars(revised_num_dataset[1])
kept_vars.style.background_gradient(cmap='coolwarm')

Unnamed: 0,BWEIGHT
WEEKS,0.565373
GAINED,0.173262
VISITS,0.129587
MAGE,0.068473
PINFANT,0.067073
MEDUC,0.055908
FEDUC,0.052674
FAGE,0.051447
DIABETES,0.010216
TOTALP,0.003201


## Separate variables and split into train/test

In [6]:
def separate_dataset(full_dataset):
    # Separates the full_dataset into two parts: X and y, where X denotes the input matrix
    # containing only the input variables, and y denotes the target vector containing only the target values
    # for exactly the same number of samples in the given full_dataset (pandas Dataframe).
    # Finally, returns X and y as a tuple
    import pandas as pd

    X = pd.DataFrame()
    y = pd.DataFrame()
    assert (len(X) == len(y))

    X = full_dataset[full_dataset.columns[:-1]]
    y = full_dataset['BWEIGHT']
    return (X, y)
separated_dataset = separate_dataset(revised_num_dataset[1])

In [7]:
from sklearn.model_selection import train_test_split
def split_dataset(X, y):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=45931)
    return (X_train, X_test, y_train, y_test)
dataset_split = split_dataset(*separated_dataset)

## Scaling

In [8]:
def min_max_scaler(X_train, X_test, y_train, y_test):
    # Given the 4 splits denoting the training and test dataset,
    # Applies min-max scaling on the training dataset (X_train).
    # Then scales the test dataset based on the metrics obtained when scaling the training dataset.
    # Finally, returns as a tuple the scaled X_train, X_test and the intact y_train and y_test.
    import pandas as pd
    from sklearn import preprocessing
    cols = list(X_train)

    scaler = preprocessing.MinMaxScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=cols)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=cols)


    return (X_train_scaled, X_test_scaled, y_train, y_test)
min_max_scaled_dataset = min_max_scaler(*dataset_split)

In [9]:
def standardize(X_train, X_test, y_train, y_test):
    # Given the 4 splits denoting the training and test dataset,
    # Applies standardization scaling on the training dataset (X_train).
    # Then scales the test dataset based on the metrics obtained when scaling the training dataset.
    # Finally, returns as a tuple the scaled X_train, X_test and the intact y_train and y_test.
    from sklearn import preprocessing
    cols = list(X_train)

    scaler = preprocessing.StandardScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=cols)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=cols)

    return (X_train_scaled, X_test_scaled, y_train, y_test)
standardized_data = standardize(*dataset_split)

## Regression models and their performance

In [10]:
def fit_close_form(X_train_scaled, X_test_scaled, y_train, y_test):
    # Given the (X_train, y_train) pairs denoting input matrix and output vector respectively,
    # Fits a linear regression model using the close-form solution to obtain
    # the coefficients, beta's, as a numpy array of m+1 values.
    # Then using the computed beta values, predicts the test samples provided in the "X_test_scaled"
    # argument. Computes Root Mean Squared Error (RMSE) of the prediction.
    # Finally, returns the beta vector, y_pred, RMSE as a tuple.
    import pandas as pd
    import numpy as np
    beta = []
    y_pred = []
    RMSE = -1

    X = X_train_scaled.to_numpy()
    y = y_train.to_numpy()
    Xt = np.transpose(X)
    XtX = np.dot(Xt, X)
    Xty = np.dot(Xt, y)
    beta = np.linalg.solve(XtX, Xty)
    for Xtest, Ytest in zip(X_test_scaled.to_numpy(), y_test.to_numpy()):
        pred = np.dot(Xtest, beta)
        y_pred.append(pred)
    RMSE = np.sqrt(np.square(np.subtract(y_test, y_pred)).mean())
    return (beta, y_pred, RMSE)

#http://www2.lawrence.edu/fast/GREGGJ/Python/numpy/numpyLA.html Referenced

In [11]:
print("Min-Max Normalization of features")
print("Root Mean Squared Error (RMSE):", fit_close_form(*min_max_scaled_dataset)[2])
print("\n")
print("Standardization of features")
print("Root Mean Squared Error (RMSE):", fit_close_form(*standardized_data)[2])

Min-Max Normalization of features
Root Mean Squared Error (RMSE): 1.042421813184601


Standardization of features
Root Mean Squared Error (RMSE): 7.329759432596651


In [12]:
def fit_batch(X_train_scaled, X_test_scaled, y_train, y_test, learning_rate=0.001, nIteration=10):
    # Given the (X_train, y_train) pairs denoting input matrix and output vector respectively,
    # Fits a linear regression model using the batch gradient descent algorithm to obtain
    # the coefficients, beta's, as a numpy array of m+1 values.
    # Measures the cpu_time needed during the training step. cpu_time is not equal to the wall_time.
    # Then using the computed beta values, predicts the test samples provided in the "X_test_scaled"
    # argument. Computes Root Mean Squared Error (RMSE) of the prediction.
    # Finally, returns the beta vector, y_pred, RMSE, cpu_time as a tuple.
    from time import perf_counter
    import numpy as np
    import random
    random.seed(554433)
    beta = []
    y_pred = []
    RMSE = -1
    cpu_time = 0

    t_start = perf_counter()
    X = X_train_scaled.to_numpy()
    y = y_train.to_numpy()
    Xt = np.transpose(X)
    beta = np.random.uniform(0, 1, 36)
    m = len(y)
    for i in range(nIteration):
        beta = beta - learning_rate * (Xt.dot(X.dot(beta)-y)/m)
    t_stop = perf_counter()
    cpu_time = t_stop - t_start
    for Xtest, Ytest in zip(X_test_scaled.to_numpy(), y_test.to_numpy()):
        pred = np.dot(Xtest, beta)
        y_pred.append(pred)
    RMSE = np.sqrt(np.square(np.subtract(y_test, y_pred)).mean())
    return (beta, y_pred, RMSE, cpu_time)

In [13]:
print("Min-Max Normalization of features")
print("Root Mean Squared Error (RMSE): ", fit_batch(*min_max_scaled_dataset)[2])
print("Runtime:", fit_batch(*min_max_scaled_dataset)[3])
print("\n")
print("Standardization of features")
print("Root Mean Squared Error (RMSE): ", fit_batch(*standardized_data)[2])
print("Runtime:", fit_batch(*standardized_data)[3])

Min-Max Normalization of features
Root Mean Squared Error (RMSE):  4.341054723971213
Runtime: 0.03177400000000041


Standardization of features
Root Mean Squared Error (RMSE):  8.271494492845612
Runtime: 0.02924289999999985


In [14]:
def fit_stochastic(X_train_scaled, X_test_scaled, y_train, y_test, learning_rate=0.001, nIteration=7000):
    # Given the (X_train, y_train) pairs denoting input matrix and output vector respectively,
    # Fits a linear regression model using the stochastic gradient descent algorithm to obtain
    # the coefficients, beta's, as a numpy array of m+1 values.
    # Then using the computed beta values, predicts the test samples provided in the "X_test_scaled"
    # argument. Computes Root Mean Squared Error (RMSE) of the prediction.
    # Finally, returns the beta vector, y_pred, RMSE, cpu_time as a tuple.
    import numpy as np
    from time import perf_counter
    import random
    random.seed(554433)
    beta = []
    y_pred = []
    RMSE = -1
    cpu_time = 0

    t_start = perf_counter()
    X = X_train_scaled.to_numpy()
    y = y_train.to_numpy()
    beta = np.random.uniform(0,1,36)
    m = len(y)
    for i in range(nIteration):
        randomize = np.random.permutation(len(X))
        X = X[randomize]
        y = y[randomize]
        Xt = np.transpose(X)
        beta = beta - learning_rate * (Xt.dot(X.dot(beta) - y) / m)
    t_stop = perf_counter()
    cpu_time = t_stop - t_start
    for Xtest, Ytest in zip(X_test_scaled.to_numpy(), y_test.to_numpy()):
        pred = np.dot(Xtest, beta)
        y_pred.append(pred)
    RMSE = np.sqrt(np.square(np.subtract(y_test, y_pred)).mean())
    return (beta, y_pred, RMSE, cpu_time)

In [15]:
print("Min-Max Normalization of features")
print("Root Mean Squared Error (RMSE): ", fit_stochastic(*min_max_scaled_dataset)[2])
print("Runtime:", fit_stochastic(*min_max_scaled_dataset)[3])
print("\n")
print("Standardization of features")
print("Root Mean Squared Error (RMSE): ", fit_stochastic(*standardized_data)[2])
print("Runtime:", fit_stochastic(*standardized_data)[3])

Min-Max Normalization of features
Root Mean Squared Error (RMSE):  1.465083599936269
Runtime: 240.18791009999998


Standardization of features
Root Mean Squared Error (RMSE):  7.330681444886723
Runtime: 216.32823570000005


In [16]:
def fit_mini_batch(X_train_scaled, X_test_scaled, y_train, y_test, batch_size=64, learning_rate=0.001, nIteration=200):
    # Given the (X_train, y_train) pairs denoting input matrix and output vector respectively,
    # Fits a linear regression model using the mini-batch gradient descent algorithm to obtain
    # the coefficients, beta's, as a numpy array of m+1 values.
    # Then using the computed beta values, predicts the test samples provided in the "X_test_scaled"
    # argument. Computes Root Mean Squared Error (RMSE) of the prediction.
    # Finally, returns the beta vector, y_pred, RMSE, cpu_time as a tuple.
    from time import perf_counter
    import numpy as np
    import random
    random.seed(554433)
    beta = []
    y_pred = []
    RMSE = -1
    cpu_time = 0

    t_start = perf_counter()
    X = X_train_scaled.to_numpy()
    y = y_train.to_numpy()
    beta = np.random.uniform(0,1,36)
    m = len(y)
    for i in range(nIteration):
        randomize = np.random.permutation(len(X))
        X = X[randomize]
        y = y[randomize]
        Xt = np.transpose(X)
        for j in range(batch_size):
            beta = beta - learning_rate * (Xt.dot(X.dot(beta) - y) / m)
    t_stop = perf_counter()
    cpu_time = t_stop - t_start
    for Xtest, Ytest in zip(X_test_scaled.to_numpy(), y_test.to_numpy()):
        pred = np.dot(Xtest, beta)
        y_pred.append(pred)
    RMSE = np.sqrt(np.square(np.subtract(y_test, y_pred)).mean())

    return (beta, y_pred, RMSE, cpu_time)

In [17]:
print("Min-Max Normalization of features")
print("Root Mean Squared Error (RMSE): ", fit_mini_batch(*min_max_scaled_dataset)[2])
print("Runtime:", fit_mini_batch(*min_max_scaled_dataset)[3])
print("\n")
print("Standardization of features")
print("Root Mean Squared Error (RMSE): ", fit_mini_batch(*standardized_data)[2])
print("Runtime:", fit_mini_batch(*standardized_data)[3])

Min-Max Normalization of features
Root Mean Squared Error (RMSE):  1.3468882811489113
Runtime: 27.62068050000005


Standardization of features
Root Mean Squared Error (RMSE):  7.329615512424052
Runtime: 30.42870019999998


## Results

In [18]:
def predictive_results(batch_GD_result, stochastic_GD_result, minibatch_GD_result):
    # Given the 3 sets of tuples from the 3 experiments with batch gradient descent,
    # stochastic gradient descent and mini-batch gradient descent, returns a string from the set
    # {"batch-GD", "stochastic-GD", "minibatch-GD"} that demonstrated the best predictive performance
    # in terms of RMSE.

    (beta_B, y_pred_B, RMSE_B, cpu_time_B) = batch_GD_result
    (beta_S, y_pred_S, RMSE_S, cpu_time_S) = stochastic_GD_result
    (beta_M, y_pred_M, RMSE_M, cpu_time_M) = minibatch_GD_result

    RMSEs = {'Batch': RMSE_B, 'Stochastic': RMSE_S, 'Mini': RMSE_M}
    minpred = min(RMSEs.keys(), key=(lambda k: RMSEs[k]))
    return ('Best prediction by: ' + str(minpred) + ' with ' + str(RMSEs[minpred]))

In [19]:
print("Min-Max Normalization of features")
print(predictive_results(fit_batch(*min_max_scaled_dataset), fit_stochastic(*min_max_scaled_dataset), fit_mini_batch(*min_max_scaled_dataset)))
print("\n")
print("Standardization of features")
print(predictive_results(fit_batch(*standardized_data), fit_stochastic(*standardized_data), fit_mini_batch(*standardized_data)))

Min-Max Normalization of features
Best prediction by: Mini with 1.3827329243324744


Standardization of features
Best prediction by: Mini with 7.329587472973014


In [20]:
def time_results(batch_GD_result, stochastic_GD_result, minibatch_GD_result):
    # Given the 3 sets of tuples from the 3 experiments with batch gradient descent,
    # stochastic gradient descent and mini-batch gradient descent, returns a string from the set
    # {"batch-GD", "stochastic-GD", "minibatch-GD"} that demonstrated the least training time.

    (beta_B, y_pred_B, RMSE_B, cpu_time_B) = batch_GD_result
    (beta_S, y_pred_S, RMSE_S, cpu_time_S) = stochastic_GD_result
    (beta_M, y_pred_M, RMSE_M, cpu_time_M) = minibatch_GD_result

    cpu_times = {'Batch': cpu_time_B, 'Stochastic': cpu_time_S, 'minibatch': cpu_time_M}

    mintime = min(cpu_times.keys(), key=(lambda k: cpu_times[k]))
    return ('min runtime by: ' + str(mintime) + ' with ' + str(cpu_times[mintime]))

In [21]:
print("Min-Max Normalization of features")
print(time_results(fit_batch(*min_max_scaled_dataset), fit_stochastic(*min_max_scaled_dataset), fit_mini_batch(*min_max_scaled_dataset)))
print("\n")
print("Standardization of features")
print(time_results(fit_batch(*standardized_data), fit_stochastic(*standardized_data), fit_mini_batch(*standardized_data)))

Min-Max Normalization of features
min runtime by: Batch with 0.025800699999990684


Standardization of features
min runtime by: Batch with 0.020153799999889088
