# Regression: San Francisco Birthweight Data

In this project you will work on data collected from births that occurred between 1960 and 1967 among women in the San Francisco East Bay area. The datasets records details of the pregnancy and the mother's health and circumstances and reports the birthweight of the baby. You will use this data to train and interpret a regression model for birthweight.


In [None]:
#lets import some things
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
link_to_file = "https://raw.githubusercontent.com/Center-for-Health-Data-Science/Python_part2/main/data/project_work/birth.csv"

df = pd.read_csv(link_to_file)
df.head()

Below you find a table with the description of each column:

|Feature   | Description
|-------|-------|
|case| Case number
|bwt| Baby's birthweight in ounces
|gestation|Duration of pregnancy in days
|age| Mother's age in years
|height| Mother's height in inches
|weight| Mother's weight in pounds
|smoke| Whether the mother is a smoker
|ses| Socioeconomic status of the mother in 3 levels (low, middle, high)

## EDA and data cleaning

The first step with our data is Exploratory Data Analysis (EDA). Use these questions to guide your analysis:

* Which features/explanatory variables are present? Are they numeric or categorical? Should they all be interpreted the same way? What do you want to use as the outcome variable?
* Are there missing values?
* Is there an index or ID you should remove?
* Create bar plots for the categorical features and check if the categories are balanced.
* Create box plots and summary statistics for the numeric variables. Check their distributions and ranges. Are there outliers present?
* Remove data you think is unreliable or wrong.



<details>
<summary>Hint for finding out if a feature is numeric or categorical</summary>

To find out if a feature is actually a category, have a look at how many different values it has. I.e. a feature with only 1's and 0's might be a binary categorical. If you don't remember how to do that have a look at [Counting instances](https://colab.research.google.com/drive/1hfp2LU-TngXBsZYPUpzH0xei-XQ16o9J#scrollTo=iYhdmMRzCt_V).

</details>

## Correlations

Take a look at the correlation between the numeric features and the outcome variable. Which numeric features exhibit the highest correlation with the outcome variable?

## Preparing the data for modelling

* What do you need to do to prepare the data for modelling?
* Do you need to dummy code categorical features?
* Should you scale numeric features?
* Hint: Remember to assemble the X array of features to put into the model.
* Hint: Remember to split into test and training set.

<details>
<summary>Hint for making the X array</summary>

The feature array X you will use to model, should contain both numeric and categorical features in the form you need them (what you prepared above).

</details>

## PCA

Now that you have extracted and scaled numeric features you can make a PCA of them to investigate the structure of the data.

Including categoricals might make sense if they are ordered, i.e. category 1 is 'more' of the feature than category 0, **and** the distance between levels is constant, i.e. the change from level 0 to 1 is the same as the change from level 1 to level 2 and going from level 0 to level 2 has approximately double the effect of going from 0 to 1. If these two conditions are not fulfilled it does not make sense to add the feature in the PCA because then there exists no sensible mapping into a numerical space.

For this exercise, we'll stick to making a PCA of the numeric features.


## Model

In this part you will define and train a linear regression model.

## Model evaluation

* How can we evaluate the performance of regression models?
* What do you think about the performance you observe?

## Model interpretation

In this section, have a look at the model parameters you have estimated. What do the estimates mean?
