**Import Libraries**

In [2]:
from collections import Counter
import math
import time

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import shuffle

from warnings import simplefilter
simplefilter('ignore', category=FutureWarning)

Please note, we will be collaborating via Git. Please find our project at https://github.com/ppich1169/fertilityProjectCs109

# Milestone 1: Proposal

Over the past fifty years, fertility rates in the US have plummeted and are currently at a historic low. Conversations about why fertility has fallen so substantially and how we can address the implications of this shift for government programs like social security have been quite salient in recent public discourse and in the 2024 election cycle. 

Interestingly, there is significant variation in fertility rates across US states. We’d like to understand the relative importance of various factors in determining a state’s fertility rate. 

We plan to run a multiple regression of fertility rate (can get state-by-state here from the CDC’s National Center for Health Statistics) on a number of regressors

**Goal:** Create a regression that can predict the fertility rate of a state. Then, analyze coefficients and/or use causal inference to understand why fertility rate is going down

# Milestone 2: Preprocessing

## 1. Access the data that you will be using for the final project by downloading, collecting, or scraping* from the relevant source(s)

### Response Variable

Our response variable is **fertility rate by state over time** which can be found at https://www.cdc.gov/nchs/pressroom/sosmap/fertility_rate/fertility_rates.htm. 

We accessed it via download and saved it as `fertility_rate_census.csv`. 

Please note, fertility rate is defined as  **total number of births per 1,000 women aged 15-44** and our dataset looks at fertility rate for each of the 50 states over 9 years (2014-2022).

### Predictors
_We used our previous knowledge and assumptions to create an **X** dataset of predictors that we believe may influence fertility rates_

Because fertility rate is evaluated statewide , and states vary significantly in population size, we have decided that for all of the predictors, we are going to essentially **normalize** them by looking at the percentage of each state that fall into a specific category. 

**ISABELLA SECTION**

We care about basic demographic data which we downloaded get from the **census** at [**THIS LINK INSERT HERE**]
Here is the census data we consider important (based on our own subjective opinions):
- race
- socio economic status (household income) - percentage of households below the poverty line
- education level - share that are high school graduates 
- immigration status

And obviously we need **age**, **sex** , **year** , and **state**  in order to aggregate our data

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

We also care about religion, political makeup, and whether certain abortion laws are in place (all of which are not in the census). We will find them different ways.

**PETER SECTION**

We decided to determine people's **religiousness** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**ELIZA SECTION**

We decided to determine people's **political orienation** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**MAIA SECTION**

We decided to determine state's **abortion laws** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

## 2. Load the data into a Jupyter notebook and understand the data by examining, among other characteristics of interest, data missingness, imbalance, and scaling issues.

Some issues we preliminary have considered before even inspecting the data includes:


- **under reporting immigration status**: via google, people tend to underreport whether they are immigrants. This is potentially a missingness issue

-  **multicollinearity**: There is most likely a relationship between racial background / household income and an individual's birth country (immigration status) so we can't use both as predictors in the same equation. We don't know how strong this correlation will be so are not worried yet, but would love to talk about it with a TF

- **is this just too much data??**: looking at each of these attributes for each state for each year may just be too many dimensions. Is there really a big difference accross years? Should we just look at 1? If so, which? 

Now we are going to import and inspect each dataset, looking for missingness, imbalance, and scaling issues!

### Fertility Rate Data

In [5]:
fertility_rate = pd.read_csv('data/fertility_rate_census.csv')
print(fertility_rate.isna().sum(axis=0))
fertility_rate.describe()

YEAR              0
STATE             0
FERTILITY RATE    0
BIRTHS            0
URL               0
dtype: int64
              YEAR  FERTILITY RATE         BIRTHS
count   451.000000      451.000000     451.000000
mean   2018.008869       60.057871   75783.962306
std       2.588850        6.420758   86837.134690
min    2014.000000       44.300000    5133.000000
25%    2016.000000       56.000000   21758.500000
50%    2018.000000       60.200000   55971.000000
75%    2020.000000       63.500000   86486.000000
max    2022.000000       80.000000  502879.000000


Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Census Data

In [None]:
census_data = pd.read_csv('data/')
print(census_data.isna().sum(axis=0))
census_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Religion Data

In [None]:
religion_data = pd.read_csv('data/')
print(religion_data.isna().sum(axis=0))
religion_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Politics Data

In [None]:
politics_data = pd.read_csv('data/')
print(politics_data.isna().sum(axis=0))
politics_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Abortion Data

In [None]:
abortion_data = pd.read_csv('data/')
print(abortion_data.isna().sum(axis=0))
abortion_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

## 3. Understand and describe the preprocessing required such that data is in a form amenable to later downstream tasks such as visualizing and modeling, as is appropriate to the specific project goals.



One of our biggest considerations is that our predictor variables, X, is essentially 3d (reshaped), not 2d. We are looking at a bunch of predictors **over state** and **over time**. This might make our predictions complicated, especially because we are trying to see **why** fertility is changing over time (so the reason that there are less babies in 2022 can't simply be that it is 2022). Thus, it might make more sense to choose only **one year** to regress on. Also, it may be hard to **visualize** multiple years and states simultaniously as, for example, if we drew a map and colored each state by predictor category, we would have to choose one specific time to do so (and vice versa).  PLEASE LET US KNOW WHAT YOU THINK!

It also may make sense to just try to minimize predictors in general as we don't want to get too specific!

### This is how we envision generally preprocessing...