**Import Libraries**

In [27]:
from collections import Counter
import plotly.express as px
import math
import time
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import shuffle

from warnings import simplefilter
simplefilter('ignore', category=FutureWarning)

Please note, we will be collaborating via Git. Please find our project at https://github.com/ppich1169/fertilityProjectCs109

# Milestone 1: Proposal

Over the past fifty years, fertility rates in the US have plummeted and are currently at a historic low. Conversations about why fertility has fallen so substantially and how we can address the implications of this shift for government programs like social security have been quite salient in recent public discourse and in the 2024 election cycle. 

Interestingly, there is significant variation in fertility rates across US states. We’d like to understand the relative importance of various factors in determining a state’s fertility rate. 

We plan to run a multiple regression of fertility rate (can get state-by-state here from the CDC’s National Center for Health Statistics) on a number of regressors

**Goal:** Create a regression that can predict the fertility rate of a state. Then, analyze coefficients and/or use causal inference to understand why fertility rate is going down

# Milestone 2: Preprocessing

## 1. Access the data that you will be using for the final project by downloading, collecting, or scraping* from the relevant source(s)

### Response Variable

Our response variable is **fertility rate by state over time** which can be found at https://www.cdc.gov/nchs/pressroom/sosmap/fertility_rate/fertility_rates.htm. 

We accessed it via download and saved it as `fertility_rate_census.csv`. 

Please note, fertility rate is defined as  **total number of births per 1,000 women aged 15-44** and our dataset looks at fertility rate for each of the 50 states over 9 years (2014-2022).

### Predictors
_We used our previous knowledge and assumptions to create an **X** dataset of predictors that we believe may influence fertility rates_

Because fertility rate is evaluated statewide , and states vary significantly in population size, we have decided that for all of the predictors, we are going to essentially **normalize** them by looking at the percentage of each state that fall into a specific category. 

**ISABELLA SECTION**

We care about basic demographic data which we  get from the **census** at [**THIS LINK INSERT HERE**]

Here is the census data we consider important (based on our own subjective opinions):
- race
- socio economic status (household income) - percentage of households below the poverty line
- education level - share that are high school graduates 
- immigration status

And obviously we need **age**, **sex** , **year** , and **state**  in order to aggregate our data

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

We also care about religion, political makeup, and whether certain abortion laws are in place (all of which are not in the census). We will find them different ways.

**PETER SECTION**

We decided to determine people's **religiousness** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**ELIZA SECTION**

We decided to determine people's **political orienation** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**MAIA SECTION**

We decided to determine state's **abortion laws** based off of how late into pregnancy, abortion is legally allowed which can be found at https://lawatlas.org/datasets/abortion-bans. We chose this dataset because it is the only one on the internet showing abortion bans in the 2014-2022 time frame and how they change (most just show abortion bans now)

We accessed it by downloading it, converting from xlsx to csv, and saved it in `abortion_data.csv` in our data folder

The main thing to consider is that **just because abortion is legal, doesn't mean it is accessible**. Many states may technically allow abortion but only have one clinic, so its not attainable. That said, we have chosen this metric (when is abortion legal), to coincide with current political debate about whether abortion should be legalized. 

There will be significant preprocessing required as the data is in format `Effective Date`, `Valid Through Date` for each law which must simply be converted into year (whichever law was the majority of the year), and each ban `6 weeks`, `8 weeks` etc is categorical! It would make more sense to simply make a variable listing the latest week aborition is legal (0,6,8,12,52 etc).  

## 2. Load the data into a Jupyter notebook and understand the data by examining, among other characteristics of interest, data missingness, imbalance, and scaling issues.

Some issues we preliminary have considered before even inspecting the data includes:


- **under reporting immigration status**: via google, people tend to underreport whether they are immigrants. This is potentially a missingness issue

-  **multicollinearity**: There is most likely a relationship between racial background / household income and an individual's birth country (immigration status) so we can't use both as predictors in the same equation. We don't know how strong this correlation will be so are not worried yet, but would love to talk about it with a TF

- **is this just too much data??**: looking at each of these attributes for each state for each year may just be too many dimensions. Is there really a big difference accross years? Should we just look at 1? If so, which? 

Now we are going to import and inspect each dataset, looking for missingness, imbalance, and scaling issues!

### Fertility Rate Data

In [24]:
fertility_rate = pd.read_csv('data/fertility_rate_census.csv')
print(fertility_rate.isna().sum(axis=0))
fertility_rate.describe()

YEAR              0
STATE             0
FERTILITY RATE    0
BIRTHS            0
URL               0
dtype: int64


Unnamed: 0,YEAR,FERTILITY RATE,BIRTHS
count,451.0,451.0,451.0
mean,2018.008869,60.057871,75783.962306
std,2.58885,6.420758,86837.13469
min,2014.0,44.3,5133.0
25%,2016.0,56.0,21758.5
50%,2018.0,60.2,55971.0
75%,2020.0,63.5,86486.0
max,2022.0,80.0,502879.0


In [51]:
pivoted_fertility = fertility_rate.pivot(index='STATE', columns='YEAR', values='FERTILITY RATE')
pivoted_fertility[pivoted_fertility.isna().any(axis=1)]

YEAR,2014,2015,2016,2017,2018,2019,2020,2021,2022
STATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
District of Columbia,,,,,,,,,57.3


Things to Consider:

**Missingness**: While there is no empty cell in the original dataset, we see if we pivot it by year, Washington DC only shows up in 2022. We can simply delete this row as Washington DC isn't technically a state. This is Missing at Random because we know the reason there was no Washington DC from 2014-2021 is that DC isn't considered a state

**Imbalance**: Since we aren't looking at different classes there is no imbalance

**Scaling**: By considering Fertility Rate (and not Number of Births), we are essentially normalizing our data as we are dividing it by total population. This will be enough in order to scale the data as it takes into account the populations of each state such that no state is overly weighed. In addition, since fertility rate is a response variable, not a predictor, we don't have to consider how large our data is in relation to other variables. Thus, we don't have to scale it any further.

### Census Data

In [None]:
census_data = pd.read_csv('data/')
print(census_data.isna().sum(axis=0))
census_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Religion Data

In [None]:
religion_data = pd.read_csv('data/')
print(religion_data.isna().sum(axis=0))
religion_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Politics Data

In [None]:
politics_data = pd.read_csv('data/')
print(politics_data.isna().sum(axis=0))
politics_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Abortion Data

We are going to start by generally preprocessing the data (as outlined in the previous section) as otherwise it is not interpretable (everything is considered an object)

In [3]:
abortion_data = pd.read_csv('data/abortion_data.csv')

In [4]:
columns = ["State","Effective Date","Valid Through Date","Bans_gest_4 weeks postfertilization (6 weeks LMP) " ,"Bans_gest_6 weeks postfertilization (8 weeks LMP) " ,"Bans_gest_8 weeks postfertilization (10 weeks LMP)","Bans_gest_10 weeks postfertilization (12 weeks LMP) " ,"Bans_gest_12 weeks postfertilization (14 weeks LMP)","Bans_gest_13 weeks postfertilization (15 weeks LMP)","Bans_gest_16 weeks postfertilization (18 weeks LMP) ","Bans_gest_18 weeks postfertilization (20 weeks LMP)","Bans_gest_19 weeks postfertilization (21 weeks LMP)","Bans_gest20 weeks postfertilization (22 weeks LMP)","Bans_gest_21 weeks postfertilization (23 weeks LMP) " ,"Bans_gest_22 weeks postfertilization (24 weeks LMP)","Bans_gest_24 weeks postfertilization (26 weeks LMP)","Bans_gestViability","Bans_gest_Fetus is capable of feeling pain","Bans_gest_3rd trimester"]
abortion_data = abortion_data[columns]
new_names = {
    "State": "state",
    "Effective Date": "start",
    "Valid Through Date": "end",
    "Bans_gest_4 weeks postfertilization (6 weeks LMP) ": "4",
    "Bans_gest_6 weeks postfertilization (8 weeks LMP) ": "6",
    "Bans_gest_8 weeks postfertilization (10 weeks LMP)": "8",
    "Bans_gest_10 weeks postfertilization (12 weeks LMP) ": "10",
    "Bans_gest_12 weeks postfertilization (14 weeks LMP)": "12",
    "Bans_gest_13 weeks postfertilization (15 weeks LMP)": "13",
    "Bans_gest_16 weeks postfertilization (18 weeks LMP) ": "16",
    "Bans_gest_18 weeks postfertilization (20 weeks LMP)": "18",
    "Bans_gest_19 weeks postfertilization (21 weeks LMP)": "19",
    "Bans_gest20 weeks postfertilization (22 weeks LMP)": "20",
    "Bans_gest_21 weeks postfertilization (23 weeks LMP) ": "21",
    "Bans_gest_22 weeks postfertilization (24 weeks LMP)": "22",
    "Bans_gest_24 weeks postfertilization (26 weeks LMP)": "24",
    "Bans_gestViability": "24", #chose via google
    "Bans_gest_Fetus is capable of feeling pain": "25", #chose via google
    "Bans_gest_3rd trimester": "28" #chose via google
}
abortion_data = abortion_data.rename(columns=new_names)

abortion_processed = abortion_data[['state', 'start', 'end']].copy()
abortion_processed['latest_abortion'] = abortion_data[abortion_data.columns].apply(
    lambda row: next((int(col) for col in abortion_data.columns  if str(row[col]) == "1"), 40), axis=1)

abortion_processed['start'] = pd.to_datetime(abortion_processed['start'])
abortion_processed['end'] = pd.to_datetime(abortion_processed['end'])
abortion_processed['year'] = abortion_processed['start'].dt.year
abortion_processed['length'] = (abortion_processed['end'] - abortion_processed['start']).dt.days
abortion_processed = abortion_processed.loc[abortion_processed.groupby(['state', 'year'])['length'].idxmax()]
abortion_processed = abortion_processed.drop(columns=['start', 'end', 'length'])

abortion_processed.head()

Unnamed: 0,state,latest_abortion,year
0,Alabama,20,2018
2,Alabama,20,2019
3,Alabama,20,2022
4,Alaska,40,2018
5,Arizona,18,2018


In [52]:
print(abortion_processed.isna().sum(axis=0))
print("missing from entire dataframe", 50*5-len(abortion_processed))
abortion_processed.describe()

state              0
latest_abortion    0
year               0
dtype: int64
missing from entire dataframe 113


Unnamed: 0,latest_abortion,year
count,137.0,137.0
mean,25.59854,2019.729927
std,10.328758,1.647206
min,4.0,2018.0
25%,20.0,2018.0
50%,20.0,2019.0
75%,40.0,2021.0
max,40.0,2022.0


Things to Consider:

**Missingness**: We see while there are no null cells in our dataset, there are 113 "missing cells" from what would be expected (one for every state for every year). This is simply because of how our dataset was made -- there is only a new row if a new law is enacted (not one per year). Thus, if we are missing a row for a certain year, we can simply **impute** the data by copying the previous row. 

**Imbalance**: There is no imbalance as there is no population sampling. 

**Scaling**: We don't need to scale the data by itself. 

**Other**: We decided to change our data from a 1-hot encoded dataset, to 1 quantitative varaible. We did this because there seems to be a clear relatonship between the previously encoded categories (a 10 week ban is stronger than a 15 week ban and less strong than a 4 week ban). Thus, we made it ordinal. In additon, note that if there was no abortion ban, we arbitrarily set this ordinal variable to 40, the length of the average pregancy. It might make some sense to add an **indicator variable** to be 0 if there is no abortion ban at all (40 weeks) and 1 if there is an abortion ban. 

## 3. Understand and describe the preprocessing required such that data is in a form amenable to later downstream tasks such as visualizing and modeling, as is appropriate to the specific project goals.



One of our biggest considerations is that our predictor variables, X, is essentially 3d (reshaped), not 2d. We are looking at a bunch of predictors **over state** and **over time**. This might make our predictions complicated, especially because we are trying to see **why** fertility is changing over time (so the reason that there are less babies in 2022 can't simply be that it is 2022). Thus, it might make more sense to choose only **one year** to regress on. Also, it may be hard to **visualize** multiple years and states simultaniously as, for example, if we drew a map and colored each state by predictor category, we would have to choose one specific time to do so (and vice versa).  PLEASE LET US KNOW WHAT YOU THINK!

It also may make sense to just try to minimize predictors in general as we don't want to get too specific!

### This is how we envision generally preprocessing...

******DO THIS CELL TUESDAY, 29th!

### One way to preliminarily understand why fertility rate is decreasing is to look at where it changes over time, and think about what factors dominate those areas.... aka do PCA

In [36]:
pivoted_fertility = fertility_rate.pivot(index='STATE', columns='YEAR', values='FERTILITY RATE')
X_std =  StandardScaler().fit_transform(pivoted_fertility.fillna(0)) 
pca = PCA(n_components=1) 
principal_components = pca.fit_transform(X_std)

states = pivoted_fertility.index
pca = pd.DataFrame(data=principal_components, columns=['PC1'], index=states).reset_index()
pca.columns = ['state', 'PC1']


In [44]:
fig = px.choropleth(
    pca,
    locations='state',
    locationmode="USA-states",
    color='PC1',
    scope="usa",  
    range_color=[-5,3],
    labels={'PC1': 'Principal Component Value'}
)

fig.update_layout(title_text="What states have the highest variation in fertility rates?")
fig.show()


We see most of our variation is north and on the coasts, mainly in liberal, wealthy states! These are predictor variables we really care about!!