**Import Libraries**

In [100]:
from collections import Counter
import math
import time

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler
import re 

from warnings import simplefilter
simplefilter('ignore', category=FutureWarning)

Please note, we will be collaborating via Git. Please find our project at https://github.com/ppich1169/fertilityProjectCs109

# Milestone 1: Proposal

Over the past fifty years, fertility rates in the US have plummeted and are currently at a historic low. Conversations about why fertility has fallen so substantially and how we can address the implications of this shift for government programs like social security have been quite salient in recent public discourse and in the 2024 election cycle. 

Interestingly, there is significant variation in fertility rates across US states. We’d like to understand the relative importance of various factors in determining a state’s fertility rate. 

We plan to run a multiple regression of fertility rate (can get state-by-state here from the CDC’s National Center for Health Statistics) on a number of regressors

**Goal:** Create a regression that can predict the fertility rate of a state. Then, analyze coefficients and/or use causal inference to understand what factors most strongly predict a low fertility rate.

# Milestone 2: Preprocessing

## 1. Access the data that you will be using for the final project by downloading, collecting, or scraping* from the relevant source(s)

### Response Variable

Our response variable is **fertility rate by state over time** which can be found at https://www.cdc.gov/nchs/pressroom/sosmap/fertility_rate/fertility_rates.htm. 

We accessed it via download and saved it as `fertility_rate_census.csv`. 

Please note, fertility rate is defined as  **total number of births per 1,000 women aged 15-44** and our dataset looks at fertility rate for each of the 50 states over 9 years (2014-2022).

### Predictors
We created an **X** dataset of predictors that we believe may influence fertility rates based on the factors surfaced most frequently in economics literature about fertility rates at the nation-wide level and our assumptions about other factors that might be relevant. 

Because fertility rate is evaluated statewide , and states vary significantly in population size, we have decided that for all of the predictors, we are going to essentially **normalize** them by looking at the percentage of each state that fall into a specific category. 

**ISABELLA SECTION**

We care about basic demographic data which we downloaded get from the **census** at [**THIS LINK INSERT HERE**]

Here is the census data we consider important (based on our own subjective opinions):
- race
- socio economic status (household income) - percentage of households below the poverty line
- education level - share that are high school graduates 
- immigration status

And obviously we need **age**, **sex** , **year** , and **state**  in order to aggregate our data

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

We also care about religion, political makeup, and whether certain abortion laws are in place (all of which are not in the census). We will find them different ways.

**PETER SECTION**

We decided to determine people's **religiousness** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**ELIZA SECTION**

We decided to determine people's **political orienation** based off of the **Cook Partisan Voting Index (Cook PVI)**, which is a measure of each state's political leaning relative to the nation as a whole. 

**The primary challenge with using this index is that the methodology was switched in 2022 to weigh the last presidential election 0.75 and the second to last 0.25, as opposed to the even (50/50) weighting of both years that was used in prior years. We will either figure out how to reweight the outcomes based on the raw data if we can get it or exclude 2022 from our analysis.**

The calculation of the index is described in more detail here: https://www.cookpolitical.com/cook-pvi

We will follow a similar methodology to that used in "State-level Political Partisanship Strongly Correlates with Health Outcomes for US Children," (full citation below).* We converted the PVI's to numerical values, with negative values representing Democratic PVIs and positive numbers representing Republican ones (an arbitrary choice). Then, we scaled those values with sklearn's MinMaxScaler() so that states have a rating between 0 and 1 representing how conservative they are, with 1 being most conservatve and 0 being most liberal.

We accessed the data for 2022 from this source: https://datawrapper.dwcdn.net/0djXs/2/ and saved it in `cook_pvi_2022.csv` in our data folder. We tried scraping the website but the data is not in the html but instead pulled from a source so we were unable to access the data using the same strategy as HW 1. 

*Paul, M., Zhang, R., Liu, B. et al. State-level political partisanship strongly correlates with health outcomes for US children. Eur J Pediatr 181, 273–280 (2022). https://doi.org/10.1007/s00431-021-04203-y

In [115]:
pol_df = pd.read_csv('data/cook_pvi_2022.csv')

#Drop DC because it's not a state: 
pol_df = pol_df[pol_df['State'] != 'District of Columbia']
pol_df.reset_index(drop=True, inplace=True)

unscaled_ratings=[]
for i in range(len(pol_df)):
    magnitude = re.search(r"(?<=\+).*", pol_df['2022_PVI'][i])
    magnitude = int(magnitude.group())
    
    if pol_df['2022_PVI'][i][0] == 'R':
        unscaled_ratings.append(magnitude)
    else:
        unscaled_ratings.append(magnitude * -1)

pol_df['unscaled_rating'] = unscaled_ratings

pol_rating_scaler = MinMaxScaler()
scaled_pol_ratings = pol_rating_scaler.fit_transform(pol_df['unscaled_rating'].values.reshape(-1, 1))

pol_df['scaled_rating'] = scaled_pol_ratings
pol_df.head()

pol_df['unscaled_rating'].min()

pol_df

Unnamed: 0,State,2022_PVI,2020_Biden,2020_Trump,2016_Clinton,2016_Trump,unscaled_rating,scaled_rating
0,Alabama,R+15,36.60%,62.00%,34.40%,62.10%,15,0.756098
1,Alaska,R+8,42.80%,52.80%,36.60%,51.30%,8,0.585366
2,Arizona,R+2,49.40%,49.10%,44.60%,48.10%,2,0.439024
3,Arkansas,R+16,34.80%,62.40%,33.70%,60.60%,16,0.780488
4,California,D+13,63.50%,34.30%,61.70%,31.60%,-13,0.073171
5,Colorado,D+4,55.40%,41.90%,48.20%,43.30%,-4,0.292683
6,Connecticut,D+7,59.30%,39.20%,54.60%,40.90%,-7,0.219512
7,Delaware,D+7,58.70%,39.80%,53.10%,41.70%,-7,0.219512
8,Florida,R+3,47.90%,51.20%,47.80%,49.00%,3,0.463415
9,Georgia,R+3,49.50%,49.20%,45.60%,50.80%,3,0.463415


**MAIA SECTION**

We decided to determine state's **abortion laws** based off of how late into pregnancy, abortion is legally allowed which can be found at https://lawatlas.org/datasets/abortion-bans. We chose this dataset because it is the only one on the internet showing abortion bans in the 2014-2022 time frame and how they change (most just show abortion bans now)

We accessed it by downloading it, converting from xlsx to csv, and saved it in `abortion_data.csv` in our data folder

The main thing to consider is that **just because abortion is legal, doesn't mean it is accessible**. Many states may technically allow abortion but only have one clinic, so its not attainable. That said, we have chosen this metric (when is abortion legal), to coincide with current political debate about whether abortion should be legalized. 

There will be significant preprocessing required as the data is in format `Effective Date`, `Valid Through Date` for each law which must simply be converted into year (whichever law was the majority of the year), and each ban `6 weeks`, `8 weeks` etc is categorical! It would make more sense to simply make a variable listing the latest week aborition is legal (0,6,8,12,52 etc).  

## 2. Load the data into a Jupyter notebook and understand the data by examining, among other characteristics of interest, data missingness, imbalance, and scaling issues.

Some issues we preliminary have considered before even inspecting the data includes:


- **under reporting immigration status**: via google, people tend to underreport whether they are immigrants. This is potentially a missingness issue

-  **multicollinearity**: There is most likely a relationship between racial background / household income and an individual's birth country (immigration status) so we can't use both as predictors in the same equation. We don't know how strong this correlation will be so are not worried yet, but would love to talk about it with a TF

- **is this just too much data??**: looking at each of these attributes for each state for each year may just be too many dimensions. Is there really a big difference accross years? Should we just look at 1? If so, which? 

Now we are going to import and inspect each dataset, looking for missingness, imbalance, and scaling issues!

### Fertility Rate Data

In [14]:
import pandas
print(pandas.__file__)


fertility_rate = pd.read_csv('data/fertility_rate_census.csv')
print(fertility_rate.isna().sum(axis=0))
fertility_rate.describe()

ModuleNotFoundError: No module named 'pandas'

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Census Data

In [None]:
census_data = pd.read_csv('data/')
print(census_data.isna().sum(axis=0))
census_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Religion Data

In [None]:
religion_data = pd.read_csv('data/')
print(religion_data.isna().sum(axis=0))
religion_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Politics Data

In [None]:
politics_data = pd.read_csv('data/')
print(politics_data.isna().sum(axis=0))
politics_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Abortion Data

In [None]:
abortion_data = pd.read_csv('data/')
print(abortion_data.isna().sum(axis=0))
abortion_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

## 3. Understand and describe the preprocessing required such that data is in a form amenable to later downstream tasks such as visualizing and modeling, as is appropriate to the specific project goals.



One of our biggest considerations is that our predictor variables, X, is essentially 3d (reshaped), not 2d. We are looking at a bunch of predictors **over state** and **over time**. This might make our predictions complicated, especially because we are trying to see **why** fertility is changing over time (so the reason that there are less babies in 2022 can't simply be that it is 2022). Thus, it might make more sense to choose only **one year** to regress on. Also, it may be hard to **visualize** multiple years and states simultaniously as, for example, if we drew a map and colored each state by predictor category, we would have to choose one specific time to do so (and vice versa).  PLEASE LET US KNOW WHAT YOU THINK!

It also may make sense to just try to minimize predictors in general as we don't want to get too specific!

### This is how we envision generally preprocessing...