**Import Libraries**

In [92]:
from collections import Counter
import math
import time
import requests
from bs4 import BeautifulSoup
import re
import json
import csv

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.utils import shuffle

from warnings import simplefilter
simplefilter('ignore', category=FutureWarning)

Please note, we will be collaborating via Git. Please find our project at https://github.com/ppich1169/fertilityProjectCs109

# Milestone 1: Proposal

Over the past fifty years, fertility rates in the US have plummeted and are currently at a historic low. Conversations about why fertility has fallen so substantially and how we can address the implications of this shift for government programs like social security have been quite salient in recent public discourse and in the 2024 election cycle. 

Interestingly, there is significant variation in fertility rates across US states. We’d like to understand the relative importance of various factors in determining a state’s fertility rate. 

We plan to run a multiple regression of fertility rate (can get state-by-state here from the CDC’s National Center for Health Statistics) on a number of regressors

**Goal:** Create a regression that can predict the fertility rate of a state. Then, analyze coefficients and/or use causal inference to understand why fertility rate is going down

# Milestone 2: Preprocessing

## 1. Access the data that you will be using for the final project by downloading, collecting, or scraping* from the relevant source(s)

### Response Variable

Our response variable is **fertility rate by state over time** which can be found at https://www.cdc.gov/nchs/pressroom/sosmap/fertility_rate/fertility_rates.htm. 

We accessed it via download and saved it as `fertility_rate_census.csv`. 

Please note, fertility rate is defined as  **total number of births per 1,000 women aged 15-44** and our dataset looks at fertility rate for each of the 50 states over 9 years (2014-2022).

### Predictors
_We used our previous knowledge and assumptions to create an **X** dataset of predictors that we believe may influence fertility rates_

Because fertility rate is evaluated statewide , and states vary significantly in population size, we have decided that for all of the predictors, we are going to essentially **normalize** them by looking at the percentage of each state that fall into a specific category. 

**ISABELLA SECTION**

We care about basic demographic data which we downloaded get from the **census** at [**THIS LINK INSERT HERE**]

Here is the census data we consider important (based on our own subjective opinions):
- race
- socio economic status (household income) - percentage of households below the poverty line
- education level - share that are high school graduates 
- immigration status

And obviously we need **age**, **sex** , **year** , and **state**  in order to aggregate our data

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

We also care about religion, political makeup, and whether certain abortion laws are in place (all of which are not in the census). We will find them different ways.

**PETER SECTION**

We decided to determine people's **religiousness** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**ELIZA SECTION**

We decided to determine people's **political orienation** based off of [**INSERT HERE**] which can be found at [**INSERT LINK HERE**]

We accessed it by [**SCRAPING (CODE BELOW) / DOWNLOADING**] and saved it in `dataset_name.csv` in our data folder

Some things to consider include: [**INSERT HERE**] 

**MAIA SECTION**

We decided to determine state's **abortion laws** based off of how late into pregnancy, abortion is legally allowed which can be found at https://lawatlas.org/datasets/abortion-bans. We chose this dataset because it is the only one on the internet showing abortion bans in the 2014-2022 time frame and how they change (most just show abortion bans now)

We accessed it by downloading it, converting from xlsx to csv, and saved it in `abortion_data.csv` in our data folder

The main thing to consider is that **just because abortion is legal, doesn't mean it is accessible**. Many states may technically allow abortion but only have one clinic, so its not attainable. That said, we have chosen this metric (when is abortion legal), to coincide with current political debate about whether abortion should be legalized. 

There will be significant preprocessing required as the data is in format `Effective Date`, `Valid Through Date` for each law which must simply be converted into year (whichever law was the majority of the year), and each ban `6 weeks`, `8 weeks` etc is categorical! It would make more sense to simply make a variable listing the latest week aborition is legal (0,6,8,12,52 etc).  

## 2. Load the data into a Jupyter notebook and understand the data by examining, among other characteristics of interest, data missingness, imbalance, and scaling issues.

Some issues we preliminary have considered before even inspecting the data includes:


- **under reporting immigration status**: via google, people tend to underreport whether they are immigrants. This is potentially a missingness issue

-  **multicollinearity**: There is most likely a relationship between racial background / household income and an individual's birth country (immigration status) so we can't use both as predictors in the same equation. We don't know how strong this correlation will be so are not worried yet, but would love to talk about it with a TF

- **is this just too much data??**: looking at each of these attributes for each state for each year may just be too many dimensions. Is there really a big difference accross years? Should we just look at 1? If so, which? 

Now we are going to import and inspect each dataset, looking for missingness, imbalance, and scaling issues!

### Fertility Rate Data

In [93]:
fertility_rate = pd.read_csv('data/fertility_rate_census.csv')
print(fertility_rate.isna().sum(axis=0))
fertility_rate.describe()

YEAR              0
STATE             0
FERTILITY RATE    0
BIRTHS            0
URL               0
dtype: int64


Unnamed: 0,YEAR,FERTILITY RATE,BIRTHS
count,451.0,451.0,451.0
mean,2018.008869,60.057871,75783.962306
std,2.58885,6.420758,86837.13469
min,2014.0,44.3,5133.0
25%,2016.0,56.0,21758.5
50%,2018.0,60.2,55971.0
75%,2020.0,63.5,86486.0
max,2022.0,80.0,502879.0


Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Census Data

In [94]:
# census_data = pd.read_csv('data/')
# print(census_data.isna().sum(axis=0))
# census_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Religion Data

In [121]:
import os
# WEB SCRAPING
# Step 1: Make a GET request to fetch the raw HTML content
url2020 = "https://www.thearda.com/us-religion/maps/us-state-maps"
url2010 = "https://www.thearda.com/us-religion/maps/us-state-maps?color=orange&m1=2_2_9999_2010"
url2000 = "https://www.thearda.com/us-religion/maps/us-state-maps?color=orange&m1=2_2_9999_2000"
response = requests.get(url2000)
soup = BeautifulSoup(response.content, 'html.parser')

# Check if the request was successful
if response.status_code == 200:
    # Step 2: Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Step 3: Print the prettified HTML content
    # print(soup.prettify())
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
    
# Find the script tag containing the variable
script_tag = soup.find('script', text=re.compile(r'usa_map_div299992000_data = '))

# Extract the JavaScript code from the script tag
script_content = script_tag.string

# print(script_content)

# Use regex to find the variable assignment and extract the JSON data
start_index = script_content.find('usa_map_div299992000_data =')
semicolon_index = script_content.find(';', start_index)

mapData = script_content[start_index:semicolon_index]
# print(mapData)

quoted_strings = re.findall(r'"(.*?)"', mapData)
values_strings = re.findall(r'(\d+\.\d+)', mapData)
year = []
for i in range(len(values_strings)):
    year.append(2000)

last_two_chars = [s[-2:] for s in quoted_strings]
print(last_two_chars)
print(len(values_strings))


# Create a CSV file and write the pairs to it
# csv_filename = 'AllReligionAdherence_2000.csv'
# with open(csv_filename, mode='w', newline='') as file:
#     writer = csv.writer(file)
#     writer.writerow(['Year','State', 'Adherence Rate per 1000'])  # Write the header
#     for year, quoted, value in zip(year,last_two_chars, values_strings):
#         writer.writerow([year,quoted, value])

# print(f"CSV file '{csv_filename}' created successfully.")

# Define the file paths
file_paths = [
    'data/AllReligionAdherence_2000.csv',
    'data/AllReligionAdherence_2010.csv',
    'data/AllReligionAdherence_2020.csv'
]

# List to hold DataFrames
dfs = []

# Process each file
for file_path in file_paths:
    # Extract the year from the file name
    file_name = os.path.basename(file_path)
    year = file_name.split('_')[-1].split('.')[0]
    
    # Read the CSV file into a DataFrame
    df = pd.read_csv(file_path)
    
    # Add the new column with the year
    df['Year'] = year
    
    # Append the DataFrame to the list
    dfs.append(df)

# Concatenate the DataFrames
combined_df = pd.concat(dfs, ignore_index=True)

# Optionally, save the combined DataFrame to a new CSV file
# combined_df.to_csv('data/AllReligionAdherence_combined.csv', index=False)

# Print the combined DataFrame
print(combined_df)



['UT', 'ND', 'DC', 'SD', 'MA', 'RI', 'MN', 'OK', 'WI', 'NY', 'NE', 'LA', 'IA', 'NM', 'PA', 'CT', 'NJ', 'AR', 'TX', 'IL', 'AL', 'MS', 'KY', 'MO', 'TN', 'KS', 'ID', 'NH', 'SC', 'WY', 'CA', 'NC', 'OH', 'GA', 'MT', 'MD', 'IN', 'MI', 'VA', 'FL', 'DE', 'AZ', 'CO', 'VT', 'ME', 'HI', 'WV', 'AK', 'NV', 'WA', 'OR']
51


  script_tag = soup.find('script', text=re.compile(r'usa_map_div299992000_data = '))


     Year State  Adherence Rate per 1000
0    2000    UT                   747.30
1    2000    ND                   732.03
2    2000    DC                   731.64
3    2000    SD                   678.13
4    2000    MA                   640.97
..    ...   ...                      ...
148  2020    AK                   351.82
149  2020    MT                   348.37
150  2020    OR                   331.81
151  2020    ME                   307.94
152  2020    NH                   272.19

[153 rows x 3 columns]


In [97]:
# religion_data = pd.read_csv('data/')
# print(religion_data.isna().sum(axis=0))
# religion_data.describe()



Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Politics Data

In [98]:
politics_data = pd.read_csv('data/')
print(politics_data.isna().sum(axis=0))
politics_data.describe()

IsADirectoryError: [Errno 21] Is a directory: 'data/'

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

### Abortion Data

In [None]:
abortion_data = pd.read_csv('data/')
print(abortion_data.isna().sum(axis=0))
abortion_data.describe()

Things to Consider:

**Missingness**:

**Imbalance**:

**Scaling**:

**Other**:

## 3. Understand and describe the preprocessing required such that data is in a form amenable to later downstream tasks such as visualizing and modeling, as is appropriate to the specific project goals.



One of our biggest considerations is that our predictor variables, X, is essentially 3d (reshaped), not 2d. We are looking at a bunch of predictors **over state** and **over time**. This might make our predictions complicated, especially because we are trying to see **why** fertility is changing over time (so the reason that there are less babies in 2022 can't simply be that it is 2022). Thus, it might make more sense to choose only **one year** to regress on. Also, it may be hard to **visualize** multiple years and states simultaniously as, for example, if we drew a map and colored each state by predictor category, we would have to choose one specific time to do so (and vice versa).  PLEASE LET US KNOW WHAT YOU THINK!

It also may make sense to just try to minimize predictors in general as we don't want to get too specific!

### This is how we envision generally preprocessing...