The dataset I created myself only has data for 51 universities. Because that's likely not enough for a model to fit well, I will try to retrieve vaccination mandates for as many universities as possible (a binary variable). I'll still need to train a fairly simple model, but this should perform better than analysis on my own data.

I will use [Chronicle's list of colleges that require vaccination](https://www.chronicle.com/blogs/live-coronavirus-updates/heres-a-list-of-colleges-that-will-require-students-to-be-vaccinated-against-covid-19). Note that it ended tracking on 10/26/2021, but most mandates should've come before the fall semester began--my last recorded date is 8/24/2021, so this should not pose a problem.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [67]:
vacc_mandates = pd.read_csv('chronicle_vaccine_mandates.csv')
vacc_mandates.head()

Unnamed: 0,College,Announce date,Type,State ^Color denotes 2020 presidential result^,All employees ^(Vaccination required)^,Some employees<sup>1</sup> ^(Vaccination required)^,All students ^(Vaccination required)^,Only residential students ^(Vaccination required)^,Booster required?,state_pol
0,[Adelphi University](https://www.adelphi.edu/n...,"July 26, 2021",Private,NY,--,--,✓,--,--,D
1,[Adler University](https://www.illinois.gov/go...,"August 26, 2021",Private,IL,✓,--,✓,--,--,D
2,[Agnes Scott College](https://www.agnesscott.e...,"May 7, 2021",Private,GA,✓,--,✓,--,--,D
3,[Albany College of Pharmacy and Health Science...,--,Private,NY,✓,--,✓,--,--,D
4,[Albany Law School](https://www.albanylaw.edu/...,"August 13, 2021",Private,NY,✓,--,✓,--,--,D


Clean data

In [68]:
vacc_mandates.rename(columns={'Announce date': 'announce_date',
                              'State  ^Color denotes 2020 presidential result^': 'State', 
                              'All employees  ^(Vaccination required)^': 'all_employee_vacc', 
                              'Some employees<sup>1</sup> ^(Vaccination required)^': 'some_employee_vacc',
                              'All students  ^(Vaccination required)^': 'all_students_vacc',
                              'Only residential students  ^(Vaccination required)^': 'res_students_vacc',
                              'Booster required?': 'booster'
                             }, inplace=True)
vacc_mandates['College'] = vacc_mandates['College'].str.extract(r'\[(.+)\]') # isolate college name
vacc_mandates['announce_date'].replace('--', np.nan, inplace=True)
vacc_mandates['announce_date'] = pd.to_datetime(vacc_mandates['announce_date'])
vacc_mandates[['all_employee_vacc', 'some_employee_vacc', 'all_students_vacc', 'res_students_vacc', 'booster']] = vacc_mandates[['all_employee_vacc', 'some_employee_vacc', 'all_students_vacc', 'res_students_vacc', 'booster']].replace({'--': '0', '✓': '1'})

Now, clean US News best colleges, which I got as described in "Documenting COVID Decisions at US Universities".

In [108]:
college_rankings = pd.read_csv('usa_list.csv')
college_rankings.head()

Unnamed: 0,displayName,rankingDisplayRank,state,city,zip,description
0,Princeton University,#1,NJ,Princeton,8544.0,The ivy-covered campus of Princeton University...
1,Columbia University,#2,NY,New York,10027.0,Columbia University has three undergraduate sc...
2,Harvard University,#2,MA,Cambridge,2138.0,Harvard University is a private institution in...
3,Massachusetts Institute of Technology,#2,MA,Cambridge,2139.0,Though the Massachusetts Institute of Technolo...
4,Yale University,#5,CT,New Haven,6520.0,"Yale University, located in New Haven, Connect..."


In [109]:
college_rankings = (college_rankings.rename(columns={'displayName': 'College', 'rankingDisplayRank': 'ranking'})
                    .drop(columns='description'))

Change rankings to raw. Ties broken in favor of listing order (which is alphabetical, I think). Shouldn't matter much because I will put the different rankings into bins. Also make zip an int (as that's how it is in my other files). This requires me to drop one college, which is fine.

In [110]:
college_rankings['ranking'] = college_rankings.index
college_rankings = college_rankings.dropna(subset=['zip']) 
college_rankings['zip'] = college_rankings['zip'].astype(int).copy()
college_rankings.head()

Unnamed: 0,College,ranking,state,city,zip
0,Princeton University,0,NJ,Princeton,8544
1,Columbia University,1,NY,New York,10027
2,Harvard University,2,MA,Cambridge,2138
3,Massachusetts Institute of Technology,3,MA,Cambridge,2139
4,Yale University,4,CT,New Haven,6520


### Merge
Inner, as manually filling in the rest will take too much work.

In [111]:
vacc_mandates_top = college_rankings.merge(vacc_mandates, on='College', how='inner')
vacc_mandates_top.shape

(162, 14)

In [115]:
vacc_mandates_top.head()

Unnamed: 0,College,ranking,state,city,zip,announce_date,Type,State,all_employee_vacc,some_employee_vacc,all_students_vacc,res_students_vacc,booster,state_pol
0,Princeton University,0,NJ,Princeton,8544,2021-04-20,Private,NJ,1,0,1,0,1,D
1,Columbia University,1,NY,New York,10027,2021-04-19,Private,NY,1,0,1,0,1,D
2,Harvard University,2,MA,Cambridge,2138,2021-05-05,Private,MA,1,0,1,0,1,D
3,Massachusetts Institute of Technology,3,MA,Cambridge,2139,2021-04-30,Private,MA,1,0,1,0,1,D
4,Yale University,4,CT,New Haven,6520,2021-04-19,Private,CT,1,0,1,0,0,D


In [122]:
vacc_mandates['announce_date'].idxmin()

732

Lost over half of colleges; however, this is more data than before so it's ok. Now, save this dataframe, then apply same functions as I did in "Analyzing Covid Decision Dates," running my script instead of another notebook.

In [118]:
# vacc_mandates_top.to_csv('vacc_mandates_top.csv', index=False)

Result from running cleaning script

In [128]:
vacc_mandates_cleaned = pd.read_csv('vacc_mandates_cleaned.csv')
vacc_mandates_cleaned.head()

Unnamed: 0,College,ranking,state,city,zip,announce_date,Type,State,all_employee_vacc,some_employee_vacc,...,state_fips,STCOUNTYFP,county_fips,county_fips_str,median_income,total_population,avg_hhsize,avg_community_level,political_control_state,county_vote_diff
0,Princeton University,0,NJ,Princeton,8544,19.0,Private,NJ,1,0,...,34,34021,21,21,37223.0,368085.0,2.46,0.62766,Dem,0.374257
1,Columbia University,1,NY,New York,10027,18.0,Private,NY,1,0,...,36,36061,61,61,52409.0,1629153.0,2.07,0.535047,Dem,0.768507
2,New York University,27,NY,New York,10012,22.0,Private,NY,1,0,...,36,36061,61,61,52409.0,1629153.0,2.07,0.535047,Dem,0.768507
3,Fordham University,67,NY,New York,10023,15.0,Private,NY,1,0,...,36,36061,61,61,52409.0,1629153.0,2.07,0.535047,Dem,0.768507
4,Yeshiva University,73,NY,New York,10033,33.0,Private,NY,0,0,...,36,36061,61,61,52409.0,1629153.0,2.07,0.535047,Dem,0.768507


Check nulls

In [132]:
vacc_mandates_cleaned[['median_income', 'total_population', 'avg_hhsize', 'avg_community_level', 'political_control_state', 'county_vote_diff']].isna().sum()

median_income              0
total_population           0
avg_hhsize                 0
avg_community_level        0
political_control_state    0
county_vote_diff           0
dtype: int64

No nulls so I can move on to a bit more preprocessing and then applying machine learning models.