### Trying to create a model for county by county data for Wisconsin.
- Used the combined data from 2018-19 to 2022-23.
- For 2018-19 academic year, used the population and income data for the year 2018, and so on.
- Income and population data retrieved from the CAINC1 data file released by bea.gov.
- The county ('Shawano') present in Wisconsin AP data but not in CAINC1 file has been manually retrieved (population from census.gov and income from federal reserve in St. Louis).
- Used geopy to compute the average distance between counties and university. We take the five closest universities to compute the average distance.
### Also did naïve modelling with statsmodels
- Used 20% of the 350 datasets as testing data and the rest as training data.
- Percentage of 3 or above score (AKA pass rate) is used as the target variable.
- Per capita income, population and average distance to five closest R1R2, public, private (not for profit), stem and landgrant universities, as well as average enrollment in them are used as features.
- Full model includes all features.
- Uni metric model includes the ten university related metrics as features.
- Non-uni model includes per capita and population as features.
- The p-value of full model compared to uni metric model is extremely low. So, adding non-uni features (population and per capita income) to the uni-metric model improves the model.
- The p-value of full model compared to non-uni model, while still small, is comparatively larger. So, adding uni-metric features to the economic model (population and per capita income) does not improve the model as much.
- This is also illustrated by root mean squared error (rmse) computation on the tesitng data. The rmse with full model and non-uni model are very similar, whereas that with uni metric model is larger.
### Also did modelling with sklearn
- Considered the ordinary least squares (OLS) linear regression (with aforementioned three types of models), PCA-then-linear-regression model (with n-components = 0.95) and Ridge model.
- The PCA reduced the 12 features to 7 when set n-components=0.95.
- Did 5-fold cross validation, and compared the average root mean square errors (rmse) of the models.
- Ridge and full models had the lowest rmse, then followed by nonuni model and PCA model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from geopy.distance import distance

In [2]:
carnegie = pd.read_csv('data/carnegie_with_location.csv')
wisconsin_2223 = pd.read_excel('data/Wisconsin/Wisconsin.xlsx',sheet_name='2022-23')
uscounties = pd.read_csv('data/uscounties.csv')

In [3]:
carnegie

Unnamed: 0.1,Unnamed: 0,unitid,name,city,stabbr,basic2000,basic2005,basic2010,basic2015,basic2018,...,hbcu,tribal,hsi,msi,womens,rooms,selindex,address,latitude,longitude
0,0,100654,Alabama A & M University,Normal,AL,16,18,18,18,18,...,1,0,0,1,0,3220,1.0,"4900 Meridian St N, Huntsville, AL 35811, USA",34.783841,-86.572224
1,1,100663,University of Alabama at Birmingham,Birmingham,AL,15,15,15,15,15,...,0,0,0,0,0,2882,2.0,"1720 University Blvd, Birmingham, AL 35294, USA",33.502086,-86.805159
2,2,100690,Amridge University,Montgomery,AL,51,24,24,20,20,...,0,0,0,0,0,0,1.0,"1200 Taylor Rd, Montgomery, AL 36117, USA",32.362671,-86.173926
3,3,100706,University of Alabama in Huntsville,Huntsville,AL,16,16,15,16,16,...,0,0,0,0,0,2200,3.0,"Shelby Center for Science and Technology, 301 ...",34.725161,-86.640471
4,4,100724,Alabama State University,Montgomery,AL,21,18,18,19,19,...,1,0,0,1,0,2079,1.0,"915 S Jackson St, Montgomery, AL 36104, USA",32.362976,-86.293980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3934,3934,496186,California Institute of Arts & Technology-Nati...,National City,CA,-2,-2,-2,-2,-2,...,0,0,0,0,0,0,,"National City, CA, USA",32.678109,-117.099197
3935,3935,999903,Inter-American Defense College,Washington,DC,-2,-2,-2,-2,32,...,0,0,0,0,0,0,0.0,"210 B St, Washington, DC 20319, USA",38.871030,-77.017851
3936,3936,999907,The Judge Advocate General's School,Charlottesville,VA,-2,-2,-2,-2,32,...,0,0,0,0,0,0,0.0,"Judge Advocate General's School, 600 Massie Rd...",38.054027,-78.507671
3937,3937,999909,United States Army War College,Carlisle,PA,-2,-2,-2,-2,32,...,0,0,0,0,0,0,0.0,"651 Wright Ave, Carlisle, PA 17013, USA",40.211661,-77.172440


In [4]:
uscounties.sample(n=10)

Unnamed: 0,county,county_ascii,county_full,county_fips,state_id,state_name,lat,lng,population
969,Harrisonburg,Harrisonburg,Harrisonburg City,51660,VA,Virginia,38.4362,-78.8735,51784
1005,Duplin,Duplin,Duplin County,37061,NC,North Carolina,34.9365,-77.933,49312
3001,Schleicher,Schleicher,Schleicher County,48413,TX,Texas,30.8974,-100.5383,2474
1805,Bourbon,Bourbon,Bourbon County,21017,KY,Kentucky,38.2067,-84.2171,20228
2913,Fisher,Fisher,Fisher County,48151,TX,Texas,32.7428,-100.4022,3680
1383,Marshall,Marshall,Marshall County,21157,KY,Kentucky,36.8835,-88.3294,31706
3068,McCone,McCone,McCone County,30055,MT,Montana,47.6452,-105.7954,1746
532,Potter,Potter,Potter County,48375,TX,Texas,35.4013,-101.8939,117905
1215,Dallas,Dallas,Dallas County,1047,AL,Alabama,32.326,-87.1065,38326
210,Somerset,Somerset,Somerset County,34035,NJ,New Jersey,40.5635,-74.6164,344978


In [5]:
wisconsin_counties = uscounties[uscounties['state_id'] == 'WI']

In [6]:
wisconsin_counties = wisconsin_counties.reset_index()

In [7]:
wisconsin_counties= wisconsin_counties[['county','lat','lng','population']]

In [8]:
wisconsin_counties

Unnamed: 0,county,lat,lng,population
0,Milwaukee,43.0072,-87.9669,933063
1,Dane,43.0673,-89.4181,559891
2,Waukesha,43.0182,-88.3045,407290
3,Brown,44.4525,-88.0037,268393
4,Racine,42.7475,-88.0613,197068
...,...,...,...,...
67,Forest,45.6673,-88.7704,9239
68,Pepin,44.5829,-92.0016,7363
69,Iron,46.2623,-90.2420,6136
70,Florence,45.8485,-88.3981,4574


In [9]:
coord1 = (carnegie.iloc[1].latitude,carnegie.iloc[1].longitude)
coord2 = (wisconsin_counties.iloc[1].lat,wisconsin_counties.iloc[1].lng)

In [10]:
distance(coord1,coord2).miles

674.7673624288187

In [11]:
wisconsin_2223

Unnamed: 0.1,Unnamed: 0,COUNTY,STUDENTS_TESTED,EXAM_COUNT,EXAMS_3_OR_ABOVE,PERCENT_3_OR_ABOVE
0,0,Adams,42,79,22,27.848101
1,1,Ashland,25,48,21,43.750000
2,2,Barron,115,178,93,52.247191
3,3,Bayfield,22,34,23,67.647059
4,4,Brown,1875,2903,2066,71.167757
...,...,...,...,...,...,...
64,64,Waukesha,6439,11372,8537,75.070348
65,65,Waupaca,334,460,243,52.826087
66,66,Waushara,60,90,38,42.222222
67,67,Winnebago,840,1122,780,69.518717


In [12]:
wisconsin_2223[wisconsin_2223['COUNTY'].isin(wisconsin_counties['county'].values) == False]

Unnamed: 0.1,Unnamed: 0,COUNTY,STUDENTS_TESTED,EXAM_COUNT,EXAMS_3_OR_ABOVE,PERCENT_3_OR_ABOVE
52,52,Saint Croix,1110,1675,1125,67.164179


In [13]:
wisconsin_counties.iloc[10:30]  #Saint Croix is given as St. Croix in "wisconsin_counties" dataframe.

Unnamed: 0,county,lat,lng,population
10,Washington,43.3685,-88.2307,136842
11,La Crosse,43.9066,-91.1152,120216
12,Sheboygan,43.7212,-87.9454,117741
13,Eau Claire,44.7268,-91.286,105697
14,Walworth,42.6685,-88.5419,105127
15,Fond du Lac,43.7536,-88.4883,104027
16,St. Croix,45.034,-92.4526,93752
17,Ozaukee,43.384,-87.9509,91745
18,Dodge,43.4163,-88.7075,89032
19,Jefferson,43.0208,-88.7759,85932


In [14]:
wisconsin_2223=wisconsin_2223.replace(to_replace='Saint Croix',value='St. Croix') # Replace 'Saint Croix' by 'St. Croix

In [15]:
carnegie[carnegie['basic2021'].isin([15,16])]

Unnamed: 0.1,Unnamed: 0,unitid,name,city,stabbr,basic2000,basic2005,basic2010,basic2015,basic2018,...,hbcu,tribal,hsi,msi,womens,rooms,selindex,address,latitude,longitude
1,1,100663,University of Alabama at Birmingham,Birmingham,AL,15,15,15,15,15,...,0,0,0,0,0,2882,2.0,"1720 University Blvd, Birmingham, AL 35294, USA",33.502086,-86.805159
3,3,100706,University of Alabama in Huntsville,Huntsville,AL,16,16,15,16,16,...,0,0,0,0,0,2200,3.0,"Shelby Center for Science and Technology, 301 ...",34.725161,-86.640471
5,5,100751,The University of Alabama,Tuscaloosa,AL,15,16,16,16,15,...,0,0,0,0,0,8443,2.0,"Tuscaloosa, AL 35487, USA",33.211438,-87.540100
9,9,100858,Auburn University,Auburn,AL,15,16,16,16,15,...,0,0,0,0,0,4823,3.0,"Auburn, AL 36849, USA",32.598055,-85.494267
43,43,102094,University of South Alabama,Mobile,AL,16,18,18,16,16,...,0,0,0,0,0,3217,2.0,"307 N University Blvd, Mobile, AL 36688, USA",30.695941,-88.184236
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3417,3417,445188,University of California-Merced,Merced,CA,-2,-2,-2,16,16,...,0,0,0,0,0,4016,1.0,"5200 Lake Rd, Merced, CA 95343, USA",37.364703,-120.424094
3717,3717,482149,Augusta University,Augusta,GA,21,19,19,16,17,...,0,0,0,0,0,1228,2.0,"1120 15th St, Augusta, GA 30912, USA",33.470909,-81.989885
3744,3744,483124,Arizona State University Digital Immersion,Scottsdale,AZ,-2,-2,-2,17,16,...,0,0,0,0,0,0,0.0,"1151 S Forest Ave, Tempe, AZ, USA",33.422998,-111.927831
3793,3793,486840,Kennesaw State University,Kennesaw,GA,21,18,18,17,16,...,0,0,0,0,0,5116,2.0,"Kennesaw, GA, USA",34.023434,-84.615490


### First, we naively use the closest five universities in the given categories, and the average distance to them as features.

In [16]:
carnegie_full = pd.read_excel('data/CCIHE2021-PublicData.xlsx',sheet_name='Data')
carnegie_full

Unnamed: 0,unitid,name,city,stabbr,basic2000,basic2005,basic2010,basic2015,basic2018,basic2021,...,satv25,satm25,satcmb25,actcmp25,satacteq25,actfinal,appsf20,admitsf20,pctadmitf20,selindex
0,100654,Alabama A & M University,Normal,AL,16,18,18,18,18,18,...,430.0,410.0,840.0,15.0,15.0,15.000000,9855.0,8835.0,0.896499,1.0
1,100663,University of Alabama at Birmingham,Birmingham,AL,15,15,15,15,15,15,...,560.0,530.0,1090.0,22.0,21.0,21.875310,10391.0,8375.0,0.805986,2.0
2,100690,Amridge University,Montgomery,AL,51,24,24,20,20,20,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,1.0
3,100706,University of Alabama in Huntsville,Huntsville,AL,16,16,15,16,16,15,...,590.0,580.0,1170.0,24.0,24.0,24.000000,5793.0,4467.0,0.771103,3.0
4,100724,Alabama State University,Montgomery,AL,21,18,18,19,19,17,...,438.0,406.0,840.0,14.0,15.0,14.298675,7027.0,6948.0,0.988758,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3934,496186,California Institute of Arts & Technology-Nati...,National City,CA,-2,-2,-2,-2,-2,11,...,,,,,,,,,,
3935,999903,Inter-American Defense College,Washington,DC,-2,-2,-2,-2,32,32,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0
3936,999907,The Judge Advocate General's School,Charlottesville,VA,-2,-2,-2,-2,32,32,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0
3937,999909,United States Army War College,Carlisle,PA,-2,-2,-2,-2,32,32,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0


In [17]:
carnegie['stem_rsd'] = carnegie_full['stem_rsd']
carnegie['anenr1920'] = carnegie_full['anenr1920']

In [18]:
def closest_five(carnegie_univ_data,lat,long):

    # carnegie_univ_data is meant to be sublist/subdataframe of carnegie dataset (with location)
    # lat is meant to be the latitude of the location (usually county)
    # long is meant to be the longitude of the location
    # Returns the average distance to the five closest universities from the supplied list.

    univ_distance = {'unitid':[],'distance':[]}
    for i in carnegie_univ_data.index:
        univ_distance['unitid']=univ_distance['unitid']+[carnegie_univ_data.unitid[i]]
        univ_distance['distance']=univ_distance['distance']+[distance((lat,long),(carnegie_univ_data.latitude[i],carnegie_univ_data.longitude[i])).km]
    univ_distance = pd.DataFrame(univ_distance)
    closest_five = univ_distance.sort_values(by = 'distance')[:5]
    return np.mean(closest_five['distance'].values)

carnegie_r1r2 = carnegie[carnegie['basic2021'].isin([15,16])]
closest_five(carnegie_r1r2,40.902771, -73.133850)  # This is the coordinates of Stony Brook.


44.58800478079978

In [19]:
wisconsin_counties['closest_five_r1r2_avg'] = wisconsin_counties.apply(lambda x: closest_five(carnegie_r1r2,x.lat, x.lng), axis=1)

In [20]:
carnegie_public = carnegie[carnegie['control'] == 1]
wisconsin_counties['closest_five_public_avg'] = wisconsin_counties.apply(lambda x: closest_five(carnegie_public,x.lat, x.lng), axis=1)

In [21]:
carnegie_private_notprofit = carnegie[carnegie['control'] == 2]
wisconsin_counties['closest_five_private_nfp_avg'] = wisconsin_counties.apply(lambda x: closest_five(carnegie_private_notprofit,x.lat, x.lng), axis=1)

In [22]:
carnegie_landgrnt = carnegie[carnegie['landgrnt'] == 1]
wisconsin_counties['closest_five_landgrnt_avg'] = wisconsin_counties.apply(lambda x: closest_five(carnegie_landgrnt,x.lat, x.lng), axis=1)

In [23]:
carnegie

Unnamed: 0.1,Unnamed: 0,unitid,name,city,stabbr,basic2000,basic2005,basic2010,basic2015,basic2018,...,hsi,msi,womens,rooms,selindex,address,latitude,longitude,stem_rsd,anenr1920
0,0,100654,Alabama A & M University,Normal,AL,16,18,18,18,18,...,0,1,0,3220,1.0,"4900 Meridian St N, Huntsville, AL 35811, USA",34.783841,-86.572224,,6560
1,1,100663,University of Alabama at Birmingham,Birmingham,AL,15,15,15,15,15,...,0,0,0,2882,2.0,"1720 University Blvd, Birmingham, AL 35294, USA",33.502086,-86.805159,99.0,25843
2,2,100690,Amridge University,Montgomery,AL,51,24,24,20,20,...,0,0,0,0,1.0,"1200 Taylor Rd, Montgomery, AL 36117, USA",32.362671,-86.173926,,1079
3,3,100706,University of Alabama in Huntsville,Huntsville,AL,16,16,15,16,16,...,0,0,0,2200,3.0,"Shelby Center for Science and Technology, 301 ...",34.725161,-86.640471,50.0,11312
4,4,100724,Alabama State University,Montgomery,AL,21,18,18,19,19,...,0,1,0,2079,1.0,"915 S Jackson St, Montgomery, AL 36104, USA",32.362976,-86.293980,,4640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3934,3934,496186,California Institute of Arts & Technology-Nati...,National City,CA,-2,-2,-2,-2,-2,...,0,0,0,0,,"National City, CA, USA",32.678109,-117.099197,,383
3935,3935,999903,Inter-American Defense College,Washington,DC,-2,-2,-2,-2,32,...,0,0,0,0,0.0,"210 B St, Washington, DC 20319, USA",38.871030,-77.017851,,0
3936,3936,999907,The Judge Advocate General's School,Charlottesville,VA,-2,-2,-2,-2,32,...,0,0,0,0,0.0,"Judge Advocate General's School, 600 Massie Rd...",38.054027,-78.507671,,0
3937,3937,999909,United States Army War College,Carlisle,PA,-2,-2,-2,-2,32,...,0,0,0,0,0.0,"651 Wright Ave, Carlisle, PA 17013, USA",40.211661,-77.172440,,0


In [24]:
carnegie_stem = carnegie[carnegie['stem_rsd'] > 0] # We define STEM institute to be the one offering at least one STEM research/scholarship doctoral degrees.
wisconsin_counties['closest_five_stem_avg'] = wisconsin_counties.apply(lambda x: closest_five(carnegie_stem,x.lat, x.lng), axis=1)

In [25]:
def closest_five_enrollment(carnegie_univ_data,lat,long):

    # carnegie_univ_data is meant to be sublist/subdataframe of carnegie dataset (with location and anenr1920)
    # lat is meant to be the latitude of the location (usually county)
    # long is meant to be the longitude of the location
    # Returns the average annual enrollment of the five closest universities from the supplied list.

    univ_enrollment = {'unitid':[],'distance':[],'enrollment':[]}
    for i in carnegie_univ_data.index:
        univ_enrollment['unitid']=univ_enrollment['unitid']+[carnegie_univ_data.unitid[i]]
        univ_enrollment['distance']=univ_enrollment['distance']+[distance((lat,long),(carnegie_univ_data.latitude[i],carnegie_univ_data.longitude[i])).km]
        univ_enrollment['enrollment'] = univ_enrollment['enrollment'] + [carnegie_univ_data['anenr1920'][i]]
    univ_enrollment = pd.DataFrame(univ_enrollment)
    closest_five = univ_enrollment.sort_values(by = 'distance')[:5]
    return np.mean(closest_five['enrollment'].values)
carnegie_r1r2 = carnegie[carnegie['basic2021'].isin([15,16])]
closest_five_enrollment(carnegie_r1r2,40.902771, -73.133850)  # This is the coordinates of Stony Brook.

21050.8

In [26]:
def closest_five_rooms(carnegie_univ_data,lat,long):
        # carnegie_univ_data is meant to be sublist/subdataframe of carnegie dataset (with location and anenr1920)
    # lat is meant to be the latitude of the location (usually county)
    # long is meant to be the longitude of the location
    # Returns the average dorm rooms of the five closest universities from the supplied list.

    univ_enrollment = {'unitid':[],'distance':[],'rooms':[]}
    for i in carnegie_univ_data.index:
        univ_enrollment['unitid']=univ_enrollment['unitid']+[carnegie_univ_data.unitid[i]]
        univ_enrollment['distance']=univ_enrollment['distance']+[distance((lat,long),(carnegie_univ_data.latitude[i],carnegie_univ_data.longitude[i])).km]
        univ_enrollment['rooms'] = univ_enrollment['rooms'] + [carnegie_univ_data['rooms'][i]]
    univ_enrollment = pd.DataFrame(univ_enrollment)
    closest_five = univ_enrollment.sort_values(by = 'distance')[:5]
    return np.mean(closest_five['rooms'].values)
carnegie_r1r2 = carnegie[carnegie['basic2021'].isin([15,16])]
closest_five_rooms(carnegie_r1r2,40.902771, -73.133850)  # This is the coordinates of Stony Brook.

4800.2

In [27]:
wisconsin_counties['closest_five_avg_enrollment_r1r2'] = wisconsin_counties.apply(lambda x: closest_five_enrollment(carnegie_r1r2,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_enrollment_public'] = wisconsin_counties.apply(lambda x: closest_five_enrollment(carnegie_public,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_enrollment_private_nfp'] = wisconsin_counties.apply(lambda x: closest_five_enrollment(carnegie_private_notprofit,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_enrollment_landgrnt'] = wisconsin_counties.apply(lambda x: closest_five_enrollment(carnegie_landgrnt,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_enrollment_stem'] = wisconsin_counties.apply(lambda x: closest_five_enrollment(carnegie_stem,x.lat, x.lng), axis=1)

In [28]:
wisconsin_counties['closest_five_avg_dormrooms_r1r2'] = wisconsin_counties.apply(lambda x: closest_five_rooms(carnegie_r1r2,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_dormrooms_public'] = wisconsin_counties.apply(lambda x: closest_five_rooms(carnegie_public,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_dormrooms_private_nfp'] = wisconsin_counties.apply(lambda x: closest_five_rooms(carnegie_private_notprofit,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_dormrooms_landgrnt'] = wisconsin_counties.apply(lambda x: closest_five_rooms(carnegie_landgrnt,x.lat, x.lng), axis=1)
wisconsin_counties['closest_five_avg_dormrooms_stem'] = wisconsin_counties.apply(lambda x: closest_five_rooms(carnegie_stem,x.lat, x.lng), axis=1)

In [29]:
wisconsin_counties.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 19 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   county                                   72 non-null     object 
 1   lat                                      72 non-null     float64
 2   lng                                      72 non-null     float64
 3   population                               72 non-null     int64  
 4   closest_five_r1r2_avg                    72 non-null     float64
 5   closest_five_public_avg                  72 non-null     float64
 6   closest_five_private_nfp_avg             72 non-null     float64
 7   closest_five_landgrnt_avg                72 non-null     float64
 8   closest_five_stem_avg                    72 non-null     float64
 9   closest_five_avg_enrollment_r1r2         72 non-null     float64
 10  closest_five_avg_enrollment_public       72 non-null

In [30]:
import data_loaders

In [31]:
incomes = data_loaders.gimmeCountyIncomes()
incomes = incomes[incomes['State_Abbreviation']=='WI']
incomes = pd.concat([pd.DataFrame([['Shawano','WI',42033,43883,46611,50004,50444]], columns=incomes.columns), incomes], ignore_index=True)


In [32]:
incomes

Unnamed: 0,County,State_Abbreviation,2018,2019,2020,2021,2022
0,Shawano,WI,42033,43883,46611,50004,50444
1,Adams,WI,39048,41159,43993,46112,44696
2,Ashland,WI,38879,40644,43573,46557,46014
3,Barron,WI,47650,49537,54519,56928,58029
4,Bayfield,WI,46098,47552,49990,53818,52963
...,...,...,...,...,...,...,...
68,Waukesha,WI,71073,73569,76931,84113,87582
69,Waupaca,WI,44466,46260,49255,53368,54632
70,Waushara,WI,40346,41135,43781,46976,46697
71,Winnebago,WI,47336,48651,51855,56253,56878


In [33]:
wisconsin_2223[wisconsin_2223['COUNTY'].isin(incomes['County'].values) == False]

Unnamed: 0.1,Unnamed: 0,COUNTY,STUDENTS_TESTED,EXAM_COUNT,EXAMS_3_OR_ABOVE,PERCENT_3_OR_ABOVE
52,52,St. Croix,1110,1675,1125,67.164179


In [34]:
population = data_loaders.gimmeCountyPopulation()
population=population[population['State_Abbreviation']=='WI']
population = pd.concat([pd.DataFrame([['Shawano','WI',40725,40794,40873,40812,40886]], columns=population.columns), population], ignore_index=True)
population

Unnamed: 0,County,State_Abbreviation,2018,2019,2020,2021,2022
0,Shawano,WI,40725,40794,40873,40812,40886
1,Adams,WI,20533,20431,20675,20795,21226
2,Ashland,WI,16013,16021,16018,16054,16039
3,Barron,WI,46385,46645,46714,46746,46843
4,Bayfield,WI,15872,16062,16233,16304,16608
...,...,...,...,...,...,...,...
68,Waukesha,WI,403616,405563,407467,409080,410434
69,Waupaca,WI,51917,51830,51791,51992,51488
70,Waushara,WI,24401,24563,24549,24797,24999
71,Winnebago,WI,170879,171875,171800,170554,170718


In [35]:
wisconsin_data = {'county':[],'year':[],'population':[],'per_capita_income':[]}
years = ['2018','2019','2020','2021','2022']
for county in incomes['County'].values:
    for year in years:
        wisconsin_data['county'] = wisconsin_data['county']+[county]
        wisconsin_data['year'] = wisconsin_data['year']+[year]
        wisconsin_data['population'] = wisconsin_data['population']+[int(population[population['County'] == county][year].values[0])]
        wisconsin_data['per_capita_income']= wisconsin_data['per_capita_income']+[int(incomes[incomes['County'] == county][year].values[0])]

In [36]:
wisconsin_data = pd.DataFrame(wisconsin_data)
wisconsin_data

Unnamed: 0,county,year,population,per_capita_income
0,Shawano,2018,40725,42033
1,Shawano,2019,40794,43883
2,Shawano,2020,40873,46611
3,Shawano,2021,40812,50004
4,Shawano,2022,40886,50444
...,...,...,...,...
360,Wood,2018,74085,45262
361,Wood,2019,74205,47142
362,Wood,2020,74197,49588
363,Wood,2021,74085,52678


In [37]:
int(population[population['County'] == 'Adams']['2018'].values[0])

20533

In [38]:
wisconsin_ap = pd.read_csv('data/Wisconsin/Wisconsin_combined.csv')

In [39]:
wisconsin_ap

Unnamed: 0,COUNTY,STUDENTS_TESTED,EXAM_COUNT,EXAMS_3_OR_ABOVE,PERCENT_3_OR_ABOVE,Year
0,Adams,45,76,22,28.947368,2018-19
1,Ashland,30,40,25,62.500000,2018-19
2,Barron,132,196,101,51.530612,2018-19
3,Bayfield,10,14,9,64.285714,2018-19
4,Brown,2207,3378,2356,69.745411,2018-19
...,...,...,...,...,...,...
344,Waukesha,6439,11372,8537,75.070348,2022-23
345,Waupaca,334,460,243,52.826087,2022-23
346,Waushara,60,90,38,42.222222,2022-23
347,Winnebago,840,1122,780,69.518717,2022-23


In [40]:
def clean_year(year):
    return year[:-3]
wisconsin_ap['Year']=wisconsin_ap['Year'].apply(clean_year)

In [41]:
list(wisconsin_ap[(wisconsin_ap['COUNTY'] == 'Adams') & (wisconsin_ap['Year'] == '2018')].COUNTY.values)

['Adams']

In [42]:
wisconsin_counties.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 19 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   county                                   72 non-null     object 
 1   lat                                      72 non-null     float64
 2   lng                                      72 non-null     float64
 3   population                               72 non-null     int64  
 4   closest_five_r1r2_avg                    72 non-null     float64
 5   closest_five_public_avg                  72 non-null     float64
 6   closest_five_private_nfp_avg             72 non-null     float64
 7   closest_five_landgrnt_avg                72 non-null     float64
 8   closest_five_stem_avg                    72 non-null     float64
 9   closest_five_avg_enrollment_r1r2         72 non-null     float64
 10  closest_five_avg_enrollment_public       72 non-null

In [43]:
wisconsin_ap=wisconsin_ap.replace(to_replace='Saint Croix',value='St. Croix')
wisconsin_data=wisconsin_data.replace(to_replace='Saint Croix',value='St. Croix')
n=[]
for i in wisconsin_ap.index:
    county = wisconsin_ap.iloc[i].COUNTY
    year = wisconsin_ap.iloc[i].Year
    if list(wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_r1r2_avg.values) == []:
        n=n+[i]
print(n)

[]


In [44]:
list(wisconsin_counties.columns)

['county',
 'lat',
 'lng',
 'population',
 'closest_five_r1r2_avg',
 'closest_five_public_avg',
 'closest_five_private_nfp_avg',
 'closest_five_landgrnt_avg',
 'closest_five_stem_avg',
 'closest_five_avg_enrollment_r1r2',
 'closest_five_avg_enrollment_public',
 'closest_five_avg_enrollment_private_nfp',
 'closest_five_avg_enrollment_landgrnt',
 'closest_five_avg_enrollment_stem',
 'closest_five_avg_dormrooms_r1r2',
 'closest_five_avg_dormrooms_public',
 'closest_five_avg_dormrooms_private_nfp',
 'closest_five_avg_dormrooms_landgrnt',
 'closest_five_avg_dormrooms_stem']

In [45]:
years = ['2018','2019','2020','2021','2022']
lat = []
long = []
pop = []
pci = []
r1r2 = []
public = []
private_notprofit = []
landgrnt  = []
stem  = []
enrollment_r1r2 = []
enrollment_public = []
enrollment_private_nfp = []
enrollment_landgrnt = []
enrollment_stem = []
rooms_r1r2 = []
rooms_public = []
rooms_private_nfp = []
rooms_landgrant = []
rooms_stem = []


for i in wisconsin_ap.index:
    county = wisconsin_ap.iloc[i].COUNTY
    year = wisconsin_ap.iloc[i].Year
    lat = lat + [wisconsin_counties[(wisconsin_counties['county']==county)].lat.values[0]]
    long = long + [wisconsin_counties[(wisconsin_counties['county']==county)].lng.values[0]]
    pop = pop + [wisconsin_data[(wisconsin_data['county']==county) & (wisconsin_data['year'] == year)].population.values[0]]
    pci = pci + [wisconsin_data[(wisconsin_data['county']==county) & (wisconsin_data['year'] == year)].per_capita_income.values[0]]
    r1r2 = r1r2 + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_r1r2_avg.values[0]]
    public = public + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_public_avg.values[0]]
    private_notprofit = private_notprofit + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_private_nfp_avg.values[0]]
    landgrnt = landgrnt + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_landgrnt_avg.values[0]]
    stem = stem + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_stem_avg.values[0]]
    enrollment_r1r2 = enrollment_r1r2 + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_enrollment_r1r2.values[0]]
    enrollment_public = enrollment_public + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_enrollment_public.values[0]]
    enrollment_private_nfp = enrollment_private_nfp + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_enrollment_private_nfp.values[0]]
    enrollment_landgrnt = enrollment_landgrnt + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_enrollment_landgrnt.values[0]]
    enrollment_stem = enrollment_stem + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_enrollment_stem.values[0]]
    rooms_r1r2 = rooms_r1r2 + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_dormrooms_r1r2.values[0]]
    rooms_public = rooms_public + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_dormrooms_public.values[0]]
    rooms_private_nfp = rooms_private_nfp + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_dormrooms_private_nfp.values[0]]
    rooms_landgrant = rooms_landgrant + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_dormrooms_landgrnt.values[0]]
    rooms_stem = rooms_stem + [wisconsin_counties[(wisconsin_counties['county']==county)].closest_five_avg_dormrooms_stem.values[0]]

wisconsin_ap['Latitude'] = lat
wisconsin_ap['Longitude'] = long
wisconsin_ap['population'] = pop
wisconsin_ap['per_capita_income'] = pci
wisconsin_ap['closest_five_r1r2_avg'] = r1r2
wisconsin_ap['closest_five_public_avg'] = public
wisconsin_ap['closest_five_private_nfp_avg'] = private_notprofit
wisconsin_ap['closest_five_landgrnt_avg'] = landgrnt
wisconsin_ap['closest_five_stem_avg'] = stem
wisconsin_ap['closest_five_avg_enrollment_r1r2'] = enrollment_r1r2
wisconsin_ap['closest_five_avg_enrollment_public'] = enrollment_public
wisconsin_ap['closest_five_avg_enrollment_private_nfp'] = enrollment_private_nfp
wisconsin_ap['closest_five_avg_enrollment_landgrnt'] = enrollment_landgrnt
wisconsin_ap['closest_five_avg_enrollment_stem'] = enrollment_stem
wisconsin_ap['closest_five_avg_dormrooms_r1r2'] = rooms_r1r2
wisconsin_ap['closest_five_avg_dormrooms_public'] = rooms_public
wisconsin_ap['closest_five_avg_dormrooms_private_nfp'] = rooms_private_nfp
wisconsin_ap['closest_five_avg_dormrooms_landgrant'] = rooms_landgrant
wisconsin_ap['closest_five_avg_dormrooms_stem'] = rooms_stem

In [46]:
wisconsin_ap[wisconsin_ap['COUNTY'] == 'Waukesha']

Unnamed: 0,COUNTY,STUDENTS_TESTED,EXAM_COUNT,EXAMS_3_OR_ABOVE,PERCENT_3_OR_ABOVE,Year,Latitude,Longitude,population,per_capita_income,...,closest_five_avg_enrollment_r1r2,closest_five_avg_enrollment_public,closest_five_avg_enrollment_private_nfp,closest_five_avg_enrollment_landgrnt,closest_five_avg_enrollment_stem,closest_five_avg_dormrooms_r1r2,closest_five_avg_dormrooms_public,closest_five_avg_dormrooms_private_nfp,closest_five_avg_dormrooms_landgrant,closest_five_avg_dormrooms_stem
65,Waukesha,6567,11729,8535,72.768352,2018,43.0182,-88.3045,403616,71073,...,27215.0,15274.4,1273.2,30553.4,23576.2,5328.0,1596.6,517.0,8082.0,4383.4
135,Waukesha,6586,11506,8412,73.109682,2019,43.0182,-88.3045,405563,73569,...,27215.0,15274.4,1273.2,30553.4,23576.2,5328.0,1596.6,517.0,8082.0,4383.4
205,Waukesha,6409,11076,7838,70.765619,2020,43.0182,-88.3045,407467,76931,...,27215.0,15274.4,1273.2,30553.4,23576.2,5328.0,1596.6,517.0,8082.0,4383.4
275,Waukesha,6329,11029,8202,74.367576,2021,43.0182,-88.3045,409080,84113,...,27215.0,15274.4,1273.2,30553.4,23576.2,5328.0,1596.6,517.0,8082.0,4383.4
344,Waukesha,6439,11372,8537,75.070348,2022,43.0182,-88.3045,410434,87582,...,27215.0,15274.4,1273.2,30553.4,23576.2,5328.0,1596.6,517.0,8082.0,4383.4


In [47]:
wisconsin_data[wisconsin_data['county'] == 'Waukesha']

Unnamed: 0,county,year,population,per_capita_income
340,Waukesha,2018,403616,71073
341,Waukesha,2019,405563,73569
342,Waukesha,2020,407467,76931
343,Waukesha,2021,409080,84113
344,Waukesha,2022,410434,87582


In [48]:
#wisconsin_ap.to_csv('Wisconsin_closest_five_method.csv')

### "wisconsin_ap" is the dataframe that we will use for statistical analysis.

In [49]:
import seaborn as sns

In [50]:
from sklearn.model_selection import train_test_split
training, testing = train_test_split(wisconsin_ap, test_size = 0.2, random_state = 226)

In [None]:
#training.to_csv('data/Wisconsin/train_test_split/training.csv')


In [None]:
#testing.to_csv('data/Wisconsin/train_test_split/testing.csv')