### The method for inverse distance

In this notebook, I calculate the distance-weighted number of R1/R2 universities

The method is as follows:

- I find out longitudes and latitudes of all the R1/R2 universities in MA and all four nearby states (NY, CT, NH, RI).
- For the MA scenario, there are 38 universities I have to consider. (38 universities distributed in MA, NY, CT, NH, RI.)
- Then for each school district (call it District #1), I calculate the inverse distance quantities for District #1 with each of the 38 universities. I then sum over all 38 of them. This sum is interpreted as the distance-weighted number of R1/R2 universities for the District #1.
- To be specific about point (3), the inverse distance quantity between a school district and a university is calculated as

\begin{equation}
    {\rm inverse~distance~quantity} = \frac{1}{ \frac{\rm distance}{\rm 10~miles} + 1}
\end{equation}

- After points (3) and (4), I then repeat the same calculation for the second district (District #2), the third district (District #3), ...., all the way to the last district in MA.
- After all the calculations are done, I add all the inverse distance quantities into the existing dataframe. I call this new column the "Inverse Distance R1R2".



Note that the softening factor I used for MA is 10 miles. The reason for choosing this value is because:
- The typical school district size in MA is about 10 miles wide, 
- I imagine the typical commute distance in MA is roughly 10 miles
- For other states such as GA or WI, a proper values can be the typical size of the county in that state.
10 miles as the softening factor for GA and WI may be too small as GA and WI are big states.

In [1]:
import pandas as pd
import numpy as np
from geopy import distance

In [2]:
MA_district_coordinates = pd.read_excel('data/MA_AP_performance/MA_district_coordinates.xlsx')

def distance_district_univeristy(district_name, university_name):
    district_coor = MA_district_coordinates[MA_district_coordinates['District Name'] == district_name][['LONGITUD', 'LATITUDE']].to_numpy()[0][::-1]
    university_coor = university_coordinate_df[university_coordinate_df['INSTNM'] == university_name][['LONGITUD', 'LATITUDE']].to_numpy()[0][::-1]
    
    return distance.distance(district_coor, university_coor).miles    ## return unit is miles

def inverse_distance_weight(distance_miles):
    '''
    The softening factor I used for MA is 10 miles. The reason for choosing this value is because:
    (1) The typical school district size in MA is about 10 miles wide, 
    (2) I imagine the typical commute distance in MA is roughly 10 miles

    For other states such as GA or WI, a proper values can be the typical size of the county in that state.
    10 miles as the softening factor for GA and WI may be too small as GA and WI are big states.
    '''
    epsilon_soften_factor = 10.0  ## [miles]
    
    return 1.0 / ( (distance_miles/epsilon_soften_factor) + 1.0 )     ## there is no return unit

In [3]:
## search R1/R2 universities in MA and all four nearby states (NY, CT, NH, RI)
state_name_abbrev_arr = ['MA', 'NY', 'CT', 'NH', 'RI', 'ME', 'VT', 'NH']
university_coordinate_df = []

for state_abbrev in state_name_abbrev_arr:
    university_coordinate_df += [pd.read_excel('data/MA_AP_performance/hd2023_coordinate/hd2023_R1R2_data_%s.xlsx'%state_abbrev)]
university_coordinate_df = pd.concat(university_coordinate_df, ignore_index=True)

university_coordinate_df

Unnamed: 0,UNITID,INSTNM,CITY,COUNTYNM,State,LONGITUD,LATITUDE,R1R2,Annual enrollment,Number of dorm beds
0,164924,Boston College,Chestnut Hill,Middlesex County,MA,-71.169242,42.336213,True,16502,7611
1,164988,Boston University,Boston,Suffolk County,MA,-71.107942,42.351118,True,42047,10551
2,165015,Brandeis University,Waltham,Middlesex County,MA,-71.260155,42.365727,True,6403,2950
3,165334,Clark University,Worcester,Worcester County,MA,-71.823356,42.249987,True,3880,1523
4,166027,Harvard University,Cambridge,Middlesex County,MA,-71.118313,42.374471,True,41024,13694
5,166513,University of Massachusetts-Lowell,Lowell,Middlesex County,MA,-71.326809,42.652864,True,22192,4787
6,166629,University of Massachusetts-Amherst,Amherst,Hampshire County,MA,-72.526728,42.385999,True,35781,14015
7,166638,University of Massachusetts-Boston,Boston,Suffolk County,MA,-71.036865,42.312881,True,19107,1077
8,166683,Massachusetts Institute of Technology,Cambridge,Middlesex County,MA,-71.093226,42.359243,True,12195,5965
9,167358,Northeastern University,Boston,Suffolk County,MA,-71.088782,42.339992,True,30003,10011


# Calculate inverse distance for each school district in MA

In [4]:
MA_AP_performance_year = pd.read_excel('data/MA_AP_performance/AP_performance_18_22.xlsx', sheet_name='2022-23').reset_index(drop=True)

inverse_distance_weight_arr = []
for district_name in MA_AP_performance_year['District Name'].tolist():
    total_val = 0
    for university_name in university_coordinate_df['INSTNM'].tolist():
        distance_ij = distance_district_univeristy(district_name, university_name)
        total_val += inverse_distance_weight(distance_ij)
    inverse_distance_weight_arr.append(total_val)

MA_AP_performance_year['Inverse Distance R1R2'] = inverse_distance_weight_arr
MA_AP_performance_year = MA_AP_performance_year.sort_values(by='District Name').reset_index(drop=True)

In [5]:
MA_AP_performance_year

Unnamed: 0,District Name,District Code,Tests Taken,Score=1,Score=2,Score=3,Score=4,Score=5,% Score 1-2,% Score 3-5,Inverse Distance R1R2
0,Abington,10000,164,40,49,51,21,3,54.3,45.7,5.742173
1,Academy Of the Pacific Rim Charter Public (Dis...,4120000,154,120,25,5,3,1,94.2,5.8,7.525006
2,Acton-Boxborough,6000000,1442,22,79,229,473,639,7.0,93.0,6.292539
3,Advanced Math and Science Academy Charter (Dis...,4300000,387,17,51,87,118,114,17.6,82.4,6.140166
4,Agawam,50000,423,60,104,140,82,37,38.8,61.2,4.259957
...,...,...,...,...,...,...,...,...,...,...,...
282,Winchendon,3430000,53,10,16,10,14,3,49.1,50.9,4.547397
283,Winchester,3440000,1165,28,115,289,365,368,12.3,87.7,7.940076
284,Winthrop,3460000,319,157,76,54,29,3,73.0,27.0,7.601140
285,Woburn,3470000,397,103,68,108,79,39,43.1,56.9,7.442246
