# MSc. AI
## Capstone Project
### Darragh Minogue

### 1. Background


This project adopts the facility location problem to the school network in Ethiopia to expand access to secondary education. Of the 40,776 schools in the country, 37,039 offer primary and middle school education (grades 1-8), but only 3,737 offer secondary (grades 10-12). On average, middle schools are located more than 6km away from the nearest secondary school, meaning children face long journeys or they don't access secondary at all.

The expansion of secondary education is a major priority for the Ministry of Education, Ethiopia. Ambitious five year targets are set in the Education Sector Development Plans (ESDP). In 2013/2014, the Gross Enrollment Rate (GER) for lower secondary was 38.9% (3,466,972/8,914,837 children) and a target of 74% (6,596,979/8,914,837 children) was set. However, by 2019/2020, the actual GER was 51.05% (4,551,024/8,914,837 children), a more modest increase of 12.15% rather than 35.1%.  In the latest five year ESDP (2020/2021-2024/2025), an increase of 24% was set, but under the key-performance indicator of 'Number of newly established secondary schools', it says TBD- for 'To Be Determined'. Therein lies part of the problem: planning. Often, a portion of the budget is allocated for secondary, then each region decides how and where to spend it. Planners then, at the macro level, identify primary schools with the highest demand (i.e. largest enrollment) and seek to construct schools to serve that particular school population. While this can often yeild good results, the goal of this project is to improve on this approach in a way that allows for the potential construction of secondary schools across a catchment of multiple primary schools, if enrollment is sufficient and distance is minimal. Meta-heuristic optimisation techniques are used to achieve this.

The project makes use of readily available data from the Ministry of Education, Ethiopia, mainly 1) enrollment data from the Education Managament Information System (EMIS) and 2) geolocation data, recently obtained from the World Bank in 2020. Since national population and census data is inaccurate and outdated in Ethiopia, this was not considered a reliable source for this project. Instead, it is assumed that middle schools are located within the community and new secondary schools should therefore be constructed close to existing middle schools.

### 2. The Problem

In the problem, there is a list of existing middle schools and secondary schools in Ethiopia, their enrollment and location data (region, zone, district, longitude and latitude). The aim is to expand the secondary school network by constructing new secondary schools that best serve the demand from the existing middle school network. No limit is placed on capacity as this is determined by budget which is not known in advance and subject to change. Key to solving the problem is understanding that when a school is located within a village, there is no impact of distance on enrolment nor dropout. However, beyond the aspirational norm of 1-2km, distance affects initial access to school but also creates barriers to retention, completion and transition to higher level. As such, the aim is to minimise distance, but construct schools in locations with the highest demand.

### 3. Approach

<b> Glossary of terms</b>
* MS: Middle Schools
* SS: Secondary Schools
* EE: Expected Enrollment
* EEI: Expected Enrollment Increase
* Feeder: a middle school that is part of the secondary school catchment area.
* x: proposed new secondary schools (the genotype).
* d: distance in km between primary schools
* catchment: a middle school is considered within a catchment if it's 
* proposed_schools: total number of new schools to construct

This solution proposes new SS locations that can provide the highest estimated enrollment from feeder MS whilst also ensuring they are constructed within at minimal distance. Given the multitude of different languages and the decentralised Government of Ethiopia, models are developed at a regional level.

Two different algorithms are used in this project to find an optimal solution: Random Search, and CMA-ES. In each of the algorithms, the objective function f() aims to <b> maximise overall expected enrollment </b> given a set of feasible locations: x. Feasibility of the locations or the genotype, x, is controlled using a box boundary of longitudes and latitudes for a given region. Initial starting points are also provided within the box boundary using a generate_random_sp() helper function. 

The fitness is determined using an expected_enrollment function. This takes in the parameters: 1) the location of existing MS, 2) the enrollment of existing MS, 3) x, and 4) the current distance between existing MS and existing SS. The function then calculates the haversine distance in kilometers (km) between each new SS in x and all MS within a given region. This vector, <b> d </b>, is of size: (proposed_schools, len(ps)). Using this data, the algorithm then addresses three key challenges:

1) **The need to identify the MS with the minimal distance to the new SS**. In this case, if a MS is close to two or more new SS, only the closest is selected. But if the MS is closer to an existing SS, then it's ignored as it's already a feeder school to the existing SS.

2) **The need to estimate the expected enrollment of the new SS, based on the distance to its feeder MS**. If a school is beyond a certain threshold distance in km, then the MS should not be considered a feeder school for a SS. To handle this, situation, a helper function is used to estimate the expected enrollment from a feeder MS to the nearby SS. It takes in as parameters a) distance to nearest SS and b) MS enrollment. It then makes some assumptions about distance to return the expected enrollment per school. Theunynck (2014) recommends the adoptation of a norm of 2 or 3km for junior secondary schools and the function therefore assumes that if a school is constructed within 3km of a MS, there is no negative effect on expected enrollment of it's nearby SS. Above 3km, a linear dropoff is assumed between 3-5km. If a MS is located more than 5km away from the new location proposed, it is expected that zero students will attend from that feeder school.

3) **The need to divert expected enrollment from a MS if it is currently a feeder school to an existing SS but the new SS proposed is closer**. This is dealt with by subtracting the expected enrollment from the MS to the existing SS using the shape function, and then allocating it to the new SS which is closer. 

A fitness evaluation budget of 10,000 is set to ensure there are sufficient iterations to achieve convergance. Each algorithm is run 30 times with different random starting points. The results are stored in a csv file, with the top 4 results plotted below. 

### Assumptions and Limitations

1. It is assumed that 1km is equivalent to a 15 minute walk for children (Theunynck, S. 2014: p6).
2. Distance is calculated using the haversine function and as a result distance is calculated as a straight line. The travel distances could therefore be further. Ethiopia is not well mapped and since most children are walking to school using other means like Google Maps API don't yeild useful results on a large scale and don't factor in more informal walking routes. Final results require close inspection for elevation and other issues that might impact walking distance or construction e.g. buildings, rivers. The final results should therefore be observed as an approximation and using a tool like ArcGIS or QGIS, the the results are observed for these types of obstacles. 
3. It is assumed that children beyond 5km are not likely to attend, but in some cases, this is not true. Some children walk extremely long distances to attend secondary school, while others stay with relatives or family friends to attend a school that is beyond 5km. Despite this being a reality, this shouldn't guide the construction as the goal is to minimise distance and create more equitable access to secondary. 

In [1]:
# Key imports
import pandas as pd
import numpy as np
from haversine import haversine, haversine_vector, Unit # for distance
import geopandas as gpd
import matplotlib.pyplot as plt
import cma
import time

# Supress the scientific notation on numpy for easierx reading.
np.set_printoptions(suppress=True)

In [2]:
df = pd.read_csv('data/clean_dataset_final.csv', converters={'point': pd.eval})

In [None]:
# Specify which test to perform: 1) basic test with 2 schools, 2) basic test with 1 district, 3) regional test.
declare_test = 3

In [None]:
# Declare key variables according to the test being performed.
region = 'Amhara' # Test Amhara region
woreda = 'ET030908' # for testing of one woreda/district in Amhara.

if declare_test == 1: # for micro test
    # read in the prepared dataset. Evaluate point data to make it readible by geopandas
    df = pd.read_csv('data/test_dataset2.csv', converters={'point': pd.eval})
    proposed_schools= 2
    gdf_woreda = gpd.read_file('eth_shape_files/json/eth_admin3v2.json')
    gdf_woreda_shp = gdf_woreda.loc[gdf_woreda['ADM3_PCODE']==woreda]['geometry'].reset_index(drop=True)
    df = df.loc[df['ADM3_PCODE'] == woreda]
    bounds = gdf_woreda_shp.bounds
elif declare_test == 2: # district or woreda test.
    df = pd.read_csv('data/test_dataset.csv', converters={'point': pd.eval})
    proposed_schools= 5
    gdf_woreda = gpd.read_file('eth_shape_files/json/eth_admin3v2.json')
    gdf_woreda_shp = gdf_woreda.loc[gdf_woreda['ADM3_PCODE']==woreda]['geometry'].reset_index(drop=True)
    df = df.loc[df['ADM3_PCODE'] == woreda]
    bounds = gdf_woreda_shp.bounds
else: # for regional test
    proposed_schools= 5
    df = pd.read_csv('data/clean_dataset.csv', converters={'point': pd.eval})
    # limit geojson to only selected region
    # limit clean dataset to only selected region
    gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
    gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']==region]['geometry'].reset_index(drop=True)
    df = df.loc[df['region'] == region]
    bounds = gdf_region_shp.bounds 

In [None]:
df.head(5)

In [None]:
df.columns

In [None]:
# Establish boundaries based on the bounds of region or woreda.
# Latitude is the Y axis, longitude is the X axis.

lat_bounds = bounds[['miny','maxy']].to_numpy(dtype=float)[0]
lon_bounds = bounds[['minx','maxx']].to_numpy(dtype=float)[0]
bounds = np.array([[lat_bounds[0], lon_bounds[0]], [lat_bounds[1], lon_bounds[1]]])
# array - [[lower lat bounds, lower lon bounds],[upper lat bounds, upper lon bounds]]
# CMA expects a list of size 2 for bounds
x1y1 = np.repeat([bounds[0,:]],proposed_schools, axis=0).flatten()
x2y2 = np.repeat([bounds[1,:]],proposed_schools, axis=0).flatten()
boundsxy = [x1y1,x2y2]
boundsxy

In [None]:
# Create subset arrays required as input for enrollment function.
# 1. Primary school enrollment data. Only the last two grades as predictors for the two grades of lower secondary.
# 2. Primary school location data: lat lon point data. 
# 3. Secondary schoool location data: lat lon point data. 
# 4. Secondary school enrollment data. Only grades 9 and 10 enrollment. 

df_prim = df.loc[df['grade7_8'] > 0]
df_prim_enroll = df_prim['grade7_8'].reset_index(drop=True).to_numpy(dtype=float)
df_prim_loc = df_prim['point'].reset_index(drop=True).to_numpy()
df_prim_loc = np.array([np.array(i) for i in df_prim_loc], dtype=float)

df_sec = df.loc[ (df['gr_offer'] == 'G. 9-10') | (df['gr_offer'] == 'G. 9-12')]
df_sec_loc = df_sec['point'].reset_index(drop=True).to_numpy()
df_sec_enroll = df_sec['grade9_10'].reset_index(drop=True).to_numpy(dtype=float)
df_sec_loc = np.array([np.array(i) for i in df_sec_loc], dtype=float)

current_ps_distance = df_prim['nearest_lwr_sec'].to_numpy() # existing distance to secondary school

In [None]:
df_prim.value_counts

## Helper Functions

In [None]:
# Function to shape expected enrollment. 
def shape(distance, enrollment):
    # If less than 2km, all children expected to attend secondary i.e. distance not a factor
    min_walk = 3
    max_walk = 5 # distance greater than 5km (1hr 15 mins) assumes school too far, and zero enrollment expected.
    # if between 2-5km, return a linear dropoff.
    return np.where(distance<min_walk, enrollment,
             np.where(distance>max_walk, 0,
                     enrollment*(1-(distance-min_walk)/(max_walk-min_walk)))
            )

# Example below
shape(4, 100)

In [None]:
# Function to generate random starting points for each proposed school within box boundary.
def generate_random_sp():
    sp1 = np.random.uniform(low=lat_bounds[0], high=lat_bounds[1], size=40000)
    sp2 = np.random.uniform(low=lon_bounds[0], high=lon_bounds[1], size=40000)
    sp = np.vstack((sp1, sp2)).T
    return sp

In [None]:
# Create large vector of 40,000 to sample in RS and CMA. 
sp = generate_random_sp()

In [None]:
# Include only starting points that are within the regional polygon.
def check_region(vec):
    # lat = y, x=lon
    vec = gpd.points_from_xy(vec[:, 1], vec[:, 0])
    return vec.within(gdf_region_shp[0])

def check_woreda(vec):
    # lat = y, x=lon
    vec = gpd.points_from_xy(vec[:, 1], vec[:, 0])
    return vec.within(gdf_woreda_shp[0])

In [None]:
# Create a list of approx 20,000 within the regional boundaries. This will be sampled. 
sp = sp[check_region(sp)]
# sp = sp[check_woreda(sp)]

In [None]:
len(sp)

In [None]:
def generate_sp_proposed(sp_list):
    return sp_list[np.random.choice(sp_list.shape[0], proposed_schools,replace=False)]

In [None]:
# Helper for fitness function to take in a vector of primary schools and the distance, and return jagged array.
def get_close_schools(ps, current_ds):    
    # Gets (index and distance) of schools located less than 5km and less than current distance.
    return [[i, ps[i]] for i in range(len(ps)) if ((ps[i] < 5) & (ps[i] < current_ds[i]))]

In [None]:
z = haversine_vector(df_prim_loc,generate_sp_proposed(sp), Unit.KILOMETERS, comb=True)
z

In [None]:
np.min(z, axis=0)

In [None]:
np.argmin(z, axis=0)

In [None]:
np.argmin(haversine_vector(df_prim_loc,generate_sp_proposed(sp), Unit.KILOMETERS, comb=True), axis=0) # index of min distance of PS to x


## Fitness Function

In [None]:
# Constant
ee_old_constant = np.sum(df_sec_enroll) # existing enrollment for secondary

def expected_enrollment(prim_loc, x, prim_enroll, current_dist):
    ee_old = ee_old_constant.copy() # Overall SS Enrollment
    d = haversine_vector(prim_loc,x, Unit.KILOMETERS, comb=True) # min distance of PS to x. 
    d_min = np.min(d, axis=0)
    # save the USE ARG MIN ()
    d_index = np.argmin(d, axis=0) # index of min distance of PS to x
    d2 = np.where((d_min <5) & (d_min < current_dist)) # limit to only schools < 5km and schools < current distance
    # Put into dataframe. index 0 = SS, index 1 = PS, index 2 = distance, index 3 = shaped enrollment 
    d3 = pd.DataFrame(np.vstack((d_index[d2], d2[0], d_min[d2], shape(d_min[d2], prim_enroll[d2[0]]))).T) 
    d32 = d3.loc[d3.groupby([1])[2].idxmin()] # find only nearby SS if PS is close to more than 1 SS.
    d32 = d32.groupby(0)[3].sum() # sum the overall shaped enrollment by school. 
    eei = np.sum(d32) # sum overall expected enrollment
     # find SS enrollment of SS if close to old SS. 
    distance_current = np.sum(shape(current_dist[d2], prim_enroll[d2]))
    ee_old -= distance_current # subtract shaped enrollment from overall SS enrollment
#     print(d3)
#     print(d32)
#     print(eei)
    return eei + ee_old # return overall expected enrollment + current secondary enrollment

In [None]:
# # Constant
# ee_old_constant = np.sum(df_sec_enroll) # existing enrollment for secondary

# def expected_enrollment(prim_loc, x, prim_enroll, current_dist):

#     ee_old  = ee_old_constant.copy() # copy of current overall secondary enrollment
#     feeder = {} # empty dictionary for list of ps, closest ss and distance in km.
#     eei = 0 # expected enrollment increase
#     d = haversine_vector(prim_loc, x, Unit.KILOMETERS, comb=True) # distance of PS to x. 
#     # keep only those < 5 and < current_dist
#     closest = [get_close_schools(d[i], current_dist) for i in range(proposed_schools)] 
    
#     for ss in range(proposed_schools):
#         for ps in range(len(closest[ss])): # for each PS distance to every proposed secondary school.
#             closest_prim = closest[ss][ps] # [0] == school index, [1] == distance to nearest new secondary.
#             # if not in feeder, add to feeder dict or # if another SS is closer, replace in feeder 
#             if (closest_prim[0] not in feeder) or (closest_prim[1] < feeder[closest_prim[0]][1]): 
#                 feeder[closest_prim[0]] = [ss, closest_prim[1], shape(closest_prim[1], prim_enroll[closest_prim[0]])]
            
#             # Current estimated enrollment feeding into current SS
#             distance_current = shape(current_dist[closest_prim[0]], prim_enroll[closest_prim[0]])
#             ee_old -= distance_current # remove old secondary students within catchment.
    
#     # return sum of all the feeder shaped enrollment.
#     eei = sum([row[2] for row in list(feeder.values())])
#     return eei + ee_old

## Objective Function

In [None]:
# The Objective Function with the shape function included.
def f(x):
    x = x.reshape(proposed_schools,2) #  add new schools to existing and reshape
#     print(x)
    test_case = expected_enrollment(df_prim_loc, x, df_prim_enroll, current_ps_distance)
    return test_case*-1 # Multiply by -1 for maximising.

In [None]:
# example result of objecitve function.
f(generate_sp_proposed(sp))

# 1. Establish Benchmark

To establish a baseline, the traditional method planners used for identifying potential locations is followed. This involves identifying all the PS above a certain threshold to the nearest SS. For this project, 5km is deemed the ultimate cut-off. Secondly, the data is filtered to only the number of schools being proposed e.g. 5. It is then assumed that these locations are best to construct the new school as demand is highest. 

This project aims to test this theory and find improvements. 

In [None]:
dd = df_prim[df_prim['nearest_lwr_sec'] >5] # Find all PS > 5km distance
dd = dd.sort_values(['grade7_8'], ascending=False).head(proposed_schools) # filter by number of SS to construct i.e. 5.
benchmark = sum(dd['grade7_8']) # sum the enrollment for a basic indication of expected enrollment
benchmark

Above displays the overall enrollment from only the number of proposed schools. However, there is also a need to test these locations using the objective function. Why? In many cases, schools with high enrollment are located in more urban locations which may be close by to many other primary schools that could benefit from a new SS being built nearby the proposed schools. The objective function will provide a wider estimate than only the schools with the top enrollment. 

In [None]:
benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
benchmark_f = f(benchmark_loc)

In [None]:
benchmark_loc

In [None]:
# Below are the figures to beat.
print('Overall expected enrollment: ', benchmark_f, '\n' \
      'Expected Enrollment Increase: ', round(abs(np.sum(df_sec_enroll) - abs(benchmark_f)),0))

## 2. Random Search

In [None]:
# Random Search Algorothm is used to see if an improved solution can be identified.
def random_search(f, n):
    x = [generate_sp_proposed(sp) for _ in range(n)]
    fx = [[f(xi), xi] for xi in x]
    best_f, best_solution = min(fx, key=lambda x:x[0])
    return best_f, best_solution

In [None]:
n_starting_points = 30
maxits = 10000

In [None]:
%%time
# Run the algorithm 10,000 times by 30 different starting points.
fx = []
for _ in range(n_starting_points):
    start_time = time.time()
    fx.append([random_search(f, maxits), time.time() - start_time])
    print(_,time.time() - start_time, 'starting point completed.')

In [None]:
results = pd.DataFrame(columns=['random_starting_point', 'algorithm', 'ee', 'eei', 'proposed_locations', 'time', 'sigma'])

In [None]:
for i in range(0, len(fx)):
    row = (pd.Series({'random_starting_point':i, 'algorithm':'Random Search', 'ee':round(abs(fx[i][0][0]),0),
                                'eei':round(abs(np.sum(df_sec_enroll) - abs(fx[i][0][0])),0), 
                                'proposed_locations': fx[i][0][1], 'time':fx[i][1], 'sigma':'NA'}))
    results = results.append(row, ignore_index=True)

In [None]:
results

Random Search does not find an improved solution. 

# 3. CMA

In [None]:
results = results.reset_index(drop=True)

In [None]:
sigmas = (0.1, 0.4, 0.8, 0.9, 1.2, 1.4, 1.6)

In [None]:
%%time

fcma = []
for i in range(n_starting_points):
    start_time = time.time()
    for j in sigmas:
        es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp).flatten(), sigma0=j,
                                  inopts={'bounds': boundsxy,'seed':1234})
        es.optimize(f, iterations=(maxits/ es.popsize))
        fcma.append((es.result[1], es.result[0].reshape(proposed_schools, 2), (time.time() - start_time), j))

In [None]:
es.

In [None]:
print('Potential Good Sigma: ', np.sqrt(np.std(df_prim_loc[:,0])**2 + np.std(df_prim_loc[:,1])**2))

In [None]:
testing = df_prim['point'][df_prim['nearest_lwr_sec'] < 5].reset_index(drop=True).to_numpy()
testing = np.array([np.array(i) for i in testing], dtype=float)
print('Potential Good Sigma: ', np.sqrt(np.std(testing[:,0])**2 + np.std(testing[:,1])**2))

In [None]:
for i in range(0, len(fcma)):
    row = (pd.Series({'random_starting_point':i, 'algorithm':'CMA', 'ee':round(abs(fcma[i][0]),0),
                                'eei':round(abs(np.sum(df_sec_enroll) - abs(fcma[i][0])),0), 
                                'proposed_locations': fcma[i][1], 'time':fcma[i][2], 'sigma':fcma[i][3]}))
    results = results.append(row, ignore_index=True)

In [None]:
results = results.sort_values(['eei'], ascending=False).reset_index()

In [None]:
results.to_csv('results_revamp.csv')

In [None]:
# from ast import literal_eval

In [None]:
# results4 = pd.read_csv('results3.csv', converters={'proposed_locations': literal_eval})

In [None]:
###### Show results of top 4.
top_4 = results[:4]
top_4

In [None]:
# it is latitude then longitude.
# box = np.array([[10.713719, 36.689328], [10.713719, 36.96973],[10.964773, 36.96973], [10.964773, 36.689328], [10.713719, 36.689328]])
plt.figure(figsize=(15, 10))
# plt.plot(box[:,1], box[:,0], '.r-')
plt.scatter(df_prim_loc[:, 1], df_prim_loc[:, 0], s=df_prim_enroll/100, label="Prim") # s gives size
plt.scatter(df_sec_loc[:, 1], df_sec_loc[:, 0], s=df_sec_enroll/100, label="Secondary") # s gives size
plt.scatter(top_4['proposed_locations'][3][:, 1], top_4['proposed_locations'][3][:, 0], s = 35, marker="o", label="New Secondary") # stars for supermarkets
plt.gca().set_aspect('equal')
plt.legend()
plt.show()

In [None]:
# it is latitude then longitude.
# box = np.array([[10.713719, 36.689328], [10.713719, 36.96973],[10.964773, 36.96973], [10.964773, 36.689328], [10.713719, 36.689328]])
plt.figure(figsize=(15, 10))
# plt.plot(box[:,1], box[:,0], '.r-')
plt.scatter(df_prim_loc[:, 1], df_prim_loc[:, 0], s=df_prim_enroll/100, label="Prim") # s gives size
plt.scatter(df_sec_loc[:, 1], df_sec_loc[:, 0], s=df_sec_enroll/100, label="Secondary") # s gives size
plt.scatter(benchmark_loc[:, 1], benchmark_loc[:, 0], s = 35, marker="o", label="New Secondary") # stars for supermarkets
plt.gca().set_aspect('equal')
plt.legend()
plt.show()

In [None]:
fig, ((ax0, ax1), (ax2, ax3)) = plt.subplots(2, 2, figsize=(15,15))
fig.suptitle('Top 4 Results. maxits=10,000, random_sp=30')

for i in range(4):
    ax = 'ax'+str(i)
    eval(ax).scatter(df_prim_loc[:, 1], df_prim_loc[:, 0], s=df_prim_enroll/100, label="Prim") # s gives size
    if(len(df_sec) != 0): eval(ax).scatter(df_sec_loc[:, 1], df_sec_loc[:, 0], s=df_sec_enroll/100, label="Secondary") # s gives size
    eval(ax).scatter([row[1] for row in top_4['proposed_locations'][i]], 
                     [row[0] for row in top_4['proposed_locations'][i]], s = 35, \
                      marker="o", label="New Secondary") # stars for supermarkets
#     eval(ax).scatter(top_4['proposed_locations'][i][:,1], top_4['proposed_locations'][i][:,0], s = 35, \
#                      marker="o", label="New Secondary") # stars for supermarkets
    eval(ax).set_title((str(top_4.loc[i]['algorithm']) + ', eei =  ' + str(top_4.loc[i]['eei'])\
                       + ', sigma: ' + str(top_4.loc[i]['sigma'])), fontstyle='italic')

for ax in fig.get_axes():
    ax.legend()
    ax.label_outer()

CMA doesn't beat the benchmark. Need to potentially re-run with new parameters

In [None]:
winner = top_4.head(1)
winner

In [None]:
# f(winner['proposed_locations'][0].flatten())

In [None]:
# %%time
# # def update_expected_enrollment(x):
# ee_old_constant = np.sum(df_sec_enroll)
# ee_old = ee_old_constant.copy()

# feeder = {} # empty dictionary for list of ps, closest ss and distance in km.
# eei = 0 # expected enrollment increase
# d = haversine_vector(df_prim_loc, winner['proposed_locations'][0], Unit.KILOMETERS, comb=True) # distance of PS to x. 
# closest = [get_close_schools(d[i], current_ps_distance) for i in range(proposed_schools)] # keep only those < 5 and < current_dist

# for ss in range(proposed_schools):
#     for ps in range(len(closest[ss])): # for each PS distance to every proposed secondary school.
#         closest_prim = closest[ss][ps] # [0] == school index, [1] == distance to nearest new secondary.
#         # if not in feeder, add to feeder dict or # if another SS is closer, replace in feeder 
#         if (closest_prim[0] not in feeder) or (closest_prim[1] < feeder[closest_prim[0]][1]): 
#             feeder[closest_prim[0]] = [ss, closest_prim[1], shape(closest_prim[1], df_prim_enroll[closest_prim[0]])]

#         # Current estimated enrollment feeding into current SS
#         distance_current = shape(current_ps_distance[closest_prim[0]], df_prim_enroll[closest_prim[0]])
#         df_prim.loc[closest_prim[0], 'expected_enroll'] = shape(closest_prim[1], df_prim_enroll[closest_prim[0]])
#         df_prim.loc[closest_prim[0], 'expected_enroll'] -= distance_current
#         df_prim.loc[closest_prim[0], 'nearest_sch_code'] = ss
#         df_prim.loc[closest_prim[0], 'nearest_lwr_sec'] = closest_prim[1]
#         ee_old -= distance_current # remove old secondary students within catchment.

# # return sum of all the feeder shaped enrollment.
# eei = sum([row[2] for row in list(feeder.values())])
# print(eei, ee_old, ee_old_constant, ee_old+eei)

In [None]:
# f(winner['proposed_locations'][0])

In [None]:
# Function to shape expected enrollment. 
def shape2(distance, enrollment):
    # If less than 2km, all children expected to attend secondary i.e. distance not a factor
    min_walk = 5 
    max_walk = 8.94666 # distance greater than 5km (1hr 15 mins) assumes school too far, and zero enrollment expected.
    # if between 2-5km, return a linear dropoff.
    return np.where(distance<min_walk, enrollment,
             np.where(distance>max_walk, 0,
                     enrollment*(1-(distance-min_walk)/(max_walk-min_walk)))
            )

In [None]:
np.sum(df_sec_enroll)

In [None]:
np.sum(shape(current_ps_distance, df_prim_enroll))

In [None]:
np.sum(shape2(current_ps_distance, df_prim_enroll))

In [None]:
es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp).flatten(), sigma0=0.9,
                          inopts={'bounds': boundsxy,'seed':1234})
es.optimize(f, iterations=(maxits/ es.popsize))

In [None]:
benchmark_loc

In [None]:
es = cma.CMAEvolutionStrategy(benchmark_loc.flatten(), sigma0=0.9,
                          inopts={'bounds': boundsxy,'seed':1234})
es.optimize(f, iterations=(maxits/ es.popsize))

In [None]:
es.result

In [None]:
df_prim['urban_rural'].value_counts()

In [None]:
shape(100, 4)

In [None]:
df_prim.columns

In [None]:
np.mean(df_prim[df_prim['urban_rural'] == 2]['nearest_lwr_sec']) # distribution. 

In [None]:
np.mean(df_prim[df_prim['urban_rural'] == 1]['nearest_lwr_sec'])

In [None]:
# f(winner['proposed_locations'][0])
# np.mean(df_sec_enroll)
# update_expected_enrollment(winner['proposed_locations'][0])
# np.sum(eei)

In [None]:
# df_prim.groupby(['nearest_sch_code'])['expected_enroll'].agg('sum')[:5]

In [None]:
dd = haversine_vector(df_prim_loc,df_sec_loc, Unit.KILOMETERS, comb=True)

In [None]:
np.where(dd < 5)[0]