# MSc. AI
## Capstone Project
### Darragh Minogue

### 1. Background


This project adopts the facility location problem to the school network in Ethiopia to expand access to secondary education. Of the 40,776 schools in the country, 37,039 offer primary and middle school education (grades 1-8), but only 3,737 offer secondary (grades 10-12). On average, middle schools are located more than 6km away from the nearest secondary school, meaning children face long journeys or they don't access secondary at all.

The expansion of secondary education is a major priority for the Ministry of Education, Ethiopia. Ambitious five year targets are set in the Education Sector Development Plans (ESDP). In 2013/2014, the Gross Enrollment Rate (GER) for lower secondary was 38.9% (3,466,972/8,914,837 children) and a target of 74% (6,596,979/8,914,837 children) was set. However, by 2019/2020, the actual GER was 51.05% (4,551,024/8,914,837 children), a more modest increase of 12.15% rather than 35.1%.  In the latest five year ESDP (2020/2021-2024/2025), an increase of 24% was set, but under the key-performance indicator of 'Number of newly established secondary schools', it says TBD- for 'To Be Determined'. Therein lies part of the problem: planning. Often, a portion of the budget is allocated for secondary, then each region decides how and where to spend it. Planners then, at the macro level, identify primary schools with the highest demand (i.e. largest enrollment) and seek to construct schools to serve that particular school population. While this can often yeild good results, the goal of this project is to improve on this approach in a way that allows for the potential construction of secondary schools across a catchment of multiple primary schools, if enrollment is sufficient and distance is minimal. Meta-heuristic optimisation techniques are used to achieve this.

The project makes use of readily available data from the Ministry of Education, Ethiopia, mainly 1) enrollment data from the Education Managament Information System (EMIS) and 2) geolocation data, recently obtained from the World Bank in 2020. Since national population and census data is inaccurate and outdated in Ethiopia, this was not considered a reliable source for this project. Instead, it is assumed that middle schools are located within the community and new secondary schools should therefore be constructed close to existing middle schools.

### 2. The Problem

In the problem, there is a list of existing middle schools and secondary schools in Ethiopia, their enrollment and location data (region, zone, district, longitude and latitude). The aim is to expand the secondary school network by constructing new secondary schools that best serve the demand from the existing middle school network. No limit is placed on capacity as this is determined by budget which is not known in advance and subject to change. Key to solving the problem is understanding that when a school is located within a village, there is no impact of distance on enrolment nor dropout. However, beyond the aspirational norm of 1-2km, distance affects initial access to school but also creates barriers to retention, completion and transition to higher level. As such, the aim is to minimise distance, but construct schools in locations with the highest demand.

### 3. Approach

<b> Glossary of terms</b>
* MS: Middle Schools
* SS: Secondary Schools
* EE: Expected Enrollment
* EEI: Expected Enrollment Increase
* Feeder: a middle school that is part of the secondary school catchment area.
* x: proposed new secondary schools (the genotype).
* d: distance in km between primary schools
* catchment: a middle school is considered within a catchment if it's 
* n: total number of proposed new SS to construct

This solution proposes new SS locations that can provide the highest estimated enrollment from feeder MS whilst also ensuring they are constructed within at minimal distance. Given the multitude of different languages and the decentralised Government of Ethiopia, models are developed at a regional level.

Two different algorithms are used in this project to find an optimal solution: Random Search, and CMA-ES. In each of the algorithms, the objective function f() aims to <b> maximise overall expected enrollment </b> given a set of feasible locations: x. Feasibility of the locations or the genotype, x, is controlled using a box boundary of longitudes and latitudes for a given region. Initial starting points are also provided within the box boundary using a generate_random_sp() helper function. 

The fitness is determined using an expected_enrollment function. This takes in the parameters: 1) the location of existing MS, 2) the enrollment of existing MS, 3) x, and 4) the current distance between existing MS and existing SS. The function then calculates the haversine distance in kilometers (km) between each new SS in x and all MS within a given region. This vector, <b> d </b>, is of size: (proposed_schools, len(ps)). Using this data, the algorithm then addresses three key challenges:

1) **The need to identify the MS with the minimal distance to the new SS**. In this case, if a MS is close to two or more new SS, only the closest is selected. But if the MS is closer to an existing SS, then it's ignored as it's already a feeder school to the existing SS.

2) **The need to estimate the expected enrollment of the new SS, based on the distance to its feeder MS**. If a school is beyond a certain threshold distance in km, then the MS should not be considered a feeder school for a SS. To handle this, situation, a helper function is used to estimate the expected enrollment from a feeder MS to the nearby SS. It takes in as parameters a) distance to nearest SS and b) MS enrollment. It then makes some assumptions about distance to return the expected enrollment per school. Theunynck (2014) recommends the adoptation of a norm of 2 or 3km for junior secondary schools and the function therefore assumes that if a school is constructed within 3km of a MS, there is no negative effect on expected enrollment of it's nearby SS. Above 3km, a linear dropoff is assumed between 3-5km. If a MS is located more than 5km away from the new location proposed, it is expected that zero students will attend from that feeder school.

3) **The need to divert expected enrollment from a MS if it is currently a feeder school to an existing SS but the new SS proposed is closer**. This is dealt with by subtracting the expected enrollment from the MS to the existing SS using the shape function, and then allocating it to the new SS which is closer. 

A fitness evaluation budget of 10,000 is set to ensure there are sufficient iterations to achieve convergance. Each algorithm is run 30 times with different random starting points. The results are stored in a csv file, with the top 4 results plotted below. 

### Assumptions and Limitations

1. It is assumed that 1km is equivalent to a 15 minute walk for children (Theunynck, S. 2014: p6).
2. Distance is calculated using the haversine function and as a result distance is calculated as a straight line. The travel distances could therefore be further. Ethiopia is not well mapped and since most children are walking to school using other means like Google Maps API don't yeild useful results on a large scale and don't factor in more informal walking routes. Final results require close inspection for elevation and other issues that might impact walking distance or construction e.g. buildings, rivers. The final results should therefore be observed as an approximation and using a tool like ArcGIS or QGIS, the the results are observed for these types of obstacles. 
3. It is assumed that children beyond 5km are not likely to attend, but in some cases, this is not true. Some children walk extremely long distances to attend secondary school, while others stay with relatives or family friends to attend a school that is beyond 5km. Despite this being a reality, this shouldn't guide the construction as the goal is to minimise distance and create more equitable access to secondary. 

In [1]:
# Key imports
import pandas as pd
import numpy as np
from haversine import haversine, haversine_vector, Unit # for distance
import geopandas as gpd
import matplotlib.pyplot as plt
import cma
import time
import os.path

# Supress the scientific notation on numpy for easierx reading.
np.set_printoptions(suppress=True)

In [2]:
df = pd.read_csv('data/clean_dataset_final.csv', converters={'point': pd.eval}) # read in cleaned dataset.

In [3]:
# drop Afar as only 5% of schools are mapped.
df = df[df['ADM1_EN'] != 'Afar']

# Helper Functions

In [4]:
def shape(distance, enrollment):
    """ Returns the expected enrollment based on distance.
    
        If distance is less than 3km, all children are expected to contribute to the supply of a new school. 
        The full enrollment is returned. If above 5km, the school is considered too far and no children contribute to 
        the supply. Zero enrollment is returned. If between 3-5km, a linear dropoff is assumed. e.g. for 100 children
        at a distance of 4km, 50 is returned. 

        Parameters:
            distance (float): the distance in km of a school to the next level school.
            enrollment (int): the enrollment of the school.
        
        Returns:
            shaped_enroll(float): expected feeder enrollment from one school.
    
    """
    min_walk = 3 # minimum km walking distance
    max_walk = 5 # maximum km walking distance
    shaped_enroll = np.where(distance < min_walk, enrollment,
             np.where(distance>max_walk, 0, enrollment*(1-(distance-min_walk)/(max_walk-min_walk))))
    return shaped_enroll

def generate_random_sp(bounds):
    """  Function to generate a vector of 40000 random (lat, lon) starting points for each 
         proposed school within box boundary.
    """
    lat = np.random.uniform(low=bounds[0][0], high=bounds[1][0], size=40000)
    lon = np.random.uniform(low=bounds[0][1], high=bounds[1][1], size=40000)
    sp = np.vstack((lat, lon)).T
    return sp

def check_region(vec, region_shp):
    """ Function to check if initial vector of starting points are located within the regional boundaries/polygon. 
        Returns boolean vector
    """
    vec = gpd.points_from_xy(vec[:, 1], vec[:, 0]) # lat = y, x=lon
    return vec.within(region_shp[0])

def generate_sp_proposed(sp_list, proposed_schools):
    """ Returns a vector of random starting points of size (proposed_schools)
    """
    return sp_list[np.random.choice(sp_list.shape[0], proposed_schools,replace=False)]

def get_bounds(df_):
    """ Returns the minimum and maximum box boundary for school construction.
    """
    bounds = np.array([[np.min(df_['lat']), np.min(df_['lon'])], [np.max(df_['lat']), np.max(df_['lon'])]])
    return bounds

def get_cma_bounds(bounds, n):
    """ CMA-ES expects a list of size 2 for bounds. This function returns the bounds reshaped for CMA-ES
    """
    x1y1 = np.repeat([bounds[0,:]],n, axis=0).flatten()
    x2y2 = np.repeat([bounds[1,:]],n, axis=0).flatten()
    boundsxy = [x1y1,x2y2]
    return boundsxy

def create_results_table():
    """ Function that checks to see if a results csv exists, it creates the csv.
    """
    if os.path.isfile('./results.csv') == False:
        results = pd.DataFrame(columns=['region','proposed_schools_n','starting_point', 'algorithm', 'ee',
                                        'eei', 'proposed_locations', 'time', 'sigma'])
        results.to_csv('results.csv', index=False)

In [5]:
def prepare_datasets(df_, region):
    """ A function which prepares datasets for the experiments.
    """
    df_ = df_.loc[df_['ADM1_EN'] == region] # filter dataset by region
    
    # Create subset numpy arrays for the Fitness Function.
    # 1. Full dataset filtered to only middle schools. 
    # 2. df_middle_enroll: MS enrollment data. Only the last two grades used as predictors for expected SS enrollment
    # 3. df_middle_loc: MS location data- lat lon point data. 
    # 4. df_sec_enroll: SS enrollment data. Only grades 9 and 10 enrollment.
    # 5. df_sec_loc: SS location data- lat lon point data. 
    # 6. current_ms_distance: existing distances from MS to SS

    df_middle = df_.loc[df_['grade_7_8'] > 0] # 1
    df_middle_enroll = df_middle['grade_7_8'].reset_index(drop=True).to_numpy(dtype=float) # 2
    df_middle_loc = df_middle['point'].reset_index(drop=True).to_numpy()
    df_middle_loc = np.array([np.array(i) for i in df_middle_loc], dtype=float) # 3

    df_sec = df_.loc[ (df_['gr_offer'] == 'G. 9-10') | (df_['gr_offer'] == 'G. 9-12')]
    df_sec_enroll = df_sec['grade9_10'].reset_index(drop=True).to_numpy(dtype=float) # 4
    df_sec_loc = df_sec['point'].reset_index(drop=True).to_numpy() 
    df_sec_loc = np.array([np.array(i) for i in df_sec_loc], dtype=float) # 5
    current_ms_distance = df_middle['nearest_lwr_sec'].to_numpy() # 6
    
    return [df_middle, df_middle_enroll, df_middle_loc, df_sec_enroll, df_sec_loc, current_ms_distance]

In [6]:
# Prepare regional datasets in a dictionary.
datasets = {}
for i in set(df['ADM1_EN']):
    datasets[i] = prepare_datasets(df, i)

In [7]:
# def benchmarks(region, data, n):
#     # *** ESTABLISH BENCHMARK ****
    
#     # Basic Benchmark
#     # Step 1. Find all MS with distance > 5km. 
#     # Step 2. Sort by enrollment and select n schools with the highest enrollment
#     # Step 3. Sum the enrollment to assume that the new schools will only serve n schoools.
#     # Step 4. Save the results to results.csv

#     start_time_overall = time.time()

#     # Run Basic Benchmark Per Region
#     for i in data:
#         df_ms = data[i][0] # MS in region i
#         ee_old_constant = np.sum(data[i][3]) # total SS_enrolment in region i
#         A
#         if (i == 'Addis Ababa') | (i == 'Dire Dawa'):  
#             # Distances to secondary schools are lower in city administrations
#             dd = df_ms[df_ms['nearest_lwr_sec'] > 2] # Find all MS > 2km distance in city administrations
#         else:
#             dd = df_ms[df_ms['nearest_lwr_sec'] > 5] # Find all MS > 5km distance in regions.

#         dd = dd.sort_values(['grade_7_8'], ascending=False).head(n) # filter by number of SS to construct
#         benchmark = sum(dd['grade_7_8']) # sum the enrollment for a basic indication of expected enrollment
#         benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
#         benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
        
#         create_results_table() # create table for results, if it doesn't already exist.
#         results = pd.read_csv('results.csv')
#         row1 = (pd.Series({'region': i, 'proposed_schools_n': n,
#                        'starting_point':np.nan, 'algorithm':'Basic Benchmark', 
#                        'ee': abs(ee_old_constant - abs(benchmark)), 'eei': benchmark, 
#                        'proposed_locations': benchmark_loc, 'time':0, 'sigma':np.nan}))
    
#         results = results.append(row1, ignore_index=True)
#         results.to_csv('results.csv', index=False)
        
#     print('Basic Algorithm Time:', n, time.time() - start_time_overall)

In [12]:
# Find all eligible region names. 
regions = set(df['ADM1_EN'])

In [15]:
# Run Basic Benchmark and save results.
for i in regions:
    benchmarks(i, datasets, 5)
    benchmarks(i, datasets, 10)

Basic Algorithm Time: 5 0.08278942108154297
Basic Algorithm Time: 10 0.07680726051330566
Basic Algorithm Time: 5 0.08205127716064453
Basic Algorithm Time: 10 0.07596230506896973
Basic Algorithm Time: 5 0.10330867767333984
Basic Algorithm Time: 10 0.1076042652130127
Basic Algorithm Time: 5 0.1159677505493164
Basic Algorithm Time: 10 0.1170186996459961
Basic Algorithm Time: 5 0.1162712574005127
Basic Algorithm Time: 10 0.10314393043518066
Basic Algorithm Time: 5 0.11569523811340332
Basic Algorithm Time: 10 0.10844922065734863
Basic Algorithm Time: 5 0.12757301330566406
Basic Algorithm Time: 10 0.11543416976928711
Basic Algorithm Time: 5 0.11879348754882812
Basic Algorithm Time: 10 0.11012792587280273
Basic Algorithm Time: 5 0.12350320816040039
Basic Algorithm Time: 10 0.12184524536132812


In [None]:
# def reduced_search_space(data):
#     dist = haversine_vector(data[2],data[2], Unit.KILOMETERS, comb=True) # distance of MS to x. 
#     # filter dataset if MS enrolment is above 200, or
#     min_distance = dist < 5 # schools within 5km of each other
#     find_enrolment = np.where(min_distance, data[1],0)
#     min_capacity = np.sum(find_enrolment, axis=0) > 200 #set min capacity at 200.
#     reduced_dataset = data[0][min_capacity].reset_index(drop=True)
#     return reduced_dataset

In [None]:
# Variable filtering of dataset based on max distance for inclusion. If >5km, only MS 5km from SS are considered. 

### TO DO: CREATE A FUNCTION THAT DOES THIS. 

min_dist_dict = { 
'Addis Ababa': 2, # Insufficient only 3 >5km: X, 9 MS > 3km: X, 21 > 2km: ✓
 'Amhara': 5, # 3376 MS > 5km: ✓
 'Benishangul Gumz': 5, # 204 MS > 5km: ✓
 'Dire Dawa': 5, # 29 MS > 5km: ✓
 'Harari': 3, # Insufficient only 10 MS >5km: X 21 > 3km: ✓
 'Oromia': 5, # 5037 MS > 5km: ✓
 'SNNP': 5, # 675 MS > 5km: ✓
 'Somali': 5, # 272 MS > 5km: ✓
 'Tigray': 5 # 686 schools > 5km: ✓
}

In [None]:
# Use this for CMA. 
def find_eligible_MS(data, expected_schools, min_dict):
    
    # Step 1. Check is distance is > variable minimum distance by region outlined above.
    data = data[data['nearest_lwr_sec'] > data['ADM1_EN'].map(min_dict)]
    
    # Step 2. Prepare key MS datasets. 
    df_middle = data.loc[data['grade_7_8'] > 0].reset_index(drop=True) # Filter to MS only
    df_middle_enroll = df_middle['grade_7_8'].reset_index(drop=True).to_numpy(dtype=float) # Convert MS Enrolment to np
    df_middle_loc = df_middle['point'].reset_index(drop=True).to_numpy()
    df_middle_loc = np.array([np.array(i) for i in df_middle_loc], dtype=float) # Convert MS locational data to np.
    
    # Step 3. Calculate distance from MS to all other MS and check if < 5km
    dist = haversine_vector(df_middle_loc,df_middle_loc, Unit.KILOMETERS, comb=True) < 5 # distance of MS < 5km
    
    # Step 4. Find MS enrolment if within 5km
    find_enrolment = np.where(dist, df_middle_enroll,0) 
    new_enroll_ss = find_enrolment # create a copy for manipulation
    
    # Step 5. Expected new SS must be < length of dist. e.g. If length_dist = 15, new SS must be less than 15.
    length_dist = len(np.sum(new_enroll_ss, axis=1)) 
    
    # Step 6. Iteratively find top n enrolment of schools within 5km of each other. n=expected_schools or length_dist
    # To ensure there is no double counting, find  first max, then subtract from array of new_enroll_ss.
    # Repeat until expected enrolment found for top n schools. 

    for i in range(min(expected_schools, length_dist)):
        expected_enrol = np.sum(new_enroll_ss, axis=1) 
        max_ee_index = np.argpartition(expected_enrol, -(i+1))[-(i+1):][-(i+1)]
        max_array = new_enroll_ss[max_ee_index]
        new_enroll_ss = (new_enroll_ss - new_enroll_ss[max_ee_index])
        new_enroll_ss = new_enroll_ss.clip(min=0)
        new_enroll_ss[max_ee_index] = max_array
        
    # Step 7. Find new schools above minimum capacity of 200.
    new_enroll_ss = new_enroll_ss[np.sum(new_enroll_ss, axis=1) > 200] # minimum capacity of 200
    max_possible = min(len(new_enroll_ss), expected_schools) # total new SS possible
    new_schools_ee = np.sum(new_enroll_ss, axis=1) #  total ee possible from new SS.
    new_schools_index = np.argpartition(new_schools_ee, -(max_possible))[-(max_possible):]

    # Step 8. Find index of all new schools included in the final sample & return MS dataframe with only these MS.
    cluster_schools = np.where(new_enroll_ss[new_schools_index])[1] 
    cluster_schools = df_middle.loc[cluster_schools]
    region = list(pd.unique(data['ADM1_EN']))[0]
    return cluster_schools, region, max_possible

In [None]:
# Find all eligible MS, then return joined datasets. 
n = 5
dataframes = [find_eligible_MS(df[df['ADM1_EN']==i], n, min_dist_dict) for i in regions]
listy = dict([ [i[1], i[2]] for i in dataframes]) # what's possible. 
revised_df = pd.concat([df[0] for df in dataframes], ignore_index=False)

In [None]:
# Prepare the new datasets. 
datasets2 = {}
for i in regions:
    datasets2[i] = prepare_datasets(revised_df, i)

In [None]:
### FITNESS FUNCTION

def expected_enrollment(middle_loc, x, middle_enroll, current_dist, ee_old_constant):

    """ The Fitness Function returns the overall total expected enrollment increase given the new school locations x. 

        The function calculates the distance between each new school in x and the current middle schools to determine 
        the expected enrollment per new school. If a new school proposed is closer than the current secondary school, 
        enrollment from that subtracted from the overall enrollment gains. 

        Parameters:
            middle_loc: a vector of longitude and latitude coordinates of all middle schools
            x: a vector longitude and latitude coordinates for the newly proposed secondary schools
            middle_enroll: a vector of current enrollment at each middle school
            current_distnce: a vector of the current distances between each middle school and secondary school.

        Returns:
            eei + ee_old (float): the expected enrollment increase from each school in x  (eei)
                                  + the current secondary enrollment (ee_old)

    """

    ee_old = ee_old_constant.copy() # Overall SS Enrollment
    d = haversine_vector(middle_loc,x, Unit.KILOMETERS, comb=True) # distance of MS to x.
    d_min = np.min(d, axis=0) # select only closest schools to avoid duplication. 

    d_index = np.argmin(d, axis=0) # index of min distance of MS to x
    d2 = np.where((d_min <5) & (d_min < current_dist)) # limit to only schools < 5km and schools < current distance
    # Put into dataframe for manipulation in pandas
    # index 0 = x, index 1 = MS, index 2 = distance, index 3 = shaped enrollment 
    d3 = pd.DataFrame(np.vstack((d_index[d2], d2[0], d_min[d2], shape(d_min[d2], middle_enroll[d2[0]]))).T) 
    d32 = d3.loc[d3.groupby([1])[2].idxmin()] # find only nearby SS if MS is close to more than 1 SS.
    d32 = d32.groupby(0)[3].sum() # The overall shaped enrollment for each school in x.
    eei = np.sum(d32) # overall expected enrollment increase for each school in x.
    # Find MS enrollment of MS if close to old SS. 
    distance_current = np.sum(shape(current_dist[d2], middle_enroll[d2]))
    ee_old -= distance_current # subtract shaped enrollment from overall SS enrollment as new school is close
    return eei + ee_old # return overall expected enrollment + current secondary enrollment

In [None]:
def f2(x, n, ms_loc, ms_enrol, ms_dist, ee_old):
    """ The Objective Function returns the maximum expected enrollment.
    """
    x = x.reshape(n,2) # reshape for input into expected enrolment.
    t_case = expected_enrollment(ms_loc, x, ms_enrol, ms_dist, ee_old) # run fitness function.
    return t_case*-1 # Multiply by -1 for maximising.

In [None]:
np.sum(datasets['Tigray'][3])

In [None]:
np.sum(df['grade9_10'])

In [None]:
def establish_benchmarks(data, n, min_dict):
    # *** ESTABLISH BENCHMARK ****
    
    # Basic Benchmark
    # Step 1. Find all MS with distance > 5km. 
    # Step 2. Sort by enrollment and select n schools with the highest enrollment
    # Step 3. Sum the enrollment to assume that the new schools will only serve n schoools.
    # Step 4. Save the results to results.csv
    start_time_basic = time.time()
    df_ms, df_ms_enrol, df_ms_loc, current_ms_dist = data[0], data[1], data[2], data[5]
    region = df_ms['ADM1_EN'].unique()[0]
    ee_old_constant = np.sum(data[3])
    df_ms = df_ms[df_ms['nearest_lwr_sec'] > df_ms['ADM1_EN'].map(min_dict)]
    dd = df_ms.sort_values(['grade_7_8'], ascending=False).head(n) # filter by number of SS to construct
    n = len(dd)
    benchmark = sum(dd['grade_7_8']) # sum the enrollment for a basic indication of expected enrollment
    basic_time = time.time() - start_time_basic

    # Advanced Benchmark
    # Since MS with the highest enrollment are likely to be close to other MS, apply the fitness function to these schools
    # Step 1. Assume x is the locations of the top n schools
    # Step 2. Run fitness Function on these schools. 
    start_time_adv = time.time()
    benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
    benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
    
    # Run Benchmark on objecitve function. 
    benchmark_f = f2(benchmark_loc, n, df_ms_loc, df_ms_enrol, current_ms_dist, ee_old_constant) # Run Fitness Function
    # Store results
    create_results_table() # create table for results, if it doesn't already exist.
    results = pd.read_csv('results.csv')

    row1 = (pd.Series({'region': region, 'proposed_schools_n': n,
                       'starting_point':np.nan, 'algorithm':'Basic Benchmark', 
                       'ee': benchmark + ee_old_constant, 'eei': benchmark,
                       'proposed_locations': benchmark_loc, 'time':np.nan, 'sigma':np.nan}))

    row2 = (pd.Series({'region': region, 'proposed_schools_n': n,'starting_point': np.nan, 
                       'algorithm':'Advanced Benchmark', 'ee': abs(round(benchmark_f,0)),
                       'eei': round(abs(benchmark_f)-ee_old_constant,0), 
                        'proposed_locations': benchmark_loc, 'time':np.nan, 'sigma':np.nan}))

    results = results.append([row1,row2], ignore_index=True)
    results.to_csv('results.csv', index=False)

    print(region, 'n=', n, ', Basic Algorithm Time:', basic_time)
    print(region, 'n=', n, ', ', 'Advanced Algorithm Time:', time.time() - start_time_adv)

In [None]:
n = 5
for i in regions:
    establish_benchmarks(datasets[i], n, min_dist_dict)

n = 10
for i in regions:
    establish_benchmarks(datasets[i], n, min_dist_dict)

In [None]:
def random_search(f, max_iterations, sp, n):
    x = [generate_sp_proposed(sp, n) for _ in range(max_iterations)]
    fx = [[f(xi), xi] for xi in x]
    best_f, best_solution = min(fx, key=lambda x:x[0])
    return best_f, best_solution

In [None]:
# NEED TO SHARE THE random_sp per region. 
def create_random_sps(region, data_bounds):
    bounds = get_bounds(data_bounds)
    sp = generate_random_sp(bounds)# Create a large sample of starting points
    sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.
    return sp

In [None]:
create_random_sps('Amhara', datasets[0])

In [None]:
def run_random_search(n, n_sp, max_iterations, data):
    start_time_rs = time.time()
    df_ms, df_ms_enrol, df_ms_loc, current_ms_dist = data[0], data[1], data[2], data[5]
    bounds = get_bounds(data[0])
    sp = generate_random_sp(bounds)# Create a large sample of starting points
    sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.
    ee_old_constant = np.sum(data[3])
    print('done here')
    
    def f(x):
        """ The Objective Function for Random Search.
        """
        x = x.reshape(n,2) # reshape for input into expected enrolment.
        test_case = expected_enrollment(df_ms_loc, x, df_ms_enrol, current_ms_dist, ee_old_constant) # run fitness function.
        return test_case*-1 # Multiply by -1 for maximising.
    
    fx = []
    for _ in range(n_sp):
        start_time = time.time()
        fx.append([random_search(f, max_iterations, sp, n), time.time() - start_time])
    
    results = pd.read_csv('results.csv')
    for i in range(0, len(fx)):
        row = (pd.Series({'region': region, 'proposed_schools_n': n,
                        'starting_point':i, 'algorithm':'Random Search', 'ee':round(abs(fx[i][0][0]),0),
                                    'eei':round(abs(ee_old_constant - abs(fx[i][0][0])),0), 
                                    'proposed_locations': fx[i][0][1], 'time':fx[i][1], 'sigma':'NA'}))
        results = results.append(row, ignore_index=True)

    print(region, 'RS complete. Total time:', round(time.time() - start_time_rs,2))

In [None]:
run_random_search(5, 30, 1000, datasets['Harari'])

In [None]:
    # Create a large initial sample of starting points. 
    sp = generate_random_sp(lat_bounds, lon_bounds)# Create a large sample of starting points
    sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.

In [None]:
# # Reduce search space for CMA-ES. 
# datasets2 = {}
# for i in datasets:
#     datasets2[i] = prepare_datasets(reduced_search_space(datasets[i]), i)
#     datasets2[i][3] = datasets[i][3]
#     datasets2[i][4] = datasets[i][4]

In [None]:
region = 'Oromia'
gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']==region]['geometry'].reset_index(drop=True)
n = listy[region]
df_test1 = datasets[region]
df_test2 = datasets2[region]
df_ms_subset = df_test2[0]
df_ms_enrol_subset = df_test2[1]
df_ms_loc_subset = df_test2[2]
current_ms_dist_subset = df_test2[5]
df_ss_enrol = df_test1[3]
df_ss_loc = df_test1[4]

bounds = get_bounds(df_ms_subset)
cma_bounds = get_cma_bounds(bounds, n)

# Create a large initial sample of starting points. 

sp = generate_random_sp(bounds)# Create a large sample of starting points
sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.
ee_old_constant = np.sum(df_ss_enrol) # existing enrollment for secondary. The constant figure. 

In [None]:
ee_old_constant

In [None]:
sum(df_ms_enrol_subset) # should be able to attain this. 

In [None]:
def random_search(f, iterations, rand_sp):
    x = [rand_sp for _ in range(iterations)]
    fx = [[f(xi), xi] for xi in x]
    best_f, best_solution = min(fx, key=lambda x:x[0])
    return best_f, best_solution

In [None]:
#     *** OBJECTIVE FUNCTION ****
    
def f(x):
    """ The Objective Function returns the maximum expected enrollment.
    """
    x = x.reshape(n,2) # reshape for input into expected enrolment.
    test_case = expected_enrollment(df_ms_loc_subset, x, df_ms_enrol_subset, current_ms_dist_subset) # run fitness function.
    return test_case*-1 # Multiply by -1 for maximising.

dd = df_ms_subset.sort_values(['grade_7_8'], ascending=False).head(n)
# print(dd)
benchmark = sum(dd['grade_7_8']) # sum the enrollment for a basic indication of expected enrollment
# print(benchmark)
benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
benchmark_f = f(benchmark_loc) # Run Fitness Function

In [None]:
    def random_search(f, iterations, rand_sp):
        x = [rand_sp for _ in range(iterations)]
        fx = [[f(xi), xi] for xi in x]
        best_f, best_solution = min(fx, key=lambda x:x[0])
        return best_f, best_solution
    
    start_time_rs = time.time()
    fx = []
    for _ in range(n_starting_points):
        start_time = time.time()
        fx.append([random_search(f, maxits), time.time() - start_time])
        
    for i in range(0, len(fx)):
        row = (pd.Series({'region': region, 'proposed_schools_n': n,
                        'starting_point':i, 'algorithm':'Random Search', 'ee':round(abs(fx[i][0][0]),0),
                                    'eei':round(abs(ee_old_constant - abs(fx[i][0][0])),0), 
                                    'proposed_locations': fx[i][0][1], 'time':fx[i][1], 'sigma':'NA'}))
        results = results.append(row, ignore_index=True)
        
    print(region, 'RS complete. Total time:', round(time.time() - start_time_rs,2))

In [None]:
n_starting_points = 30
maxits = 30000

In [None]:
%%time
sigma0 = np.sqrt(np.std(df_ms_loc_subset[:,0])**2 + np.std(df_ms_loc_subset[:,1])**2)
sigmas = np.linspace(0, sigma0, 11)[1:] # Take 10 values evenly spaced out between 0 to sigma0

start_time_cma = time.time()
fcma = []
for i in range(n_starting_points):
        start_time = time.time()
        for j in sigmas:
            es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp, n).flatten(), sigma0=j,
                                      inopts={'bounds': cma_bounds,'seed':1234})
            es.optimize(f, iterations=(maxits/ es.popsize))
        fcma.append((es.result[1], es.result[0].reshape(n, 2), (time.time() - start_time), j))

In [None]:
fcma

In [None]:
bounds

In [None]:
626675.0588190556 -  625248.0

In [None]:
1427/5

In [None]:
sum(df_ms_enrol_subset)

In [None]:
fcma

In [None]:
round(abs(ee_old_constant - abs(fcma[0][0])),0)

In [None]:
shape(3.570095, 84)

In [None]:

def run(df, region, n):
    """ Function to run all experiments per region according to the total new schools proposed. 
        Doesn't return any value. It appends the results dataframe and creates a set of visualisations per region.
    """
    # limit geojson to only selected region
    # limit clean dataset to only selected region
    gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
    gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']==region]['geometry'].reset_index(drop=True)
    df = df.loc[df['ADM1_EN'] == region]
    
    # Find regional boundaries in lat lon.
    # Latitude is the Y axis, longitude is the X axis.
    bounds = gdf_region_shp.bounds 
    lat_bounds = bounds[['miny','maxy']].to_numpy(dtype=float)[0]
    lon_bounds = bounds[['minx','maxx']].to_numpy(dtype=float)[0]
    bounds = np.array([[lat_bounds[0], lon_bounds[0]], [lat_bounds[1], lon_bounds[1]]])
    # array - [[lower lat bounds, lower lon bounds],[upper lat bounds, upper lon bounds]]
    
    # CMA expects a list of size 2 for bounds
    x1y1 = np.repeat([bounds[0,:]],n, axis=0).flatten()
    x2y2 = np.repeat([bounds[1,:]],n, axis=0).flatten()
    boundsxy = [x1y1,x2y2]
    
    # Create subset arrays in numpy for input to the Fitness Function.
    # 1. df_middle_enroll: MS enrollment data. Only the last two grades used as predictors for expected SS enrollment
    # 2. df_middle_loc: MS location data- lat lon point data. 
    
    # 3. df_sec_enroll: SS enrollment data. Only grades 9 and 10 enrollment.
    # 4. df_sec_loc: SS location data- lat lon point data. 
    # 5. current_ms_distance: existing distances from MS to SS

    df_middle = df.loc[df['grade_7_8'] > 0]
    df_middle_enroll = df_middle['grade_7_8'].reset_index(drop=True).to_numpy(dtype=float) # 1
    df_middle_loc = df_middle['point'].reset_index(drop=True).to_numpy()
    df_middle_loc = np.array([np.array(i) for i in df_middle_loc], dtype=float) # 2

    df_sec = df.loc[ (df['gr_offer'] == 'G. 9-10') | (df['gr_offer'] == 'G. 9-12')]
    df_sec_enroll = df_sec['grade9_10'].reset_index(drop=True).to_numpy(dtype=float)
    df_sec_loc = df_sec['point'].reset_index(drop=True).to_numpy() 
    df_sec_loc = np.array([np.array(i) for i in df_sec_loc], dtype=float) # 3
    current_ms_distance = df_middle['nearest_lwr_sec'].to_numpy() # 5
    
    # Create a large initial sample of starting points. 
    sp = generate_random_sp(lat_bounds, lon_bounds)# Create a large sample of starting points
    sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.
    
    ee_old_constant = np.sum(df_sec_enroll) # existing enrollment for secondary. The constant figure. 

In [None]:
joined

In [None]:
len(df.loc[df['grade_7_8'] > 0])

In [None]:
find_clusters >5km away, then identify best location in these priority areas.

In [None]:
# Step 1. Filter by recommended distance per region
# Step 2. Calculate distance from MS to all other MS. 
# Step 3. Find Enrolment of MS within 5km (self inclusive)
# Step 4. Return feeder MS with a minimum capacity of 200 students. 
# Step 5. Find the index of MS with highest feeder enrolment. 
# Step 6. Return 

In [None]:
reduced_search_space3(df[df['ADM1_EN']=='Oromia'], 10,m, min_dist_to_sec, min_dist_dict)

In [None]:
tt = np.array([5,2,3,8])
np.argpartition(tt, -3)[-3:]

In [None]:
# What about existing secondary?

def reduced_search_space2(df, n, m):
    # filter all 
    data = data[0][data[0]['nearest_lwr_sec'] > 5]

#     dist = haversine_vector(data[2],data[2], Unit.KILOMETERS, comb=True) # distance of MS to x. 
#     # filter dataset if MS enrolment is above 200, or
#     min_distance = dist < 5 # schools within 5km of each other
#     # should remove schools < 3km. 
#     find_enrolment = np.where(min_distance, data[1],0)
#     min_capacity = np.sum(find_enrolment, axis=0) > 200 #set min capacity at 200.
#     # sort instead of choose?
#     max20 = np.argpartition(min_capacity, -(n+m))[-(n+m):] # best 20 schools MUST have nearest_lwr_sec > 5 EDIT TOMORROW!
#     # then filter by those > 5km?
#     demand_points = np.argwhere(find_enrolment[max20])[:,1] # index of max20
#     reduced_dataset = data[0].iloc[demand_points].reset_index(drop=True)
#     return reduced_dataset

In [None]:
# df[df['nearest_lwr_sec'] > 5]

In [None]:
rr = reduced_search_space2(datasets['Oromia'], 5,5)

In [None]:
sum(rr['nearest_lwr_sec'] > 5)

In [None]:
# NEED TO remove 

In [None]:
testy = datasets['Amhara']

In [None]:
testy[0]['nearest_lwr_sec'] > 5

In [None]:
rr.head()

In [None]:
rr['nearest_lwr_sec']

In [None]:
# Reduce search space for CMA-ES. 
datasets2 = {}
for i in datasets:
    datasets2[i] = prepare_datasets(reduced_search_space(datasets[i]), i)
    datasets2[i][3] = datasets[i][3]
    datasets2[i][4] = datasets[i][4]

In [None]:
# calculate total area 

In [None]:
bounds = get_bounds(df)

In [None]:
bounds[:, 1]

In [None]:
x1y1 = np.repeat(bounds[0,:],2)
x2y2 = np.repeat(bounds[1,:],2)

In [None]:
poly = Polygon(zip(np.repeat(bounds[:, 0],2),np.repeat(bounds[:, 1],2)))b

In [None]:
crs = 'epsg:4326'
polygon = gpd.GeoDataFrame(index=[0], crs=crs, geometry=[poly])  

In [None]:
polygon

In [None]:
zip(lon_list, lat_list)

In [None]:
from shapely.geometry import Polygon

In [None]:
    x1y1 = np.repeat([bounds[0,:]],n, axis=0).flatten()
    x2y2 = np.repeat([bounds[1,:]],n, axis=0).flatten()

In [None]:
# Calculate area difference between old and new functions. 

In [None]:
s = gpd.GeoSeries(bounds.flatten())

In [None]:
amhara = datasets['Amhara']

In [None]:
# filter dataset if MS enrolment is above 200, or
dist = haversine_vector(amhara[2],amhara[2], Unit.KILOMETERS, comb=True) # distance of MS to x. 
# dist = remove_diagonals(dist)
dist5 = dist < 5
enrol5 = np.where(dist5, amhara[1],0)
enrol = np.sum(enrol5, axis=0) > 200
# amhara[0].reset_index(drop=True)[enrol]

max20 = np.argpartition(enrol, -20)[-20:]
demand_points = np.argwhere(enrol5[max20])[:,1]

In [None]:
demand_points

In [None]:
set(enrol5[max20][0])

In [None]:
amhara[0].reset_index(drop=True).loc[max20]

In [None]:
amhara[0].reset_index(drop=True)[max20]

In [None]:
amhara[0][max20]

In [None]:
np.sum(enrol5, axis=0)

In [None]:
len(np.any(enrol5t, axis=0))

In [None]:
len(amhara[0])

In [None]:
enrol5t

In [None]:
len(amhara[0])

In [None]:
len(amhara[0].reset_index(drop=True)[enrol5t])

In [None]:
amhara[3][np.where(dist5[0], amhara[3],0)]

In [None]:
for i in dist5:
    np.where( i == True, )
    print(i)

In [None]:
len(dist5[0])

In [None]:
np.where(dist5)

In [None]:
for i in np.sum(dist5[amhara[3].astype(int)], axis=0):
    print(i)

In [None]:
for i in dist5:
    i[amhara[3]]

In [None]:
dist5[np.array(amhara[3])]

In [None]:
dist5

In [None]:
len(dist[0])

In [None]:
len(dist5[0])

In [None]:
for i in dist:
    print(i[dist5])

In [None]:
np.argwhere(dist <5)

In [None]:
np.where(dist <5, axis=1)

In [None]:
dist5[amhara[1]]

In [None]:
# Substantially reduce the search space. 
np.where(dist[dist5], axis=1)

In [None]:
len(dist[dist5])

In [None]:
dist[within5]

In [None]:
len(dist[0])

In [None]:
len(within5[0])

In [None]:
d_index = np.argmin(dist2, axis=0) # index of min distance of MS to x

In [None]:
d_index

In [None]:
# As the most expensive component is CMA. Modularity of programmes is geared arounds this. 

In [None]:
# reduce size of dataset
# include only schools that are above a minimum enrolment of 200, or accumulation of MS_Enroll within 5km > 200. 

In [None]:
# removes self-comparisons.
def remove_diagonals(xx):
    # Remove diagonals as it is the distance to itself. 
    # Source: https://pyquestions.com/deleting-diagonal-elements-of-a-numpy-array
    xx = xx[~np.eye(xx.shape[0], dtype=bool)].reshape(xx.shape[0],-1)
    return xx

In [None]:
within5

In [None]:
amhara = prepare_datasets(df, 'Amhara')

In [None]:
amhara[0]

In [None]:
### Original Function. 


def run(df, region, n):
    """ Function to run all experiments per region according to the total new schools proposed. 
        Doesn't return any value. It appends the results dataframe and creates a set of visualisations per region.
    """
    # limit geojson to only selected region
    # limit clean dataset to only selected region
    gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
    gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']==region]['geometry'].reset_index(drop=True)
    df = df.loc[df['ADM1_EN'] == region]
    
    # Find regional boundaries in lat lon.
    # Latitude is the Y axis, longitude is the X axis.
    bounds = gdf_region_shp.bounds 
    lat_bounds = bounds[['miny','maxy']].to_numpy(dtype=float)[0]
    lon_bounds = bounds[['minx','maxx']].to_numpy(dtype=float)[0]
    bounds = np.array([[lat_bounds[0], lon_bounds[0]], [lat_bounds[1], lon_bounds[1]]])
    # array - [[lower lat bounds, lower lon bounds],[upper lat bounds, upper lon bounds]]
    
    # CMA expects a list of size 2 for bounds
    x1y1 = np.repeat([bounds[0,:]],n, axis=0).flatten()
    x2y2 = np.repeat([bounds[1,:]],n, axis=0).flatten()
    boundsxy = [x1y1,x2y2]
    
    # Create subset arrays in numpy for input to the Fitness Function.
    # 1. df_middle_enroll: MS enrollment data. Only the last two grades used as predictors for expected SS enrollment
    # 2. df_middle_loc: MS location data- lat lon point data. 
    
    # 3. df_sec_enroll: SS enrollment data. Only grades 9 and 10 enrollment.
    # 4. df_sec_loc: SS location data- lat lon point data. 
    # 5. current_ms_distance: existing distances from MS to SS

    df_middle = df.loc[df['grade_7_8'] > 0]
    df_middle_enroll = df_middle['grade_7_8'].reset_index(drop=True).to_numpy(dtype=float) # 1
    df_middle_loc = df_middle['point'].reset_index(drop=True).to_numpy()
    df_middle_loc = np.array([np.array(i) for i in df_middle_loc], dtype=float) # 2

    df_sec = df.loc[ (df['gr_offer'] == 'G. 9-10') | (df['gr_offer'] == 'G. 9-12')]
    df_sec_enroll = df_sec['grade9_10'].reset_index(drop=True).to_numpy(dtype=float)
    df_sec_loc = df_sec['point'].reset_index(drop=True).to_numpy() 
    df_sec_loc = np.array([np.array(i) for i in df_sec_loc], dtype=float) # 3
    current_ms_distance = df_middle['nearest_lwr_sec'].to_numpy() # 5
    
    # Create a large initial sample of starting points. 
    sp = generate_random_sp(lat_bounds, lon_bounds)# Create a large sample of starting points
    sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.
    
    ee_old_constant = np.sum(df_sec_enroll) # existing enrollment for secondary. The constant figure. 
    
#     best_sp = df_middle.nlargest(n, 'grade_7_8')['point'].reset_index(drop=True).to_numpy() 
#     best_sp = np.array([np.array(i) for i in best_sp], dtype=float) # 3

#     *** FITNESS FUNCTION ****

    def expected_enrollment(middle_loc, x, middle_enroll, current_dist):
        
        """ The Fitness Function returns the overall total expected enrollment increase given the new school locations x. 
        
            The function calculates the distance between each new school in x and the current middle schools to determine 
            the expected enrollment per new school. If a new school proposed is closer than the current secondary school, 
            enrollment from that subtracted from the overall enrollment gains. 
        
            Parameters:
                middle_loc: a vector of longitude and latitude coordinates of all middle schools
                x: a vector longitude and latitude coordinates for the newly proposed secondary schools
                middle_enroll: a vector of current enrollment at each middle school
                current_distnce: a vector of the current distances between each middle school and secondary school.
                            
            Returns:
                eei + ee_old (float): the expected enrollment increase from each school in x  (eei)
                                      + the current secondary enrollment (ee_old)
        
        """
        
        ee_old = ee_old_constant.copy() # Overall SS Enrollment
        d = haversine_vector(middle_loc,x, Unit.KILOMETERS, comb=True) # distance of MS to x. 
        d_min = np.min(d, axis=0) # select only closest schools to avoid duplication. 
        
        d_index = np.argmin(d, axis=0) # index of min distance of MS to x
        d2 = np.where((d_min <5) & (d_min < current_dist)) # limit to only schools < 5km and schools < current distance
        # Put into dataframe for manipulation in pandas
        # index 0 = x, index 1 = MS, index 2 = distance, index 3 = shaped enrollment 
        d3 = pd.DataFrame(np.vstack((d_index[d2], d2[0], d_min[d2], shape(d_min[d2], middle_enroll[d2[0]]))).T) 
        d32 = d3.loc[d3.groupby([1])[2].idxmin()] # find only nearby SS if MS is close to more than 1 SS.
        d32 = d32.groupby(0)[3].sum() # The overall shaped enrollment for each school in x.
        eei = np.sum(d32) # overall expected enrollment increase for each school in x.
        # Find MS enrollment of MS if close to old SS. 
        distance_current = np.sum(shape(current_dist[d2], middle_enroll[d2]))
        ee_old -= distance_current # subtract shaped enrollment from overall SS enrollment as new school is closer
        return eei + ee_old # return overall expected enrollment + current secondary enrollment

#     *** OBJECTIVE FUNCTION ****
    
    def f(x):
        """ The Objective Function returns the maximum expected enrollment.
        """
        x = x.reshape(n,2) # reshape for input into expected enrolment.
        test_case = expected_enrollment(df_middle_loc, x, df_middle_enroll, current_ms_distance) # run fitness function.
        return test_case*-1 # Multiply by -1 for maximising.
    
    
    # **** RESULTS ****
    
    create_results_table() # create table for results, if it doesn't already exist.
    results = pd.read_csv('results.csv')
    
    # *** ESTABLISH BENCHMARK ****
    
    # Basic Benchmark
    # Step 1. Find all MS with distance > 5km. 
    # Step 2. Sort by enrollment and select n schools with the highest enrollment
    # Step 3. Sum the enrollment to assume that the new schools will only serve n schoools.
    # This is the basic benchmark
    
    if (region == 'Addis Ababa') | (region == 'Dire Dawa'):  
        # Distances to secondary schools are lower in city administrations
        dd = df_middle[df_middle['nearest_lwr_sec'] > 2] # Find all PS > 2km distance in city administrations
    else:
        dd = df_middle[df_middle['nearest_lwr_sec'] > 5] # Find all PS > 5km distance in regions.
    dd = dd.sort_values(['grade_7_8'], ascending=False).head(n) # filter by number of SS to construct
    benchmark = sum(dd['grade_7_8']) # sum the enrollment for a basic indication of expected enrollment

    # Advanced Benchmark
    # Since MS with the highest enrollment are likely to be close to other MS, apply the fitness function to these schools
    # Step 1. Assume x is the locations of the top n schools
    # Step 2. Run fitness Function on these schools. 
    start_time = time.time()
    benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
    benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
    benchmark_f = f(benchmark_loc) # Run Fitness Function
       
    # Store results
    
    row1 = (pd.Series({'region': region, 'proposed_schools_n': n,
                       'starting_point':np.nan, 'algorithm':'Basic Benchmark', 
                       'ee': abs(ee_old_constant - abs(benchmark)), 'eei': benchmark, 
                       'proposed_locations': benchmark_loc, 'time':0, 'sigma':np.nan}))
    
    row2 = (pd.Series({'region': region, 'proposed_schools_n': n,'starting_point': np.nan, 
                       'algorithm':'Advanced Benchmark', 'ee': round(abs(benchmark_f)),
                        'eei': round(abs(ee_old_constant  - abs(benchmark_f)),0), 
                        'proposed_locations': benchmark_loc, 'time':(time.time() - start_time), 'sigma':np.nan}))

    results = results.append(row1, ignore_index=True)
    results = results.append(row2, ignore_index=True)
    
    # Random Search Algorothm is used to see if an improved solution can be identified.
    
    # Set Parameters for experiments.
    n_starting_points = 30
    maxits = 20000 # max iterations
    
    best_sigma = {'Addis Ababa': 0.098445155,
                 'Amhara': 0.86193024,
                 'Benishangul Gumz': 0.443843224,
                 'Dire Dawa': 0.15484862300000002,
                 'Harari': 0.08544453699999999,
                 'Oromia': 2.018435726,
                 'SNNP': 0.14630103,
                 'Somali': 0.34336816299999995,
                 'Tigray': 0.555333321}
    
#     def random_search(f, iterations):
#         x = [generate_sp_proposed(sp, n) for _ in range(iterations)]
#         fx = [[f(xi), xi] for xi in x]
#         best_f, best_solution = min(fx, key=lambda x:x[0])
#         return best_f, best_solution
    
#     start_time_rs = time.time()
#     fx = []
#     for _ in range(n_starting_points):
#         start_time = time.time()
#         fx.append([random_search(f, maxits), time.time() - start_time])
        
#     for i in range(0, len(fx)):
#         row = (pd.Series({'region': region, 'proposed_schools_n': n,
#                         'iteration':i, 'algorithm':'Random Search', 'ee':round(abs(fx[i][0][0]),0),
#                                     'eei':round(abs(ee_old_constant - abs(fx[i][0][0])),0), 
#                                     'proposed_locations': fx[i][0][1], 'time':fx[i][1], 'sigma':'NA'}))
#         results = results.append(row, ignore_index=True)
        
#     print(region, 'RS complete. Total time:', round(time.time() - start_time_rs,2))
    
    # Covariance Matrix Adaptation-  CMA
    
    # Sigma is the initial standard deviation. Problem variables need to be scaled per region.
    # Take 10 values between 0 and the variance of all MS
#     sigma0 = 1.5*(np.sqrt(np.std(df_middle_loc[:,0])**2 + np.std(df_middle_loc[:,1])**2))
#     sigmas = np.linspace(0, sigma0, 11)[1:] # Take 10 values evenly spaced out between 0 to sigma0
    
    start_time_cma = time.time()
    fcma = []
    for i in range(n_starting_points):
        start_time = time.time()
#         for j in sigmas:
        es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp, n).flatten(), sigma0=best_sigma[region],
                                  inopts={'bounds': boundsxy,'seed':1234})
        es.optimize(f, iterations=(maxits/ es.popsize))
        fcma.append((es.result[1], es.result[0].reshape(n, 2), (time.time() - start_time), best_sigma[region]))
            
    for i in range(0, len(fcma)):
        row = (pd.Series({'region': region, 'proposed_schools_n': n,
                    'starting_point':i, 'algorithm':'CMA', 'ee':round(abs(fcma[i][0]),0),
                                'eei':round(abs(ee_old_constant - abs(fcma[i][0])),0), 
                                'proposed_locations': fcma[i][1], 'time':fcma[i][2], 'sigma':fcma[i][3]}))
        results = results.append(row, ignore_index=True)
    
    print(region, 'CMA complete. Total time:', round(time.time() - start_time_cma,2))
    
    results.to_csv('results.csv', index=False)

In [None]:
# always globally converge?
# if maximum enrolment 


   


In [None]:
## 

Experiments

In [None]:
boundy = get_bounds(df[df['ADM1_EN']=='Amhara'])

In [None]:
spish = generate_random_sp(boundy)

In [None]:
gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']=='Amhara']['geometry'].reset_index(drop=True)

In [None]:
sp = spish[check_region(spish, gdf_region_shp)] # only include points that are within regional boundaries.


In [None]:
spy = generate_sp_proposed(spish, 5)

In [None]:
spy

In [None]:
np.min(df['lon'])

In [None]:
bounds

In [None]:
sp = generate_random_sp(lat_bounds, lon_bounds)# Create a large sample of starting points
sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.

In [None]:
get_bounds(df[df['ADM1_EN']=='Amhara'], 'Amhara', 'other', 5)

In [None]:
get_bounds(df[df['ADM1_EN']=='Amhara'], 'Amhara', 'minmax', 5)

In [None]:
#     else:
#         gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
#         gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']==region]['geometry'].reset_index(drop=True)

#         # Find regional boundaries in lat lon.
#         # Latitude is the Y axis, longitude is the X axis.
#         bounds = gdf_region_shp.bounds 
#         lat_bounds = bounds[['miny','maxy']].to_numpy(dtype=float)[0]
#         lon_bounds = bounds[['minx','maxx']].to_numpy(dtype=float)[0]
#         bounds = np.array([[lat_bounds[0], lon_bounds[0]], [lat_bounds[1], lon_bounds[1]]])
#         # array - [[lower lat bounds, lower lon bounds],[upper lat bounds, upper lon bounds]]



In [None]:
amhara5 = prepare_datasets(df, 'Amhara', 5)

In [None]:
amhara5[0]

In [None]:
# Experiments
# Basic Benchmark for 5, 10, 20 schools


In [None]:
addis_ababa =
amhara = 
oromia = 

In [None]:
def basic_benchmark():
    # *** ESTABLISH BENCHMARK ****
    
    # Basic Benchmark
    # Step 1. Find all MS with distance > 5km. 
    # Step 2. Sort by enrollment and select n schools with the highest enrollment
    # Step 3. Sum the enrollment to assume that the new schools will only serve n schoools.
    # This is the basic benchmark
    
    if (region == 'Addis Ababa') | (region == 'Dire Dawa'):  
        # Distances to secondary schools are lower in city administrations
        dd = df_middle[df_middle['nearest_lwr_sec'] > 2] # Find all PS > 2km distance in city administrations
    else:
        dd = df_middle[df_middle['nearest_lwr_sec'] > 5] # Find all PS > 5km distance in regions.
    dd = dd.sort_values(['grade_7_8'], ascending=False).head(n) # filter by number of SS to construct
    benchmark = sum(dd['grade_7_8']) # sum the enrollment for a basic indication of expected enrollment
    
    

# Main

In [None]:

def run(df, region, n):
    """ Function to run all experiments per region according to the total new schools proposed. 
        Doesn't return any value. It appends the results dataframe and creates a set of visualisations per region.
    """
    # limit geojson to only selected region
    # limit clean dataset to only selected region
    gdf_region = gpd.read_file('eth_shape_files/json//eth_admin1v2.json') # read in geojson
    gdf_region_shp = gdf_region.loc[gdf_region['ADM1_EN']==region]['geometry'].reset_index(drop=True)
    df = df.loc[df['ADM1_EN'] == region]
    
    # Find regional boundaries in lat lon.
    # Latitude is the Y axis, longitude is the X axis.
    bounds = gdf_region_shp.bounds 
    lat_bounds = bounds[['miny','maxy']].to_numpy(dtype=float)[0]
    lon_bounds = bounds[['minx','maxx']].to_numpy(dtype=float)[0]
    bounds = np.array([[lat_bounds[0], lon_bounds[0]], [lat_bounds[1], lon_bounds[1]]])
    # array - [[lower lat bounds, lower lon bounds],[upper lat bounds, upper lon bounds]]
    
    # CMA expects a list of size 2 for bounds
    x1y1 = np.repeat([bounds[0,:]],n, axis=0).flatten()
    x2y2 = np.repeat([bounds[1,:]],n, axis=0).flatten()
    boundsxy = [x1y1,x2y2]
    
    # Create subset arrays in numpy for input to the Fitness Function.
    # 1. df_middle_enroll: MS enrollment data. Only the last two grades used as predictors for expected SS enrollment
    # 2. df_middle_loc: MS location data- lat lon point data. 
    # 3. df_sec_enroll: SS enrollment data. Only grades 9 and 10 enrollment.
    # 4. df_sec_loc: SS location data- lat lon point data. 
    # 5. current_ms_distance: existing distances from MS to SS

    df_middle = df.loc[df['grade_7_8'] > 0]
    df_middle_enroll = df_middle['grade_7_8'].reset_index(drop=True).to_numpy(dtype=float) # 1
    df_middle_loc = df_middle['point'].reset_index(drop=True).to_numpy()
    df_middle_loc = np.array([np.array(i) for i in df_middle_loc], dtype=float) # 2

    df_sec = df.loc[ (df['gr_offer'] == 'G. 9-10') | (df['gr_offer'] == 'G. 9-12')]
    df_sec_enroll = df_sec['grade9_10'].reset_index(drop=True).to_numpy(dtype=float)
    df_sec_loc = df_sec['point'].reset_index(drop=True).to_numpy() 
    df_sec_loc = np.array([np.array(i) for i in df_sec_loc], dtype=float) # 3
    current_ms_distance = df_middle['nearest_lwr_sec'].to_numpy() # 5
    
    # Create a large initial sample of starting points. 
    sp = generate_random_sp(lat_bounds, lon_bounds)# Create a large sample of starting points
    sp = sp[check_region(sp, gdf_region_shp)] # only include points that are within regional boundaries.
    
    ee_old_constant = np.sum(df_sec_enroll) # existing enrollment for secondary. The constant figure. 
    
#     best_sp = df_middle.nlargest(n, 'grade_7_8')['point'].reset_index(drop=True).to_numpy() 
#     best_sp = np.array([np.array(i) for i in best_sp], dtype=float) # 3

#     *** FITNESS FUNCTION ****

    def expected_enrollment(middle_loc, x, middle_enroll, current_dist):
        
        """ The Fitness Function returns the overall total expected enrollment increase given the new school locations x. 
        
            The function calculates the distance between each new school in x and the current middle schools to determine 
            the expected enrollment per new school. If a new school proposed is closer than the current secondary school, 
            enrollment from that subtracted from the overall enrollment gains. 
        
            Parameters:
                middle_loc: a vector of longitude and latitude coordinates of all middle schools
                x: a vector longitude and latitude coordinates for the newly proposed secondary schools
                middle_enroll: a vector of current enrollment at each middle school
                current_distnce: a vector of the current distances between each middle school and secondary school.
                            
            Returns:
                eei + ee_old (float): the expected enrollment increase from each school in x  (eei)
                                      + the current secondary enrollment (ee_old)
        
        """
        
        ee_old = ee_old_constant.copy() # Overall SS Enrollment
        d = haversine_vector(middle_loc,x, Unit.KILOMETERS, comb=True) # distance of MS to x. 
        d_min = np.min(d, axis=0) # select only closest schools to avoid duplication. 
        
        d_index = np.argmin(d, axis=0) # index of min distance of MS to x
        d2 = np.where((d_min <5) & (d_min < current_dist)) # limit to only schools < 5km and schools < current distance
        # Put into dataframe for manipulation in pandas
        # index 0 = x, index 1 = MS, index 2 = distance, index 3 = shaped enrollment 
        d3 = pd.DataFrame(np.vstack((d_index[d2], d2[0], d_min[d2], shape(d_min[d2], middle_enroll[d2[0]]))).T) 
        d32 = d3.loc[d3.groupby([1])[2].idxmin()] # find only nearby SS if MS is close to more than 1 SS.
        d32 = d32.groupby(0)[3].sum() # The overall shaped enrollment for each school in x.
        eei = np.sum(d32) # overall expected enrollment increase for each school in x.
        # Find MS enrollment of MS if close to old SS. 
        distance_current = np.sum(shape(current_dist[d2], middle_enroll[d2]))
        ee_old -= distance_current # subtract shaped enrollment from overall SS enrollment as new school is closer
        return eei + ee_old # return overall expected enrollment + current secondary enrollment

#     *** OBJECTIVE FUNCTION ****
    
    def f(x):
        """ The Objective Function returns the maximum expected enrollment.
        """
        x = x.reshape(n,2) # reshape for input into expected enrolment.
        test_case = expected_enrollment(df_middle_loc, x, df_middle_enroll, current_ms_distance) # run fitness function.
        return test_case*-1 # Multiply by -1 for maximising.
    
    
    # **** RESULTS ****
    
    create_results_table() # create table for results, if it doesn't already exist.
    results = pd.read_csv('results.csv')
    
    # *** ESTABLISH BENCHMARK ****
    
    # Basic Benchmark
    # Step 1. Find all MS with distance > 5km. 
    # Step 2. Sort by enrollment and select n schools with the highest enrollment
    # Step 3. Sum the enrollment to assume that the new schools will only serve n schoools.
    # This is the basic benchmark
    
    if (region == 'Addis Ababa') | (region == 'Dire Dawa'):  
        # Distances to secondary schools are lower in city administrations
        dd = df_middle[df_middle['nearest_lwr_sec'] > 2] # Find all PS > 2km distance in city administrations
    else:
        dd = df_middle[df_middle['nearest_lwr_sec'] > 5] # Find all PS > 5km distance in regions.
    dd = dd.sort_values(['grade_7_8'], ascending=False).head(n) # filter by number of SS to construct
    benchmark = sum(dd['grade_7_8']) # sum the enrollment for a basic indication of expected enrollment

    # Advanced Benchmark
    # Since MS with the highest enrollment are likely to be close to other MS, apply the fitness function to these schools
    # Step 1. Assume x is the locations of the top n schools
    # Step 2. Run fitness Function on these schools. 
    start_time = time.time()
    benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
    benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
    benchmark_f = f(benchmark_loc) # Run Fitness Function
       
    # Store results
    
    row1 = (pd.Series({'region': region, 'proposed_schools_n': n,
                       'random_starting_point':np.nan, 'algorithm':'Basic Benchmark', 
                       'ee': abs(ee_old_constant - abs(benchmark)), 'eei': benchmark, 
                       'proposed_locations': benchmark_loc, 'time':0, 'sigma':np.nan}))
    
    row2 = (pd.Series({'region': region, 'proposed_schools_n': n,'iteration': np.nan, 
                       'algorithm':'Advanced Benchmark', 'ee': round(abs(benchmark_f)),
                        'eei': round(abs(ee_old_constant  - abs(benchmark_f)),0), 
                        'proposed_locations': benchmark_loc, 'time':(time.time() - start_time), 'sigma':np.nan}))

    results = results.append(row1, ignore_index=True)
    results = results.append(row2, ignore_index=True)
    
    # Random Search Algorothm is used to see if an improved solution can be identified.
    
    # Set Parameters for experiments.
    n_starting_points = 30
    maxits = 20000 # max iterations
    
    best_sigma = {'Addis Ababa': 0.098445155,
                 'Amhara': 0.86193024,
                 'Benishangul Gumz': 0.443843224,
                 'Dire Dawa': 0.15484862300000002,
                 'Harari': 0.08544453699999999,
                 'Oromia': 2.018435726,
                 'SNNP': 0.14630103,
                 'Somali': 0.34336816299999995,
                 'Tigray': 0.555333321}
    
#     def random_search(f, iterations):
#         x = [generate_sp_proposed(sp, n) for _ in range(iterations)]
#         fx = [[f(xi), xi] for xi in x]
#         best_f, best_solution = min(fx, key=lambda x:x[0])
#         return best_f, best_solution
    
#     start_time_rs = time.time()
#     fx = []
#     for _ in range(n_starting_points):
#         start_time = time.time()
#         fx.append([random_search(f, maxits), time.time() - start_time])
        
#     for i in range(0, len(fx)):
#         row = (pd.Series({'region': region, 'proposed_schools_n': n,
#                         'iteration':i, 'algorithm':'Random Search', 'ee':round(abs(fx[i][0][0]),0),
#                                     'eei':round(abs(ee_old_constant - abs(fx[i][0][0])),0), 
#                                     'proposed_locations': fx[i][0][1], 'time':fx[i][1], 'sigma':'NA'}))
#         results = results.append(row, ignore_index=True)
        
#     print(region, 'RS complete. Total time:', round(time.time() - start_time_rs,2))
    
    # Covariance Matrix Adaptation-  CMA
    
    # Sigma is the initial standard deviation. Problem variables need to be scaled per region.
    # Take 10 values between 0 and the variance of all MS
#     sigma0 = 1.5*(np.sqrt(np.std(df_middle_loc[:,0])**2 + np.std(df_middle_loc[:,1])**2))
#     sigmas = np.linspace(0, sigma0, 11)[1:] # Take 10 values evenly spaced out between 0 to sigma0
    
    start_time_cma = time.time()
    fcma = []
    for i in range(n_starting_points):
        start_time = time.time()
#         for j in sigmas:
        es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp, n).flatten(), sigma0=best_sigma[region],
                                  inopts={'bounds': boundsxy,'seed':1234})
        es.optimize(f, iterations=(maxits/ es.popsize))
        fcma.append((es.result[1], es.result[0].reshape(n, 2), (time.time() - start_time), best_sigma[region]))
            
    for i in range(0, len(fcma)):
        row = (pd.Series({'region': region, 'proposed_schools_n': n,
                    'iteration':i, 'algorithm':'CMA', 'ee':round(abs(fcma[i][0]),0),
                                'eei':round(abs(ee_old_constant - abs(fcma[i][0])),0), 
                                'proposed_locations': fcma[i][1], 'time':fcma[i][2], 'sigma':fcma[i][3]}))
        results = results.append(row, ignore_index=True)
    
    print(region, 'CMA complete. Total time:', round(time.time() - start_time_cma,2))
    
    results.to_csv('results.csv', index=False)

In [None]:
all_regions = set(df['ADM1_EN'])
for i in all_regions:
    run(df,i, 5)

In [None]:
all_regions = set(df['ADM1_EN'])
for i in all_regions:
    run(df,i, 10)

In [None]:
all_reg2 = all_regions['Harari'].pop()

In [None]:
all_reg2 = {'Addis Ababa',
 'Amhara',
 'Benishangul Gumz',
 'Dire Dawa',
 'Oromia',
 'SNNP',
 'Somali',
 'Tigray'}

In [None]:
for i in all_reg2:
    run(df,i, 10)

In [None]:
run(df,'Addis Ababa', 10) # only 8 schools in Harari

In [None]:
run(df,'SNNP', 10)# only 8 schools.

In [None]:
all_regions = set(df['ADM1_EN'])
for i in all_regions:
    run(df,i, 10)

In [None]:
run(df, 'Harari', 5)

In [None]:
run(df, 'Addis Ababa', 5)

In [None]:
run(df, 'Amhara', 5)

In [None]:
run(df, 'Benishangul Gumz', 5)

In [None]:
run(df, 'Dire Dawa', 5)

In [None]:
run(df, 'Oromia', 5)

In [None]:
run(df, 'SNNP', 5)

In [None]:
run(df, 'Somali', 5)

In [None]:
run(df, 'Tigray', 5)

In [None]:
# experiment 2

In [None]:
run(df, 'Harari', 5)

In [None]:
run(df, 'Addis Ababa', 5)

In [None]:
run(df, 'Amhara', 5)

In [None]:
run(df, 'Benishangul Gumz', 5)

In [None]:
run(df, 'Dire Dawa', 5)

In [None]:
run(df, 'Oromia', 5)

In [None]:
run(df, 'SNNP', 5)

In [None]:
run(df, 'Somali', 5)

In [None]:
run(df, 'Tigray', 5)

In [None]:
# experiment 3

In [None]:
run(df, 'Amhara', 10)

In [None]:
run(df, 'Oromia', 10)

In [None]:
run(df, 'SNNP', 10)

In [None]:
run(df, 'Tigray', 10)

In [None]:
run(df, 'Benishangul Gumz', 10)

In [None]:
run(df, 'Harari', 10)

In [None]:
run(df, 'Addis Ababa', 10) # Not necessary. 
run(df, 'Dire Dawa', 10)

In [None]:
def main():
    run() # all. 

In [None]:
dd = df_middle[df_middle['nearest_lwr_sec'] > 5] # Find all PS > 5km distance
dd = dd.sort_values(['grade7_8'], ascending=False).head(n) # filter by number of SS to construct i.e. 5.
benchmark = sum(dd['grade7_8']) # sum the enrollment for a basic indication of expected enrollment
benchmark

Above displays the overall enrollment from only the number of proposed schools. However, there is also a need to test these locations using the objective function. Why? In many cases, schools with high enrollment are located in more urban locations which may be close by to many other primary schools that could benefit from a new SS being built nearby the proposed schools. The objective function will provide a wider estimate than only the schools with the top enrollment. 

In [None]:
benchmark_loc = dd['point'].reset_index(drop=True).to_numpy()
benchmark_loc = np.array([np.array(i) for i in benchmark_loc], dtype=float)
benchmark_f = f(benchmark_loc)

In [None]:
benchmark_loc

In [None]:
# Below are the figures to beat.
print('Overall expected enrollment: ', benchmark_f, '\n' \
      'Expected Enrollment Increase: ', round(abs(np.sum(df_sec_enroll) - abs(benchmark_f)),0))

## 2. Random Search

In [None]:
# Random Search Algorothm is used to see if an improved solution can be identified.
def random_search(f, n):
    x = [generate_sp_proposed(sp, n) for _ in range(n)]
    fx = [[f(xi), xi] for xi in x]
    best_f, best_solution = min(fx, key=lambda x:x[0])
    return best_f, best_solution

In [None]:
n_starting_points = 30
maxits = 10000

In [None]:
%%time
# Run the algorithm 10,000 times by 30 different starting points.
fx = []
for _ in range(n_starting_points):
    start_time = time.time()
    fx.append([random_search(f, maxits), time.time() - start_time])
    print(_,time.time() - start_time, 'starting point completed.')

In [None]:
results

Random Search does not find an improved solution. 

# 3. CMA

In [None]:
results = results.reset_index(drop=True)

In [None]:
sigmas = (0.1, 0.4, 0.8, 0.9, 1.2, 1.4, 1.6)

In [None]:
%%time

fcma = []
for i in range(n_starting_points):
    start_time = time.time()
    for j in sigmas:
        es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp, proposed_schools).flatten(), sigma0=j,
                                  inopts={'bounds': boundsxy,'seed':1234})
        es.optimize(f, iterations=(maxits/ es.popsize))
        fcma.append((es.result[1], es.result[0].reshape(proposed_schools, 2), (time.time() - start_time), j))

In [None]:
es.

In [None]:
print('Potential Good Sigma: ', np.sqrt(np.std(df_prim_loc[:,0])**2 + np.std(df_prim_loc[:,1])**2))

In [None]:
testing = df_prim['point'][df_prim['nearest_lwr_sec'] < 5].reset_index(drop=True).to_numpy()
testing = np.array([np.array(i) for i in testing], dtype=float)
print('Potential Good Sigma: ', np.sqrt(np.std(testing[:,0])**2 + np.std(testing[:,1])**2))

In [None]:
for i in range(0, len(fcma)):
    row = (pd.Series({'random_starting_point':i, 'algorithm':'CMA', 'ee':round(abs(fcma[i][0]),0),
                                'eei':round(abs(np.sum(df_sec_enroll) - abs(fcma[i][0])),0), 
                                'proposed_locations': fcma[i][1], 'time':fcma[i][2], 'sigma':fcma[i][3]}))
    results = results.append(row, ignore_index=True)

In [None]:
results = results.sort_values(['eei'], ascending=False).reset_index()

In [None]:
results.to_csv('results_revamp.csv')

In [None]:
# from ast import literal_eval

In [None]:
# results4 = pd.read_csv('results3.csv', converters={'proposed_locations': literal_eval})

In [None]:
###### Show results of top 4.
top_4 = results[:4]
top_4

In [None]:
# it is latitude then longitude.
# box = np.array([[10.713719, 36.689328], [10.713719, 36.96973],[10.964773, 36.96973], [10.964773, 36.689328], [10.713719, 36.689328]])
plt.figure(figsize=(15, 10))
# plt.plot(box[:,1], box[:,0], '.r-')
plt.scatter(df_prim_loc[:, 1], df_prim_loc[:, 0], s=df_prim_enroll/100, label="Prim") # s gives size
plt.scatter(df_sec_loc[:, 1], df_sec_loc[:, 0], s=df_sec_enroll/100, label="Secondary") # s gives size
plt.scatter(top_4['proposed_locations'][3][:, 1], top_4['proposed_locations'][3][:, 0], s = 35, marker="o", label="New Secondary") # stars for supermarkets
plt.gca().set_aspect('equal')
plt.legend()
plt.show()

In [None]:
# it is latitude then longitude.
# box = np.array([[10.713719, 36.689328], [10.713719, 36.96973],[10.964773, 36.96973], [10.964773, 36.689328], [10.713719, 36.689328]])
plt.figure(figsize=(15, 10))
# plt.plot(box[:,1], box[:,0], '.r-')
plt.scatter(df_prim_loc[:, 1], df_prim_loc[:, 0], s=df_prim_enroll/100, label="Prim") # s gives size
plt.scatter(df_sec_loc[:, 1], df_sec_loc[:, 0], s=df_sec_enroll/100, label="Secondary") # s gives size
plt.scatter(benchmark_loc[:, 1], benchmark_loc[:, 0], s = 35, marker="o", label="New Secondary") # stars for supermarkets
plt.gca().set_aspect('equal')
plt.legend()
plt.show()

In [None]:
fig, ((ax0, ax1), (ax2, ax3)) = plt.subplots(2, 2, figsize=(15,15))
fig.suptitle('Top 4 Results. maxits=10,000, random_sp=30')

for i in range(4):
    ax = 'ax'+str(i)
    eval(ax).scatter(df_prim_loc[:, 1], df_prim_loc[:, 0], s=df_prim_enroll/100, label="Prim") # s gives size
    if(len(df_sec) != 0): eval(ax).scatter(df_sec_loc[:, 1], df_sec_loc[:, 0], s=df_sec_enroll/100, label="Secondary") # s gives size
    eval(ax).scatter([row[1] for row in top_4['proposed_locations'][i]], 
                     [row[0] for row in top_4['proposed_locations'][i]], s = 35, \
                      marker="o", label="New Secondary") # stars for supermarkets
#     eval(ax).scatter(top_4['proposed_locations'][i][:,1], top_4['proposed_locations'][i][:,0], s = 35, \
#                      marker="o", label="New Secondary") # stars for supermarkets
    eval(ax).set_title((str(top_4.loc[i]['algorithm']) + ', eei =  ' + str(top_4.loc[i]['eei'])\
                       + ', sigma: ' + str(top_4.loc[i]['sigma'])), fontstyle='italic')

for ax in fig.get_axes():
    ax.legend()
    ax.label_outer()

CMA doesn't beat the benchmark. Need to potentially re-run with new parameters

In [None]:
winner = top_4.head(1)
winner

In [None]:
# Function to shape expected enrollment. 
def shape2(distance, enrollment):
    # If less than 2km, all children expected to attend secondary i.e. distance not a factor
    min_walk = 5 
    max_walk = 8.94666 # distance greater than 5km (1hr 15 mins) assumes school too far, and zero enrollment expected.
    # if between 2-5km, return a linear dropoff.
    return np.where(distance<min_walk, enrollment,
             np.where(distance>max_walk, 0,
                     enrollment*(1-(distance-min_walk)/(max_walk-min_walk)))
            )

In [None]:
np.sum(df_sec_enroll)

In [None]:
np.sum(shape(current_ps_distance, df_prim_enroll))

In [None]:
np.sum(shape2(current_ps_distance, df_prim_enroll))

In [None]:
es = cma.CMAEvolutionStrategy(generate_sp_proposed(sp, proposed_schools).flatten(), sigma0=0.9,
                          inopts={'bounds': boundsxy,'seed':1234})
es.optimize(f, iterations=(maxits/ es.popsize))

In [None]:
benchmark_loc

In [None]:
es = cma.CMAEvolutionStrategy(benchmark_loc.flatten(), sigma0=0.9,
                          inopts={'bounds': boundsxy,'seed':1234})
es.optimize(f, iterations=(maxits/ es.popsize))

In [None]:
es.result

In [None]:
df_prim['urban_rural'].value_counts()

In [None]:
shape(100, 4)

In [None]:
df_prim.columns

In [None]:
np.mean(df_prim[df_prim['urban_rural'] == 2]['nearest_lwr_sec']) # distribution. 

In [None]:
np.mean(df_prim[df_prim['urban_rural'] == 1]['nearest_lwr_sec'])

In [None]:
# f(winner['proposed_locations'][0])
# np.mean(df_sec_enroll)
# update_expected_enrollment(winner['proposed_locations'][0])
# np.sum(eei)

In [None]:
# df_prim.groupby(['nearest_sch_code'])['expected_enroll'].agg('sum')[:5]

In [None]:
dd = haversine_vector(df_prim_loc,df_sec_loc, Unit.KILOMETERS, comb=True)

In [None]:
np.where(dd < 5)[0]