# Reading Data

In [13]:
import pandas as pd
from shapely.geometry import Point
from shapely import wkt
import numpy as np

In [14]:
project_data = pd.read_csv('My_CHI._My_Future._Programs_20241113.csv')
chi_nei=pd.read_csv('CommAreas_20241114.csv')

### Data cleaning - Impute Missing geographic cluster (neighborhood name) data
*By Luna Xu*

Step 1: inspecting geographic cluters data. how many are missing?

In [15]:
project_data['Geographic Cluster Name'].isnull().sum()

np.int64(130071)

There are 130071 project entries with missing geographic cluster names

We know that some of the programs are online programs, so it makes sense that don't have a geographic cluster. Let's make online programs's geographic cluster name to be online

In [16]:
project_data['Geographic Cluster Name'] = project_data.apply(
    lambda row: 'online' if row['Meeting Type'] == 'online' and pd.isnull(row['Geographic Cluster Name']) else row['Geographic Cluster Name'],
    axis=1
)

In [17]:
project_data['Geographic Cluster Name'].isnull().sum()

np.int64(121359)

Answer: now we see that there 121359 missing values for in-person programs

Step 2: inspect neighborhood names in Geographic Cluster Name (in the program dataset) & neighborhood names in Commmunity (in the chicago neighborhod boundaries dataset) to see if there are any difference

In [18]:
project_data_unique = project_data['Geographic Cluster Name'].unique()
chi_nei_unique = chi_nei['COMMUNITY'].unique()

# Find matches
matches = set(project_data_unique).intersection(chi_nei_unique)

# Find unmatched values
unmatched_project_data = set(project_data_unique) - matches
unmatched_chi_nei = set(chi_nei_unique) - matches

# Results
print(f"Matches: {matches}")
print(f"Unmatched in project_data: {unmatched_project_data}")
print(f"Unmatched in chi_nei: {unmatched_chi_nei}")

Matches: {'HYDE PARK', 'JEFFERSON PARK', 'NEAR SOUTH SIDE', 'MCKINLEY PARK', 'LINCOLN SQUARE', 'UPTOWN', 'HEGEWISCH', 'RIVERDALE', 'NEW CITY', 'NORTH CENTER', 'GREATER GRAND CROSSING', 'WEST LAWN', 'NORWOOD PARK', 'WASHINGTON PARK', 'MORGAN PARK', 'LAKE VIEW', 'MONTCLARE', 'LOOP', 'WASHINGTON HEIGHTS', 'MOUNT GREENWOOD', 'ENGLEWOOD', 'BRIDGEPORT', 'LOWER WEST SIDE', 'OHARE', 'AVALON PARK', 'FOREST GLEN', 'EAST SIDE', 'SOUTH DEERING', 'SOUTH LAWNDALE', 'DUNNING', 'NORTH LAWNDALE', 'NEAR NORTH SIDE', 'WEST ENGLEWOOD', 'AUBURN GRESHAM', 'CHICAGO LAWN', 'AUSTIN', 'PORTAGE PARK', 'AVONDALE', 'FULLER PARK', 'GRAND BOULEVARD', 'SOUTH SHORE', 'ROSELAND', 'EDISON PARK', 'DOUGLAS', 'IRVING PARK', 'CLEARING', 'SOUTH CHICAGO', 'ARCHER HEIGHTS', 'WEST PULLMAN', 'CALUMET HEIGHTS', 'BRIGHTON PARK', 'ROGERS PARK', 'NEAR WEST SIDE', 'GARFIELD RIDGE', 'KENWOOD', 'EAST GARFIELD PARK', 'CHATHAM', 'WEST RIDGE', 'LOGAN SQUARE', 'BURNSIDE', 'WOODLAWN', 'WEST GARFIELD PARK', 'HERMOSA', 'ASHBURN', 'ARMOUR SQUA

In [19]:
unmatched_geocluster_list = list(unmatched_project_data)

As shown above, there are more neighborhood names in project_data geographic cluster name than neighborhood names in chi_nei, Community. By looking at the unmatched neighborhood names in project_data geographic cluster name, we can see that some program uses 'zone' instead of neighborhood names in the geographic cluster name. Since we can't pinpoint how these zone correspond with neighborhood, we need to also find neighborhood name for program that used unmatched neiborhood name. 

To sum up, for the next step, we want to try to impute neighborhood name for entries that either doesn't have a geographic cluster name or has unmatched geographic cluster name.

#### Match Method: use lattitude longtitude info in project_data entries to match corresponding neighborhood

we can see that in project_data, we have column 'lattitude' and 'longtitude' that tells us where this program is happening. 

we can also see that in chi_nei, we have a column called 'the_geom' which are multipolygons (sets of longtitude,latitude set) that marks the boundaries of each neighborhood. 

Thus, my first attempt to impute these missing geographic cluster data is to see if the program's longtitude, lattitude falls in any the_geom. If it does fall in a certain geom, that means, the program is held in that neighborhood. 

Step 1: We want to extract programs that has both longtitude and latittude, don't have a geographic cluster name or its geographic cluster name is not in the chicago neighborhood name.

In [20]:
project_withlatlong=project_data.loc[
    ((project_data['Geographic Cluster Name'].isnull()) | (project_data['Geographic Cluster Name'].isin(unmatched_geocluster_list))) & 
    (project_data['Latitude'].notnull()) & 
    (project_data['Longitude'].notnull()),
    ['Program ID','Latitude','Longitude']
]

Step 2: To see if the longtitude lattitude pair is in the boundary multipolygon, I decided to use shapely library. So, we need to turn longtitude lattitude information into shapely point format

In [21]:
project_withlatlong['point_geom']=project_withlatlong.apply(
    lambda row: Point(row['Longitude'],row['Latitude']),axis=1
)

Step 3: we also need to turn multipolygon in chicago neighborhood dataset into shapely format

In [22]:
chi_nei['shapely_geom']=chi_nei['the_geom'].apply(wkt.loads)

Step 4: we want to define a helper function to check if program longtitude,latitude is in any of the multiploygon

In [23]:
def match_multiploygon(point,multipolygons):
    for muultipolygon in multipolygons:
        if muultipolygon.contains(point):
            return muultipolygon
    return None

Step 5: for each longtitude-lattitude pair, we try to check if its in any multipolygon (neighborhood)

In [24]:
project_withlatlong['shapely_geom']=project_withlatlong['point_geom'].apply(
    lambda point: match_multiploygon(point,chi_nei['shapely_geom'])
)

In [25]:
matched_program_neiname = pd.merge(project_withlatlong,chi_nei,how='left')

In [26]:
attempt1_result = matched_program_neiname.loc[matched_program_neiname['COMMUNITY'].notnull(),['Program ID','COMMUNITY']]

In [31]:
project_data = pd.merge(project_data,attempt1_result,on='Program ID',how='left')

Step 6: we fill the geographic cluster name with the neighborhood name we found.

In [32]:
project_data.loc[
    project_data['Geographic Cluster Name'].isnull() & project_data['COMMUNITY'].notnull(),
    ['Program ID','Geographic Cluster Name','COMMUNITY']
]
project_data['Geographic Cluster Name'] = project_data['Geographic Cluster Name'].fillna(project_data['COMMUNITY'])


In [33]:
project_data=project_data.drop(columns=['COMMUNITY'])

In [34]:
project_data['Geographic Cluster Name'].isnull().sum()

np.int64(996)

In this attempt, we successfully make missing geographic cluster name came down to 996 from 121359. We impute 120363 missing values.