# Marko Zlatic HopHacks Competition 2023 Python Code

**This Python code was created for the HopHacks Competition 2023.**

The Python code below is used to output a shapefile (.shp) file containing the resulting predictions of influenza infections for 180 countries from 2023 to 2028. The primary machine learning model used was the XGBoost regressor from the XGBoost API. Due to the robust data manipulation of the parent dataset, there was no hyperparameter tuning as the resulting R-Squared and RMSE values are considered ideal. Once when the production ready prediction shapefile is created, a supplementary Python script (mz_hophacks2023_upload_2_agol.py) was created to automate the process to upload and publish content in ArcGIS Online (AGOL).

The source dataset used to produce the predictions is from the World Heath Organization's (WHO) FluNet Database containing the records of several virus infections for severl countries from 1997 to 2023, found [here]("https://www.who.int/tools/flunet").

The primary feature layer enabling this web application can be found [here]("https://services1.arcgis.com/0MSEUqKaxRlEPj5g/arcgis/rest/services/MS_HopHacks_2023_Influenza_Global_Predictions_2023_to_2028/FeatureServer").

The file geodatabase (gdb) that contains the global country geopolitical boundaries used to generate the final shapefile is from Esri's ArcGIS Hub, found [here]("https://hub.arcgis.com/datasets/esri::world-countries-generalized/explore").

The web application rendering the results can be found in [GitHub]("https://github.com/mzlatic1/hophacks2023") for the source code and can also be viewed from the JSbin hosted link [here]("https://jsbin.com/diyihurola").

Author Info:<br/>
Name: Marko Zlatic<br/>
Date: September 17, 2023<br/>
Purpose: HopHacks 2023<br/>
Student Status: Graduate<br/>
Program: MSc. Geographic Information Systems<br/>
University: Johns Hopkins University


In [86]:
# load packages
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection, metrics
from xgboost import XGBRegressor
import geopandas as gpd
import pandas as pd
import numpy as np
import os
import math

In [70]:
# locate FluNet file and load data
folder = os.path.split(os.getcwd())[0]
datasets = os.path.join(folder, 'datasets')
csv_file = os.path.join(datasets, 'VIW_FNT.csv')

In [71]:
df = pd.read_csv(csv_file).fillna(0)

In [72]:
# manually encode country names (to be able to reverse transform once when results are produced)
country_names = {}
idx = 0
for c in list(df['COUNTRY_AREA_TERRITORY'].unique()):
    country_names[c] = idx
    idx += 1

In [73]:
# apply preprocessing computations
def preprocessing(input_df):
    drop_fields = []
    for col in list(input_df.columns):
        if 'date' in col.lower(): # apply sin transformation to make day and month values more cyclical
            input_df[col] = pd.to_datetime(input_df[col])
            input_df[col + '_DAY'] = input_df[col].apply(lambda row: np.sin(row.day / 365 * 2 * np.pi))
            input_df[col + '_MONTH'] = input_df[col].apply(lambda row: np.sin(row.year / 12 * 2 * np.pi))
            drop_fields.append(col)
        elif input_df[col].dtype == object:
            try:
                if col == 'COUNTRY_AREA_TERRITORY': # manually transform to store values for final export
                    input_df[col] = input_df[col].apply(lambda row: country_names[row])
                else:
                    input_df[col] = LabelEncoder().fit_transform(input_df[col])
            except:
                print('Didnt work with', col)
                drop_fields.append(col)
        elif input_df[col].dtype == int or input_df[col].dtype == float: # apply logarithm transformations due to extreme skews in many distributions
            if col != 'ISO_YEAR':
                log_med = np.log(np.median(input_df[col])) if np.median(input_df[col]) > 0 else 1
                input_df[col] = input_df[col].apply(lambda row: np.log(row) if row > 0 else log_med)
    input_df.drop(columns=drop_fields, axis=1, inplace=True)
    return input_df


In [74]:
df = preprocessing(df)

Didnt work with AOTHER_SUBTYPE_DETAILS
Didnt work with OTHER_RESPVIRUS_DETAILS
Didnt work with LAB_RESULT_COMMENT
Didnt work with WCR_COMMENT
Didnt work with ISO2


In [75]:
# prep labels and features
y_fields = ['INF_ALL', 'COUNTRY_AREA_TERRITORY', 'ISO_YEAR']
x_fields = [f for f in list(df.columns) if f not in y_fields]

In [76]:
x = df[x_fields]
y = df[y_fields]

In [77]:
# set up the training and testing subsets
xtrain, xtest, ytrain, ytest = model_selection.train_test_split(x, y, test_size=0.2, random_state=42)

In [78]:
# fit and train model
xg = XGBRegressor(n_jobs=-1)

In [79]:
xg.fit(xtrain, ytrain)
pred = xg.predict(xtest)

In [80]:
# output results
print('r squared', metrics.r2_score(pred, ytest))
print('rmse', metrics.mean_squared_error(ytest, pred, squared=True))
print('mse', metrics.mean_squared_error(ytest, pred, squared=False))

r squared 0.9997625584558998
rmse 0.455001706879856
mse 0.39338242739299917


In [81]:
# reverse transform and clean up predicted results
def finalize_and_export(pred_results):
    results = pd.DataFrame(pred_results, columns=y_fields)
    results['COUNTRY_AREA_TERRITORY'] = np.round(np.abs(results['COUNTRY_AREA_TERRITORY']))
    results['COUNTRY_AREA_TERRITORY'] = results['COUNTRY_AREA_TERRITORY'].apply(lambda row: list(country_names.keys())[list(country_names.values()).index(row)] if row in list(country_names.values()) else 'ERROR') # reverse transform country names back to original string value
    results['INF_ALL'] = results['INF_ALL'].apply(lambda row: math.floor(math.exp(row))) # reverse transform natural log
    results['ISO_YEAR'] = np.floor(results['ISO_YEAR'])
    results = results.groupby(y_fields[1:])[y_fields[0]].sum().to_frame().reset_index() # group by year and country name
    return results


In [82]:
grouped_pred = finalize_and_export(pred)

In [84]:
get_1999_after_index = df.query('ISO_YEAR > 2017').index.values.tolist() # frequency of data collection similar from 2017 to 2023
final_pred = finalize_and_export(xg.predict(x[x.index.isin(get_1999_after_index)])).astype({'ISO_YEAR': 'int32', 'INF_ALL': 'int32'})

In [103]:
# post processing
spat_ready_preds = final_pred.merge(df, on='COUNTRY_AREA_TERRITORY').rename(columns={"ISO2": "ISO"}) # rename to properly join with boundary feature class

In [87]:
country_boundary_gdb = os.path.join(datasets, 'f5c62d79-b3bb-440a-96cb-a5a8015a3fce.gdb') # load unzipped geodatabase

In [91]:
boundaries = gpd.read_file(country_boundary_gdb, driver='FileGDB', layer='World_Countries_Generalized')
boundaries.drop(columns=['COUNTRY', 'COUNTRYAFF', 'AFF_ISO', 'SHAPE_Length', 'SHAPE_Area'], axis=1, inplace=True) # remove unnecessary columns

In [105]:
spatial_preds = spat_ready_preds.merge(boundaries, on='ISO').astype({'ISO_YEAR': 'int32', 'INF_ALL': 'int32'})

In [106]:
spatial_preds = gpd.GeoDataFrame(spatial_preds, geometry=spatial_preds['geometry'], crs=boundaries.crs)

In [None]:
spatial_preds.to_file(os.path.join(datasets, 'mz_hophacks_2023_influenza_global_predictions.shp'))

In [None]:
# please review the mz_hophacks2023_upload_2_agol.py file to understand how to publish content to AGOL using the ArcGIS API for Python #
