# Marko Zlatic Urisa Digital Competition 2023 Python Code

**This Python code was created to uphold requirements for the Urisa Digital Competition 2023.**

The Python code below is used to output a csv file containing all the locations for predicted crime and their respected intervals from 2023 to 2033. XGBoost, as well as, Sklearn KMeans were the primary machine learning algorithms used to dictate the predicted type of crime and at what year the crime will occur. An XGBoost regressor was used to carry out the final computation and an R-Squared score dictated the accuracy of the model.

The source dataset used to produce the predictions is from the Part 1 Crime Data produced from the Baltimore Police Department found [here]("https://data.baltimorecity.gov/datasets/baltimore::part-1-crime-data/about").

The main feature layer powering this web application can be found [here]("https://services1.arcgis.com/0MSEUqKaxRlEPj5g/arcgis/rest/services/MZ_Urisa_Comp_Predicted_Crime_in_Baltimore/FeatureServer").

The web application hosting the results can be found in [GitHub]("https://github.com/mzlatic1/Urisa_Digital_Competition_2023/blob/gh-pages/mz_urisa_digital_comp23.html") for the source code and can also be viewed from the JSbin hosted link [here]("https://output.jsbin.com/fazozin").

Author Info:<br/>
Name: Marko Zlatic<br/>
Date: June 30, 2023<br/>
Purpose: Urisa Digital Competition 2023<br/>
Student Status: Graduate<br/>
Program: MSc. Geographic Information Systems<br/>
University: Johns Hopkins University


In [1]:
# Import necessary packages
from sklearn.preprocessing import FunctionTransformer, LabelEncoder
from sklearn import model_selection, metrics
from sklearn.cluster import KMeans
from xgboost import XGBRegressor
import pandas as pd
import numpy as np


**Step 1: Data Manipulation.** Before the supervised XGBoost regressor is able to accept the Part 1 Crime Dataset, there are some simple data manipulation procedures that will take place using Pandas and SKlearn. The primary data manipulation practices used are data subset, as well as, categorical and date transformations.

In [2]:
crime = r"C:\Users\marko\Part_1_Crime_Data.csv" # csv file containing crimes committed in Baltimore, MD

df_crime = pd.read_csv(crime, low_memory=False)

df_crime.drop(columns=['CCNO', 'GeoLocation', 'Total_Incidents'], axis=1, inplace=True) # remove unnecessary fields
df_crime['CrimeDateTime'] = pd.to_datetime(df_crime['CrimeDateTime'].replace('+00', ''), errors='coerce') # reformat date field

In [3]:
df_crime.head()

Unnamed: 0,CrimeDateTime,CrimeCode,Location,Description,Inside_Outside,Weapon,Post,Gender,Age,Race,Ethnicity,District,Neighborhood,Latitude,Longitude,Premise
0,2023-06-24 04:01:00+00:00,4B,600 LUCIA AVE,AGG. ASSAULT,,PERSONAL_WEAPONS,833.0,F,15.0,BLACK_OR_AFRICAN_AMERICAN,NOT_HISPANIC_OR_LATINO,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,
1,2023-06-24 04:01:00+00:00,4B,600 LUCIA AVE,AGG. ASSAULT,,PERSONAL_WEAPONS,833.0,F,15.0,BLACK_OR_AFRICAN_AMERICAN,NOT_HISPANIC_OR_LATINO,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,
2,2023-06-24 04:01:00+00:00,4B,600 LUCIA AVE,AGG. ASSAULT,,PERSONAL_WEAPONS,833.0,F,27.0,BLACK_OR_AFRICAN_AMERICAN,NOT_HISPANIC_OR_LATINO,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,
3,2023-06-24 04:01:00+00:00,3JK,600 LUCIA AVE,ROBBERY,,PERSONAL_WEAPONS,833.0,M,25.0,BLACK_OR_AFRICAN_AMERICAN,UNKNOWN,SOUTHWEST,YALE HEIGHTS,39.273302,-76.692439,
4,2023-06-24 03:45:00+00:00,5A,3200 LILY AVE,BURGLARY,,,922.0,M,48.0,,HISPANIC_OR_LATINO,SOUTHERN,CHERRY HILL,39.246432,-76.636819,


In [6]:
# Apply categorical transformations to text fields
# (this code effectively converts text categories to whole numbers)
description_dict = {}
num = 1
for col in list(df_crime.columns):
    if int(df_crime[[col]].isnull().sum()) > 0:
        if df_crime[col].dtype == object and col:
            df_crime[col].fillna('Unknown', inplace=True)
            df_crime[col] = LabelEncoder().fit_transform(df_crime[col]) + 1
    elif df_crime[col].dtype == object and col:
        if col != 'Description':
            df_crime[col] = LabelEncoder().fit_transform(df_crime[col]) + 1
        else:
            for description in list(df_crime[col].unique()):
                description_dict[description] = num
                num += 1
            df_crime[col] = df_crime[col].apply(lambda row: description_dict[row])

df_crime.dropna(axis=0, inplace=True) # remove remaining null values
df_crime.query('Longitude != 0 and Latitude != 0', inplace=True) # remove null geometries
df_crime = df_crime[df_crime['CrimeDateTime'].dt.year > 2012] # subset data after 2012

In [7]:
df_crime.info() # All fields are now an integer or float

<class 'pandas.core.frame.DataFrame'>
Int64Index: 447356 entries, 0 to 568613
Data columns (total 16 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   CrimeDateTime   447356 non-null  datetime64[ns, UTC]
 1   CrimeCode       447356 non-null  int32              
 2   Location        447356 non-null  int32              
 3   Description     447356 non-null  int64              
 4   Inside_Outside  447356 non-null  int32              
 5   Weapon          447356 non-null  int32              
 6   Post            447356 non-null  float64            
 7   Gender          447356 non-null  int32              
 8   Age             447356 non-null  float64            
 9   Race            447356 non-null  int32              
 10  Ethnicity       447356 non-null  int32              
 11  District        447356 non-null  int32              
 12  Neighborhood    447356 non-null  int32              
 13  Latitude      

Before the code is able to continue, export the recently modified dataframe and use the Esri arcpy XY Table to Point Geoprocessing tool followed by the Density-based Clustering Geoprocessing tool. The message output produced from the Density-based Clustering tool contains the number of clusters, this will be the number of clusters used for the KMeans algorithm. The code is as follows:<br/>```
    arcpy.stats.DensityBasedClustering(
        in_features="Part_1_Crime_Data_with_TRANSFORMATIONS_XYTableToPoint",
        output_features=r"C:\Users\marko\Urisa_Digital_Comp23.gdb\Part_1_Crime_Data_with_TRANSFORMATIONS_XYTableToPoint_DensityBasedClustering",
        cluster_method="HDBSCAN",
        min_features_cluster=50,
        search_distance=None,
        cluster_sensitivity=None,
        time_field=None,
        search_time_interval=None
    )```

In [8]:
df_crime.to_csv(r"C:\Users\marko\Part_1_Crime_Data_with_TRANSFORMATIONS.csv")

In [9]:
df_crime['YEAR'] = df_crime['CrimeDateTime'].dt.year

The cosine_transformer() function was derived from a blog post created from NVIDIA's Eryk Lewinson found here: [https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/ Retrieved June 30, 2023]("https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/").

In [10]:
# Month and Day Transformations
def cosine_transformation(period):
    return FunctionTransformer(lambda row: np.cos(row / period * 2 * np.pi))

df_crime['MONTH'] = cosine_transformation(12).fit_transform(df_crime['CrimeDateTime'].dt.month)
df_crime['DAY'] = cosine_transformation(365).fit_transform(df_crime['CrimeDateTime'].dt.day)


In [11]:
df_crime.head() # Visualizing final version of dataframe before step 2.

Unnamed: 0,CrimeDateTime,CrimeCode,Location,Description,Inside_Outside,Weapon,Post,Gender,Age,Race,Ethnicity,District,Neighborhood,Latitude,Longitude,Premise,YEAR,MONTH,DAY
0,2023-06-24 04:01:00+00:00,45,19139,1,5,18,833.0,14,15.0,3,4,8,278,39.273302,-76.692439,157,2023,-1.0,0.915864
1,2023-06-24 04:01:00+00:00,45,19139,1,5,18,833.0,14,15.0,3,4,8,278,39.273302,-76.692439,157,2023,-1.0,0.915864
2,2023-06-24 04:01:00+00:00,45,19139,1,5,18,833.0,14,27.0,3,4,8,278,39.273302,-76.692439,157,2023,-1.0,0.915864
3,2023-06-24 04:01:00+00:00,31,19139,2,5,18,833.0,21,25.0,3,6,8,278,39.273302,-76.692439,157,2023,-1.0,0.915864
4,2023-06-24 03:45:00+00:00,49,11360,3,5,23,922.0,21,48.0,6,2,7,46,39.246432,-76.636819,157,2023,-1.0,0.915864


**Step 2: Running the Model.** Now that the data manipulation process is complete. It is now time to run the supervised machine learning model. To start, the KMeans algorithm is used (where k is equal to the number of clusters produced from the Esri arcpy Density-based Clustering geoprocessing tool) to predict the cluster id associated with each row; this outputs an array with cluster id's (in the same order as the input dataframe). The array of cluster id's is then joined back to the primary input dataframe. The dataframe is then split into testing and training subsets. The training subset is then run through the XGBoost regressor algorithm using the .fit() function. The same XGBoost regressor model is then run through the .predict() function where an array of predicted values is produced. The array of predicted values is then compared to the test subset to determine the R^2 (R-Squared) value; ultimately determine the fit of the model.

In [13]:
k = 2549 # number retrieved from arcpy density-based clustering tool
kmeans = KMeans(k, n_init='auto').fit_predict(df_crime[['YEAR', 'Description', 'Longitude', 'Latitude']])

In [14]:
clusters = pd.DataFrame(kmeans, columns=['cluster']).reset_index().rename(columns={'index': 'join_field'})
df_crime = df_crime.reset_index().rename(columns={'index': 'join_field'}).merge(clusters, on='join_field').drop(columns='join_field', axis=1)

In [15]:
df_crime.drop(columns=['CrimeDateTime'], axis=1, inplace=True) # Date field is no longer needed

In [16]:
y_fields = ['YEAR', 'Description', 'Longitude', 'Latitude']
x_fields = [c for c in list(df_crime.columns) if c not in y_fields]
x = df_crime[x_fields]
y = df_crime[y_fields]

In [17]:
xtrain, xtest, ytrain, ytest = model_selection.train_test_split(x, y, test_size=0.2, random_state=42)

In [18]:
params_xg = {
    'n_estimators': 500,
    'n_jobs': -1,
    'gpu_id': 0,
    'predictor': 'gpu_predictor',
    'verbosity': 1,
    'random_state': 42,
}
xg_model = XGBRegressor(**params_xg).fit(xtrain, ytrain)

In [19]:
ypred_xg = xg_model.predict(xtest)

In [20]:
r2_xg = metrics.r2_score(ypred_xg, ytest)
r2_xg

0.9678233969792526

**Step 3: Final Processing and Export Results.** The KMeans-XGBoost supervised model was able to produce a predicted model whose R-Squared value is 0.97. These results would be considered robust. The .predict() function is now used for the entire modified Part 1 Crime dataframe to produce a complete prediction of crime. Since the Description column was a text column, it was transformed in Step 1 and currently contains integers. The integers are then re-translated back to their original values and the Year column was rounded up using the numpy.ceil() function. Finally, exporting the results as a csv file; where the data is then imported into ArcGIS Pro and published as a feature layer.

In [21]:
export_pred = xg_model.predict(x)

In [22]:
df_ypred = pd.DataFrame(export_pred, columns=[y_fields])
df_ypred['Description'] = df_ypred['Description'].round()
for index, row in df_ypred.iterrows():
    idx_4_dict = int(row['Description'])
    row['Description_Val'] = list(description_dict.keys())[list(description_dict.values()).index(idx_4_dict)]
df_ypred['YEAR'] = np.ceil(df_ypred['YEAR'])
df_ypred.to_csv(r'C:\Users\marko\predicted_values.csv')
