<a href="https://colab.research.google.com/github/jygre51/AI_Two_Datasets_Assignment/blob/main/Assignment_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Using Measures of economic freedom as a predictor for financially motivated crimes**


# Introduction

The advent of big data as a concept has greatly influenced the way in which businesses, politicians, and everyday people think about information and causality. With the computing power available since the introduction of transistor computing and the techniques bought forward by various machine learning algorithms, tenuous causal links can be fleshed out and backed with data in a timeframe that was previously impossible.

The following analysis explores the use of artificial neural networks in an attempt to establish one such tenuous causal link. Specifically, it aims to establish the efficacy of economic freedom indicators as a predictor for financially motivated crimes.

# Gathering Data

As with any machine learning project, data is king. In order to gain any meaningful insights from the data, it is important to have a repository of data that is large and of a high standard of quality. For this reason, the data used in this analysis has been selected for the quantity and relative reliability of data available.

Data for financially motivated crime occurrence has been taken from the UNODC's website[1]. It is keyed by country, detailing the occurence per 100,000 of various types of crimes in the year 2019. Three types of crime in particular have been pinpointed (theft, burglary, and fraud).

Economic freedom data has been taken from the Heritage Foundation's website[2]. It is keyed by country and details the 'score' for each country across 13 different economic freedom indicators for the year 2019.

The first step in implementing the model is importing and cleaning the data. Python offers many useful libraries for managing and manipulating data. Here, the pandas library is used in order to take advantage of its powerful DataFrame class:

In [None]:
import pandas as pd

# Import separate data files
freedom_data = pd.read_csv('freedom_scores_2019.csv')
crime_data = pd.read_csv('UN_ODC_Crime_Data.csv')

With the data imported, it is time to clean up. Since the data is keyed by country and for the same year, there is no need to reformat the dataframes before merging.

We can use pandas' built in merge tool to inner join the two dataframes. This removes occurences of countries that do not appear in both tables. After this, we can prepare for the next step of data preparation. Some country/attribute values are not available and have been left as either 0 or Null. In order to replace these empty values with some estimated value (using sklearn.impute.SimpleImputer), each occurence of missing data must be represented as null instead of 0. Fortunately, DataFrame's built-in replace function defaults to replacing chosen values with null when a replacement value is not specified:

In [None]:
# Merge by country and clean data (replace null values with median for feature)
merged_data = pd.merge(freedom_data, crime_data, on='Country', how='inner')
crime_and_freedom_nums = merged_data.drop("Country", axis=1).replace(0)

Having all empty values represented by null allows us to use sklearn's SimpleImputer class to replace all empty occurences with an estimated value. In this case, using SimpleImputer's 'median' strategy, an interpolative estimate of the median in each column is calculated and all empty cells are replaced with the median respective to their columns.

>It is important to note here that the two csv files containing the economic freedom indicators have been somewhat pre-prepared, with rows containing more than 50% empty cells having been removed. This prevents the Imputer from creating many rows with very similar data which would impair the neural network's ability to make reasonable estimates.

Using the Imputer is a simple matter of initialising the imputer instance (spcifying the desired strategy in the process), fitting the imputer instance to the given data, and transforming the data with SimpleImputer's transform function. Since the transform produces a series of data in a similar - but not identical - format to a DataFrame, it is necessary to convert the Imputer output back into a dataframe:



In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(crime_and_freedom_nums)
X = imputer.transform(crime_and_freedom_nums)

clean_data = pd.DataFrame(X, columns=crime_and_freedom_nums.columns)

In order for the neural network to make meaningful conclusions about the causal link between the predictors and target values, it is necessary to use a scaling method to ensure that the predictors and target values are on a similar scale. Recall that the economic freedom indicators are measured as a score out of 100 whilst the crime values are measured by occurence per 100,000. This discrepancy calls for scaling of some sort.

The two most popular scaling methods are 'normalisation' and 'standardisation'. Both methods scale the dataset in such a way that the mean of the values of each feature is 0. They differ in that normalisation ensures the values are restricted to a [-1,1] range whilst standardisation does not restrict the range of values.

This analysis employs a standardisation scaler (sklearn.preprocessing.StandardScaler) since sklearn's StandardScaler will standardise each column separately, whilst its Normalizer class will normalize over all values in the DataFrame at the same time.

Since the economic freedom data and crime data have come from different sources, it is important to separate them before applying standardisation measures. Note that fmc_measures only takes value for burglary, theft, and fraud, despite more measures being available. The rest are thrown away since they are not pertinent to the question at hand.

Using the StandardScaler is similar to using the imputer. First, a StandardScaler is instantiated for each different datasource. Each scaler is then fit to its respective sets of data. Finally, each StandardScaler's transform function is applied to its respective datasource to produce the scaled data:

In [None]:
from sklearn.preprocessing import StandardScaler

# Identify data to be used in neural net from each datasource
freedom_measures = ['Overall Score', 'Property Rights', 'Judicial Effectiveness', 'Government Integrity', 'Tax Burden', 'Government Spending', 'Fiscal Health', 'Business Freedom', 'Labor Freedom', 'Monetary Freedom', 'Trade Freedom', 'Investment Freedom', 'Financial Freedom']
fmc_measures = ['Burglary', 'Theft', 'Fraud']

# Fetch relevant data from clean_data table
X = clean_data[freedom_measures].values
Y = clean_data[fmc_measures].values

# Standardization of data
predictor_scaler = StandardScaler()
target_var_scaler = StandardScaler()

# Fit to respective datasets and store the fit objects to be used later
predictor_scaler_fit = predictor_scaler.fit(X)
target_var_scaler_fit = target_var_scaler.fit(Y)

# Apply transform to respective datasources
X = predictor_scaler_fit.transform(X)
Y = target_var_scaler_fit.transform(Y)

With the data imported and cleaned, it's time to start thinking about putting it into a machine learning model. Whilst this analysis is mostly explorative, and

In [None]:
import numpy as np


from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense










# Split the data into training and testing set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=47)


# Quick sanity check with the shapes of Training and testing datasets
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

# create ANN model
model = Sequential()

# Defining the Input layer and FIRST hidden layer, both are same!
model.add(Dense(units=10, input_dim=13, kernel_initializer='normal', activation='relu'))

# Defining the Second layer of the model
# after the first layer we don't have to specify input_dim as keras configure it automatically
model.add(Dense(units=10, kernel_initializer='normal', activation='relu'))

# Output will be 3 fully connected nodes
model.add(Dense(3, kernel_initializer='normal'))

# Compiling the model
model.compile(loss='mean_squared_error', optimizer='adam')

# Fitting the ANN to the Training set (batch size is 3 to adjust for small dataset)
# model.fit(X_train, Y_train ,batch_size = 3, epochs = 100, verbose=0)


Find best hyperparameters:

- manual grid search cross validation

In [None]:
np.set_printoptions(suppress=True)

def find_best_params(X_train, Y_train, X_test, Y_test):

    # Defining the list of hyper parameters to try
    batch_size_list = [4, 5, 6]
    epoch_list = [4, 5, 6, 7, 8]
    neuron_count_list = [5, 10, 15, 20]

    search_results_data = pd.DataFrame(columns=['trial_number', 'parameters', 'accuracy'])

    # initializing the trials
    trial_number=0
    for batch_size_trial in batch_size_list:
        for epochs_trial in epoch_list:
          for neuron_count in neuron_count_list:
              trial_number += 1
              # create ANN model
              model = Sequential()
              # Defining the first layer of the model
              model.add(Dense(units=neuron_count, input_dim=X_train.shape[1], kernel_initializer='normal', activation='relu'))

              # Defining the Second layer of the model
              model.add(Dense(units=neuron_count, kernel_initializer='normal', activation='relu'))

              # Output will be 3 fully connected nodes
              model.add(Dense(3, kernel_initializer='normal'))

              # Compiling the model
              model.compile(loss='mean_squared_error', optimizer='adam')

              # Fitting the ANN to the Training set
              model.fit(X_train, Y_train ,batch_size = batch_size_trial, epochs = epochs_trial, verbose=0)

              MAPE = np.mean(100 * (np.abs(Y_test - model.predict(X_test)/Y_test)))

              # printing the results of the current iteration
              print(trial_number, 'Parameters:','batch_size:', batch_size_trial,'-', 'epochs:',epochs_trial, 'neurons:',neuron_count, 'Accuracy:', 100-MAPE)

              search_results_data=search_results_data.append(pd.DataFrame(data=[[trial_number, str(batch_size_trial)+'-'+str(epochs_trial)+'-'+str(neuron_count), 100-MAPE]],
                                                                      columns=['trial_number', 'parameters', 'accuracy'] ))
    return(search_results_data)

# Calling the function
hyperparameter_tuning_results = find_best_params(X_train, Y_train, X_test, Y_test)

In [None]:
print(hyperparameter_tuning_results)

this reveals that 6-4-10 are the optimal hyperparameters.

train model on optimal parameters:

In [17]:
# Fitting the ANN to the Training set
model.fit(X_train, Y_train ,batch_size = 6, epochs = 4, verbose=0)

# Generating predictions on testing data
predictions=model.predict(X_test)

# Scaling the predicted Price data back to original price scale
predictions=target_var_scaler_fit.inverse_transform(predictions)

# Scaling the y_test Price data back to original price scale
y_test_orig=target_var_scaler_fit.inverse_transform(Y_test)

# Scaling the test data back to original scale
Test_Data=predictor_scaler_fit.inverse_transform(X_test)

TestingData=pd.DataFrame(data=Test_Data, columns=freedom_measures)
TestingData['Burglary']=y_test_orig[:,0]
TestingData['Theft']=y_test_orig[:,1]
TestingData['Fraud']=y_test_orig[:,2]
TestingData['pred_Burglary']=predictions[:,0]
TestingData['pred_Theft']=predictions[:,1]
TestingData['pred_Fraud']=predictions[:,2]
TestingData.head()



Unnamed: 0,Overall Score,Property Rights,Judicial Effectiveness,Government Integrity,Tax Burden,Government Spending,Fiscal Health,Business Freedom,Labor Freedom,Monetary Freedom,Trade Freedom,Investment Freedom,Financial Freedom,Burglary,Theft,Fraud,pred_Burglary,pred_Theft,pred_Fraud
0,68.6,66.7,51.9,39.8,89.7,69.0,89.3,63.1,64.5,82.7,86.0,70.0,50.0,149.271077,491.415505,58.086854,279.32132,733.059387,179.191315
1,62.2,71.7,49.8,43.7,55.6,26.5,71.3,71.7,51.1,84.0,86.0,85.0,50.0,276.80349,1517.802425,406.878309,281.801117,740.568481,180.123184
2,61.9,40.2,37.9,30.2,84.3,46.1,96.6,49.7,67.0,83.1,82.6,65.0,60.0,246.849322,529.84911,82.520112,276.032135,720.478394,180.606323
3,51.1,39.5,26.6,18.2,91.8,75.6,96.9,47.9,46.5,78.1,79.0,60.0,50.0,246.849322,529.84911,82.520112,271.981873,704.249573,182.417252
4,74.2,73.6,61.2,47.8,86.4,65.1,97.3,75.2,63.6,84.6,86.0,80.0,70.0,57.59748,382.649435,104.630156,280.581116,737.381775,179.045074


In [18]:
# Computing the absolute percent error
APE_burglary = 100*(abs(TestingData['Burglary']-TestingData['pred_Burglary'])/TestingData['Burglary'])
TestingData['Burglary_APE'] = APE_burglary

APE_theft = 100*(abs(TestingData['Theft']-TestingData['pred_Theft'])/TestingData['Theft'])
TestingData['Theft_APE'] = APE_theft

APE_fraud = 100*(abs(TestingData['Fraud']-TestingData['pred_Fraud'])/TestingData['Fraud'])
TestingData['Fraud_APE'] = APE_fraud

print('Accuracy in predicting Burglary:', 100-np.mean(APE_burglary))
print('Accuracy in predicting Theft:', 100-np.mean(APE_theft))
print('Accuracy in predicting Fraud:', 100-np.mean(APE_fraud))
TestingData.head()

Accuracy in predicting Burglary: 13.589247795577023
Accuracy in predicting Theft: -0.4787117535885983
Accuracy in predicting Fraud: -53.36382843789525


Unnamed: 0,Overall Score,Property Rights,Judicial Effectiveness,Government Integrity,Tax Burden,Government Spending,Fiscal Health,Business Freedom,Labor Freedom,Monetary Freedom,...,Financial Freedom,Burglary,Theft,Fraud,pred_Burglary,pred_Theft,pred_Fraud,Burglary_APE,Theft_APE,Fraud_APE
0,68.6,66.7,51.9,39.8,89.7,69.0,89.3,63.1,64.5,82.7,...,50.0,149.271077,491.415505,58.086854,279.32132,733.059387,179.191315,87.123538,49.173028,208.488585
1,62.2,71.7,49.8,43.7,55.6,26.5,71.3,71.7,51.1,84.0,...,50.0,276.80349,1517.802425,406.878309,281.801117,740.568481,180.123184,1.805478,51.207847,55.730453
2,61.9,40.2,37.9,30.2,84.3,46.1,96.6,49.7,67.0,83.1,...,60.0,246.849322,529.84911,82.520112,276.032135,720.478394,180.606323,11.822116,35.978032,118.863399
3,51.1,39.5,26.6,18.2,91.8,75.6,96.9,47.9,46.5,78.1,...,50.0,246.849322,529.84911,82.520112,271.981873,704.249573,182.417252,10.181333,32.915119,121.057929
4,74.2,73.6,61.2,47.8,86.4,65.1,97.3,75.2,63.6,84.6,...,70.0,57.59748,382.649435,104.630156,280.581116,737.381775,179.045074,387.141308,92.704263,71.121864


reflection notes:

- could have found multiple sources for crime data and collated to avoid using Imput to estimate values
- could use model on multiple years' worth of data and include year, country as predictors

# **References**



1.   The Heritage Foundation (2023). Economic Data and Statistics on World Economy and Economic Freedom. [online] Heritage.org. Available at: https://www.heritage.org/index/explore.


2.   United Nations (2018). Statistics and Data | UNODC. [online] Un.org. Available at: https://dataunodc.un.org/.

