# I. Introduction

 The purpose of this project is to verify the claim that **IQ level** affects a large number of human factors, in particular, the **crime rate**.



# II. Short Information

  I took 2021 IQ data from the **World Population Review** website, which contains many different interesting statistics about humanity: https://worldpopulationreview.com/country-rankings/average-iq-by-country

  Also i took **2021 Crime index statistics** by countries: https://www.numbeo.com/crime/rankings_by_country.jsp

Based on the data from these two resources, I created a csv file with IQ statistics by countries and crime index in this countries.

*There are many countries that are not represented in the crime index statistics for 2021 year and to get more accurate statistics I'm not going to predict the data, so I decided take their crime indexes for the previous years (I don't think that criminal countries can change their position dramatically in a couple of years). The data was taken from https://numbeo.com/crime.

I uploaded *.csv file to **github** repository to easily get access to the data: https://github.com/h0pped/SADISM-project-1

So here is a link to a **CSV File**: https://raw.githubusercontent.com/h0pped/SADISM-project-1/main/stat.csv

---

Some of the source code was inspired by lab's *.ipynb notes from our lessons and information about how some things work in python (especcialy, pandas  and sklearn libraries) was taken from official documentation of these libraries: 

*   https://byes.pl/systemy/
*   https://pandas.pydata.org/docs/
*   https://scikit-learn.org/0.21/documentation.html



# III. Acquire the data

In [None]:
import urllib.request
import os
import pandas as pd

filename = 'stat.csv'
url = "https://raw.githubusercontent.com/h0pped/SADISM-project-1/main/"+filename
urllib.request.urlretrieve(url, filename)

data = pd.read_csv(filename, index_col=[0])
data.sample(15)


Unnamed: 0_level_0,IQ,Crime index
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
uzbekistan,87,34.7
qatar,78,12.29
netherlands,100,27.22
djibouti,68,57.53
myanmar,87,47.17
lesotho,67,61.44
austria,100,25.23
czech republic,98,25.31
cape verde,76,56.09
hong kong,108,21.73


# IV. Orginizing the dataset

To work with data we need to take **average crime index** for each **iq marking**

**But there is some problem with a data: **


*   If you check the **.csv** file, you will see that some **IQ** values are specific to one country only (for example Equatorial Guinea or Gambia), which is inappropriate because the whole "picture" will not be displayed in this case.

**Possible Fix:**


*   To take into consideration only **IQ** values which have 2 (or even better 3) or more countries in statistic so **average crime index** by this **IQ** will be calculated more accurately




In [None]:
import numpy as np
import math
iq = data['IQ'].values
crime = data["Crime index"].values

dictionary = dict()
avg = dict()
for index,row in data.iterrows():
  if row['IQ'] in dictionary:
    dictionary.get(row['IQ']).append(row['Crime index'])
  else:
    dictionary[row['IQ']] = [row['Crime index']]

for x in dictionary:
  if(len(dictionary[x])>=3): #filter
    avg[x] = round((sum(dictionary[x])/len(dictionary[x])),1)


df = pd.DataFrame({"IQ":avg.keys(),"AVG Crime index":avg.values()})
display(df)


Unnamed: 0,IQ,AVG Crime index
0,64.0,59.9
1,67.0,57.7
2,68.0,55.0
3,69.0,54.7
4,70.0,41.0
5,71.0,57.8
6,76.0,52.8
7,79.0,47.5
8,80.0,53.1
9,81.0,56.5


Extract data:




In [None]:
x = df['IQ'].values
y = df['AVG Crime index'].values

print(x)
print(y)

[ 64.  67.  68.  69.  70.  71.  76.  79.  80.  81.  82.  83.  84.  85.
  86.  87.  88.  89.  90.  91.  92.  94.  96.  97.  98.  99. 100. 101.]
[59.9 57.7 55.  54.7 41.  57.8 52.8 47.5 53.1 56.5 55.8 50.3 48.6 53.2
 44.1 46.5 50.1 46.1 42.8 43.9 49.  34.9 37.3 46.9 35.3 35.4 32.6 33.7]


# V. Machine Learning

The purpose of the model is to **predict crime index** according to **IQ level** for all positions


In [None]:
# =================== Linear model ===================
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

linear_model = LinearRegression()
linear_model.fit(x.reshape(-1,1), y)
print(f'Linear model params: {linear_model.coef_}, {linear_model.intercept_}')
mean_lin_err = mean_squared_error(y, linear_model.predict(x.reshape(-1,1)))

# =================== Support Vector Machine (Regression) ===================
from sklearn.svm import SVR

svr_model = SVR()
svr_model.fit(x.reshape(-1,1), y)
mean_svr_model = mean_squared_error(y,svr_model.predict(x.reshape(-1,1)))

# =================== General Linear Model (GLM) ===================
from sklearn.preprocessing import PolynomialFeatures

GLM_degree = 8
GLM_model = LinearRegression()
preX = PolynomialFeatures(degree=GLM_degree, 
                          include_bias=True, 
                          interaction_only=False)
Xpreprocessed = preX.fit_transform(x.reshape(-1,1))
GLM_model.fit(Xpreprocessed, y)
mean_glm = mean_squared_error(y,GLM_model.predict(Xpreprocessed))

print(f'GLM model params: {GLM_model.coef_}, {GLM_model.intercept_}')

# ========== models predictions for the entire x axis ===================

x_axis = np.linspace(start=x.min(), stop=x.max(), num=300)
y_lin_pred_axis = linear_model.predict(x_axis.reshape(-1,1))
Xpreprocessed = preX.fit_transform(x_axis.reshape(-1,1))
y_GLM_pred_axis = GLM_model.predict(Xpreprocessed)
y_svr_pred_axis = svr_model.predict(x_axis.reshape(-1,1))

print("MSE Linear: ",mean_lin_err)
print("MSE SVR: ",mean_svr_model)
print("MSE GLM: ",mean_glm)

Linear model params: [-0.59557244], 97.57928460032517
GLM model params: [ 0.00000000e+00  3.27360725e-02  1.94890841e-02  3.96851472e-01
 -1.94011681e-02  3.99350840e-04 -4.23241716e-06  2.28296603e-08
 -4.98062633e-11], -2892.7219674134226
MSE Linear:  23.11451230649873
MSE SVR:  32.474541770153714
MSE GLM:  16.17281651668347


I am partially satisfied with the results, since in GLM we see the smallest measurement error.

# VI. Conclusion

In [1]:
import matplotlib.pyplot as plt

# =========== Visualize the data and models
plt.figure(figsize=(15,3))
plt.scatter(x,y, label='data')
plt.plot(x_axis, y_lin_pred_axis, color='tab:orange', label='Linear model')
plt.plot(x_axis, y_svr_pred_axis, color='tab:green', label='SVR model')
plt.plot(x_axis, y_GLM_pred_axis, color='tab:red', label='GLM model')
plt.xlabel(df.columns[0], fontsize=15)
plt.ylabel(df.columns[1], fontsize=15)
plt.legend()
plt.ylim([y.min()-1, y.max()+2])
plt.show()

SyntaxError: invalid syntax (<ipython-input-1-67521893595a>, line 10)

*  If we take the linear model into account, then we can clearly see that as the IQ level increases, the crime rate decreases. 

* The predictions would be more accurate if we had more data, but I think this is enough to see the correlation between these two measurements.

* Of course it is not enough to check only IQ level, because there a lot of another factors that may have an influence on the crime level such as standard of living, level of salaries etc., but If you look at the input data, you will notice that the main statistics of low IQ made by countries with low social and economic development
* Hence, I guess that this exploration is very "raw" but it can be used in the future researches connected to this topic

Also, during the execution of this project, I found some interesting research on the relationship between IQ and crime rates in European countries(which, I guess, may relate to every country), which partially supports the model above: https://www.researchgate.net/publication/258857651_An_analysis_of_the_relation_between_IQ_and_crime_in_the_European_countries