# Prevedere il crime rate

## Caricamento Librerie
Per prima cosa carichiamo le librerie per effettuare operazioni sui dati

*   NumPy per creare e operare su array a N dimensioni
*   pandas per caricare e manipolare dati tabulari
*   matplotlib per creare grafici

Importiamo le librerie usando i loro alias convenzionali

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

## Caricamento dei dati
I dataset presi sono satti scricati dal sito ufficile delle nazioni unite https://dataunodc.un.org/ , e contengono varie informazioni tra cui:
*  Kidnapping
*  Rape
*  Drug Trafficking
*  Sexual assault
*  Burglary and Theft (insieme)
*  Homicide
I dataset presentano tutti dati dal 2013 al 2022 delle varie nazioni del mondo

### Significato delle colonne
In tutti i dataset che prendiamo in considerazione abbiamo le varie colonne:
*  Nation
*

## Crimini per furto

In [3]:
import os.path
file = "data_cts_corruption_and_economic_crime.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  theftCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [132]:
theftCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22754 entries, 0 to 22753
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Iso3_code            22754 non-null  object 
 1   Country              22754 non-null  object 
 2   Region               22754 non-null  object 
 3   Subregion            22754 non-null  object 
 4   Indicator            22754 non-null  object 
 5   Dimension            22754 non-null  object 
 6   Category             22754 non-null  object 
 7   Sex                  22754 non-null  object 
 8   Age                  22754 non-null  object 
 9   Year                 22754 non-null  int64  
 10  Unit of measurement  22754 non-null  object 
 11  VALUE                22754 non-null  float64
 12  Source               22754 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 16.5 MB


In [133]:
theftCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Offences,by type of offence,Corruption,Total,Total,2013,Counts,782.0,CTS
1,AUT,Austria,Europe,Western Europe,Offences,by type of offence,Corruption,Total,Total,2013,Counts,3439.0,CTS
2,CHE,Switzerland,Europe,Western Europe,Offences,by type of offence,Corruption,Total,Total,2013,Counts,4884.0,CTS
3,CHL,Chile,Americas,Latin America and the Caribbean,Offences,by type of offence,Corruption,Total,Total,2013,Counts,339.0,CTS
4,COL,Colombia,Americas,Latin America and the Caribbean,Offences,by type of offence,Corruption,Total,Total,2013,Counts,23483.0,CTS


In [134]:
theftCrime.shape

(22754, 13)

### Preprocessing dei dati
Filtrimo il dataframe utilizzando come unità solo 'Counts' inevce che 'Rate per 100,000 population'. Effettuiamo quindi una prima scrematura dei dati.

Selezioniamo inoltre soltanto le categorie interessate, ovvero nel nostro caso i `Theft` e i `Burglary`.

In [135]:
theftCrime = theftCrime[theftCrime['Unit of measurement'] != "Rate per 100,000 population"]
theftCrime = theftCrime[theftCrime['Category'].isin(['Theft', 'Burglary'])]

In [136]:
theftCrime.shape

(3563, 13)

Analizzando il dataframe, le feature 'Age', 'Sex', 'Iso3_code', 'Dimension', 'Unit of measurement'e 'Source' non sono rilevanti per il nostro modello, perciò effetuiamo la loro rimozione.

Rinominiamo inoltre la colonna `VALUE` in `Value`

In [137]:
theftCrime.rename(columns={'VALUE':'Value'}, inplace=True)
theftCrime = theftCrime.drop(['Unit of measurement', 'Sex', 'Iso3_code', 'Age', 'Source', 'Dimension', 'Indicator'], axis=1)

In [138]:
theftCrime.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
2486,United Arab Emirates,Asia,Western Asia,Burglary,2003,1882.0
2487,Australia,Oceania,Australia and New Zealand,Burglary,2003,354020.0
2488,Azerbaijan,Asia,Western Asia,Burglary,2003,757.0
2489,Belgium,Europe,Western Europe,Burglary,2003,97007.0
2490,Bulgaria,Europe,Eastern Europe,Burglary,2003,28210.0


Notimao inoltre che, dalle informazioni ricevute nel paragrafo precedente, la maggior parte delle feature tranne `Year` e `Value` sono di tipo `Object`, e questo perchè paython converte automaticamente in tipo objet tuttin i dati che non sono interi. Dunque cambiamo i vari tipi delle feature.
Inoltre, rimuoviamo, se vi sono, le varie righe conteneti valori nulli.

In [139]:
theftCrime["Country"] = theftCrime["Country"].astype(pd.StringDtype())
theftCrime["Region"] = theftCrime["Region"].astype(pd.StringDtype())
theftCrime["Subregion"] = theftCrime["Subregion"].astype(pd.StringDtype())
theftCrime["Category"] = theftCrime["Category"].astype(pd.StringDtype())
theftCrime.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3563 entries, 2486 to 6048
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    3563 non-null   string 
 1   Region     3563 non-null   string 
 2   Subregion  3563 non-null   string 
 3   Category   3563 non-null   string 
 4   Year       3563 non-null   int64  
 5   Value      3563 non-null   float64
dtypes: float64(1), int64(1), string(4)
memory usage: 194.9 KB


In [140]:
theftCrime.dropna()
theftCrime.shape

(3563, 6)

Notiamo che il nostro dataset non contiene valori nulli in quanto la shape non è cambiata.

In [263]:
theftCrime.describe()

Unnamed: 0,Year,Value
count,3496.0,3496.0
mean,2012.365275,144447.1
std,5.499032,519118.6
min,2003.0,2.0
25%,2008.0,2910.25
50%,2012.0,17494.0
75%,2017.0,94106.25
max,2022.0,7026802.0


Contiamo anche il numero di paesi nel nostro dataset.

In [143]:
print(f"Numero totale di paesi: {theftCrime['Country'].unique().size}")

Numero totale di paesi: 149


## Crimini per omicidio

In [144]:
file = "data_cts_intentional_homicide.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  homicideCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [145]:
homicideCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117070 entries, 0 to 117069
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Iso3_code            117070 non-null  object 
 1   Country              117070 non-null  object 
 2   Region               117070 non-null  object 
 3   Subregion            117070 non-null  object 
 4   Indicator            117070 non-null  object 
 5   Dimension            117070 non-null  object 
 6   Category             117070 non-null  object 
 7   Sex                  117070 non-null  object 
 8   Age                  117070 non-null  object 
 9   Year                 117070 non-null  int64  
 10  Unit of measurement  117070 non-null  object 
 11  VALUE                117070 non-null  float64
 12  Source               117070 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 87.4 MB


In [146]:
homicideCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,35.0,CTS
1,CHE,Switzerland,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,28.0,CTS
2,COL,Colombia,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,15053.0,CTS
3,CZE,Czechia,Europe,Eastern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,69.0,CTS
4,DEU,Germany,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,455.0,CTS


In [147]:
homicideCrime.shape

(117070, 13)

### Preprocessing dei dati
Filtrimo il dataframe utilizzando come unità solo 'Counts' inevce che 'Rate per 100,000 population'. Effettuiamo quindi una prima scrematura dei dati

In [148]:
homicideCrime = homicideCrime[homicideCrime['Unit of measurement'] != "Rate per 100,000 population"]
homicideCrime.shape

(62850, 13)

Poichè stiamo analizzando i crimini legati all'omicidio dobbiamo fare un'ulteriore scrematura in modo da considerare esclusivamente i crimini di nostro interesse. In particolare di seguito controllo quali categorie riguardano gli omicidi.

In [149]:
homicide_categories = homicideCrime[homicideCrime['Indicator'].str.contains('homicide', case=False, na=False)]['Indicator'].unique()
print(homicide_categories)

['Persons arrested/suspected for intentional homicide'
 'Victims of intentional homicide'
 'Victims of intentional homicide â\x80\x93 City-level data'
 'Persons convicted for intentional homicide'
 'Death due to intentional homicide in prison'
 'Victims of Intentional Homicide - Regional Estimate']


In [150]:
homicideCrime = homicideCrime[homicideCrime['Indicator'].isin(['Victims of intentional homicide'])]
homicideCrime.shape

(46224, 13)

Così facendo rimuoviamo, come precedentemente, buona parte dei dati e allegeriamo così il dataset.

Come prima, rimuoviamo anche da questo dataset le feature non rilevanti per lo studio, e rinominiamo la feature `VALUE` come prima in `Value`. In questo caso modifichiamo la struttura del dataset per una miglire comprensione e coerenza, inserendo in `Category` il valore `Homicide` che era di `Indicator`.

Inoltre, effettuiamo anche la conversione delle feature da `Object` in `String`.

In [151]:
homicideCrime.rename(columns={'VALUE':'Value'}, inplace=True)
homicideCrime = homicideCrime.drop(['Unit of measurement', 'Sex', 'Iso3_code', 'Age', 'Source', 'Dimension', 'Category'], axis=1)
homicideCrime.rename(columns={'Indicator':'Category'}, inplace=True)
homicideCrime["Country"] = homicideCrime["Country"].astype(pd.StringDtype())
homicideCrime["Region"] = homicideCrime["Region"].astype(pd.StringDtype())
homicideCrime["Subregion"] = homicideCrime["Subregion"].astype(pd.StringDtype())
homicideCrime["Category"] = homicideCrime["Category"].astype(pd.StringDtype())

homicideCrime['Category'] = homicideCrime['Category'].replace('Victims of intentional homicide', 'Homicide')
homicideCrime.info()

<class 'pandas.core.frame.DataFrame'>
Index: 46224 entries, 1408 to 47631
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    46224 non-null  string 
 1   Region     46224 non-null  string 
 2   Subregion  46224 non-null  string 
 3   Category   46224 non-null  string 
 4   Year       46224 non-null  int64  
 5   Value      46224 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 2.5 MB


In [152]:
homicideCrime.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
1408,Aruba,Americas,Latin America and the Caribbean,Homicide,1990,0.0
1409,Anguilla,Americas,Latin America and the Caribbean,Homicide,1990,0.0
1410,Armenia,Asia,Western Asia,Homicide,1990,178.0
1411,Antigua and Barbuda,Americas,Latin America and the Caribbean,Homicide,1990,1.0
1412,Australia,Oceania,Australia and New Zealand,Homicide,1990,374.0


In [153]:
homicideCrime.shape

(46224, 6)

In [154]:
homicideCrime.dropna()
homicideCrime.shape

(46224, 6)

Notimao anche questa volta che la shape del dataset non è cambiata dopo aver effettuato la `dropna()`, il che vuol dire non vi sono presenti righr con valori nulli.





In [157]:
homicideCrime.describe()

Unnamed: 0,Year,Value
count,37949.0,37949.0
mean,2014.391367,621.33777
std,6.691502,3402.204814
min,1990.0,0.444444
25%,2011.0,5.0
50%,2016.0,23.0
75%,2019.0,112.835821
max,2023.0,63788.0


Controlliamo il numero di paesi del dataset:

In [158]:
print(f"Numero totale di paesi: {homicideCrime['Country'].unique().size}")
countries_theft = set(theftCrime['Country'])
countries_homicide = set(homicideCrime['Country'])

# Confronta gli insiemi
if countries_theft == countries_homicide:
    print("Entrambi i DataFrame contengono gli stessi paesi.")
else:
    print("I DataFrame non contengono gli stessi paesi.")

Numero totale di paesi: 204
I DataFrame non contengono gli stessi paesi.


In questo caso, il numero di peasi è lo stesso del dataset precedente, ma non contiene gli stessi paesi.

Dunque, per poter procedere verso il merging dei vari dataset per creare uno solo, bisogna rimuovere da `theftCrime` tutte le righe rigurdanti i paesi non contenuti in `homicideCrime`.

Controlliamo intanto i paesi che differenziano da un dataset all'altro.

In [159]:
different_in_theft = countries_theft - countries_homicide
different_in_homicide = countries_homicide - countries_theft

if different_in_theft:
    print(f"Paesi in theftCrime ma non in homicideCrime: {different_in_theft}")
    print(f"Numero: {len(different_in_theft)}")
if different_in_homicide:
    print(f"Paesi in homicideCrime ma non in theftCrime: {different_in_homicide}")
    print(f"Numero: {len(different_in_homicide)}")

Paesi in theftCrime ma non in homicideCrime: {'Kyrgyzstan', 'CÃ´te dâ\x80\x99Ivoire', 'Senegal', 'Sudan', 'Djibouti', 'Guinea'}
Numero: 6
Paesi in homicideCrime ma non in theftCrime: {'Fiji', 'Uzbekistan', 'Liberia', 'Tunisia', 'Saint Pierre and Miquelon', 'Saint Helena', 'Cook Islands', 'Zambia', 'Niger', 'Malawi', 'United States Virgin Islands', 'Yemen', 'Vanuatu', 'South Sudan', 'Palau', 'Namibia', 'New Caledonia', 'Angola', 'Haiti', 'Tonga', 'Viet Nam', 'Cambodia', 'South Africa', 'Gibraltar', 'San Marino', 'Anguilla', 'Marshall Islands', 'Guam', 'Ethiopia', 'French Guiana', 'Venezuela (Bolivarian Republic of)', 'Greenland', 'RÃ©union', 'Tuvalu', 'Guadeloupe', 'Saint Martin (French Part)', 'Papua New Guinea', 'China', 'Montserrat', 'Mauritania', 'Saudi Arabia', 'Cuba', 'Iraq (Kurdistan Region)', 'Isle of Man', 'Turks and Caicos Islands', 'Aruba', 'American Samoa', 'Kiribati', 'Eritrea', 'Samoa', 'Cayman Islands', 'Ghana', 'Afghanistan', 'Iraq', 'Micronesia (Federated States of)', '

In [160]:
common_countries = countries_theft & countries_homicide
print(f"Numero Paesi in comune: {len(common_countries)}")

Numero Paesi in comune: 143


Il numero di paesi in comune è 143, quindi muteremo i due dataset filtrandoli in base ai paesi in comune tra i due.

In [161]:
theftCrime = theftCrime[theftCrime['Country'].isin(common_countries)]
homicideCrime = homicideCrime[homicideCrime['Country'].isin(common_countries)]
print(f"theftCrime: {theftCrime.shape}")
print(f"homicideCrime: {homicideCrime.shape}")

theftCrime: (3496, 6)
homicideCrime: (36712, 6)


In [162]:
file = "/data_cts_prisons_and_prisoners.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  drugsCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [163]:
drugsCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70790 entries, 0 to 70789
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Iso3_code            70790 non-null  object 
 1   Country              70790 non-null  object 
 2   Region               70790 non-null  object 
 3   Subregion            70790 non-null  object 
 4   Indicator            70790 non-null  object 
 5   Dimension            70790 non-null  object 
 6   Category             70790 non-null  object 
 7   Sex                  70790 non-null  object 
 8   Age                  70790 non-null  object 
 9   Year                 70790 non-null  int64  
 10  Unit of measurement  70790 non-null  object 
 11  VALUE                70790 non-null  float64
 12  Source               70790 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 51.0 MB


In [164]:
drugsCrime.shape

(70790, 13)

In [165]:
drugsCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,BRB,Barbados,Americas,Latin America and the Caribbean,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,0.0,CTS
1,CRI,Costa Rica,Americas,Latin America and the Caribbean,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,173.0,CTS
2,DMA,Dominica,Americas,Latin America and the Caribbean,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,65.0,CTS
3,GBR_NI,United Kingdom (Northern Ireland),Europe,Northern Europe,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,61.0,CTS
4,ITA,Italy,Europe,Southern Europe,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,1443.0,CTS


Filtrimo il dataframe, manteniamo solo i dati che presentano come `Unit of measurement` `Counts` inevce che `Rate per 100,000 population`.

In [166]:
drugsCrime = drugsCrime[drugsCrime['Unit of measurement'] != "Rate per 100,000 population"]
drugsCrime.shape

(44997, 13)

Le feature `Iso3_code`, `Indicator`, `Dimension`, `Sex`, `Age`, `Unit of measurement` e `Source` sono inutili per il nostro modello, perciò le rimuoviamo. Inoltre rinomiamo la feature `VALUE` in `Value` e rimuoviamo le eventuali righe contenenti valori null.

In [167]:
drugsCrime = drugsCrime.drop(['Iso3_code', 'Indicator', 'Dimension', 'Sex', 'Age', 'Unit of measurement', 'Source'], axis=1)
drugsCrime.rename(columns={'VALUE':'Value'}, inplace=True)
drugsCrime.dropna()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
0,Barbados,Americas,Latin America and the Caribbean,Intentional Homicide,2016,0.0
1,Costa Rica,Americas,Latin America and the Caribbean,Intentional Homicide,2016,173.0
2,Dominica,Americas,Latin America and the Caribbean,Intentional Homicide,2016,65.0
3,United Kingdom (Northern Ireland),Europe,Northern Europe,Intentional Homicide,2016,61.0
4,Italy,Europe,Southern Europe,Intentional Homicide,2016,1443.0
...,...,...,...,...,...,...
44992,Zimbabwe,Africa,Sub-Saharan Africa,Total,2008,487.0
44993,Zimbabwe,Africa,Sub-Saharan Africa,Total,2017,475.0
44994,Zimbabwe,Africa,Sub-Saharan Africa,Total,2018,353.0
44995,Zimbabwe,Africa,Sub-Saharan Africa,Total,2019,450.0


Notiamo che il numero di righe del dataset non è cambiata dopo aver effettuato la `dropna()`, il che vuol dire non vi sono presenti righe con valori nulli.

Di seguito, invece, effettuiamo la conversione da `Object` a `String` delle feature che lo necessitano.

In [168]:
drugsCrime["Country"] = drugsCrime["Country"].astype(pd.StringDtype())
drugsCrime["Region"] = drugsCrime["Region"].astype(pd.StringDtype())
drugsCrime["Subregion"] = drugsCrime["Subregion"].astype(pd.StringDtype())
drugsCrime["Category"] = drugsCrime["Category"].astype(pd.StringDtype())
drugsCrime.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44997 entries, 0 to 44996
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    44997 non-null  string 
 1   Region     44997 non-null  string 
 2   Subregion  44997 non-null  string 
 3   Category   44997 non-null  string 
 4   Year       44997 non-null  int64  
 5   Value      44997 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 2.4 MB


Poichè stiamo analizzando i crimini legati al traffico e all'uso di droghe dobbiamo fare un'ulteriore scrematura del dataset in modo da considerare esclusivamente i crimini di nostro interesse. In particolare di seguito controllo quali categorie riguardano le droghe.

In [169]:
drug_categories = drugsCrime[drugsCrime['Category'].str.contains('Drug', case=False, na=False)]['Category'].unique()
print(drug_categories)

<StringArray>
['Drug Possession', 'Drug Trafficking', 'Drug possession', 'Drug trafficking']
Length: 4, dtype: string


In [170]:
drugsCrime = drugsCrime[drugsCrime['Category'].isin(['Drug Trafficking', 'Drug trafficking'])]

In [171]:
drugsCrime.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
596,Barbados,Americas,Latin America and the Caribbean,Drug Trafficking,2016,3.0
597,Costa Rica,Americas,Latin America and the Caribbean,Drug Trafficking,2016,203.0
598,Dominica,Americas,Latin America and the Caribbean,Drug Trafficking,2016,0.0
599,United Kingdom (Northern Ireland),Europe,Northern Europe,Drug Trafficking,2016,3047.0
600,Honduras,Americas,Latin America and the Caribbean,Drug Trafficking,2016,87.0


Avendo effetuato questo ulteriore controllo posso eliminare anche la colonna `Category`.

In [172]:
drugsCrime = drugsCrime.drop(['Category'], axis=1)

In [173]:
drugsCrime.shape

(682, 5)

In [174]:
drugsCrime.describe()

Unnamed: 0,Year,Value
count,682.0,682.0
mean,2017.567449,8160.313783
std,3.335696,27153.150717
min,2010.0,0.0
25%,2015.0,80.0
50%,2018.0,687.0
75%,2020.0,3066.5
max,2022.0,240113.0


In [175]:
print(f"Numero totale di paesi: {homicideCrime['Country'].unique().size}")

Numero totale di paesi: 143


## Violenza sessuale e rapimenti

In [176]:
file = "data_cts_violent_and_sexual_crime.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  violent_sexualCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [177]:
violent_sexualCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26114 entries, 0 to 26113
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Iso3_code            26114 non-null  object 
 1   Country              26114 non-null  object 
 2   Region               26114 non-null  object 
 3   Subregion            26114 non-null  object 
 4   Indicator            26114 non-null  object 
 5   Dimension            26114 non-null  object 
 6   Category             26114 non-null  object 
 7   Sex                  26114 non-null  object 
 8   Age                  26114 non-null  object 
 9   Year                 26114 non-null  int64  
 10  Unit of measurement  26114 non-null  object 
 11  VALUE                26114 non-null  float64
 12  Source               26114 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 19.3 MB


In [178]:
violent_sexualCrime.shape

(26114, 13)

In [179]:
violent_sexualCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,AZE,Azerbaijan,Asia,Western Asia,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,155.0,CTS
1,BEL,Belgium,Europe,Western Europe,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,61959.0,CTS
2,BGR,Bulgaria,Europe,Eastern Europe,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,3806.0,CTS
3,BHR,Bahrain,Asia,Western Asia,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,2701.0,CTS
4,BLR,Belarus,Europe,Eastern Europe,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,4032.0,CTS


Filtrimo il dataframe, manteniamo solo i dati che presentano come `Unit of measurement` `Counts` inevce che `Rate per 100,000 population`.

In [180]:
violent_sexualCrime = violent_sexualCrime[violent_sexualCrime['Unit of measurement'] != "Rate per 100,000 population"]
violent_sexualCrime.shape

(13073, 13)

Le feature `Iso3_code`, `Indicator`, `Dimension`, `Sex`, `Age`, `Unit of measurement` e `Source` sono inutili per il nostro modello, perciò le rimuoviamo.
Inoltre rimuoviamo le eventuali righe contenenti valori null.

In [181]:
violent_sexualCrime = violent_sexualCrime.drop(['Iso3_code', 'Indicator', 'Dimension', 'Sex', 'Age', 'Unit of measurement', 'Source'], axis=1)
violent_sexualCrime.dropna()

Unnamed: 0,Country,Region,Subregion,Category,Year,VALUE
0,Azerbaijan,Asia,Western Asia,Serious assault,2003,155.0
1,Belgium,Europe,Western Europe,Serious assault,2003,61959.0
2,Bulgaria,Europe,Eastern Europe,Serious assault,2003,3806.0
3,Bahrain,Asia,Western Asia,Serious assault,2003,2701.0
4,Belarus,Europe,Eastern Europe,Serious assault,2003,4032.0
...,...,...,...,...,...,...
13068,Montenegro,Europe,Southern Europe,Acts intended to induce fear or emotional dist...,2021,10.0
13069,Mauritius,Africa,Sub-Saharan Africa,Acts intended to induce fear or emotional dist...,2021,342.0
13070,El Salvador,Americas,Latin America and the Caribbean,Acts intended to induce fear or emotional dist...,2021,4.0
13071,Serbia,Europe,Southern Europe,Acts intended to induce fear or emotional dist...,2021,1.0


Notiamo che il numero di righe del dataset non è cambiata dopo aver effettuato la dropna(), il che vuol dire non vi sono presenti righe con valori nulli.

Di seguito, invece, effettuiamo la conversione da Object a String delle feature che lo necessitano.

In [182]:
violent_sexualCrime["Country"] = violent_sexualCrime["Country"].astype(pd.StringDtype())
violent_sexualCrime["Region"] = violent_sexualCrime["Region"].astype(pd.StringDtype())
violent_sexualCrime["Subregion"] = violent_sexualCrime["Subregion"].astype(pd.StringDtype())
violent_sexualCrime["Category"] = violent_sexualCrime["Category"].astype(pd.StringDtype())
violent_sexualCrime.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13073 entries, 0 to 13072
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    13073 non-null  string 
 1   Region     13073 non-null  string 
 2   Subregion  13073 non-null  string 
 3   Category   13073 non-null  string 
 4   Year       13073 non-null  int64  
 5   VALUE      13073 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 714.9 KB


Poichè dobbiamo analizzare i crimini di violenza sessuale ed i rapimenti andiamo a creare due copie del dataframe in modo da poter analizzare separatamente i due casi.

Inoltre rinominiamo la feature `VALUE` in `sexual_violence_count` e `kidnapping_count` a seconda del fenomeno analizzato.

In [183]:
sexual_violence = violent_sexualCrime.copy()
sexual_violence = sexual_violence[sexual_violence['Category'].isin(['Sexual violence'])]
sexual_violence.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,VALUE
3796,Belgium,Europe,Western Europe,Sexual violence,2003,5565.0
3797,Bulgaria,Europe,Eastern Europe,Sexual violence,2003,1287.0
3798,Canada,Americas,Northern America,Sexual violence,2003,26128.0
3799,Czechia,Europe,Eastern Europe,Sexual violence,2003,1898.0
3800,Germany,Europe,Western Europe,Sexual violence,2003,54632.0


In [184]:
kidnapping = violent_sexualCrime.copy()
kidnapping = kidnapping[kidnapping['Category'].isin(['Kidnapping'])]
kidnapping.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,VALUE
1977,United Arab Emirates,Asia,Western Asia,Kidnapping,2003,434.0
1978,Azerbaijan,Asia,Western Asia,Kidnapping,2003,28.0
1979,Belgium,Europe,Western Europe,Kidnapping,2003,1003.0
1980,Bulgaria,Europe,Eastern Europe,Kidnapping,2003,209.0
1981,Bahrain,Asia,Western Asia,Kidnapping,2003,5.0


Avendo effetuato questa ulteriore scrematura posso eliminare anche la colonna `Category`.

In [185]:
sexual_violence = sexual_violence.drop(['Category'], axis=1)
sexual_violence.rename(columns={'VALUE':'sexual_violence_count'}, inplace=True)
sexual_violence.shape

(1768, 5)

In [186]:
sexual_violence.describe()

Unnamed: 0,Year,sexual_violence_count
count,1768.0,1768.0
mean,2013.1431,6434.749434
std,5.244384,15363.198967
min,2003.0,0.0
25%,2009.0,238.0
50%,2013.0,1362.5
75%,2018.0,5163.25
max,2022.0,193566.0


In [187]:
kidnapping = kidnapping.drop(['Category'], axis=1)
kidnapping.rename(columns={'VALUE':'kidnapping_count'}, inplace=True)
kidnapping.shape

(1819, 5)

In [188]:
kidnapping.describe()

Unnamed: 0,Year,kidnapping_count
count,1819.0,1819.0
mean,2012.722375,792.048679
std,5.438161,4022.97388
min,2003.0,0.0
25%,2008.0,5.0
50%,2013.0,34.0
75%,2017.0,265.0
max,2022.0,65461.0


Notiamo che il dataset, per ogni `Country` contiene dati dal 2003 al 2022.

In [189]:
print(f"Numero totale di paesi nel dataframe sexual_violence: {sexual_violence['Country'].unique().size}")

Numero totale di paesi nel dataframe sexual_violence: 140


In [83]:
print(f"Numero totale di paesi nel dataframe kidnapping: {kidnapping['Country'].unique().size}")

Numero totale di paesi nel dataframe kidnapping: 147


## Stipendio medio

In [252]:
file = "API_SL.EMP.WORK.ZS_DS2_en_csv_v2_305657.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  df = pd.read_csv(file, index_col=False, encoding='latin1')

In [253]:
df.head()

Unnamed: 0,"ï»¿""Country Name""",Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,Unnamed: 68
0,Aruba,ABW,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,,,,,,,...,,,,,,,,,,
1,Africa Eastern and Southern,AFE,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,,,,,,,...,25.271594,25.258113,25.349516,25.355688,25.24392,24.609162,24.710108,25.06964,,
2,Afghanistan,AFG,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,,,,,,,...,12.612444,13.21704,13.904849,14.638503,15.557754,15.666043,17.719207,18.259346,,
3,Africa Western and Central,AFW,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,,,,,,,...,18.281948,18.174335,17.701215,17.669886,17.571845,17.282398,17.374956,17.564789,,
4,Angola,AGO,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,,,,,,,...,36.332605,36.054109,35.796968,35.515452,34.833671,34.666696,34.883633,35.20936,,


In [254]:
df.fillna(0)
df.rename(columns={'ï»¿"Country Name"':'Country'}, inplace=True)

In [255]:
columns = df.columns[:3]
columns

Index(['Country', 'Country Code', 'Indicator Name'], dtype='object')

In [256]:
columns_to_drop = [str(year) for year in range(1960, 1991)]

# Filtrare le colonne da eliminare per includere solo quelle presenti nel DataFrame
columns_to_drop = [col for col in columns_to_drop if col in df.columns]

# Eliminare le colonne dal DataFrame
df.drop(columns=columns_to_drop, inplace=True)

In [257]:
df.drop(columns=['Unnamed: 68'], inplace=True)

In [258]:
# Step 2: Utilizzare melt per trasformare le colonne degli anni in righe
df_melted = pd.melt(df, id_vars=['Country', 'Country Code', 'Indicator Name', 'Indicator Code'],
                    var_name='Year', value_name='Value')

# Step 3: Convertire la colonna 'Year' in numerico
df_melted['Year'] = pd.to_numeric(df_melted['Year'], errors='coerce')

# Step 4: Filtrare le righe dove 'Value' è NaN
df = df_melted.dropna(subset=['Value'])
df

Unnamed: 0,Country,Country Code,Indicator Name,Indicator Code,Year,Value
1,Africa Eastern and Southern,AFE,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,1991,24.085849
2,Afghanistan,AFG,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,1991,7.329954
3,Africa Western and Central,AFW,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,1991,14.621399
4,Angola,AGO,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,1991,23.131789
5,Albania,ALB,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,1991,33.767352
...,...,...,...,...,...,...
8506,Samoa,WSM,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,2022,64.825453
8508,"Yemen, Rep.",YEM,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,2022,45.522387
8509,South Africa,ZAF,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,2022,71.014147
8510,Zambia,ZMB,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,2022,26.980686


In [259]:
df.drop(columns=['Indicator Name', 'Indicator Code', 'Country Code'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Indicator Name', 'Indicator Code', 'Country Code'], inplace=True)


In [261]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7519 entries, 1 to 8511
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Country  7519 non-null   object 
 1   Year     7519 non-null   int64  
 2   Value    7519 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 235.0+ KB


In [262]:
df.head()

Unnamed: 0,Country,Year,Value
1,Africa Eastern and Southern,1991,24.085849
2,Afghanistan,1991,7.329954
3,Africa Western and Central,1991,14.621399
4,Angola,1991,23.131789
5,Albania,1991,33.767352


## Personale di polizia

In [4]:
file = "data_cts_access_and_functioning_of_justice.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  police_personnel = pd.read_csv(file, index_col=False, encoding='latin1')

In [5]:
police_personnel.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102032 entries, 0 to 102031
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Iso3_code            102032 non-null  object 
 1   Country              102032 non-null  object 
 2   Region               102032 non-null  object 
 3   Subregion            102032 non-null  object 
 4   Indicator            102032 non-null  object 
 5   Dimension            102032 non-null  object 
 6   Category             102032 non-null  object 
 7   Sex                  102032 non-null  object 
 8   Age                  102032 non-null  object 
 9   Year                 102032 non-null  int64  
 10  Unit of measurement  102032 non-null  object 
 11  VALUE                102032 non-null  float64
 12  Source               102032 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 74.7 MB
