# Prevedere il crime rate

## Caricamento Librerie
Per prima cosa carichiamo le librerie per effettuare operazioni sui dati

*   NumPy per creare e operare su array a N dimensioni
*   pandas per caricare e manipolare dati tabulari
*   matplotlib per creare grafici

Importiamo le librerie usando i loro alias convenzionali

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

## Caricamento dei dati
I dataset presi sono satti scricati dal sito ufficile delle nazioni unite https://dataunodc.un.org/ , e contengono varie informazioni tra cui:
*  Kidnapping
*  Rape
*  Drug Trafficking
*  Sexual assault
*  Burglary and Theft (insieme)
*  Homicide
I dataset presentano tutti dati dal 2013 al 2022 delle varie nazioni del mondo

### Significato delle colonne
In tutti i dataset che prendiamo in considerazione abbiamo le varie colonne:
*  Nation
*

## Crimini per furto

In [2]:
import os.path
file = "data_cts_corruption_and_economic_crime.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  theftCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [3]:
theftCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22754 entries, 0 to 22753
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Iso3_code            22754 non-null  object 
 1   Country              22754 non-null  object 
 2   Region               22754 non-null  object 
 3   Subregion            22754 non-null  object 
 4   Indicator            22754 non-null  object 
 5   Dimension            22754 non-null  object 
 6   Category             22754 non-null  object 
 7   Sex                  22754 non-null  object 
 8   Age                  22754 non-null  object 
 9   Year                 22754 non-null  int64  
 10  Unit of measurement  22754 non-null  object 
 11  VALUE                22754 non-null  float64
 12  Source               22754 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 16.5 MB


In [4]:
theftCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Offences,by type of offence,Corruption,Total,Total,2013,Counts,782.0,CTS
1,AUT,Austria,Europe,Western Europe,Offences,by type of offence,Corruption,Total,Total,2013,Counts,3439.0,CTS
2,CHE,Switzerland,Europe,Western Europe,Offences,by type of offence,Corruption,Total,Total,2013,Counts,4884.0,CTS
3,CHL,Chile,Americas,Latin America and the Caribbean,Offences,by type of offence,Corruption,Total,Total,2013,Counts,339.0,CTS
4,COL,Colombia,Americas,Latin America and the Caribbean,Offences,by type of offence,Corruption,Total,Total,2013,Counts,23483.0,CTS


In [5]:
theftCrime.shape

(22754, 13)

In [6]:
theftCrime.describe()

Unnamed: 0,Year,VALUE
count,22754.0,22754.0
mean,2015.348159,26752.75
std,5.059154,213446.5
min,2003.0,0.0
25%,2013.0,6.0
50%,2017.0,127.2963
75%,2019.0,1484.817
max,2022.0,7026802.0


### Preprocessing dei dati
Filtrimo il dataframe utilizzando come unità solo 'Counts' inevce che 'Rate per 100,000 population'. Effettuiamo quindi una prima scrematura dei dati

In [7]:
theftCrime = theftCrime[theftCrime['Unit of measurement'] != "Rate per 100,000 population"]

In [8]:
theftCrime.shape

(11377, 13)

Analizzando il dataframe, le feature 'Age', 'Sex', 'Iso3_code', 'Dimension', 'Unit of measurement'e 'Source' non sono rilevanti per il nostro modello, perciò effetuiamo la loro rimozione.

Rinominiamo inoltre la colonna `VALUE` in `Value`

In [9]:
theftCrime.rename(columns={'VALUE':'Value'}, inplace=True)
theftCrime = theftCrime.drop(['Unit of measurement', 'Sex', 'Iso3_code', 'Age', 'Source', 'Dimension', 'Indicator'], axis=1)

In [10]:
theftCrime.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
0,Armenia,Asia,Western Asia,Corruption,2013,782.0
1,Austria,Europe,Western Europe,Corruption,2013,3439.0
2,Switzerland,Europe,Western Europe,Corruption,2013,4884.0
3,Chile,Americas,Latin America and the Caribbean,Corruption,2013,339.0
4,Colombia,Americas,Latin America and the Caribbean,Corruption,2013,23483.0


Notimao inoltre che, dalle informazioni ricevute nel paragrafo precedente, la maggior parte delle feature tranne `Year` e `Value` sono di tipo `Object`, e questo perchè paython converte automaticamente in tipo objet tuttin i dati che non sono interi. Dunque cambiamo i vari tipi delle feature.
Inoltre, rimuoviamo, se vi sono, le varie righe conteneti valori nulli.

In [11]:
theftCrime["Country"] = theftCrime["Country"].astype(pd.StringDtype())
theftCrime["Region"] = theftCrime["Region"].astype(pd.StringDtype())
theftCrime["Subregion"] = theftCrime["Subregion"].astype(pd.StringDtype())
theftCrime["Category"] = theftCrime["Category"].astype(pd.StringDtype())
theftCrime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11377 entries, 0 to 11376
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    11377 non-null  string 
 1   Region     11377 non-null  string 
 2   Subregion  11377 non-null  string 
 3   Category   11377 non-null  string 
 4   Year       11377 non-null  int64  
 5   Value      11377 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 622.2 KB


In [12]:
theftCrime.dropna()
theftCrime.shape

(11377, 6)

Notiamo che il nostro dataset non contiene valori nulli in quanto la shape non è cambiata.

Contiamo anche il numero di paesi nel nostro dataset.

In [13]:
print(f"Numero totale di paesi: {theftCrime['Country'].unique().size}")

Numero totale di paesi: 157


## Crimini per omicidio

In [14]:
import os.path
file = "data_cts_intentional_homicide.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  homicideCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [15]:
homicideCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117070 entries, 0 to 117069
Data columns (total 13 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Iso3_code            117070 non-null  object 
 1   Country              117070 non-null  object 
 2   Region               117070 non-null  object 
 3   Subregion            117070 non-null  object 
 4   Indicator            117070 non-null  object 
 5   Dimension            117070 non-null  object 
 6   Category             117070 non-null  object 
 7   Sex                  117070 non-null  object 
 8   Age                  117070 non-null  object 
 9   Year                 117070 non-null  int64  
 10  Unit of measurement  117070 non-null  object 
 11  VALUE                117070 non-null  float64
 12  Source               117070 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 87.4 MB


In [16]:
homicideCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,35.0,CTS
1,CHE,Switzerland,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,28.0,CTS
2,COL,Colombia,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,15053.0,CTS
3,CZE,Czechia,Europe,Eastern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,69.0,CTS
4,DEU,Germany,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,455.0,CTS


In [17]:
homicideCrime.shape

(117070, 13)

In [18]:
homicideCrime.describe()

Unnamed: 0,Year,VALUE
count,117070.0,117070.0
mean,2014.886162,733.792988
std,6.167995,9906.653113
min,1990.0,0.0
25%,2012.0,0.243
50%,2016.0,2.5566
75%,2019.0,20.935593
max,2023.0,457945.484991


### Preprocessing dei dati
Filtrimo il dataframe utilizzando come unità solo 'Counts' inevce che 'Rate per 100,000 population'. Effettuiamo quindi una prima scrematura dei dati

In [19]:
homicideCrime = homicideCrime[homicideCrime['Unit of measurement'] != "Rate per 100,000 population"]
homicideCrime.shape

(62850, 13)

Così facendo rimuovimao, come precedentemente, quasi la metà dei dati e allegeriamo così il dataset.

Come prima, rimuoviamo anche da questo dataset le feature non rilevanti per lo studio, e rinominiamo la feature `VALUE` come prima in `Value`.

Inoltre, effettuiamo anche la conversione delle feature da `Object` in `String`.

In [20]:
homicideCrime.rename(columns={'VALUE':'Value'}, inplace=True)
homicideCrime = homicideCrime.drop(['Unit of measurement', 'Sex', 'Iso3_code', 'Age', 'Source', 'Dimension', 'Indicator'], axis=1)
homicideCrime["Country"] = homicideCrime["Country"].astype(pd.StringDtype())
homicideCrime["Region"] = homicideCrime["Region"].astype(pd.StringDtype())
homicideCrime["Subregion"] = homicideCrime["Subregion"].astype(pd.StringDtype())
homicideCrime["Category"] = homicideCrime["Category"].astype(pd.StringDtype())
homicideCrime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62850 entries, 0 to 62849
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    62850 non-null  string 
 1   Region     62850 non-null  string 
 2   Subregion  62850 non-null  string 
 3   Category   62850 non-null  string 
 4   Year       62850 non-null  int64  
 5   Value      62850 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 3.4 MB


In [21]:
homicideCrime.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
0,Armenia,Asia,Western Asia,National citizens,2013,35.0
1,Switzerland,Europe,Western Europe,National citizens,2013,28.0
2,Colombia,Americas,Latin America and the Caribbean,National citizens,2013,15053.0
3,Czechia,Europe,Eastern Europe,National citizens,2013,69.0
4,Germany,Europe,Western Europe,National citizens,2013,455.0


In [22]:
homicideCrime.dropna()
homicideCrime.shape

(62850, 6)

Notimao anche questa volta che la shape del dataset non è cambiata dopo aver effettuato la `dropna()`, il che vuol dire non vi sono presenti righr con valori nulli.



Controlliamo il numero di paesi del dataset:

In [23]:
print(f"Numero totale di paesi: {homicideCrime['Country'].unique().size}")
countries_theft = set(theftCrime['Country'])
countries_homicide = set(homicideCrime['Country'])

# Confronta gli insiemi
if countries_theft == countries_homicide:
    print("Entrambi i DataFrame contengono gli stessi paesi.")
else:
    print("I DataFrame non contengono gli stessi paesi.")

Numero totale di paesi: 231
I DataFrame non contengono gli stessi paesi.


In questo caso, il numero di peasi è lo stesso del dataset precedente, ma non contiene gli stessi paesi.

Dunque, per poter procedere verso il merging dei vari dataset per creare uno solo, bisogna rimuovere da `theftCrime` tutte le righe rigurdanti i paesi non contenuti in `homicideCrime`.

Controlliamo intanto i paesi che differenziano da un dataset all'altro.

In [24]:
different_in_theft = countries_theft - countries_homicide
different_in_homicide = countries_homicide - countries_theft

if different_in_theft:
    print(f"Paesi in theftCrime ma non in homicideCrime: {different_in_theft}")
    print(f"Numero: {len(different_in_theft)}")
if different_in_homicide:
    print(f"Paesi in homicideCrime ma non in theftCrime: {different_in_homicide}")
    print(f"Numero: {len(different_in_homicide)}")

Paesi in theftCrime ma non in homicideCrime: {'CÃ´te dâ\x80\x99Ivoire', 'Benin', 'Guinea', 'Senegal', 'Sudan'}
Numero: 5
Paesi in homicideCrime ma non in theftCrime: {'Seychelles', 'All Africa', 'Tonga', 'San Marino', 'Saint Helena', 'Aruba', 'All Oceania', 'Samoa', 'Micronesia', 'Eastern Asia', 'Iraq', 'Southern Europe', 'Vanuatu', 'South Sudan', 'Gibraltar', 'Ethiopia', 'Fiji', 'Papua New Guinea', 'Viet Nam', 'Faroe Islands', 'Greenland', 'Southern Asia', 'South Africa', 'Anguilla', 'All Asia', 'Western Asia', 'Melanesia', 'Turks and Caicos Islands', 'New Caledonia', 'Martinique', 'Eritrea', 'Mauritania', 'British Virgin Islands', 'Iraq (Kurdistan Region)', 'Saint Pierre and Miquelon', 'Marshall Islands', 'Afghanistan', 'Montserrat', 'Polynesia', 'Tuvalu', 'Saint Martin (French Part)', 'American Samoa', 'French Guiana', 'Mayotte', 'Cuba', 'Ghana', 'All Americas', 'Central Asia', 'Isle of Man', 'Angola', 'Cayman Islands', 'French Polynesia', 'Haiti', 'Australia and New Zealand', 'Nort

In [25]:
common_countries = countries_theft & countries_homicide
print(f"Numero Paesi in comune: {len(common_countries)}")

Numero Paesi in comune: 152


Il numero di paesi in comune è 152, quindi muteremo i due dataset filtrandoli in base ai paesi in comune tra i due.

In [26]:
theftCrime = theftCrime[theftCrime['Country'].isin(common_countries)]
homicideCrime = homicideCrime[homicideCrime['Country'].isin(common_countries)]
print(f"theftCrime: {theftCrime.shape}")
print(f"homicideCrime: {homicideCrime.shape}")

theftCrime: (11355, 6)
homicideCrime: (60234, 6)


## Traffico di droghe

In [272]:
import os.path
file = "data_cts_prisons_and_prisoners.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  drugsCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [273]:
drugsCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70790 entries, 0 to 70789
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Iso3_code            70790 non-null  object 
 1   Country              70790 non-null  object 
 2   Region               70790 non-null  object 
 3   Subregion            70790 non-null  object 
 4   Indicator            70790 non-null  object 
 5   Dimension            70790 non-null  object 
 6   Category             70790 non-null  object 
 7   Sex                  70790 non-null  object 
 8   Age                  70790 non-null  object 
 9   Year                 70790 non-null  int64  
 10  Unit of measurement  70790 non-null  object 
 11  VALUE                70790 non-null  float64
 12  Source               70790 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 51.0 MB


In [274]:
drugsCrime.shape

(70790, 13)

In [275]:
drugsCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,BRB,Barbados,Americas,Latin America and the Caribbean,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,0.0,CTS
1,CRI,Costa Rica,Americas,Latin America and the Caribbean,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,173.0,CTS
2,DMA,Dominica,Americas,Latin America and the Caribbean,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,65.0,CTS
3,GBR_NI,United Kingdom (Northern Ireland),Europe,Northern Europe,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,61.0,CTS
4,ITA,Italy,Europe,Southern Europe,Persons entering prison,by selected crime,Intentional Homicide,Total,Total,2016,Counts,1443.0,CTS


Filtrimo il dataframe, manteniamo solo i dati che presentano come `Unit of measurement` `Counts` inevce che `Rate per 100,000 population`.

In [276]:
drugsCrime = drugsCrime[drugsCrime['Unit of measurement'] != "Rate per 100,000 population"]
drugsCrime.shape

(44997, 13)

Le feature `Iso3_code`, `Indicator`, `Dimension`, `Sex`, `Age`, `Unit of measurement` e `Source` sono inutili per il nostro modello, perciò le rimuoviamo. Inoltre rinomiamo la feature `VALUE` in `Value` e rimuoviamo le eventuali righe contenenti valori null.

In [277]:
drugsCrime = drugsCrime.drop(['Iso3_code', 'Indicator', 'Dimension', 'Sex', 'Age', 'Unit of measurement', 'Source'], axis=1)
drugsCrime.rename(columns={'VALUE':'Value'}, inplace=True)
drugsCrime.dropna()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
0,Barbados,Americas,Latin America and the Caribbean,Intentional Homicide,2016,0.0
1,Costa Rica,Americas,Latin America and the Caribbean,Intentional Homicide,2016,173.0
2,Dominica,Americas,Latin America and the Caribbean,Intentional Homicide,2016,65.0
3,United Kingdom (Northern Ireland),Europe,Northern Europe,Intentional Homicide,2016,61.0
4,Italy,Europe,Southern Europe,Intentional Homicide,2016,1443.0
...,...,...,...,...,...,...
44992,Zimbabwe,Africa,Sub-Saharan Africa,Total,2008,487.0
44993,Zimbabwe,Africa,Sub-Saharan Africa,Total,2017,475.0
44994,Zimbabwe,Africa,Sub-Saharan Africa,Total,2018,353.0
44995,Zimbabwe,Africa,Sub-Saharan Africa,Total,2019,450.0


Notiamo che il numero di righe del dataset non è cambiata dopo aver effettuato la `dropna()`, il che vuol dire non vi sono presenti righe con valori nulli.

Di seguito, invece, effettuiamo la conversione da `Object` a `String` delle feature che lo necessitano.

In [278]:
drugsCrime["Country"] = drugsCrime["Country"].astype(pd.StringDtype())
drugsCrime["Region"] = drugsCrime["Region"].astype(pd.StringDtype())
drugsCrime["Subregion"] = drugsCrime["Subregion"].astype(pd.StringDtype())
drugsCrime["Category"] = drugsCrime["Category"].astype(pd.StringDtype())
drugsCrime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44997 entries, 0 to 44996
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    44997 non-null  string 
 1   Region     44997 non-null  string 
 2   Subregion  44997 non-null  string 
 3   Category   44997 non-null  string 
 4   Year       44997 non-null  int64  
 5   Value      44997 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 2.4 MB


Poichè stiamo analizzando i crimini legati al traffico e all'uso di droghe dobbiamo fare un'ulteriore scrematura del dataset in modo da considerare esclusivamente i crimini di nostro interesse. In particolare di seguito controllo quali categorie riguardano le droghe.

In [279]:
drug_categories = drugsCrime[drugsCrime['Category'].str.contains('Drug', case=False, na=False)]['Category'].unique()
print(drug_categories)

<StringArray>
['Drug Possession', 'Drug Trafficking', 'Drug possession', 'Drug trafficking']
Length: 4, dtype: string


In [280]:
drugsCrime = drugsCrime[drugsCrime['Category'].isin(['Drug Trafficking', 'Drug trafficking'])]

In [281]:
drugsCrime.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,Value
596,Barbados,Americas,Latin America and the Caribbean,Drug Trafficking,2016,3.0
597,Costa Rica,Americas,Latin America and the Caribbean,Drug Trafficking,2016,203.0
598,Dominica,Americas,Latin America and the Caribbean,Drug Trafficking,2016,0.0
599,United Kingdom (Northern Ireland),Europe,Northern Europe,Drug Trafficking,2016,3047.0
600,Honduras,Americas,Latin America and the Caribbean,Drug Trafficking,2016,87.0


Avendo effetuato questo ulteriore controllo posso eliminare anche la colonna `Category`.

In [282]:
drugsCrime = drugsCrime.drop(['Category'], axis=1)

In [283]:
drugsCrime.shape

(682, 5)

In [284]:
drugsCrime.describe()

Unnamed: 0,Year,Value
count,682.0,682.0
mean,2017.567449,8160.313783
std,3.335696,27153.150717
min,2010.0,0.0
25%,2015.0,80.0
50%,2018.0,687.0
75%,2020.0,3066.5
max,2022.0,240113.0


In [285]:
print(f"Numero totale di paesi: {homicideCrime['Country'].unique().size}")

Numero totale di paesi: 226


## Violenza sessuale e rapimenti

In [257]:
import os.path
file = "data_cts_violent_and_sexual_crime.csv"
if not os.path.exists(file):
    print("Missing dataset")
else:
  violent_sexualCrime = pd.read_csv(file, index_col=False, encoding='latin1')

In [258]:
violent_sexualCrime.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26114 entries, 0 to 26113
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Iso3_code            26114 non-null  object 
 1   Country              26114 non-null  object 
 2   Region               26114 non-null  object 
 3   Subregion            26114 non-null  object 
 4   Indicator            26114 non-null  object 
 5   Dimension            26114 non-null  object 
 6   Category             26114 non-null  object 
 7   Sex                  26114 non-null  object 
 8   Age                  26114 non-null  object 
 9   Year                 26114 non-null  int64  
 10  Unit of measurement  26114 non-null  object 
 11  VALUE                26114 non-null  float64
 12  Source               26114 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 19.3 MB


In [259]:
violent_sexualCrime.shape

(26114, 13)

In [260]:
violent_sexualCrime.head()

Unnamed: 0,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,AZE,Azerbaijan,Asia,Western Asia,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,155.0,CTS
1,BEL,Belgium,Europe,Western Europe,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,61959.0,CTS
2,BGR,Bulgaria,Europe,Eastern Europe,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,3806.0,CTS
3,BHR,Bahrain,Asia,Western Asia,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,2701.0,CTS
4,BLR,Belarus,Europe,Eastern Europe,Violent offences,by type of offence,Serious assault,Total,Total,2003,Counts,4032.0,CTS


Filtrimo il dataframe, manteniamo solo i dati che presentano come `Unit of measurement` `Counts` inevce che `Rate per 100,000 population`.

In [261]:
violent_sexualCrime = violent_sexualCrime[violent_sexualCrime['Unit of measurement'] != "Rate per 100,000 population"]
violent_sexualCrime.shape

(13073, 13)

Le feature `Iso3_code`, `Indicator`, `Dimension`, `Sex`, `Age`, `Unit of measurement` e `Source` sono inutili per il nostro modello, perciò le rimuoviamo. 
Inoltre rimuoviamo le eventuali righe contenenti valori null.

In [262]:
violent_sexualCrime = violent_sexualCrime.drop(['Iso3_code', 'Indicator', 'Dimension', 'Sex', 'Age', 'Unit of measurement', 'Source'], axis=1)
violent_sexualCrime.dropna()

Unnamed: 0,Country,Region,Subregion,Category,Year,VALUE
0,Azerbaijan,Asia,Western Asia,Serious assault,2003,155.0
1,Belgium,Europe,Western Europe,Serious assault,2003,61959.0
2,Bulgaria,Europe,Eastern Europe,Serious assault,2003,3806.0
3,Bahrain,Asia,Western Asia,Serious assault,2003,2701.0
4,Belarus,Europe,Eastern Europe,Serious assault,2003,4032.0
...,...,...,...,...,...,...
13068,Montenegro,Europe,Southern Europe,Acts intended to induce fear or emotional dist...,2021,10.0
13069,Mauritius,Africa,Sub-Saharan Africa,Acts intended to induce fear or emotional dist...,2021,342.0
13070,El Salvador,Americas,Latin America and the Caribbean,Acts intended to induce fear or emotional dist...,2021,4.0
13071,Serbia,Europe,Southern Europe,Acts intended to induce fear or emotional dist...,2021,1.0


Notiamo che il numero di righe del dataset non è cambiata dopo aver effettuato la dropna(), il che vuol dire non vi sono presenti righe con valori nulli.

Di seguito, invece, effettuiamo la conversione da Object a String delle feature che lo necessitano.

In [263]:
violent_sexualCrime["Country"] = violent_sexualCrime["Country"].astype(pd.StringDtype())
violent_sexualCrime["Region"] = violent_sexualCrime["Region"].astype(pd.StringDtype())
violent_sexualCrime["Subregion"] = violent_sexualCrime["Subregion"].astype(pd.StringDtype())
violent_sexualCrime["Category"] = violent_sexualCrime["Category"].astype(pd.StringDtype())
violent_sexualCrime.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13073 entries, 0 to 13072
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    13073 non-null  string 
 1   Region     13073 non-null  string 
 2   Subregion  13073 non-null  string 
 3   Category   13073 non-null  string 
 4   Year       13073 non-null  int64  
 5   VALUE      13073 non-null  float64
dtypes: float64(1), int64(1), string(4)
memory usage: 714.9 KB


Poichè dobbiamo analizzare i crimini di violenza sessuale ed i rapimenti andiamo a creare due copie del dataframe in modo da poter analizzare separatamente i due casi.

Inoltre rinominiamo la feature `VALUE` in `sexual_violence_count` e `kidnapping_count` a seconda del fenomeno analizzato.

In [264]:
sexual_violence = violent_sexualCrime.copy()
sexual_violence = sexual_violence[sexual_violence['Category'].isin(['Sexual violence'])]
sexual_violence.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,VALUE
3796,Belgium,Europe,Western Europe,Sexual violence,2003,5565.0
3797,Bulgaria,Europe,Eastern Europe,Sexual violence,2003,1287.0
3798,Canada,Americas,Northern America,Sexual violence,2003,26128.0
3799,Czechia,Europe,Eastern Europe,Sexual violence,2003,1898.0
3800,Germany,Europe,Western Europe,Sexual violence,2003,54632.0


In [265]:
kidnapping = violent_sexualCrime.copy()
kidnapping = kidnapping[kidnapping['Category'].isin(['Kidnapping'])]
kidnapping.head()

Unnamed: 0,Country,Region,Subregion,Category,Year,VALUE
1977,United Arab Emirates,Asia,Western Asia,Kidnapping,2003,434.0
1978,Azerbaijan,Asia,Western Asia,Kidnapping,2003,28.0
1979,Belgium,Europe,Western Europe,Kidnapping,2003,1003.0
1980,Bulgaria,Europe,Eastern Europe,Kidnapping,2003,209.0
1981,Bahrain,Asia,Western Asia,Kidnapping,2003,5.0


Avendo effetuato questa ulteriore scrematura posso eliminare anche la colonna `Category`.

In [266]:
sexual_violence = sexual_violence.drop(['Category'], axis=1)
sexual_violence.rename(columns={'VALUE':'sexual_violence_count'}, inplace=True)
sexual_violence.shape

(1768, 5)

In [267]:
sexual_violence.describe()

Unnamed: 0,Year,sexual_violence_count
count,1768.0,1768.0
mean,2013.1431,6434.749434
std,5.244384,15363.198967
min,2003.0,0.0
25%,2009.0,238.0
50%,2013.0,1362.5
75%,2018.0,5163.25
max,2022.0,193566.0


In [268]:
kidnapping = kidnapping.drop(['Category'], axis=1)
kidnapping.rename(columns={'VALUE':'kidnapping_count'}, inplace=True)
kidnapping.shape

(1819, 5)

In [269]:
kidnapping.describe()

Unnamed: 0,Year,kidnapping_count
count,1819.0,1819.0
mean,2012.722375,792.048679
std,5.438161,4022.97388
min,2003.0,0.0
25%,2008.0,5.0
50%,2013.0,34.0
75%,2017.0,265.0
max,2022.0,65461.0


Notiamo che il dataset, per ogni `Country`, contiene dati dal 2003 al 2022.

In [271]:
print(f"Numero totale di paesi nel dataframe sexual_violence: {sexual_violence['Country'].unique().size}")

Numero totale di paesi nel dataframe sexual_violence: 140


In [270]:
print(f"Numero totale di paesi nel dataframe kidnapping: {kidnapping['Country'].unique().size}")

Numero totale di paesi nel dataframe kidnapping: 147
