In [94]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

### O datech EUROSTATU
Pro spravnou interpretaci a manipulaci s daty o kriminalite z Eurostatu je potreba si nejprve nastudovat potrebnou [dokumentaci](https://ec.europa.eu/eurostat/cache/metadata/en/crim_sims.htm), souhrnne sdeleni vcetne metodiky zpracovani dat popsano v [README.md](/README.rd)

Import a aktivace vlastni tridy, automaticke cisteni, tranformace a statisticky prepocet dat. 

In [95]:
from eurostatlib.crimetable import EurostatCrimeTable

crime_table = EurostatCrimeTable()

geo_df = pd.read_csv(r'data\geo.csv')
iccs_df = pd.read_csv(r'data\iccs.csv')

crime_table.load_data(f'data/estat_crim_off_cat.tsv', geo_df, iccs_df)
crime_table.create_summary_df_1all()
df = crime_table.country_crime_info_11

V pripade, ze bychom chteli ze summarizacni tabulky porovnat data se zakladnimi hodnotami jeste neprepocitaneho df, je mozno zavolat crime_table.filter_data('<nazev_zeme>', '<krimi_cin>') a poté pracovat s vyfiltovanými daty pod crime_table.filtered_data (podoruceno preulozit do promenne).

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 861 entries, 0 to 860
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   country                    861 non-null    object 
 1   crime                      861 non-null    object 
 2   crime_category             861 non-null    object 
 3   count_years                861 non-null    int64  
 4   count_fill_values          861 non-null    int64  
 5   first_fill_year            755 non-null    object 
 6   last_fill_year             755 non-null    object 
 7   mean_value                 755 non-null    float64
 8   median_value               755 non-null    float64
 9   max_value                  755 non-null    float64
 10  max_value_year             755 non-null    object 
 11  min_value                  769 non-null    float64
 12  min_value_year             769 non-null    object 
 13  standard_deviation         755 non-null    float64

V summarizacni tabulce jsou vypocitane a sebirane nejpodstatnejsi udaje, se kterymi lze dale pracovat a dle podminek si data filtrovat. Nejvhodnejsi je pouzit je pro monitoring podezrelych hodnot, tyto data neslouzi vizualizacim. V ramci interaktivniho jupyter notebooku ci dash/plotly app se automatizovane generuje dle vygenerovanych hodnot zhodnocujici text. Nize proprikald ukazano, jak lze s daty pracovat a co z nich lze vycist. 

In [97]:
df.head(2)

Unnamed: 0,country,crime,crime_category,count_years,count_fill_values,first_fill_year,last_fill_year,mean_value,median_value,max_value,...,min_value,min_value_year,standard_deviation,quality_range_fill_data,quality_range_unfill_data,missing_values_info,trend,relative_trend_strength,min_range_year,max_range_year
0,Albania,Acts against computer systems,hidden,15,7,2016,2022,3.89,2.93,7.14,...,2.33,2018,1.77,7,0,The time series has no missing values within t...,increasing,0.68,2008.0,2022.0
1,Albania,Attempted intentional homicide,visible,15,12,2008,2022,5.05,4.525,7.63,...,3.25,2019,1.67,15,3,The time series has 3 missing value(s) within ...,to missing value(s),,2008.0,2022.0


In [98]:
#nula v count_years znamena, ze pro dany kriminalni cin neexistuje zaznam, respektive ze pro nej neexistuje casove období
no_record_crime = df[df['count_years'] == 0] 

#zjistime, kolik zemi nejvice figuruji v nezverejnovani dat o krimi cinech, coz je napr. Bosnia and Herzegovina,England and Wales  
country_no_record_crime = no_record_crime[['country', 'crime']].value_counts().reset_index()[['country', 'crime']]
country_no_record_crime['crime'].value_counts() # seznam kriminalnich cinnu, ktere nejsou zaznamenavany

crime
Acts against computer systems                             13
Participation in an organized criminal group              13
Child pornography                                          9
Sexual exploitation                                        9
Bribery                                                    8
Money laundering                                           7
Corruption                                                 5
Fraud                                                      5
Kidnapping                                                 4
Attempted intentional homicide                             3
Burglary of private residential premises                   2
Sexual violence                                            2
Burglary                                                   2
Rape                                                       2
Sexual assault                                             1
Serious assault                                            1
Unlawful acts invo

Jako nekvalitni data bychom mohli definovat ta, kterym behem zaznamenavaneho obdobi bud chybi vyplnene hodnoty, nebo jsou jejich hodnoty prilis kratke na to, aby se z nich mohla nejak odvozovat trendovost, respektive potvrdit trendovost. V zakladnim dataframu vsak najdeme vypocitane trendy jiz od vyplnenych hodnot 2 vcetne.

In [114]:
# chceme, aby chybela alespon jedna hodnota, nebo byla casova rada kratsi 4let vcetne
no_quality_data = df[(df['quality_range_unfill_data'] != 0) | (df['count_fill_values'] <= 4)]
no_quality_data.head()

Unnamed: 0,country,crime,crime_category,count_years,count_fill_values,first_fill_year,last_fill_year,mean_value,median_value,max_value,...,min_value,min_value_year,standard_deviation,quality_range_fill_data,quality_range_unfill_data,missing_values_info,trend,relative_trend_strength,min_range_year,max_range_year
1,Albania,Attempted intentional homicide,visible,15,12,2008,2022,5.05,4.525,7.63,...,3.25,2019,1.67,15,3,The time series has 3 missing value(s) within ...,to missing value(s),,2008.0,2022.0
3,Albania,Burglary,visible,15,8,2008,2022,11.34,8.67,26.36,...,4.54,2008,8.19,15,7,The time series has 7 missing value(s) within ...,to missing value(s),,2008.0,2022.0
4,Albania,Burglary of private residential premises,visible,15,12,2008,2022,35.79,37.185,57.21,...,21.51,2022,11.19,15,3,The time series has 3 missing value(s) within ...,to missing value(s),,2008.0,2022.0
8,Albania,Intentional homicide,visible,15,12,2008,2022,2.38,2.095,4.38,...,1.5,2022,0.82,15,3,The time series has 3 missing value(s) within ...,to missing value(s),,2008.0,2022.0
9,Albania,Kidnapping,visible,15,12,2008,2022,0.16,0.14,0.32,...,0.07,2011,0.08,15,3,The time series has 3 missing value(s) within ...,to missing value(s),,2008.0,2022.0


In [115]:
# nejcasteji chybi v casovych radach par hodnot(udaje za 3, 1, 4 roky) v ramci delsiho vyplnovaciho obdobi
no_quality_data[['first_fill_year', 'last_fill_year', 'quality_range_unfill_data']].value_counts().reset_index().head() 

Unnamed: 0,first_fill_year,last_fill_year,quality_range_unfill_data,count
0,2008,2022,3,14
1,2008,2022,1,13
2,2008,2020,4,11
3,2019,2022,0,5
4,2009,2022,5,4


Pokud bychom chteli pracovat s temi kvalitneji vyplnenymi daty bez chybejich hodnot, mohli bychom postupovat takto.

In [None]:
# tzn. nechceme chybejici hodnoty v zaznamenavanem obdobi a chceme casovou radu alepson o 5 rocich vcetne. 
# 861 (zaznamu country-crime) - 671 -> prisli jsme o 190 zaznamu.
quality_data = df[(df['quality_range_unfill_data'] == 0) & (df['count_fill_values'] > 4)] #671 zaznamu
quality_data.head()

Unnamed: 0,country,crime,crime_category,count_years,count_fill_values,first_fill_year,last_fill_year,mean_value,median_value,max_value,...,min_value,min_value_year,standard_deviation,quality_range_fill_data,quality_range_unfill_data,missing_values_info,trend,relative_trend_strength,min_range_year,max_range_year
0,Albania,Acts against computer systems,hidden,15,7,2016,2022,3.89,2.93,7.14,...,2.33,2018,1.77,7,0,The time series has no missing values within t...,increasing,0.68,2008.0,2022.0
2,Albania,Bribery,hidden,15,7,2016,2022,10.29,9.55,17.68,...,5.66,2020,3.64,7,0,The time series has no missing values within t...,increasing,0.72,2008.0,2022.0
5,Albania,Child pornography,sensitive,15,6,2016,2021,0.88,0.17,4.38,...,0.1,2018,1.71,6,0,The time series has no missing values within t...,increasing,0.98,2008.0,2022.0
6,Albania,Corruption,hidden,15,7,2016,2022,37.86,35.99,46.46,...,29.76,2020,5.51,7,0,The time series has no missing values within t...,increasing,0.57,2008.0,2022.0
7,Albania,Fraud,hidden,15,7,2016,2022,32.25,32.55,35.39,...,28.04,2020,2.49,7,0,The time series has no missing values within t...,increasing,0.58,2008.0,2022.0


Summarizacni tabulka pro konkretni stat.

In [116]:
# Z par vybranych sloupcu muzeme jednoduse zjistit pro jednotlive zeme, jak jsou na tom v ramci kriminality. 
# jednoznacne je dulezita interpretace rustu/poklesu dle toho, o jakou kategorii tr. cinu se jedna!
trend_columns = ['country', 'crime', 'crime_category', 'count_years', 'quality_range_fill_data', 'quality_range_unfill_data', 'trend', 'relative_trend_strength']
switz_summ = df[df['country'] == 'Switzerland']
switz_summ = switz_summ[trend_columns].sort_values(by='crime_category')
switz_summ

Unnamed: 0,country,crime,crime_category,count_years,quality_range_fill_data,quality_range_unfill_data,trend,relative_trend_strength
819,Switzerland,Acts against computer systems,hidden,0,,,,
830,Switzerland,Participation in an organized criminal group,hidden,15,7.0,0.0,decreasing,0.66
826,Switzerland,Fraud,hidden,15,7.0,0.0,increasing,0.82
825,Switzerland,Corruption,hidden,15,7.0,0.0,decreasing,0.99
829,Switzerland,Money laundering,hidden,15,7.0,0.0,increasing,1.0
821,Switzerland,Bribery,hidden,15,7.0,0.0,increasing,0.76
839,Switzerland,Unlawful acts involving controlled drugs or pr...,hidden,15,15.0,0.0,increasing,0.5
824,Switzerland,Child pornography,sensitive,15,2.0,0.0,decreasing,1.0
831,Switzerland,Rape,sensitive,15,15.0,0.0,increasing,0.64
834,Switzerland,Sexual assault,sensitive,15,14.0,0.0,decreasing,0.69


In [None]:
# pro zemi si lze roztridit tr. ciny dle kategorie a trendu a silu trendu zprumerovat
switz_summ.groupby(['crime_category', 'trend'])['relative_trend_strength'].mean()

crime_category  trend              
hidden          decreasing             0.825000
                increasing             0.770000
sensitive       decreasing             0.740000
                increasing             0.785000
visible         decreasing             0.668333
                increasing             0.690000
                to missing value(s)         NaN
Name: relative_trend_strength, dtype: float64

Summarizacni tabulka pro konkretni tr. cin.

In [118]:
# Nebo z par vybranych sloupcu zjistit, jak jsou na tom trendove jednotlive krimi ciny
crime_summ = df[df['crime'] == 'Intentional homicide']
crime_summ = crime_summ[trend_columns].sort_values(by='country')
crime_summ.head(10)

Unnamed: 0,country,crime,crime_category,count_years,quality_range_fill_data,quality_range_unfill_data,trend,relative_trend_strength
8,Albania,Intentional homicide,visible,15,15,3,to missing value(s),
29,Austria,Intentional homicide,visible,15,15,0,increasing,0.51
50,Belgium,Intentional homicide,visible,15,15,0,decreasing,0.59
71,Bosnia and Herzegovina,Intentional homicide,visible,15,5,0,decreasing,0.6
92,Bulgaria,Intentional homicide,visible,15,15,0,decreasing,0.69
113,Croatia,Intentional homicide,visible,15,15,0,decreasing,0.65
134,Cyprus,Intentional homicide,visible,15,15,0,decreasing,0.52
155,Czechia,Intentional homicide,visible,15,15,0,decreasing,0.58
176,Denmark,Intentional homicide,visible,15,15,0,increasing,0.51
197,England and Wales,Intentional homicide,visible,11,11,0,decreasing,0.52


In [119]:
# lze si trendy roztridit dle trendu a vypocitat prumer sily daneho trendu
crime_summ.groupby('trend')['relative_trend_strength'].mean()

trend
decreasing             0.636552
increasing             0.531667
to missing value(s)         NaN
Name: relative_trend_strength, dtype: float64

Vrele doporucuji si vyzkouset dash app nebo interaktivni jupyter notebook. 