In [6]:
import pandas as pd

# Crime ML be <Anette> Leslie PyhonAi ITHS 2024

*What do i want to Accomplish:*

Can we predict in which city special crime will increase and therefore take more measures like increasing the police force, cameras and so on to stop crime in this area!


*My Problem with my data:*

I downloaded statistics from BRÅ (https://statistik.bra.se/solwebb/action/index ).
Picked some data I wanted to focus on, like in different counties in Stockholm and the crimes of violence against people.


Here is a example how the data looks:

Region;Brott;�r;Period;Antal;/100 000 inv
Stockholm kommun;Brott mot brottsbalken;2022;Jan;11322;1153;
Stockholm kommun;Brott mot brottsbalken;2022;Feb;12199;1243;

I have a few problems I need to solve with this data, here is a list of those problems:

1. This data uses Swedish letters, another type of unicode
2. I will work more to a regression problem, because the data i use will be more a prediction on a linear scale.
3. Which data is most important to USE for predicting the outcome?
4. Have 4 files with different year, how to put these together without copy paste

*Conclusion*

Below i will work with my data and try to solve the problems that i have with it.

# Fixing my data!!


In [7]:
df = pd.read_csv('./data/Databasfil2020.txt')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 13: invalid continuation byte

Ok we have some unicode problems lets fix that!

Lets check which encoding i need for Swedish.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1



In [8]:
df = pd.read_csv('./data/Databasfil2020.txt', encoding="ISO-8859-1")
df.head()

Unnamed: 0,Region;Brott;År;Period;Antal;/100 000 inv
0,Botkyrka kommun;3-7 kap. Brott mot person;2020...
1,Botkyrka kommun;3-7 kap. Brott mot person;2020...
2,Botkyrka kommun;3-7 kap. Brott mot person;2020...
3,Botkyrka kommun;3-7 kap. Brott mot person;2020...
4,Botkyrka kommun;3-7 kap. Brott mot person;2020...


We have multiple files, lets concat them together

In [9]:
import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv,  encoding="ISO-8859-1"), glob.glob("data/*.txt")))

In [10]:
df

Unnamed: 0,Region;Brott;År;Period;Antal;/100 000 inv
0,Botkyrka kommun;3-7 kap. Brott mot person;2021...
1,Botkyrka kommun;3-7 kap. Brott mot person;2021...
2,Botkyrka kommun;3-7 kap. Brott mot person;2021...
3,Botkyrka kommun;3-7 kap. Brott mot person;2021...
4,Botkyrka kommun;3-7 kap. Brott mot person;2021...
...,...
5539,Österåker kommun;7 kap. Brott mot familj;2023 ...
5540,Österåker kommun;7 kap. Brott mot familj;2023 ...
5541,Österåker kommun;7 kap. Brott mot familj;2023 ...
5542,Österåker kommun;7 kap. Brott mot familj;2023 ...


YEAH! That fixed the encoding problem..
But now I have a new problem?! 22176 rows but just 1 column!

Need to fix the data so i get some good columns, lets use a separator as ; because thats is what separates the text and remove our index column.

In [11]:
# Testing one file
#df = pd.read_csv('./data/Databasfil2020.txt', encoding="ISO-8859-1", sep=";", index_col=False )


df = pd.concat(map(functools.partial(pd.read_csv, encoding="ISO-8859-1", sep=";", index_col=False ), glob.glob("data/*.txt")))
df

Unnamed: 0,Region,Brott,År,Period,Antal,/100 000 inv
0,Botkyrka kommun,3-7 kap. Brott mot person,2021,Jan,241,253
1,Botkyrka kommun,3-7 kap. Brott mot person,2021,Feb,238,250
2,Botkyrka kommun,3-7 kap. Brott mot person,2021,Mar,196,206
3,Botkyrka kommun,3-7 kap. Brott mot person,2021,Apr,242,255
4,Botkyrka kommun,3-7 kap. Brott mot person,2021,Maj,254,267
...,...,...,...,...,...,...
5539,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Aug,0,0
5540,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Sep,0,0
5541,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Okt,0,0
5542,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Nov,0,0


Lets get to know our data

In [12]:
# Which Regions do I have?
reg = df['Region'].drop_duplicates()
reg

0                                        Botkyrka kommun
168                                      Danderyd kommun
336                                       Haninge kommun
504                                      Huddinge kommun
672                                      Järfälla kommun
840                                       Knivsta kommun
1008                                      Lidingö kommun
1176                                        Nacka kommun
1344                                    Norrtälje kommun
1512                                        Salem kommun
1680                                      Sigtuna kommun
1848                                   Sollentuna kommun
2016                                        Solna kommun
2184                      Stadsdelsområde Bromma (Sthlm)
2352    Stadsdelsområde Enskede - Årsta - Vantör (Sthlm)
2520                      Stadsdelsområde Farsta (Sthlm)
2688            Stadsdelsområde Hägersten-Älvsjö (Sthlm)
2856        Stadsdelsområde Häs

In [13]:

# Which Crime do I have?
crime = df['Brott'].drop_duplicates()
crime


0             3-7 kap. Brott mot person
24       3 kap. Brott mot liv och hälsa
60          därav misshandel inkl. grov
84     4 kap. Brott mot frihet och frid
108           5 kap. Ärekränkningsbrott
132                  6 kap. Sexualbrott
156             7 kap. Brott mot familj
Name: Brott, dtype: object

In [14]:

# Which year do I have?
year = df['År'].drop_duplicates()
year

0          2021
0          2020
0          2022
0    2023 prel.
Name: År, dtype: object

In [15]:
# Which Period do I have?
month = df['Period'].drop_duplicates()
month

0     Jan
1     Feb
2     Mar
3     Apr
4     Maj
5     Jun
6     Jul
7     Aug
8     Sep
9     Okt
10    Nov
11    Dec
Name: Period, dtype: object

In [17]:
#  Crime rate?
rate = df['Antal']
rate

0       241
1       238
2       196
3       242
4       254
       ... 
5539      0
5540      0
5541      0
5542      0
5543      1
Name: Antal, Length: 22176, dtype: int64

In [18]:
# Per 100000 capital

# Which data do I have?
cap = df['/100 000 inv']
cap

0       253
1       250
2       206
3       255
4       267
       ... 
5539      0
5540      0
5541      0
5542      0
5543      2
Name: /100 000 inv, Length: 22176, dtype: int64

In [19]:
# What datatypes do i have
df.dtypes

Region          object
Brott           object
År              object
Period          object
Antal            int64
/100 000 inv     int64
dtype: object

I have a problem with my datatypes, have 4 type of objects, need to convert these to values to readable for the computer.
Lets do some boolean encoding with help of get_dummies

In [20]:
encoded_reg = pd.get_dummies(df, columns=['Region'])
encoded_brott = pd.get_dummies(df, columns=['Brott'])
encoded_year = pd.get_dummies(df, columns=['År'])
encoded_per = pd.get_dummies(df, columns=['Period'])


In [21]:
# lets check
encoded_reg.info()
print()
encoded_brott.info(verbose = False, memory_usage = 'deep')
print()

encoded_year.info(verbose = False, memory_usage = 'deep')
print()
encoded_per.info(verbose = False, memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
Index: 22176 entries, 0 to 5543
Data columns (total 38 columns):
 #   Column                                                   Non-Null Count  Dtype 
---  ------                                                   --------------  ----- 
 0   Brott                                                    22176 non-null  object
 1   År                                                       22176 non-null  object
 2   Period                                                   22176 non-null  object
 3   Antal                                                    22176 non-null  int64 
 4   /100 000 inv                                             22176 non-null  int64 
 5   Region_Botkyrka kommun                                   22176 non-null  bool  
 6   Region_Danderyd kommun                                   22176 non-null  bool  
 7   Region_Haninge kommun                                    22176 non-null  bool  
 8   Region_Huddinge kommun                    

# Last Info

The data i will use for ML is Antal crime, but I only have 4 years of data and I really don't know if thats enough, guess i will see after setting up my ML model.

And i think if i did use per 100000 citizen then i probably need a list with how many citizen thats living in each community. Maybe that could be a later project.
So lets dump that column we don't need.


In [24]:
df.head()
df.drop(columns=['/100 000 inv'])


Unnamed: 0,Region,Brott,År,Period,Antal
0,Botkyrka kommun,3-7 kap. Brott mot person,2021,Jan,241
1,Botkyrka kommun,3-7 kap. Brott mot person,2021,Feb,238
2,Botkyrka kommun,3-7 kap. Brott mot person,2021,Mar,196
3,Botkyrka kommun,3-7 kap. Brott mot person,2021,Apr,242
4,Botkyrka kommun,3-7 kap. Brott mot person,2021,Maj,254
...,...,...,...,...,...
5539,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Aug,0
5540,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Sep,0
5541,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Okt,0
5542,Österåker kommun,7 kap. Brott mot familj,2023 prel.,Nov,0
