## Project Description, Resources & Checklist

1. Is it a supervised or an unsupervised machine learning project
2. Is it a classification or regression task
3. Identify the target feature or features to be clustered
4. What are the available solutions to the problem
5. How do I intend to measure performance of my model
6. How will my solution be deployed and utilized?

## Data loading

In [1]:
library(dplyr)
library(pastecs)
library(ggplot2)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

"package 'pastecs' was built under R version 3.3.3"Loading required package: boot

Attaching package: 'pastecs'

The following objects are masked from 'package:dplyr':

    first, last



In [2]:
## Dataset A - Crime reported and police station
dataset_A <- read.csv("Dataset_A.csv")

# structure of dataset
str(dataset_A)

'data.frame':	30861 obs. of  4 variables:
 $ Province        : Factor w/ 9 levels "Eastern Cape",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Police_Station  : Factor w/ 1143 levels "Aberdeen","Acornhoek",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Crime_Category  : Factor w/ 27 levels "All theft not mentioned elsewhere",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Period_2015_2016: int  51 4 87 5 0 15 74 0 8 48 ...


In [3]:
## Reshaping the dataframe to wide format
library(tidyr)
dataset_A_wide <- spread(dataset_A, Crime_Category, Period_2015_2016 )
head(dataset_A_wide, n=5)


Attaching package: 'tidyr'

The following object is masked from 'package:pastecs':

    extract



Province,Police_Station,All theft not mentioned elsewhere,Arson,Assault with the intent to inflict grievous bodily harm,Attempted murder,Bank robbery,Burglary at non-residential premises,Burglary at residential premises,Carjacking,...,Robbery at residential premises,Robbery of cash in transit,Robbery with aggravating circumstances,Sexual Offences,Sexual offences as result of police action,Shoplifting,Stock-theft,Theft of motor vehicle and motorcycle,Theft out of or from motor vehicle,Truck hijacking
Eastern Cape,Aberdeen,51,4,87,5,0,15,74,0,...,2,0,8,14,0,0,20,2,7,0
Eastern Cape,Addo,97,2,150,25,0,87,144,0,...,12,1,41,55,0,0,21,8,11,0
Eastern Cape,Adelaide,47,2,75,0,0,22,85,0,...,2,0,12,18,0,7,22,4,12,0
Eastern Cape,Afsondering,11,1,54,5,0,7,29,0,...,6,0,13,28,0,0,97,0,6,0
Eastern Cape,Alexandria,76,0,86,17,0,27,116,2,...,6,0,36,41,0,5,35,6,13,3


In [4]:
# structure of the wide dataset
str(dataset_A_wide)


'data.frame':	1143 obs. of  29 variables:
 $ Province                                               : Factor w/ 9 levels "Eastern Cape",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Police_Station                                         : Factor w/ 1143 levels "Aberdeen","Acornhoek",..: 1 4 5 6 14 15 16 17 18 35 ...
 $ All theft not mentioned elsewhere                      : int  51 97 47 11 76 505 183 16 205 12 ...
 $ Arson                                                  : int  4 2 2 1 0 4 3 0 1 3 ...
 $ Assault with the intent to inflict grievous bodily harm: int  87 150 75 54 86 137 299 28 140 87 ...
 $ Attempted murder                                       : int  5 25 0 5 17 14 7 1 0 4 ...
 $ Bank robbery                                           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Burglary at non-residential premises                   : int  15 87 22 7 27 259 62 6 69 0 ...
 $ Burglary at residential premises                       : int  74 144 85 29 116 259 245 12 141 46 ...
 $ Carjacking        

In [20]:
dataset_A_wide[dataset_A_wide$Police_Station, ]

Unnamed: 0,Province,Police_Station,All theft not mentioned elsewhere,Arson,Assault with the intent to inflict grievous bodily harm,Attempted murder,Bank robbery,Burglary at non-residential premises,Burglary at residential premises,Carjacking,...,Robbery at residential premises,Robbery of cash in transit,Robbery with aggravating circumstances,Sexual Offences,Sexual offences as result of police action,Shoplifting,Stock-theft,Theft of motor vehicle and motorcycle,Theft out of or from motor vehicle,Truck hijacking
1,Eastern Cape,Aberdeen,51,4,87,5,0,15,74,0,...,2,0,8,14,0,0,20,2,7,0
4,Eastern Cape,Afsondering,11,1,54,5,0,7,29,0,...,6,0,13,28,0,0,97,0,6,0
5,Eastern Cape,Alexandria,76,0,86,17,0,27,116,2,...,6,0,36,41,0,5,35,6,13,3
6,Eastern Cape,Algoapark,505,4,137,14,0,259,259,24,...,34,0,325,53,1,223,2,89,429,1
14,Eastern Cape,Baviaanskloof,0,0,1,0,0,3,3,0,...,0,0,0,1,0,0,2,0,0,0
15,Eastern Cape,Beacon Bay,205,2,170,2,0,23,346,1,...,14,0,75,32,0,81,8,11,67,1
16,Eastern Cape,Bedford,45,0,83,1,0,19,48,0,...,1,0,7,22,0,3,29,1,4,0
17,Eastern Cape,Bell,22,1,20,2,0,3,29,0,...,3,0,7,13,0,0,17,1,6,0
18,Eastern Cape,Berlin,27,1,37,0,0,11,25,0,...,2,0,10,17,0,0,21,5,10,0
35,Eastern Cape,Chungwa,13,2,89,3,0,18,33,1,...,2,0,16,40,0,0,33,1,4,0


In [13]:
# check the dataset for duplicates
length(duplicated(dataset_A_wide$Police_Station))

In [7]:
# dataset B - Police stations and population they cover
library(xlsx)
dataset_B <- read.csv("Dataset_B.csv")
dataset_B[dataset_B$Police_Station, ]
dataset_A[dataset_A$Police_Station, ]


Unnamed: 0,Police_Station,population_estimate
1,ABERDEEN,9867
2,ACORNHOEK,127623
3,ACTONVILLE,52831
4,ADDO,20938
5,ADELAIDE,13588
6,AFSONDERING,21315
7,AGGENEYS,2384
8,AKASIA,191804
9,ALBERTINIA,8328
10,ALBERTON,90144


Unnamed: 0,Province,Police_Station,Crime_Category,Period_2015_2016
1,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.1,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.2,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.3,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.4,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.5,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.6,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.7,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.8,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51
1.9,Eastern Cape,Aberdeen,All theft not mentioned elsewhere,51


In [29]:
str(dataset_B$Police_Station)

 Factor w/ 1140 levels "ABERDEEN","ACORNHOEK",..: 1 2 3 4 5 6 7 8 9 10 ...


In [9]:
# Dataset C - Police Stations and their coordinates
dataset_C <- read.table("Dataset_C.tsv", header=TRUE, sep='\t')
dataset_C[dataset_C$Police_Station, ]

Unnamed: 0,Police_Station,LongitudeY,LatitudeX
1,ABERDEEN,-32.47634,24.06098
2,ACORNHOEK,-24.59710,31.04835
3,ACTONVILLE,-26.21198,28.29975
4,ADDO,-33.54769,25.69029
5,ADELAIDE,-32.70725,26.29255
6,AFSONDERING,-30.16502,28.96145
7,AGGENEYS,-29.24206,18.84713
8,AKASIA,-25.62571,28.09538
9,ALBERTINIA,-34.20860,21.58325
10,ALBERTON,-26.26088,28.12692


In [10]:
# Check dataset for duplicates
dataset_C[duplicated(dataset_C$Police_Station), ]

Police_Station,LongitudeY,LatitudeX


In [27]:
str(dataset_B)

'data.frame':	1140 obs. of  2 variables:
 $ Police_Station     : Factor w/ 1140 levels "ABERDEEN","ACORNHOEK",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ population_estimate: int  9867 127623 52831 20938 13588 21315 2384 191804 8328 90144 ...


In [21]:
## merging the datasets
## all records on left should be universal

dataset_AB <- left_join(dataset_A_wide, dataset_B, by="Police_Station")

"joining factors with different levels, coercing to character vector"

In [23]:
head(dataset_AB)

Province,Police_Station,All theft not mentioned elsewhere,Arson,Assault with the intent to inflict grievous bodily harm,Attempted murder,Bank robbery,Burglary at non-residential premises,Burglary at residential premises,Carjacking,...,Robbery of cash in transit,Robbery with aggravating circumstances,Sexual Offences,Sexual offences as result of police action,Shoplifting,Stock-theft,Theft of motor vehicle and motorcycle,Theft out of or from motor vehicle,Truck hijacking,population_estimate
Eastern Cape,Aberdeen,51,4,87,5,0,15,74,0,...,0,8,14,0,0,20,2,7,0,
Eastern Cape,Addo,97,2,150,25,0,87,144,0,...,1,41,55,0,0,21,8,11,0,
Eastern Cape,Adelaide,47,2,75,0,0,22,85,0,...,0,12,18,0,7,22,4,12,0,
Eastern Cape,Afsondering,11,1,54,5,0,7,29,0,...,0,13,28,0,0,97,0,6,0,
Eastern Cape,Alexandria,76,0,86,17,0,27,116,2,...,0,36,41,0,5,35,6,13,3,
Eastern Cape,Algoapark,505,4,137,14,0,259,259,24,...,0,325,53,1,223,2,89,429,1,
