# Machine Learning Project 
#### Forensic Glass Fragment Analysis using Classification 
---

*Ludek Cizinský (luci@itu.dk)*, *Lukas Rasocha (lukr@itu.dk)*, *Mika Senghaas (jsen@itu.dk)*

Lecturers: *Therese Graversen*, *Djordje Grbic*, *Payam Zahadat*

Deadline: *3rd of January 2022*

Last Modified: *24th of Novemeber 2021*

## Assignment Description
---
This project explores the possibility of using the elemental composition and refractive index to determine the origin of a very small glass fragment. This was studied by Evett and Spiehler (1987), who wrote in their paper:

```
Glass is a material which figures prominently in the investigation of crimes such as burglary and criminal damage in which it is common for a window to be smashed violently, either to gain access or as an act of vandalism. If a suspect is apprehended for such an offence then it is almost a routine matter to submit articles of his clothing to a forensic science laboratory so that a scientist may determine whether or not there is evidential material present.
Evett and Spiehler (1987)
```

The study by Evett and Spiehler (1987) is just one example of the many contributions to the field of forensic science that came out of the UK Forensic Science Service before it was closed in 2012. For this project, we use the data from their study, as found in (Dua and Graff, 2019), to investigate suitable classification techniques for determining the origin of a glass fragment.

## Environment Setup
---
This project uses Python libraries that are essential for the performed analysis. Make sure to have the dependencies listed in `requirements.txt` installed locally using the *Python Package Manager* `pip`. If installed, running the next code cell should install all relevant dependencies. Check documentation via the provided links:

- [*NumPy* Quickstart](https://numpy.org/doc/stable/user/quickstart.html)
- [*Matplotlib* Documentation](https://matplotlib.org/stable/tutorials/introductory/usage.html)
- [*Pandas* Documentation](https://pandas.pydata.org/docs/)
- [*Sklearn* Documentation](https://scikit-learn.org/stable/)

### Libraries

In [1]:
%%capture
!pip install -r requirements.txt

In [4]:
# python standard library
from time import time                                           # used for timing execution
from datetime import date, datetime                             # get current data and time
import json                                                     # read/ write json
import re                                                       # regex search 
import os                                                       # os operations
import random                                                   # randomness
from collections import Counter                                 # efficient counting
import contextlib

# jupyter library
from IPython.display import display, Image, Markdown            # display images and markdown in jupyter

# general data science libraries
from matplotlib import pyplot as plt                            # basic plotting
import seaborn as sns                                           # advanced plotting
import numpy as np                                              # for representing n-dimensional arrays
import scipy as sp                                              # numerical computation
import pandas as pd                                             # dataframes

# sklearn imports 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn.tree import DecisionTreeClassifier

# deep learning framework TODO: pytorch imports here

# custom imports
from scripts.models import DecisionTreeClassifier
from scripts.models import NeuralNetwork

from scripts.metrics import *
from scripts.plotting import *

### Set global style of plots

Below you can specify global style for all plots or any other setups related to plots visualization.

In [5]:
sns.set_style("darkgrid")
sns.set(rc={"xtick.bottom" : True, "ytick.left" : True})

### Flags

Flags are used to control the run flow of the notebook when executed at once. This is useful, to prevent operations that should only produce a result once, from running multiple times. 

In [6]:
# section flags

"""
LOAD_DATA = True # loads raw data for initial inspection
PRODUCE_PLOTS = False # global parameter to generate plots

TRAIN_MODELS = False # global parameter to train models
"""

'\nLOAD_DATA = True # loads raw data for initial inspection\nPRODUCE_PLOTS = False # global parameter to generate plots\n\nTRAIN_MODELS = False # global parameter to train models\n'

### Constants



In [7]:
PATH_TO = {}
PATH_TO['data'] = {}

PATH_TO['data']['raw'] = 'data/raw'
PATH_TO['data']['figures'] = 'data/figures'
PATH_TO['data']['summaries'] = 'data/summaries'
PATH_TO['data']['metadata'] = 'data/metadata'

PATH_TO['models'] = {}

### Folder Structure

Create relevant folders to read from and write to, if not yet existent.
An easy way to check if all the folders have been created is to use bash, go to the project folder and type "tree" it will give you an overview of the lists, check the readme for the folderstructure.

In [8]:
# iterate over path_to dict
exists = []
for subdir in PATH_TO.keys():
    for path in PATH_TO[subdir].values():
        if not os.path.exists(path):
            os.makedirs(path) 
            print(f'created path: {path}')
        else: 
            exists.append(path)

if len(exists) != 0:
    print(f'the following paths already existed: {exists}')

created path: data/figures
created path: data/summaries
created path: data/metadata
the following paths already existed: ['data/raw']


## #01 Data preprocessing and cleaning
---

In [9]:
!head -5 data/raw/df_train.csv

RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,type
1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.0,0.35,2
1.52081,13.78,2.28,1.43,71.99,0.49,9.85,0.0,0.17,2
1.51708,13.72,3.68,1.81,72.06,0.64,7.88,0.0,0.0,2
1.52739,11.02,0.0,0.75,73.08,0.0,14.96,0.0,0.0,2


In [10]:
!head -5 data/raw/df_test.csv

RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,type
1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1.51721,12.87,3.48,1.33,73.04,0.56,8.43,0.0,0.0,1
1.52213,14.21,3.82,0.47,71.77,0.11,9.57,0.0,0.0,1
1.51623,14.2,0.0,2.79,73.46,0.04,9.04,0.4,0.09,7


In [11]:
features = [
    'refractive_index',
    'sodium',
    'magnesium',
    'aluminium',
    'silicone',
    'potassium',
    'calcium',
    'barium',
    'iron',
]

In [12]:
classes = [
    'window_from_building_(float_processed)',
    'window_from_building_(non_float_processed)',
    'window_from_vehicle',
    'container',
    'tableware',
    'headlamp'
]

In [13]:
# loading in train and test data splits 
train = np.loadtxt(f"{PATH_TO['data']['raw']}/df_train.csv", delimiter=',', skiprows=1)
test = np.loadtxt(f"{PATH_TO['data']['raw']}/df_test.csv", delimiter=',', skiprows=1)

# split feature matrix from target
X_train, y_train = train[:, :-1], train[:, -1]
X_test, y_test = test[:, :-1], test[:, -1]

In [58]:
X_test

array([[1.52101e+00, 1.36400e+01, 4.49000e+00, 1.10000e+00, 7.17800e+01,
        6.00000e-02, 8.75000e+00, 0.00000e+00, 0.00000e+00],
       [1.51721e+00, 1.28700e+01, 3.48000e+00, 1.33000e+00, 7.30400e+01,
        5.60000e-01, 8.43000e+00, 0.00000e+00, 0.00000e+00],
       [1.52213e+00, 1.42100e+01, 3.82000e+00, 4.70000e-01, 7.17700e+01,
        1.10000e-01, 9.57000e+00, 0.00000e+00, 0.00000e+00],
       [1.51623e+00, 1.42000e+01, 0.00000e+00, 2.79000e+00, 7.34600e+01,
        4.00000e-02, 9.04000e+00, 4.00000e-01, 9.00000e-02],
       [1.51829e+00, 1.44600e+01, 2.24000e+00, 1.62000e+00, 7.23800e+01,
        0.00000e+00, 9.26000e+00, 0.00000e+00, 0.00000e+00],
       [1.51602e+00, 1.48500e+01, 0.00000e+00, 2.38000e+00, 7.32800e+01,
        0.00000e+00, 8.76000e+00, 6.40000e-01, 9.00000e-02],
       [1.51610e+00, 1.34200e+01, 3.40000e+00, 1.22000e+00, 7.26900e+01,
        5.90000e-01, 8.32000e+00, 0.00000e+00, 0.00000e+00],
       [1.51763e+00, 1.26100e+01, 3.59000e+00, 1.31000e+00, 7.

## #02 Exploratory Data Analysis
---

### Define input data for analysis

In [27]:
df = pd.read_csv(f"{PATH_TO['data']['raw']}/df_train.csv", usecols=range(9))
df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
0,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.0,0.35
1,1.52081,13.78,2.28,1.43,71.99,0.49,9.85,0.0,0.17
2,1.51708,13.72,3.68,1.81,72.06,0.64,7.88,0.0,0.0
3,1.52739,11.02,0.0,0.75,73.08,0.0,14.96,0.0,0.0
4,1.5221,13.73,3.84,0.72,71.76,0.17,9.74,0.0,0.0


In [45]:
df2 = pd.read_csv(f"{PATH_TO['data']['raw']}/df_train.csv")
df2.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,type
0,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.0,0.35,2
1,1.52081,13.78,2.28,1.43,71.99,0.49,9.85,0.0,0.17,2
2,1.51708,13.72,3.68,1.81,72.06,0.64,7.88,0.0,0.0,2
3,1.52739,11.02,0.0,0.75,73.08,0.0,14.96,0.0,0.0,2
4,1.5221,13.73,3.84,0.72,71.76,0.17,9.74,0.0,0.0,1


In [33]:
t = pd.read_csv(f"{PATH_TO['data']['raw']}/df_train.csv", usecols=['type'])['type']
t

0      2
1      2
2      2
3      2
4      1
      ..
144    1
145    2
146    7
147    1
148    1
Name: type, Length: 149, dtype: int64

### Training data - shape

Clearly, we can see that the we have a very small dataset and a lot of features given the number of records we have. Thus, we should take this into account and try to reduce the feature space so we avoid `curse of dimensonality`.

In [50]:
df.shape

(149, 9)

### Definition of features

Apart from `Refractive index`, we were also given glass fragment’s chemical composition in terms of the weight percent for each of eight different elements. Before training our model, we should definitely make sure that our data is normalized since `RI` 

#### RI (Refractive index)
- standard measurement taken for forensic purposes since there is a signifficant variation between different type of glass and also because it is possible to be measured exactly even on small fragments of glass

- Definition from [Wikipedia](https://en.wikipedia.org/wiki/Refractive_index):
```
In optics, the refractive index (also known as refraction index or index of refraction) of a material is a dimensionless number that describes how fast light travels through the material.
```

#### Na (Sodium)

#### Mg (Magnesium)

#### Al (Aluminum)

#### Si (Silicon)

#### K (Potassium)

#### Ca (Calcium)

#### Ba (Barium)

#### Fe (Iron)

### Is row sum 100 %?

In [44]:
df.iloc[5, 1:].sum()

99.85

### Numerical summary of all features

#### Overall

In [36]:
df.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0
mean,1.518427,13.42047,2.724765,1.434698,72.623758,0.485168,8.924295,0.199799,0.061611
std,0.003213,0.863283,1.422193,0.50647,0.783145,0.569998,1.511192,0.553319,0.097642
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0
25%,1.51663,12.93,2.28,1.17,72.28,0.13,8.22,0.0,0.0
50%,1.51769,13.3,3.49,1.36,72.78,0.55,8.59,0.0,0.0
75%,1.51916,13.83,3.61,1.62,73.05,0.61,9.14,0.0,0.11
max,1.53393,17.38,3.98,3.5,75.41,6.21,16.19,3.15,0.37


**Key highlights**

- The key material in all glass fragments seems to be `Silicon` with the mean of `72 %`. Second is `Sodium` with `13 %` and third `Calcium` with `8 %`. Perhaps, more interesting will be to look at this per given `type of glass`.

#### Mean per type per feature

In [49]:
df2.groupby('type').mean()

Unnamed: 0_level_0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.518684,13.241837,3.524694,1.167347,72.639388,0.457143,8.784286,0.014082,0.063469
2,1.518744,13.110189,3.080943,1.402642,72.555283,0.519623,9.028679,0.07,0.08717
3,1.517991,13.399167,3.566667,1.185,72.4625,0.375833,8.769167,0.0125,0.080833
5,1.519182,12.777778,0.938889,1.982222,72.344444,1.254444,10.236667,0.244444,0.031111
6,1.51682,14.913333,0.818333,1.305,73.508333,0.0,9.31,0.0,0.0
7,1.517357,14.5345,0.6915,2.117,72.724,0.3275,8.3775,1.151,0.01


**Key highlights**

- From this table, we can clearly see that certain columns might be more helpful than others when it comes to classyfiying data. For example, `Calcium` seems as a good indicator to which the class given record belongs.

- Interestingly, `types 1 - 3` have similar mean value for Magnesium and the same can be said about types `4 - 6`

### Distribution plot

In [56]:
DataFrame.hist()

KeyError: 'type'

## 03 Custom Implementations
---

## #04 Training and Evaluating Models
---

### Decision Trees
---

### Neural Networks
---

### Ensemble Classifier
---