# Group 5 Project Proposal - Glass Classification

## Introduction

The classification of glass types can be incredibly important during criminological investigations. At the scene of a crime, glass can be used as evidence so it is important to correctly determine if two or more glass fragments originated from different sources. This can lead to identifying methods of escape, murder weapons, and other pieces of conclusive evidence.  

Our goal is to create a model that is successful in identifying glass samples obtained from a crime scene based on the weight percentages of 8 corresponding oxides. More specifically, we hope to answer the question: Given the weight percent of 8 glass oxides from a glass sample, which of the 6 glass types is it?  

The particular dataset we have chosen comes from the USA Forensic Science Service and classifies 6 types of glass based on their oxide content (see below for more details). The dataset also contains the refractive index of each glass observation.

Note that float-processing is a glass manufacturing process that creates a smooth, thick and uniform surface. In this dataset, there is no data for “vehicle_windows_non_float_processed” glass.

<b>More details on the glass identification dataset:</b>

<b>Glass Oxides (measured in weight percent)</b>
<ul>
<li>Na: Sodium</li>
<li>Mg: Magnesium</li>
<li>Al: Aluminum</li>
<li>Si: Silicon</li>
<li>K: Potassium</li>
<li>Ca: Calcium</li>
<li>Ba: Barium</li>
<li>Fe: Iron</li>
</ul>
<b>Glass Types</b>
<ul>
<li>building_windows_float_processed</li>
<li>building_windows_non_float_processed</li>
<li>vehicle_windows_float_processed</li>
<li>vehicle_windows_non_float_processed (**none in this dataset**)</li>
<li>containers</li>
<li>tableware</li>
<li>headlamps</li>
</ul>

## Preliminary Exploratory Data Analysis

#### Read data from the web & Clean data into tidy format

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [2]:
dataset_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data"
col_names <- c("id", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe", "glass_type")
glass_data_raw <- read.table(dataset_url, sep =",", header = FALSE, col.names = col_names) |>
    mutate(glass_type = as_factor(glass_type))
glass_data_raw

glass_type_names <- c("building_windows_float_processed", "building_windows_non_float_processed", "vehicle_windows_float_processed", "containers", "tableware", "headlamps")
glass_data_processed <- glass_data_raw |> select(-RI)
levels(glass_data_processed$glass_type) <- glass_type_names
glass_data_processed

id,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,glass_type
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0,0.00,1
2,1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0,0.00,1
3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0,0.00,1
4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0,0.00,1
5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0,0.00,1
6,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0,0.26,1
7,1.51743,13.30,3.60,1.14,73.09,0.58,8.17,0,0.00,1
8,1.51756,13.15,3.61,1.05,73.24,0.57,8.24,0,0.00,1
9,1.51918,14.04,3.58,1.37,72.08,0.56,8.30,0,0.00,1
10,1.51755,13.00,3.60,1.36,72.99,0.57,8.40,0,0.11,1


id,Na,Mg,Al,Si,K,Ca,Ba,Fe,glass_type
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,13.64,4.49,1.10,71.78,0.06,8.75,0,0.00,building_windows_float_processed
2,13.89,3.60,1.36,72.73,0.48,7.83,0,0.00,building_windows_float_processed
3,13.53,3.55,1.54,72.99,0.39,7.78,0,0.00,building_windows_float_processed
4,13.21,3.69,1.29,72.61,0.57,8.22,0,0.00,building_windows_float_processed
5,13.27,3.62,1.24,73.08,0.55,8.07,0,0.00,building_windows_float_processed
6,12.79,3.61,1.62,72.97,0.64,8.07,0,0.26,building_windows_float_processed
7,13.30,3.60,1.14,73.09,0.58,8.17,0,0.00,building_windows_float_processed
8,13.15,3.61,1.05,73.24,0.57,8.24,0,0.00,building_windows_float_processed
9,14.04,3.58,1.37,72.08,0.56,8.30,0,0.00,building_windows_float_processed
10,13.00,3.60,1.36,72.99,0.57,8.40,0,0.11,building_windows_float_processed


#### Summarize the data with the training set

In [3]:
glass_split <- initial_split(glass_data_processed, prop = 0.75, strata = glass_type)
glass_training <- training(glass_split)
glass_testing <- testing(glass_split)

In [4]:
glass_summary <- glass_training |>
                        group_by(glass_type) |>
                        mutate(count = 1) |>
                        summarize(count = sum(count),
                                  avg_Na = mean(Na), avg_Mg = mean(Mg), 
                                  avg_Al = mean(Al), avg_Si = mean(Si), 
                                  avg_K = mean(K), avg_Ca = mean(Ca), 
                                  avg_Ba = mean(Ba), avg_Fe = mean(Fe)) 
glass_summary

glass_type,count,avg_Na,avg_Mg,avg_Al,avg_Si,avg_K,avg_Ca,avg_Ba,avg_Fe
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
building_windows_float_processed,52,13.29519,3.5767308,1.137885,72.52019,0.4221154,8.868269,0.01711538,0.06326923
building_windows_non_float_processed,60,13.093,2.9313333,1.376833,72.62017,0.4895,9.1965,0.05916667,0.07
vehicle_windows_float_processed,10,13.333,3.537,1.245,72.347,0.451,8.8,0.015,0.097
containers,8,12.96625,0.77125,2.2625,71.8925,2.06375,9.5275,0.305,0.06375
tableware,5,14.886,1.264,1.132,73.696,0.0,8.87,0.0,0.0
headlamps,24,14.50167,0.3779167,2.157083,72.97875,0.2945833,8.570833,1.05,0.01625


#### Visualize the data with the training set

From the training data that includes the oxide types along with the glass identifications, a bar graph comparing the two predictor variables will be the best way to visualize this analysis.As shown below, each glass type is its own bar graph which compares the 8 oxide types on the x-axis to the weight percentage of each oxide on the y-axis.Afterwards, we will align the separate bar graphs beside one another to visually compare any differences in the glass types before comparing numerically.

<img src="img/VisualizeData.png"/>

## Methods

#### Data Analysis Explanation

Glass type will be predicted by the 8 oxides variables. The model will be trained using a subset of the processed data (training set). We will equalize the number of points for each glass label by oversampling the rare observations to avoid any biases stemming from observation imbalances. We will use cross-validation over multiple k-values and select the k that provides the highest average accuracy. This model will then predict the glass type on the testing set and the accuracy of predictions will be computed to assess the model. Additionally, we will interpret the accuracy of the predictions in the context of our project and how helpful it is to identify glass types during criminal investigations.

#### Visualizing the Results

We plan on creating a bar graph to visualize our data analysis and the results. We will compare the distribution of each predictor variable, the 8 oxide types (x-axis), to the weight percentages (y-axis). We will then align each of the graphs next to each other for a visual comparison before comparing the values numerically to provide insight into our project question.

## Expected Outcomes and Significance

#### What do you expect to find?

Based on the analysis and our results we expect to find that float-processed glass has the highest silicon content because we learned that float-processing is a glass manufacturing process that creates a smooth, thick and uniform surface. We also expect to see that each glass type has a distinguishing ratio of weight percentages of oxides.

#### What impact could such findings have?

In forensics, identifying a sample of glass found at the scene could help identify the origin of criminal activities. If the glass type is correctly identified it can also be used as evidence. Hence, improving the efficiency of glass analysis can directly impact how, and how fast, law enforcement can find criminals guilty or innocent.

#### What future questions could this lead to?

This analysis can lead to further research and an improvement in the Forensic Science field for quickly classifying the glass found at a crime scene given the oxide content. This can answer the following questions such as, “Given the location of the crime scene, what is the most common glass type found?” as well as, “How does the density of a particular glass sample relate to its weight percentages of corresponding oxides?”