# Report

## Introduction

Understanding the various types of glass is crucial for numerous applications, from architectural design to container manufacturing. Each glass type possesses unique characteristics based on its elemental composition and refractive index. This study will focus on identifying these glass types using a KNN model, considering different elements and variations in refractive index. Specifically, we explore float-processed glass, known for its economical production in flat glass manufacturing; container glass, used in various bottles and jars; tableware, which includes glass items for dining; and headlamp glass, vital in automotive safety.

**We want to answer the question of:**
How can we accurately predict the type of glass in an unknown sample using a KNN model trained on a dataset comprising refractive index and elemental compositions?

## Types of Glass:

Float-Processed Glass: Developed in the 1950s, it's pivotal in creating flat glass for vehicles and buildings. Key materials include silica, lime, soda, and cullet, heated to form molten glass, then floated on molten tin to achieve flatness​​.
Container Glass: Encompasses glass bottles and jars for beverages, pharmaceuticals, cosmetics, and foods​​.
Tableware: This category includes glassware for dining purposes, often made from glass, steel, or copper​​.
Headlamp Glass: Essential in vehicles, headlamps provide illumination for safe driving and are distinct in their design and material requirements​​.
## Preliminary Data Analysis
### 1. Data Visualization Insights:

Reflective Index vs. Silicon Level: Understanding this relationship can shed light on the glass's optical properties. However, clarity on its significance in the prediction model is needed.
Material Composition in Glass Types: Initial observations suggest uniform silicon levels across glass types, questioning its predictive value.
Glass Mean Table and Plot: Provides valuable insights into elemental composition variations across glass types. In-depth analysis under each plot will help in understanding its relevance to the study.
### 2. Dataset Description:
The dataset comprises the refractive index and eight elemental compositions representing various glass types. It's essential to analyze these elements to identify patterns and correlations pertinent to different glass categories.

### Glass Dataset

**Type of Glass** \
1 : Building Windows (float-processed) \
2 : Building Windows (non-float-processed) \
3 : Vehicle Windows (float-processed) \
4 : Vehicle Windows (non-float-processed) *none in this dataset* \
5 : Containers \
6 : Tableware \
7 : Headlamps 

Source: German,B.. (1987). Glass Identification. UCI Machine Learning Repository. https://doi.org/10.24432/C5WW2P.

### About the Data
The dataset from the UCI Machine Learning Repository, donated on August 31, 1987, by the USA Forensic Science Service, focuses on glass identification. It consists of 214 instances, each with 9 features, and is used for classification tasks in the fields of physics and chemistry.It is composed of the refractive index as well as eight different elements to represent the type of glass the combination of those elements and refractive index represent. 

The dataset includes various types of glass, defined by their oxide content such as sodium (Na), magnesium (Mg), aluminum (Al), silicon (Si), potassium (K), calcium (Ca), barium (Ba), and iron (Fe). Each instance has a unique ID number and a measure of refractive index (RI). The types of glass are categorized into seven classes: 1) building windows float processed, 2) building windows non-float processed, 3) vehicle windows float processed, 4) vehicle windows non-float processed (not included in the dataset), 5) containers, 6) tableware, and 7) headlamps.

Notably, the dataset does not contain data for category 4 (vehicle windows non-float processed)

## Methodology
### 1. A Classification Problem:

We recognize the question as a classification problem. The unknown needs to be predicted to be a class (a type). Our dataset consists of 7 types a long with a series of continuous variables describing the characteristics of said types. This scenario is fitting for a K-nearest neighbor Classification Model. 

### 2. Variable Selection:

The dataset comprises several continuous features: refractive index (RI), and weight percent of various oxides like Sodium (Na), Magnesium (Mg), Aluminum (Al), Silicon (Si), Potassium (K), Calcium (Ca), Barium (Ba), and Iron (Fe). Each of these elements contributes distinctively to the glass properties and thus, their inclusion is critical for accurate classification. The 'Id_number' feature, which is merely an identifier with no intrinsic value to the glass properties, is excluded from the analysis. The selection of variables is based on their chemical significance in glass composition. Elements like Na, Mg, and Si are known to affect the physical and chemical properties of glass, making them key variables for prediction. The refractive index (RI) is a crucial optical property and can indicate glass type. \
Analysis of Variance: The variance of each variable across different glass types should be examined. Variables with higher variance are more likely to contribute to effective classification. \
Unique Distributions: Investigating the distribution of these elements across different glass types can reveal unique patterns, aiding in accurate classification.

### 3. KNN Model Development:

After our variable selection, we follow a series of steps to create our KNN Model. This includes balancing the dataset, standardizing the data, cross-validation and testing. 

We first clean up our data. This includes add labels to our table, removing the 'Id_number' variable which we have determined not to use, and turning our Type variable into a factor. 

Then, we split our dataset into training and testing data. 

From the summary table created for our dataset, we recognise that some class have less data than others. To prevent this from skewing our results, we use the function step_upsample to balance our dataset. This function duplicates values in classes that are unbalanced so that the dataset contains the same amount of data across every class. This will help our model be more accurate.  

Next, we standardize our data with step_scale(all_predictors()) and step_center(all_predictors()). A crucial part of how the KNN model function is determining the "nearest" data point to the unknown point with a distance equation. The general distance equations with *n* predictor variables is as follows: 
$$Distance = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... + (a_n - b_n)^2}$$ 
Because our variables span a large range, it is necessary to standardize them so that one variable does not skew the prediction too far off the accurate value. 

Following on, we cross validate different values of K to be able to determine that value of k that provides the best fitting model. This prevents overfitting, where the model is too influenced by the training data and therefore is not able to predict the testing data or new unknowns as accurately. 

Finally, we build the KNN Model with the best determined value of k. 

## Results

In [2]:
#more code or something

## Discussion


blah

## References

reference