<a href="https://colab.research.google.com/github/laurabrin/Classifying-EMPA-mineral-data/blob/adding-input/Classifying_mineral_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Electron Microprobe Mineral Data Using KNN and Decision Tree Algorithms

This project is for CMPT 3830: ML Work Integrated Project course at Norquest College, and is complete by Laura Brin

The goal of this project will be to produce both KNN and Decision Tree models to determine the mineral classification for Electron Microprobe (EMP) data. 

Part 1 will be to create and optimize the two models for recall. Part 2 will assess whether KNN or decision tree classification algorithm is better at correctly labelling EMP mineral data. 

Loading libraries and data needed

In [None]:
import pandas as pd
import numpy as np
import time as time
import matplotlib.pyplot as plt
import plotly.express as px

from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

from sklearn.tree import DecisionTreeClassifier 
from sklearn.preprocessing import OneHotEncoder

from six import StringIO
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus



In [None]:
df_emp7030 = pd.read_csv("/content/70-30-EMPA.csv")

### The Data

Electron microprobe analysis (EPMA) is a non-destructive tool commonly for determining mineral composition. A beam of electrons is directed at a mineral sample, exciting the outer electron. When the electrons return to their original state, the resulting x-rays are collected and measured and assigned to particular elements. The concentration of minor elements in a minerals chemical composition can assist in telling the story of how a particular rock was formed (temperature, presssure).

In this project we will be using the measurements from 12 element oxides to classify the EMPA sample into 7 minerals: olivine, garnet, clinopyroxene, orthopyroxene, nickel oxide, chromite and spinel. Minerals have a defined chemical structure that should lend itself well to classification. In manual classification, the ratio between major elements (Si, Mg, Fe, Al) determines the mineral assignment. I am interested to see if either model will over-emphasize the importance of minor elements, which will almost certainly lead to overfitting and poor accuracy.

Mineral Assignment no=nickel oxide, chr=chromite, cpx=clinopyroxene, grt=garnet, ol-olivinne, opx=orthopyroxene, sp=spinel


In [None]:
df_emp7030.head(15)

In [None]:
df_emp7030.dtypes

In [None]:
df_emp7030['Mineral'].unique()

In [None]:
df_emp7030.replace(to_replace="opx ", value="opx",inplace=True)
df_emp7030.replace(to_replace="train", value=0,inplace=True)
df_emp7030.replace(to_replace="test", value=1,inplace=True)

In [None]:
df_emp7030.drop(df_emp7030[df_emp7030["Name"]=="P1-17"].index, inplace=True)

In [None]:
oxides=["SiO2","TiO2","Al2O3","V2O3","Cr2O3","MgO","CaO","MnO","FeO","NiO","Na2O","K2O","Total"]
oxides_striped=["SiO2","Al2O3","V2O3","Cr2O3","MgO","CaO","FeO","NiO"]

In [None]:
df_emp7030.describe()

In [None]:
for index in oxides:  
  df_emp7030[index]=df_emp7030[index].clip(lower=0)
df_emp7030.head(15)

In [None]:
mask1=df_emp7030["Total"].values>90
df_emp7030=df_emp7030.loc[mask1]


In [None]:
df_emp7030.boxplot(column="SiO2")

In [None]:
df_emp7030["Mineral"].describe()

### KNN

In [None]:
df_knn=df_emp7030.copy()
X_train=df_knn.loc[df_emp7030["dataset"]==0]
X_train=X_train.drop(["Mineral","Name","Date","dataset","MnO","Na2O","K2O","TiO2","Total"],axis=1)

X_test=df_knn.loc[df_emp7030["dataset"]==1]
X_test=X_test.drop(["Mineral","Name","Date","dataset","MnO","Na2O","K2O","TiO2","Total"],axis=1)

y_train=df_knn.query("dataset==0")["Mineral"]
y_test=df_knn.query("dataset==1")["Mineral"]

In [None]:
X_train.head(10)

In [None]:
knn=KNeighborsClassifier(n_neighbors=7)

In [None]:
knn.fit(X_train,y_train)

In [None]:
y_predict=knn.predict(X_test)


In [None]:
print(f"knn accuracy score:", accuracy_score(y_test,y_predict))
print(f"knn recall score:", recall_score(y_test,y_predict, average="macro"))
print(f"knn precision score:", precision_score(y_test,y_predict, average="weighted"))
print(f"knn f1 score:", f1_score(y_test,y_predict,average="weighted"))

### Decision Tree

In [None]:
df_tree=df_emp7030.copy()
X_train2=df_tree.loc[df_emp7030["dataset"]==0]
X_train2=X_train2.drop(["Mineral","Name","Date","dataset","MnO","Na2O","K2O","TiO2","Total"],axis=1)

X_test2=df_tree.loc[df_emp7030["dataset"]==1]
X_test2=X_test2.drop(["Mineral","Name","Date","dataset","MnO","Na2O","K2O","TiO2","Total"],axis=1)

y_train2=df_tree.query("dataset==0")["Mineral"]
y_test2=df_tree.query("dataset==1")["Mineral"]

In [None]:
tree=DecisionTreeClassifier(min_samples_leaf=3)

In [None]:
tree.fit(X_train2,y_train2)

In [None]:
y_predict2=tree.predict(X_test2)

In [None]:
print(f"tree accuracy score:", accuracy_score(y_test2,y_predict2))
print(f"tree recall score:", recall_score(y_test2,y_predict2, average="macro"))
print(f"tree precision score:", precision_score(y_test2,y_predict2, average="weighted"))
print(f"tree f1 score:", f1_score(y_test2,y_predict2,average="weighted"))
print(classification_report(y_test2,y_predict2))

In [None]:
#Tree visualization
dot_data = StringIO()
export_graphviz(tree, out_file=dot_data, feature_names=oxides_striped, class_names= tree.classes_, 
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

### Model Assessments and visualizations

In [None]:
input = pd.read_csv("/content/input_test_set.csv")

In [None]:
input.dropna(axis=0, how="all",inplace=True)

In [None]:
input.head()

In [None]:
for index in oxides:  
  input[index]=input[index].clip(lower=0)


In [None]:
X_input=input.drop(["Mineral","Name","Date","dataset","MnO","Na2O","K2O","TiO2","Total","dataset"],axis=1)
X_input_UK=X_input[:8]
X_input=X_input[8:]
y_input=input["Mineral"][8:]
y_input_UK=input["Mineral"][:8]


In [None]:
y_input.shape

In [None]:
X_input.fillna(0,inplace=True)

In [None]:
input_predict_knn=knn.predict(X_input)
input_predict_tree=tree.predict(X_input)


In [None]:
input_accuracy_knn=accuracy_score(y_input,input_predict_knn)
input_accuracy_tree=accuracy_score(y_input,input_predict_tree)
print(f"knn:",input_accuracy_knn)
print(f"tree:",input_accuracy_tree)

In [None]:
input_accuracy_compare=accuracy_score(input_predict_knn,input_predict_tree)
print(f"comparing both together:",input_accuracy_compare)

### Part 3: New Data

### Import new data

A description is needed here to explain where to save or attach the new data template to the content of this collab notebook. If user would prefer to change the name of the document, it will need to be changed here too.

In [None]:
xls = pd.ExcelFile('/content/New_data_template.xlsx')
df_new=pd.read_excel(xls, 'Format')
df_new=df_new.loc[~(df_new==0).all(axis=1)]
new_x=df_new.drop(["Name","MnO","Na2O","K2O","TiO2","Total"],axis=1)


In [None]:
new_predict=knn.predict(new_x)
new_predict

In [None]:
from pandas.core.internals import concat
new_predict_df=pd.DataFrame(new_predict, columns=["Mineral"])
print_df=pd.merge(df_new, new_predict_df, left_index=True, right_index=True)
print_df

In [None]:
print_df.to_excel("New_data_assigned.xlsx",sheet_name='Mineral Assign') 

Retrieve mineral assignment from New_data_assigned spreadsheet