<center><img src = "https://www.geoilenergy.com/images/software/index/geovariances_logo.png" width="20%"></center>

## **Downloading data and plotting scripts**

The `curl` command downloads the repository data used for the course. If you are on Google Colaboratory session, you will also need to download the plotting scripts from Geovariances.


In [None]:
# Downloads dataset from GitHub
!curl -o phosphate_assay_sampled_geomet.csv https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/phosphate_assay_sampled_geomet.csv

# If you are in a Google Colab session, make sure to also download the GeoVariances module for plotting!
# !curl -o plotting_gv.py https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/plotting_gv.py

## **Importing four libraries:**

<details>
<summary><strong>libraries</strong></summary>

**Pandas:** used for data manipulation and analysis.

**Numpy:** Used for scientific computing and working with arrays.

**Matplotlib:** Used for data visualization and creating plots.

**Plotting_gv:** A custom plotting library created by GV Americas, which contains additional plotting functions and custom styles.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotting_gv as gv

## **Upload data clustered with Google Colab files**

In [None]:
if 'google.colab' in str(get_ipython()):
    from google.colab import files
    uploaded = files.upload()
    filename = list(uploaded.keys())[0]
else:
    filename = "phosphate_assay_sampled_geomet_clustered.csv"

## **Reading data with Pandas**

In [None]:
data = pd.read_csv(filename)

#data.head(5)
data.columns

# **Data preprocessing analysis: cleaning and processing**

## **Clean dataframe with `dataframe.dropna()`**

In [None]:
data0 = data.dropna()

## **Declaring variables to filter data**

In [None]:
coords = ['X', 'Y', 'Z']

cat_var = ['ALT']

variables =  ['AL2O3', 'BAO', 'CAO', 'FE2O3', 'MGO', 'NB2O5', 'P2O5', 'SIO2', 'TIO2']

clusters = ['aggl_5k','kmeans_3k', 'aggl_3k', 'kmeans_4k', 'aggl_4k', 'kmeans_5k', 'kmeans_2k', 'aggl_2k', 'kmeans_6k', 'aggl_6k']

geomet = ["Reagente", "Recuperacao"]

## **Define the target to be supervised!**

In [None]:
target = 'kmeans_5k'

## **Splitting features (X) and target (y)**

In [None]:
X = data0[variables].values #declaring variable or features
y = data0[target].values # declaring what will be the target class of the model

## **Counting target categories**

In [None]:
data['kmeans_5k'].value_counts()

## **Features samples size**

In [None]:
len(X)

# **Split train, test samples with sklearn.model_selection**

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X, # X features, independent variables
    y, # y target, dependet variable
    test_size=0.3, #fraction of training and testing data
    shuffle=True, #shuffles the data: prevents the data split from being biased towards a particular class or order
    random_state=100, #random seed: ensures reproducibility of results, i.e., the data split will always be the same
    stratify=y) #separates training and testing data with the same proportion of classes. "It is always good to make this clear."


In [None]:
print('fraction of training:')
len(X_train)

In [None]:
print('fraction of validating')
len(X_test)

# **Data transformation: `StandardScaler()` using Sklearn.preprocessing**
<details>
<summary>

$$ z = \frac{x-\mu}{\sigma}$$
</summary>

Where $\mu$ is the mean of the training samples, and $\sigma$ is the standard deviation of the samples. Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)


**Note: The transformation will be applied only on X/features!**
</details>

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# We only need to transform the X, that is, the features where the models will be trained and validated
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#These transformations ensure that the data is on the same scale and improve the accuracy of machine learning algorithms.

## **Training models**

### **KNN Classifier**
Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [None]:
from sklearn.neighbors import KNeighborsClassifier    

In [None]:
nn = 50 #number of neighbors


knn = KNeighborsClassifier(
    n_neighbors=nn, #number of neighbors to be considered
    weights='distance',  #how it will weigh the proximity of samples (weights), in this case, Euclidean distance
    p=2 #p=2 uses Euclidean distance, weights is how it calculates weights for neighbors
    ) 

knn.fit(X_train, y_train) #applying the model to the training data

y_pred = knn.predict(X_test) #predicting values from the model on the test data



## **Validating KNN model with confusion matrix and classification report**


In [None]:
gv.confusion_matrix_plot(knn, y_test, y_pred, f'Confusion Matrix KNN {target}', report=True )

## **SVM**

Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/svm.html)

## **Linear and RBF SVC: support vector classification**

In [None]:
from sklearn.svm import SVC

In [None]:
svm = SVC(
kernel='linear', #kernel to be used for constructing the hyperplanes.
C=1, #penalizes points that are on the wrong side of the hyperplane, the higher C the more points are penalized or the more rigorous it is.
gamma='scale', #enable if the kernel is rbf!
class_weight='balanced', #calculates class balances automatically (can be passed manually as a dictionary)
random_state=100, #reproducibility of results
probability=True) #returns the probabilities of each class
svm.fit(X_train,y_train)

y_pred = svm.predict(X_test)

## **Validating SVM model with confusion matrix and classification report**


In [None]:
gv.confusion_matrix_plot(svm, y_test, y_pred, f'Confusion Matrix SVM {target}', report=True)

## **Decision Trees**

Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/tree.html)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(
random_state=100,
criterion='gini',
max_depth=8,
min_samples_split=100
)

tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)


## **Validating Decision Trees model with confusion matrix and classification report**


In [None]:
gv.confusion_matrix_plot(tree, y_test, y_pred, f'Confusion Matrix Decision Tree {target}', report=True )


## **Random Forests**

Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(
n_estimators=300,
max_depth=8,
min_samples_split=5,
random_state=100,
)

rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

## **Validating Random Forest model with confusion matrix and classification report**


In [None]:
gv.confusion_matrix_plot(rf, y_test, y_pred, f'Confusion Matrix RandomForest {target}', report=True )


## **Which input features are most important in predicting the target variable for Random Forest Model**
<details>
<summary>Note:</summary> Feature importance provides a way to identify which features have the most predictive power for a given target variable, and can be useful for optimizing model performance or gaining insights into the relationships between features and the target variable.
</details>

In [None]:
gv.features_importance(rf, X_test, variables, y_test, clf=True)

## **Model evaluation with K-Folds**
<details>
<summary>
The purpose of this plot is to visualize the performance of a model when evaluated with a k-fold cross-validation strategy.</summary>

The x-axis represents the different folds used in the cross-validation (1 to k), while the y-axis represents the performance metric chosen to evaluate the model.

Each box in the plot represents the distribution of scores obtained for the corresponding fold. 

This plot can help to understand the variability of the model's performance across different folds, and whether the model is overfitting or underfitting.

If the performance is consistent across all folds, the model is likely to generalize well to new data. 

If the performance is highly variable, the model may need to be improved or re-evaluated with a different strategy.
</details>

In [None]:
gv.evaluate_kfolds(X_train, y_train, 10, 5, rf, classcore='balanced_accuracy')

# **Practice**

<details>
    <summary><strong><u>Supervised learning process</u></strong>
    </summary>

1) In this exercise, you will reproduce the supervised learning process presented in the notebook, but with a new set of variables!
<details>
    <summary>&#128161;</summary>
  
new_variables =  [...]

X = data0[new_variables].values #declaring variable or features
  
</details>

In [None]:
## code


<details>
    <summary><strong><u>Statistical analysis</u></strong>
    </summary>

2) Perform a statistical analysis of the data to understand the distributions and their correlations.

<details>
    <summary>&#128161;</summary>
  
Use  "scatter matrix" and "correlation matrix"

</details>  

In [None]:
## code


<details>
    <summary><strong><u>Defining features</u></strong>
    </summary>
    
3) Define your features to be used for training the model and your geometalurgical target variable. 

<details>
    <summary>&#128161;</summary>

new_target = [...]

y = data0[new_target].values # declaring what will be the target class of the model  

Note: try not to use the same target variable as the one used in the group exercise to obtain different tests.
</details>

In [None]:
## code



<details>
    <summary><strong><u>Counting categories</u></strong>
    </summary>
    
4) Count the categories of the target variable.
<details>
    <summary>&#128161;</summary>

...value.counts()

</details>

In [None]:
## code





<details>
    <summary><strong><u>Train, test set split!</u></strong>
    </summary>
    
5) Split the train and test sets after the discussions made based on your analysis.
<details>
    <summary>&#128161;</summary>

X_train, X_test, y_train, y_test = train_test_split.....

</details>

In [None]:
## code

 
<details>
    <summary><strong><u>Standardization</u></strong>
    </summary>
    
6) Preprocess the data by applying standardization.
<details>
    <summary>&#128161;</summary>

with **StandardScaler()**.

Remember... only for your features! 

And... don't forget to apply it to your test and train variables.

</details>


In [None]:
## code

 
<details>
    <summary><strong><u>Choosing algorithms</u></strong>
    </summary>
    
7) Choose one of the algorithms worked on and explained, and train your model, then perform its validations!
<details>
    <summary>&#128161;</summary>

Remember... training the model is done only on your training data, while the validations are performed on the test data!

</details>


In [None]:
## code

<details>
    <summary><strong><u>Confusion Matrix and Classification report</u></strong>
    </summary>
    
8) Plot a confusion matrix and the respective classification report of your model
<details>
    <summary>&#128161;</summary>

gv.confusion_matrix_plot(...)

</details>

In [None]:
## code