## Downloading data and plotting scripts

The `curl` command downloads the repository data used for the course. If you are on Google Colaboratory session, you will also need to download the plotting scripts from Geovariances.


In [None]:
# Downloads dataset from GitHub
!curl -o phosphate_assay_sampled_geomet.csv https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/phosphate_assay_sampled_geomet.csv

# If you are in a Google Colab session, make sure to also download the GeoVariances module for plotting!
#!curl -o plotting_gv.py https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/plotting_gv.py

# Download the StandardScaler model
#!curl -o std_scaler.bin https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/std_scaler.bin

## Importing four libraries:

**Pandas**: used for data manipulation and analysis.

**Numpy**: used for scientific computing and working with arrays.

**Matplotlib**: used for data visualization and creating plots.

**Plotting_gv**: a custom plotting library created by GV Americas, which contains additional plotting functions and custom styles.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotting_gv as gv

## Reading data with Pandas

In [None]:
data = pd.read_csv('phosphate_assay_sampled_geomet.csv')

data

# Data preprocessing analysis: cleaning and processing

## Clean dataframe with `dataframe.dropna()`

In [None]:
data0 = data.dropna()

data0.shape[0]

## Declaring variables to filter data

In [None]:
coords = ['x', 'y', 'z']

lito_var = ['ALT']

variables =  ['AL2O3', 'CAO', 'FE2O3', 'MGO',  'P2O5', 'SIO2', 'TIO2', 'NB2O5', 'BAO']

gmt = ['Consumo_coletor_(g/t)', 'MASSA_T']

## Flagging outliers with `gv.flag_outliers()`

In [None]:
gv.flag_outliers(data, 'NB2O5', remove_outliers=False)

# Exploratory data analysis

## Scatter matrix with `gv.scatter_matrix()`

In [None]:
gv.scatter_matrix(data[variables+gmt], figsize=(30,30))

## Correlation Matrix with `gv.correlation_matrix()`

In [None]:
gv.correlation_matrix(data[variables+gmt], fsize=(15,15))

## Splitting features (X) and target (y)

In [None]:
X = data0[variables].values #declarando as variáveis ou features
y = data0[gmt[0]] #declarando a target gmt[0] ou gmt[1]

## Split train, test samples with `sklearn.model_selection`

In [None]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X, # X features, independent variables
    y, # y target, dependet variable
    test_size=0.3, #fração de treino e teste
    shuffle=True, #embaralha os dados: evita que a divisão dos dados fique tendenciosa a uma classe
    random_state=100, #semente aleatória: garante a repoducibilidade dos resultados, ou seja, a divisão dos dados será sempre a mesma
    )


In [None]:
print('Fração de treino:')
len(X_train)

In [None]:
print('Fração de validação:')
len(X_test)

# Data transformation: `StandardScaler()` using Sklearn.preprocessing

$$ z = \frac{x-\mu}{\sigma}$$

Where $\mu$ is the mean of the training samples, and $\sigma$ is the standard deviation of the samples. Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [None]:
from sklearn.preprocessing import StandardScaler
from joblib import load

scaler = load("std_scaler.bin")

# só precisamos transformar o X, ou seja, as features onde serão treinados e validados os modelos
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## KNN Regressor

Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

In [None]:
from sklearn.neighbors import KNeighborsRegressor


In [None]:
nn = 50 #número de vizinhos


knn = KNeighborsRegressor(
    n_neighbors=nn, #numero de vizinhos a ser considerados
    weights='distance', #como ele vai ponderar a proximidade das amostras (pesos), nesse caso distancia euclidiana
    p=2 #p=2 usa a distância euclidiana, weights é como ele calcula os pesos para os vizinhos
    ) 

knn.fit(X_train, y_train) #aplicando o modelo nos dados de treino

y_pred = knn.predict(X_test) #prevendo os valores a partir do modelo nos dados de teste

## Validating KNN regressor model with `gv.validate_regression()`

It compares the model predictions with the true values and evaluates how well the model is making predictions.

Regression validation is an essential part of data analysis and machine learning model development. It is a powerful tool that helps to evaluate the quality and performance of models, enabling you to make adjustments and improvements to obtain more accurate predictions.

This plot calculates the statistics to evaluate the performance of the regression model. These statistics include:
mean absolute deviation (MAE);

*   mean absolute deviation (MAE): the average of the absolute differences between each prediction and the corresponding true value...performance of the model!
*   mean squared error (MSE): the average of the squares of the differences...penalizes larger errors!
*   coefficient of determination (R²): how well the model fits the data!


The graph "grade x error" allows you to see how the errors are distributed across the range of true values, and to identify any patterns or trends in the errors.

In [None]:
gv.validate_regression(y_pred, y_test, title='Validating KNN Regressor')

## SVM

Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

## Linear and RBF SVR: support vector regression

In [None]:
from sklearn.svm import SVR

In [None]:
svm = SVR(
kernel='rbf', #kernel a ser usado para a construção dos hiperplanos..
C=1, #penaliza os pontos que estão do lado errado do hiperplano, quanto maior C mais pontos sao penalizados, ou mais rigoroso.
gamma='scale', #habilitar se o kernel for rbf!
)
svm.fit(X_train,y_train)

y_pred = svm.predict(X_test)

## Validating SVM regressor model with `gv.validate_regression()`


In [None]:
gv.validate_regression(y_pred, y_test, title='Validating SVM Regressor')

## Decision Tree Regressor

Documentation can be found on [`scikit-learn`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree = DecisionTreeRegressor(
random_state=100,
max_depth=8,
min_samples_split=100
)

tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)


## Validating DTrees regressor model with `gv.validate_regression()`

In [None]:
gv.validate_regression(y_pred, y_test, title='Validating Decision Tree Regressor')

## Random Forest Regressor

Documentation can be found on [scikit-learn website](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(
n_estimators=50,
max_depth=6,
min_samples_split=10,
random_state=100,
)

rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

## Validating RF regressor model with `gv.validate_regression()`


In [None]:
gv.validate_regression(y_pred, y_test, title='Validating RF Regressor')

## Which input features are most important in predicting the target variable for Random Forest Model

Note: Feature importance provides a way to identify which features have the most predictive power for a given target variable, and can be useful for optimizing model performance or gaining insights into the relationships between features and the target variable.

In [None]:
gv.features_importance(rf, X_test, variables, y_test, clf=False)

## Model evaluation with K-Folds

The purpose of this plot is to visualize the performance of a model when evaluated with a k-fold cross-validation strategy.

The x-axis represents the different folds used in the cross-validation (1 to k), while the y-axis represents the performance metric chosen to evaluate the model.

Each box in the plot represents the distribution of scores obtained for the corresponding fold. 

This plot can help to understand the variability of the model's performance across different folds, and whether the model is overfitting or underfitting.

If the performance is consistent across all folds, the model is likely to generalize well to new data. 

If the performance is highly variable, the model may need to be improved or re-evaluated with a different strategy.

In [None]:
gv.evaluate_kfolds(X_train, y_train, 10, 5, rf, clf=False)

# Practice

In this exercise, you will reproduce the supervised learning process presented in the notebook, but with a new set of variables!

Perform a statistical analysis of the data using a scatter matrix and a correlation matrix to understand the distributions and their correlations.

In [None]:
## code

Define your features to be used for training the model and your geometalurgical target variable. 

Note: try not to use the same target variable as the one used in the group exercise to obtain different tests.

In [None]:
## code

Preprocess the data by applying standardization with **StandardScaler()**.

Remember... only for your features! 

And... don't forget to apply it to your test and train variables.


In [None]:
## code

Choose one of the algorithms worked on and explained, and train your model, then perform its validations! 

Remember... training the model is done only on your training data, while the validations are performed on the test data!


In [None]:
## code

Plot a regression validation of your model with **gv.validate_regression()**.

In [None]:
## code