# Milestone 1: Exploratory Data Analysis

**Authors**: __Khizer Zakir & Rodrigo Brust Santos__

__October 2023__

# Table of Contents

1 - [Data Description]()

    1.1  [Data Collection]()

    1.2 [Meaning of the Dataset]()

2 - [Data Preparation]()

    2.1 [Data Pre-Processing]()

    2.2 [Cleaning Tidy Data]()

    2.3 [Data Splitting]()

    2.4 [Data Normalization]()

    2.5 [Explanatory Variables]()

    2.6 [Response Variable]()

3 - [Visual Exploration]()

    3.1 

4 - [Statistical Exploration]()

    4.1 [Descriptive Statistics]()

    4.2 [Correlation Analysis]()

    4.3 [Spatial Correlation Analysis]()

5 - [Evaluation Protocol]()

6 - [References]()


<a name="dataset-description"></a>
## 1. Dataset Description 

**Objective**: Prediction/Interpolation of elemental concentrations

### 1.1 Data Collection

The data was collected from the Brazilian Geological Survey (CPRM) in the folder of the [Destrito Zincífero de Vazante, Minas Gerais (MG), Brasil](https://rigeo.cprm.gov.br/handle/doc/19397) (_Zinc-enriched District of Vazante, Minas Gerais State, Brazil_), acronym `DZV`. The stream sediments dataset was requested by e-mail to CPRM and promptly made available.

The dataset that will be explored is the Stream Sediments Geochemical Samples from DZV, which is a program that has been conducted in the northwest region of Minas Gerais State, Brazil, published in 2017 to aid exploration and development of mineral industry focused on **Zinc ore**. 

`Stream sediments` are sediment samples collected from a stream, river or water body. It consists in the product of physical and chemical weathering over rocks, that end up in the dreinage channel caried by surface runoff (Rose et al, 1979). Such material is fundamental for the targeting of mineral ore body, and it has been used for thousand years by humankind (Ottesen and Theobald, 1994, Doherty et al, 2023).


### 1.2 Meaning of the Dataset

#### Context

The `DZV` lies within the Brasília Belt, more specifically in the center-south part (Almeida, 1967 & 1968).  This geological terrain was formed in the Brasiliano Cycle, during the Neoproterozoic period (1Ga to 500Ma), when the cratons of São Francisco, Amazonas, and Paranapanema collided. This collision was one of many that created the big southern paleocontinent Gondwana. (Valeriano et al., 2004; Pimentel et al., 2011).

Cratons are characterized as a solid and stable part of the continental lithosphere. Due to its light density when compared to the ocean crust, the cratons can resist the movements of the tectonic plates, such as rifts and subductions. As a consequence, these resistant pieces of rocks are commonly found in the middle of tectonic plates and continents (Petit, 2010). Figure 1 shows the main cratons of South America and Africa when these continents were joined in the super paleocontinent Pangea.

|![cratons-1.svg](./images/Cratons_West_Gondwana.svg)|
|:--:|
|*Figure 1: South America's and Africa's cratons represented by dark brown. Source: [Woudloper, 2010](https://en.wikipedia.org/wiki/Craton#/media/File:Cratons_West_Gondwana.svg)*|

The shock of these Brazilian cratons led to the Brasilia Belt (Figure 2) as stated above. This collision generated 3 geological compartments of the folding belt: the Inner Zone, the External Zone, and the Cratonic Domain. Since the area of interest (AOI) is located in the External Zone, that is where we are focusing - highlighted by the purple box in Figure 2. 

|![brasilia-belt.png](./images/brasilia_belt_draw.png)|
| :--: |
|*Figure 2: Brasília Belt geological setting and position. It is a strip that connects both São Francisco Craton and Paranapanema Craton, that nowadays is under the Paraná Basin. The stream sediments are located in the region of the purple draw square, in the cities of Paracatu, Vazante, and Coromandel. Source: Valeriano, 2016.*|

The `DZV` is located in the External Zone of Brasília Belt. The main mineralizations of Zn are within the Vazante Group (Dias et al., 2015). In Geology, a group is a grouping of geologic formations. Each formation is characterized by a couple of lithologies. Within Vazante Group, there are in total 3 Formations: Lapa Fm, Serra do Poço, and Morro do Calcário Fm, Serra do Garrote Fm. They are always displayed from the youngest to the oldest, as shown in Figure 3. Groups and formations are classified according to the similarities between the lithologies, being different from rock bodies in the surroundings, and that also occupies a particular position in the layers of rock exposed in a geographical region (Boggs, 1987).

| ![vazante_group.png](./images/vazante_group_modified.png) |
|:--:|
|_Figure 3: Vazante Group stratigraphic column. Modified from [Aldi et al. (2022)](https://www.researchgate.net/publication/362722519_LA-ICP-MS_Trace_Element_Composition_of_Sphalerite_and_Galena_of_the_Proterozoic_Carbonate-Hosted_Morro_Agudo_Zn-Pb_Sulfide_District_Brazil_Insights_into_Ore_Genesis)_|


In total, in the `DZV`, there are 6 mines extracting Zn-ore, mainly by the mineral willemite (Zn2SiO4 - Zinc Silicate). In some areas, the concentration of Zn can reach 8000 ppm (Dias et al., 2015).  

#### Objectives

With the whole background in mind, the objective of this project is to find, with the aid of Machine Learning regression algorithms, prospectable areas rich in Zn based on the stream sediment samples collected in the `DZV`. Due to the 1st law of Geography, where "everything is related to everything, but near things are more related than distant things"; and due to the geological setting of the region, it is expected that the Zn-rich areas are located in the eastern part of the grid of stream sediment samples.


___

### 1.3 Data Preparation

##### 1.3.1 Cleaning tidy data

In [1]:
# Import necessary libraries
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split


In [2]:
# Load the dataset
file_path = '../dataset/stream_samples_original.csv'
df = pd.read_csv(file_path, sep = ';')

# Display the first few rows of the dataset
df.head()


Unnamed: 0,Estação,N__Lab_,Long__X_,Lat__Y_,Folha,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),...,Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zn (ppm),Zr (ppm)
0,AC-0002,CDE225,248757,7972050,Coromandel,0.03,2.63,4,<0.1,<10,...,<0.05,0.06,7.4,0.03,0.7,73,0.3,13.15,27,6.3
1,AC-0003,CDE226,244460,7973135,Coromandel,0.02,1.93,2,<0.1,<10,...,<0.05,0.14,8.2,0.06,0.94,58,0.3,23.9,58,6.9
2,AC-0004,CDE227,244044,7970217,Coromandel,0.04,1.47,3,<0.1,<10,...,<0.05,0.08,5.4,0.04,0.65,55,0.2,10.4,34,1.8
3,AC-0005,CDE228,242895,7970593,Coromandel,0.05,1.72,23,<0.1,<10,...,<0.05,<0.05,6.9,0.04,1.01,66,0.7,9.34,27,4.1
4,AC-0006,CDE229,242999,7971416,Coromandel,0.04,0.97,7,<0.1,<10,...,<0.05,<0.05,5.7,0.05,0.82,41,0.4,7.45,28,1.5


In [3]:
#-- renaming columns

df = df.rename(columns={'Estação':'station',
                        'N__Lab_':'lab',
                        'Long__X_':'x',
                        'Lat__Y_': 'y',
                        'Folha':'map_idx'})

❗🔴 **NÃO É MAIS NECESSÁRIO REORDENAR O DF** APAGAR ESSA PARTE

In [4]:
#-- reordening columns to make it easier to slice

#-- changed map_idx from idx 3 to idx 2 and zn(ppm) 

reoorder_cols = ['station', 'lab','map_idx', 'x', 'y', 'Ag (ppm)', 'Al (%)', 'As (ppm)',
       'Au (ppm)', 'B (ppm)', 'Ba (ppm)', 'Be (ppm)', 'Bi (ppm)', 'Ca (%)',
       'Cd (ppm)', 'Ce (ppm)', 'Co (ppm)', 'Cr (ppm)', 'Cs (ppm)', 'Cu (ppm)','Fe (%)', 'Ga (ppm)', 'Ge (ppm)', 'Hf (ppm)', 'Hg (ppm)', 'In (ppm)','K (%)', 'La (ppm)', 'Li (ppm)', 'LREE (ppm)', 'Mg (%)', 'Mn (ppm)','Mo (ppm)', 'Na (%)', 'Nb (ppm)', 'Ni (ppm)', 'P (ppm)', 'Pb (ppm)','Rb (ppm)', 'Re (ppm)', 'S (%)', 'Sb (ppm)', 'Sc (ppm)', 'Se (ppm)','Sn (ppm)', 'Sr (ppm)', 'Ta (ppm)', 'Te (ppm)', 'Th (ppm)', 'Ti (%)','U (ppm)', 'V (ppm)', 'W (ppm)', 'Y (ppm)', 'Zr (ppm)', 'Zn (ppm)']

df_reorder = df[reoorder_cols]
df_reorder.head(2)

Unnamed: 0,station,lab,map_idx,x,y,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),...,Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zr (ppm),Zn (ppm)
0,AC-0002,CDE225,Coromandel,248757,7972050,0.03,2.63,4,<0.1,<10,...,<0.05,0.06,7.4,0.03,0.7,73,0.3,13.15,6.3,27
1,AC-0003,CDE226,Coromandel,244460,7973135,0.02,1.93,2,<0.1,<10,...,<0.05,0.14,8.2,0.06,0.94,58,0.3,23.9,6.9,58


In [5]:
#-- filtering only elements column
df_elements = df_reorder.iloc[:, 3:]

#-- getting all the column names
string_elements_name = df_elements.select_dtypes(include=['object'])

print('Shape of the dataframe with all elements:', df_elements.shape)


Shape of the dataframe with all elements: (709, 53)


- Before explaining the variables, it is necessary to remove elements which has less than 50% of records. This is important to do because, geologically, it is impossible to find any anomaly pattern on data that was not detected. In order to have any misinterpretation or correlation from these elements, it is fundamental to remove them. ❗🔴 **ADOTAR OUTRA ESTRATÉGIA E REMOVER ESSA PARTE**

In [6]:
#-- Remove elements with less than 50% of register 

# Creating an empty list for elements to remove
remove = []

# Iterate through all elements
for e in string_elements_name.columns:
    try:
        # Count the number of occurrences of '<' and '>'
        minus = df_elements[e].str.count('<').sum()
        plus = df_elements[e].str.count('>').sum()
        # Calculate validity percentage
        validity = 1 - ((minus + plus) / df_elements.shape[0])
        
        # Check if the validity is less than 0.5, mark for removal
        if validity < 0.5:
            print(f'{e} was removed. Expected at least 50%, but got {100*validity:.2f}%.')
            remove.append(e)
    except KeyError:
        pass

# Filter the dataframe by dropping specified columns
df_elements = df_elements.drop(columns=remove)

display(df_elements.head())

df_elements.shape



Au (ppm) was removed. Expected at least 50%, but got 1.41%.
B (ppm) was removed. Expected at least 50%, but got 1.69%.
Ge (ppm) was removed. Expected at least 50%, but got 30.61%.
Na (%) was removed. Expected at least 50%, but got 6.21%.
Re (ppm) was removed. Expected at least 50%, but got 0.00%.
S (%) was removed. Expected at least 50%, but got 30.18%.
Se (ppm) was removed. Expected at least 50%, but got 8.32%.
Ta (ppm) was removed. Expected at least 50%, but got 9.73%.
Te (ppm) was removed. Expected at least 50%, but got 38.79%.


Unnamed: 0,x,y,Ag (ppm),Al (%),As (ppm),Ba (ppm),Be (ppm),Bi (ppm),Ca (%),Cd (ppm),...,Sn (ppm),Sr (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zr (ppm),Zn (ppm)
0,248757,7972050,0.03,2.63,4,76,1.0,0.31,0.05,0.03,...,1.5,3.3,7.4,0.03,0.7,73,0.3,13.15,6.3,27
1,244460,7973135,0.02,1.93,2,84,1.7,0.29,0.04,0.01,...,1.6,3.1,8.2,0.06,0.94,58,0.3,23.9,6.9,58
2,244044,7970217,0.04,1.47,3,70,0.7,0.19,0.11,0.06,...,0.9,3.9,5.4,0.04,0.65,55,0.2,10.4,1.8,34
3,242895,7970593,0.05,1.72,23,80,1.1,0.32,0.04,0.08,...,1.1,3.1,6.9,0.04,1.01,66,0.7,9.34,4.1,27
4,242999,7971416,0.04,0.97,7,56,0.5,0.22,0.03,0.05,...,0.7,2.5,5.7,0.05,0.82,41,0.4,7.45,1.5,28


(709, 44)

❗🔴**ADOTAR A TÉCNICA DE LDI E LDS PARA OS VALORES ABAIXO**

In [7]:
#taking out all < or > symbols, and convert all columns to float.

for col in df_elements:
    #if the column is numeric type, it'll pass
    if (is_numeric_dtype(df_elements[col])):
        pass
    #if the column is string type, it'll remove symbols and the convert to float
    elif (is_string_dtype(df_elements[col])):
        try:
            df_elements[col] = df_elements[col].str.replace('<','').astype('float')
        except:
            df_elements[col] = df_elements[col].str.replace('>','').astype('float')

In [8]:
df_elements.dtypes

x               int64
y               int64
Ag (ppm)      float64
Al (%)        float64
As (ppm)      float64
Ba (ppm)      float64
Be (ppm)      float64
Bi (ppm)      float64
Ca (%)        float64
Cd (ppm)      float64
Ce (ppm)      float64
Co (ppm)      float64
Cr (ppm)        int64
Cs (ppm)      float64
Cu (ppm)      float64
Fe (%)        float64
Ga (ppm)      float64
Hf (ppm)      float64
Hg (ppm)      float64
In (ppm)      float64
K (%)         float64
La (ppm)      float64
Li (ppm)      float64
LREE (ppm)    float64
Mg (%)        float64
Mn (ppm)        int64
Mo (ppm)      float64
Nb (ppm)      float64
Ni (ppm)      float64
P (ppm)       float64
Pb (ppm)      float64
Rb (ppm)      float64
Sb (ppm)      float64
Sc (ppm)      float64
Sn (ppm)      float64
Sr (ppm)      float64
Th (ppm)      float64
Ti (%)        float64
U (ppm)       float64
V (ppm)         int64
W (ppm)       float64
Y (ppm)       float64
Zr (ppm)      float64
Zn (ppm)        int64
dtype: object

In [9]:
df_elements.isna().sum()

x             0
y             0
Ag (ppm)      0
Al (%)        0
As (ppm)      0
Ba (ppm)      0
Be (ppm)      0
Bi (ppm)      0
Ca (%)        0
Cd (ppm)      0
Ce (ppm)      0
Co (ppm)      0
Cr (ppm)      0
Cs (ppm)      0
Cu (ppm)      0
Fe (%)        0
Ga (ppm)      0
Hf (ppm)      0
Hg (ppm)      0
In (ppm)      0
K (%)         0
La (ppm)      0
Li (ppm)      0
LREE (ppm)    0
Mg (%)        0
Mn (ppm)      0
Mo (ppm)      0
Nb (ppm)      0
Ni (ppm)      0
P (ppm)       0
Pb (ppm)      0
Rb (ppm)      0
Sb (ppm)      0
Sc (ppm)      0
Sn (ppm)      0
Sr (ppm)      0
Th (ppm)      0
Ti (%)        0
U (ppm)       0
V (ppm)       0
W (ppm)       0
Y (ppm)       0
Zr (ppm)      0
Zn (ppm)      0
dtype: int64

In [10]:
#-- export cleaned data

df_elements.to_csv('../dataset/stream_samples_cleaned.csv', index = False)

_____

### 1.3.2 Splitting Data 

❗🔴 EXPLICAR MELHOR ISSO. SE POSSÍVEL, BUSCAR PAPERS QUE COMPROVEM A ESTRATÉGIA.

Explain how you will split the dataset into training and testing data to avoid data leakage.

- Now that our dataset is cleaned, it is possible to separate in test, train and validation datasets.

- We are using the proportion 80/20 for training and testing. 

- The dataset was split into training and testing before any normalization was performed. It is important to note that, despite the separation, the variable we want to predict, the Zinc element, is still present in both tables (training and testing), and will be properly processed. It is fully known that, if not properly processed, it will result in data leakage.

- Furthermore, in order to improve the training part of the algorithm, cross-validation with k=5 will be used, maintaining 20% ​​consistency of the training data.


In [11]:
# set aside 20% of train and test data for evaluation
train, test = train_test_split(df_elements,test_size=0.2, shuffle=True, random_state=42)

print("train shape: {}".format(train.shape))
print("test shape: {}".format(test.shape))

train shape: (567, 44)
test shape: (142, 44)


#### 1.3.3 Dataset normalization

❗🔴 **VERIFICAR EM XAVIER 2023 E SALOMÃO 2020 SOBRE NORMALIZAÇÃO DO DATASET**

In order to normalize the dataset, log normalization will be used.

- Log normalization is a good approach beacuse it makes sure that all values are positive. It also capture relatives changes and the magnitude of change. It's useful when there is a lot of variance.

In [12]:
train.head(2)

Unnamed: 0,x,y,Ag (ppm),Al (%),As (ppm),Ba (ppm),Be (ppm),Bi (ppm),Ca (%),Cd (ppm),...,Sn (ppm),Sr (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zr (ppm),Zn (ppm)
310,269916,7935396,0.01,1.32,4.0,76.0,0.9,0.14,0.04,0.07,...,1.0,5.6,5.0,0.02,0.75,56,0.1,5.49,9.7,22
523,305017,7948833,0.04,0.77,2.0,18.0,0.1,0.06,0.01,0.01,...,0.9,1.3,3.5,0.01,0.4,23,0.3,1.51,2.8,4


In [17]:
tmp_train = train.copy()
tmp_test = test.copy()


for i in tmp_train.columns[2:]:
    tmp_train[i] = np.absolute(np.log10(tmp_train[i]))
    
print('--- Train dataframe normalized ---')

for i in tmp_test.columns[2:]:
    tmp_test[i] = np.absolute(np.log10(tmp_test[i]))
    
print('--- Test dataframe normalized ---')



--- Train dataframe normalized ---
--- Test dataframe normalized ---


In [18]:
#-- converting everything back to dataframes

training = pd.DataFrame(tmp_train,
                        columns= df_elements.columns
                        ).to_csv(
                            '../dataset/train_test/training.csv',
                             index = False
                        )

testing = pd.DataFrame(tmp_test,
                       columns = df_elements.columns
                       ).to_csv(
                           '../dataset/train_test/testing.csv', 
                           index = False
                       )


Now that we have the data cleaned, without NaNs, without objects as numbers with all statistically significant measures of the elements, and with the training and testing samples properly divided and log-normalized, it is possible to better explain what are out explanatory and response variables.


### 1.4 Explanatory Variables
List and describe the types and numbers of explanatory variables.


In [19]:
#-- transforming df to np array
tmp = df_elements.to_numpy()

#separating the Zn into an array
Y = tmp[:, -1]

#separating all  explanatory variables into X
X = tmp[:, :-1]

In [20]:
#Explanatory Variables

print("Feature Name:", df_elements.columns[:-1])

print("\n Type of X Values:")
print(df_elements.iloc[:, -1].dtypes)
print("\nshape of X:")
print(X.shape)

Feature Name: Index(['x', 'y', 'Ag (ppm)', 'Al (%)', 'As (ppm)', 'Ba (ppm)', 'Be (ppm)',
       'Bi (ppm)', 'Ca (%)', 'Cd (ppm)', 'Ce (ppm)', 'Co (ppm)', 'Cr (ppm)',
       'Cs (ppm)', 'Cu (ppm)', 'Fe (%)', 'Ga (ppm)', 'Hf (ppm)', 'Hg (ppm)',
       'In (ppm)', 'K (%)', 'La (ppm)', 'Li (ppm)', 'LREE (ppm)', 'Mg (%)',
       'Mn (ppm)', 'Mo (ppm)', 'Nb (ppm)', 'Ni (ppm)', 'P (ppm)', 'Pb (ppm)',
       'Rb (ppm)', 'Sb (ppm)', 'Sc (ppm)', 'Sn (ppm)', 'Sr (ppm)', 'Th (ppm)',
       'Ti (%)', 'U (ppm)', 'V (ppm)', 'W (ppm)', 'Y (ppm)', 'Zr (ppm)'],
      dtype='object')

 Type of X Values:
int64

shape of X:
(709, 43)


**Location Variables**

- `station` (_object_): means Station. It is the ID of the collecting point.

- `lab` (_object_): identification of the lab that performed the stream sediment analysis.

- `x` (_int64_): coordinates in the X axis.

- `y` (_int64_): coordinates in the Y axis.

- `map_idx` (_object_): mapping site

**Explanatory Variables**

- `element (ppm)` (_float64_): element measured in parts per million

- `element (%)`(_float64_): element measured in percentage

| Chemical Element | Element Name |
|------------------|--------------|
| Ag (ppm)         | Silver       |
| Al (%)           | Aluminum     |
| As (ppm)         | Arsenic      |
| Au (ppm)         | Gold         |
| B (ppm)          | Boron        |
| Ba (ppm)         | Barium       |
| Be (ppm)         | Beryllium    |
| Bi (ppm)         | Bismuth      |
| Ca (%)           | Calcium      |
| Cd (ppm)         | Cadmium      |
| Ce (ppm)         | Cerium       |
| Co (ppm)         | Cobalt       |
| Cr (ppm)         | Chromium     |
| Cs (ppm)         | Cesium       |
| Cu (ppm)         | Copper       |
| Fe (%)           | Iron         |
| Ga (ppm)         | Gallium      |
| Ge (ppm)         | Germanium    |
| Hf (ppm)         | Hafnium      |
| Hg (ppm)         | Mercury      |
| In (ppm)         | Indium       |
| K (%)            | Potassium    |
| La (ppm)         | Lanthanum    |
| Li (ppm)         | Lithium      |
| LREE (ppm)       | Light Rare Earth Elements |
| Mg (%)           | Magnesium    |
| Mn (ppm)         | Manganese    |
| Mo (ppm)         | Molybdenum   |
| Na (%)           | Sodium       |
| Nb (ppm)         | Niobium      |
| Ni (ppm)         | Nickel       |
| P (ppm)          | Phosphorus   |
| Pb (ppm)         | Lead         |
| Rb (ppm)         | Rubidium     |
| Re (ppm)         | Rhenium      |
| S (%)            | Sulfur       |
| Sb (ppm)         | Antimony     |
| Sc (ppm)         | Scandium     |
| Se (ppm)         | Selenium     |
| Sn (ppm)         | Tin          |
| Sr (ppm)         | Strontium    |
| Ta (ppm)         | Tantalum     |
| Te (ppm)         | Tellurium    |
| Th (ppm)         | Thorium      |
| Ti (%)           | Titanium     |
| U (ppm)          | Uranium      |
| V (ppm)          | Vanadium     |
| W (ppm)          | Tungsten     |
| Y (ppm)          | Yttrium      |
| Zr (ppm)         | Zirconium    |


*_Disclamer_*: in geology, the chemical elements are separated in two types of elements: `Major` and `Trace`. 

All `Major elements`, such as Al, K, and Fe, for example, are measured in percentage (%). These elements typically constitute a significant portion of the sediment's composition, so their concentrations are conveniently expressed as percentages. 

Meanwhile all `Trace elements` are measured in ppm.  These elements, presented in much lower concentrations, include valuable minerals, heavy metals, and other trace elements like gold (Au), silver (Ag), copper (Cu), and various rare earth elements. Even in trace amounts, these elements can be essential indicators for mineral deposits or provide valuable information about the geological environment.



### 1.5 Response Variable
Describe the response variable and its type.

**Response Variable**

- `Zn (ppm)` (_float64_): concentration of Zinc in ppm.

| Chemical Element | Element Name |
|------------------|--------------|
| Zn (ppm)| Zinc | 


In [21]:
res_variables = df_elements.iloc[:, -1]

print("Feature Name:", res_variables.name)

print("\nType of Y Values:")
print(res_variables.dtypes)

print("\nshape of Y:")
print(res_variables.shape)

Feature Name: Zn (ppm)

Type of Y Values:
int64

shape of Y:
(709,)


❗🔴**CHECK THE POSSIBILITY OF ADDING MORE DATA, SUCH AS LITHOLOGY AND/OR BASINS. THIS WOULD IMPROVE THE MODEL? CHECK IT WITH CHARLOTTE.**

❗🔴**ALSO DISCUSS REGARDING THE CLASSIFICATION TO THE ZN. WILL IT BE REGRESSION OR CLASSIFICATION PROBLEM?**