## **Introduction to Exploratory Data Analysis with Pandas, Matplotlib and Seaborn!**


## Course content:

1.   Pandas: import, manipulate, visualize and describe your data
2.   Visualize your data with different plots using Geovariances plotting! 


## Downloading data and plotting scripts

The `curl` command downloads the repository data used for the course. If you are on Google Colaboratory session, you will also need to download the plotting scripts from Geovariances.


In [None]:
# Downloads dataset from GitHub
!curl -o phosphate_assay_sampled_geomet.csv https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/phosphate_assay_sampled_geomet.csv

# If you are in a Google Colab session, make sure to also download the GeoVariances module for plotting!
# !curl -o plotting_gv.py https://raw.githubusercontent.com/gv-americas/ml_course_americas/main/plotting_gv.py

## Importing four libraries:

**Pandas**: used for data manipulation and analysis.

**Numpy**: used for scientific computing and working with arrays.

**Matplotlib**: used for data visualization and creating plots.

**Plotting_gv**: a custom plotting library created by GV Americas, which contains additional plotting functions and custom styles.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotting_gv as gv


## Reading data with Pandas

In [None]:
data = pd.read_csv("phosphate_assay_sampled_geomet.csv")


## Recognizing data: display first few rows of our dataframe

In [None]:
data.head()


## Printing dataset information and columns

In [None]:
print("Dataset information:\n")
data.info()
data.columns


## Declaring to filter data

In [None]:
variables = ["AL2O3", "BAO", "CAO", "FE2O3", "MGO", "NB2O5", "P2O5", "SIO2", "TIO2"]

lito_var = "ALT"

geomet = ["Consumo_coletor_(g/t)", "MASSA_T"]

x, y, z = data["X"], data["Y"], data["Z"]


## Number of rows x Number of columns

In [None]:
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")


## Counting drillholes

In [None]:
ddh = len(np.unique(data["Name"]))
print(f"Contagem total de furos: {ddh}")


## Lithologies: ALT variable

In [None]:
lito = np.unique(data["ALT"])

print(lito)


## Counting lithologies categories with Count Plot

In [None]:
gv.count_cat(cat=data["ALT"], color="#116981")


## Describing basic statistics for each variable

In [None]:
print("Data statistics:\n")
data.describe().round(2)


## Printing NaN Values with Pandas

In [None]:
print(f"NaN Values for each column:")

print(data.isnull().sum())


## Plotting Histogram

In [None]:
gv.histogram(
    data["P2O5"], title="Histogram for $P_2O_5$", bins=20, color="#116981", cum=False
)


## Plotting Scatterplot

In [None]:
gv.scatterplot(data, "P2O5", "CAO", lineregress=False)


## Plotting Scattermatrix and Correlations

In [None]:
gv.scatter_matrix(data[variables], figsize=(20, 20))


## Plotting Correlation Matrix

In [None]:
gv.correlation_matrix(data[variables], (10, 10), method="pearson")


## Plotting Boxplots per category: understanding all distributions

In [None]:
gv.boxplots(data[variables], variables, data["ALT"])


## Flagging outliers of a specific variable

In [None]:
gv.flag_outliers(data, "P2O5", remove_outliers=True)


## Location Map 

In [None]:
gv.locmap(
    data["X"],
    data["Y"],
    data["P2O5"],
    cat=False,
    figsize=(20, 13),
    title="Location map $P_2O_5$",
)


## Cross Section

In [None]:
gv.locmap(
    data["X"],
    data["Z"],
    data["P2O5"],
    cat=False,
    figsize=(30, 20),
    title="Cross section location map $P_2O_5$",
)


## Creating cut-offs domains

In [None]:
data["ORE"] = np.where(data["P2O5"] >= 6, "Rich", "Poor")


## Visualize Rich and Poor domains

In [None]:
gv.locmap(
    data["X"],
    data["Z"],
    data["ORE"],
    cat=True,
    figsize=(30, 10),
    title="Categorical location map for $P_2O_5$",
)


In [None]:
gv.tdscatter(data["X"], data["Y"], data["Z"], data["ORE"], zex=1, s=3, azim=80, elev=20)


## Practice

Generate different histograms to analyze the distributions of the main variables. Analyze and comment on the results. Use the function **gv.histogram()** to generate the results





In [None]:
## code


Generate scatter plots and a new scatter plot matrix to understand the dispersion and correlations between variables. Also, use the correlation matrix. To do this, use the functions **gv.scatterplot(), gv.scatter_matrix(), and gv.correlation_matrix()**.

Note: declare new variables for the scatter plot matrix and for the correlation matrix. For this, you can declare a new variable with the chemical and metallurgical variables of interest. Use in this new step the geometallurgical variables.



In [None]:
## code


Along with the analysis of the boxplots, generate different flags for outliers of other variables through the function **gv.flag_outliers()** and analyze the results.







In [None]:
## code


Make a new domain division with other cut-off grades of your interest using other variables. Use the **np.where()** function.

In [None]:
## code


Visualize the results obtained with a location map, a horizontal section, and a three-dimensional plot with **gv.locmap and gv.tdscatter**.

In [None]:
## code
