## Data Preperation for the first Model
Welcome to the first notebook. Here we'll process the data from downloading to what we will be using to train our first model - **'Wh’re Art Thee Min’ral?'**.

The steps we'll be following here are:
- Downloading the SARIG Geochem Data Package. **(~350 Mb)**
- Understanding the data columns in our csv of interest.
- Cleaning and applying some processing.
- Saving our processed file into a csv.
- _And seeing some unnecessary memes in between_.

You can upload this notebook and run it on colab or on Jupyter-Notebook locally.

In [6]:
# import the required package - Pandas
import pandas as pd

You can simply download the data by clicking the link [here](https://unearthed-exploresa.s3-ap-southeast-2.amazonaws.com/Unearthed_5_SARIG_Data_Package.zip). You can also download it by simply running the cell down below.

We recommed you to use **Google Colab** and download it here itself if you have a poor internet connection.

![](https://media.giphy.com/media/FgiHOQyKUJmwg/giphy.gif
)

 Colab has a decent internet speed of around **~15-20 Mb/s** which is more than enough for the download.

In [None]:
# You can simply download the data by running this cell
!wget https://unearthed-exploresa.s3-ap-southeast-2.amazonaws.com/Unearthed_5_SARIG_Data_Package.zip

--2020-07-26 10:57:12--  https://unearthed-exploresa.s3-ap-southeast-2.amazonaws.com/Unearthed_5_SARIG_Data_Package.zip
Resolving unearthed-exploresa.s3-ap-southeast-2.amazonaws.com (unearthed-exploresa.s3-ap-southeast-2.amazonaws.com)... 52.95.128.54
Connecting to unearthed-exploresa.s3-ap-southeast-2.amazonaws.com (unearthed-exploresa.s3-ap-southeast-2.amazonaws.com)|52.95.128.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458997620 (438M) [application/zip]
Saving to: ‘Unearthed_5_SARIG_Data_Package.zip’


2020-07-26 10:57:35 (19.5 MB/s) - ‘Unearthed_5_SARIG_Data_Package.zip’ saved [458997620/458997620]





Here for extracting, if you wish to use the download file for a later use, than you can first mount your google drive and then extracting the files there. You can read more about mounting Google Drive to colab [here](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

***Note** - One of the files is really big (~10 Gb) and so it might take some time to extract as well. *Don't think that it's stuck!*

In [None]:
# Let's first create a directory to extract the downloaded zip file.
!mkdir 'GeoChemData'

# Now let's unzip the files into the data directory that we created.
!unzip 'Unearthed_5_SARIG_Data_Package.zip' -d 'GeoChemData/'

Archive:  Unearthed_5_SARIG_Data_Package.zip
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_core_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_details_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_litho_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_petrophys_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_reference_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_strat_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_fieldobs_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_fieldobs_litho_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_fieldobs_note_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_fieldobs_struct_exp.csv  
  inflating: GeoChemData/SARIG_Data_Package3_Exported06072

In [21]:
# Read the df_details.csv 
# We use unicode_escape as the encoding to avoid etf-8 error.
df_details = pd.read_csv('/content/GeoChemData/SARIG_Data_Package3_Exported06072020/sarig_dh_details_exp.csv', encoding= 'unicode_escape')

  interactivity=interactivity, compiler=compiler, result=result)


In [22]:
# Let's view the first few columns
df_details.head()

Unnamed: 0,DRILLHOLE_NO,DH_NAME,DH_OTHER_NAME,PACE_DH,PACE_ROUND_NO,REPRESENTATIVE_DH,REPRESENTATIVE_DH_COMMENTS,DH_UNIT_NO,MAX_DRILLED_DEPTH,MAX_DRILLED_DEPTH_DATE,CORED_LENGTH,TENEMENT,OPERATOR_CODE,OPERATOR_NAME,TARGET_COMMODITIES,MINERAL_CLASS,PETROLEUM_CLASS,STRATIGRAPHIC_CLASS,ENGINEERING_CLASS,SEISMIC_POINT_CLASS,WATER_WELL_CLASS,WATER_POINT_CLASS,DRILLING_METHODS,STRAT_LOG,LITHO_LOG,PETROPHYSICAL_LOG,GEOCHEMISTRY,PETROLOGY,BIOSTRATIGRAPHY,SPECTRAL_SCANNED,CORE_LIBRARY,REFERENCES,HISTORICAL_DOCUMENTS,COMMENTS,MAP_250000,MAP_100000,MAP_50K_NO,SITE_NO,EASTING_GDA2020,NORTHING_GDA2020,ZONE_GDA2020,LONGITUDE_GDA2020,LATITUDE_GDA2020,LONGITUDE_GDA94,LATITUDE_GDA94,HORIZ_ACCRCY_M,ELEVATION_M,INCLINATION,AZIMUTH,SURVEY_METHOD_CODE,SURVEY_METHOD
0,1,GINGERAH HILL 1,,N,,N,,3359 1,1473.5,14/09/1986,,,,,,N,Y,N,N,N,N,N,,N,N,N,N,N,N,N,N,N,N,,SE5114 MUNRO,3359 Cudalgarra,2,124,433537.93,7847062.23,51,122.366742,-19.469838,122.366734,-19.469824,,,,,,
1,2,BROOKE 1,,N,,N,,3458 1,2035.1,21/07/1988,,,,,,N,Y,N,N,N,N,N,,N,N,N,N,N,N,N,N,N,N,,SE5114 MUNRO,3458 Brooke,1,125,491238.13,7816262.34,51,122.91637,-19.749266,122.916362,-19.749252,,,,,,
2,3,SAHARA 1,,N,,N,,3555 1,2120.19,26/02/1965,,,,,,N,Y,N,N,N,N,N,,N,N,N,N,N,N,N,N,N,Y,,SF5107 SAHARA,3555 Tandalgoo,1,126,540836.18,7674882.1,51,123.392991,-21.026384,123.392983,-21.02637,200.0,,,,MAP,Map Plot
3,4,NYALAYI 1/90,,N,,N,,3743 1,96.0,06/12/1990,,,,,Water,N,N,N,N,N,Y,N,Rotary,N,Y,N,N,N,N,N,N,Y,Y,,SG5115 THROSSELL,3743 Buldya,1,127,647138.76,6991660.71,51,124.485432,-27.18989,124.485424,-27.189876,,,,,,
4,5,GAMBANGA 1,,N,,N,,3833 1,391.06,05/03/1960,,,,,,N,Y,N,N,N,N,N,,N,N,N,N,N,N,N,N,N,Y,,SI5104 CULVER,3833 Price,3,128,664137.34,6426048.14,51,124.743123,-32.290334,124.743116,-32.29032,,,,,,


In [23]:
# Data Column Information
df_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 321843 entries, 0 to 321842
Data columns (total 51 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   DRILLHOLE_NO                321843 non-null  int64  
 1   DH_NAME                     191457 non-null  object 
 2   DH_OTHER_NAME               26298 non-null   object 
 3   PACE_DH                     321843 non-null  object 
 4   PACE_ROUND_NO               6535 non-null    float64
 5   REPRESENTATIVE_DH           321843 non-null  object 
 6   REPRESENTATIVE_DH_COMMENTS  97696 non-null   object 
 7   DH_UNIT_NO                  321843 non-null  object 
 8   MAX_DRILLED_DEPTH           303597 non-null  float64
 9   MAX_DRILLED_DEPTH_DATE      296142 non-null  object 
 10  CORED_LENGTH                51566 non-null   float64
 11  TENEMENT                    321843 non-null  object 
 12  OPERATOR_CODE               155645 non-null  object 
 13  OPERATOR_NAME 

### What columns do we need?
We only need the following three columns from this dataframe ->
- `LONGITUDE_GDA94`: This is the longitude of the mine/mineral location in **EPSG:4283** Co-ordinate Referencing System (CRS). 

- `LATITUDE_GDA94`: This is the latitude of the mine/mineral location in **EPSG:4283** Co-ordinate Referencing System (CRS).

- `MINERAL_CLASS`: Mineral Class is a column containing **two unique values (Y/N)** representing if there is any mineralization or not.

> *Note - We are using GDA94 over GDA20 because of the former's standardness.* You can understand more about it our glossary's page [here]().



In [24]:
# Here the only relevant data we need is the location and the Mineral Class (Yes/No)
df_final = df_details[['LONGITUDE_GDA94','LATITUDE_GDA94', 'MINERAL_CLASS']]

# Drop the rows with null values 
df_final = df_final.dropna()

In [25]:
# Lets print out a few rows of the new dataframe.
df_final.head()

Unnamed: 0,LONGITUDE_GDA94,LATITUDE_GDA94,MINERAL_CLASS
0,122.366734,-19.469824,N
1,122.916362,-19.749252,N
2,123.392983,-21.02637,N
3,124.485424,-27.189876,N
4,124.743116,-32.29032,N


In [31]:
# Let's check the data points in both classes
print("Number of rows with Mineral Class Yes is", len(df_final.query('MINERAL_CLASS=="Y"')))
print("Number of rows with Mineral Class No is", len(df_final.query('MINERAL_CLASS=="N"')))

Number of rows with Mineral Class Yes is 147407
Number of rows with Mineral Class No is 174436


321843

The Total Number of rows in the new dataset is **147407 (Y) + 174436 (N) = 321843** which is quite sufficient for training our models over it.

Also the ratio of Class `'Y'` to Class `'N'` is 1 : 0.8 which is quite _**balanced**_.

![](https://media.giphy.com/media/Q1LPV0vs7oKqc/giphy.gif)

Now that we have our csv, let's go ahead and save our progress into a new csv before the session expires!

![](https://www.meme-arsenal.com/memes/4d79ebd426c488f01201fa1c70f704c8.jpg)

In [35]:
# Create a new directory to save the csv.
!mkdir 'GeoChemData/exported'

# Convert the dataframe into a new csv file.
df_final.to_csv('GeoChemData/mod1_unsampled.csv')

mkdir: cannot create directory ‘GeoChemData/exported’: File exists


In [36]:
# Finally if you are on google colab, you can simply download using ->
from google.colab import files
files.download('GeoChemData/exported/mod1_vectors.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>