# Data cleansing and extraction:

The dataset we will be using during this project can be found both on [Kaggle](https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset?resource=download) and University of California, Irvine's ML [Repository](https://archive.ics.uci.edu/dataset/31/covertype) (where, in addition to the data, you can find a more extended explanaition on the variable meaning that the Kaggle page lacks).

Let's begin by extracting the data and loading it to a Pandas dataframe:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("../data/raw/covtype.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           581012 non-null  int64
 1   Aspect                              581012 non-null  int64
 2   Slope                               581012 non-null  int64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  int64
 4   Vertical_Distance_To_Hydrology      581012 non-null  int64
 5   Horizontal_Distance_To_Roadways     581012 non-null  int64
 6   Hillshade_9am                       581012 non-null  int64
 7   Hillshade_Noon                      581012 non-null  int64
 8   Hillshade_3pm                       581012 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  int64
 10  Wilderness_Area1                    581012 non-null  int64
 11  Wilderness_Area2                    581012 non-null 

Our feature variables are classified in the following way:
- 10 countinuous variables:
    * Elevation (m)
    * Aspect (azimuth, degrees from North)
    * Slope (degrees)
    * Horizontal_Distance_To_Hydrology (m)
    * Vertical_Distance_To_Hydrology (m)
    * Horizontal_Distance_To_Roadways (m)
    * Hillshade_9am (index, from 0 to 255)
    * Hillshade_Noon (index, from 0 to 255)
    * Hillshade_3pm (index, from 0 to 255)
    * Horizontal_Distance_To_Fire_Points (m)
- Qualitative variables:
    * Willderness_Area: 4 wilderness areas ara included in the dataset (*Rawah, Neota, Comanche Peak* and *Cache la Poudre*). Using **one hot encoding**, a column was created to indicate belonging to one of the four wilderness areas. The columns that belong to each wilderness area are:
        - Rawah: area 1.
        - Neota: area 2.
        - Comanche Peak: area 3.
        - Coche la Poudre: area 4.
    * Soil_type: 40 different soil types were included. Soil types are compatible with each other, so each piece of land might have more than one soil type. Presence of a soil type is expressed with a 1 in the corresponding column.

First things first: what kind of variable is the target variable `Cover_type`?

In [3]:
df["Cover_Type"].value_counts()

Cover_Type
2    283301
1    211840
3     35754
7     20510
6     17367
5      9493
4      2747
Name: count, dtype: int64

Our target variable is a **categorical variable**. We see that the cover type distribution is not uniform. There are far more type 2 and 1 cover types than type 4. Cover types 3, 4, 5, 6 and 7 are under represented. This is a challenge we will need to face later, while training our models, as it is recommended that the target variable should be equally distributed in the training data.

The category behind each value is the following:
- Cover type 1: spruce/fir (*Picea/Abeto in Spanish*).
- Cover type 2: lodgepole pine
- Cover type 3: Ponderosa pine
- Cover type 4: cottonwood/willow (*álamo de Norteamérica/sauce*)
- Cover type 5: aspen (*álamo temblón*)
- Cover type 6: Douglas-fir
- Cover type 7: krummholz

# Resume:

The dataset is pretty heavy. On a first sight, we can see that:
- There are more than half a million entries and more than 50 columns!
- All variables are numeric. But many variables are produced following the **One-hot encoder** approach for converting categorical variables into numeric variables.
- **cover_type** is the target variable we want to predict. With such a big set of information, we can apply many different classical Machine Learning (and maybe some basic neural network) models to try to predict the possible cover_type of a pixel.

# Pruebas

In [None]:
sns.histplot(df["Hillshade_9am"])

In [None]:
df["Hillshade_9am"].head()

In [None]:
df["Wilderness_Area"] = pd.Series(df["Wilderness_Area1"] == 1).astype(int)

In [None]:
df.loc[df["Wilderness_Area2"] == 1, "Wilderness_Area"] = 2 

In [None]:
df["Wilderness_Area"].value_counts()

## Open questions: to be answered in the future

1. Are different soil types compatible? I mean: is it possible for a pixel to have more than one soil type, and thus have more than one True value in the soil columns?
2. Are all soil types equally represented? Is there a soil type with mostly `False` values? If the latter is true, we could drop that column.
3. Can we group the different soil types into a smaller number of columns? Maybe different soil types show themselves in groups. In that case, we could just create a new column per group and drop the individual columns.
4. From the ML point of view: which are the most influential columns? How much can we simplify the dataset by dropping columns?
5. Is it possible to reduce the amount of columns by combining them in a single column? Not by creating columns that represent soil type groups, but by merging all of them (or as many of them as possible) in a single column, where each soil type would be represented by an index value.

## Future steps:
(provisional, this cell will be moved to a different notebook or file in the future)

- Is this compatible with other forests (forests with other tree species)? Is it compatible with other forest types?
- Can it be combined with satellite data?
- Is it possible to create a pipeline where satellite data would be preprocessed with ML algorithms (CV?) to get scalar values that represent each soil type and then combine it to the model developed in this project to guess cover type? Would this provide any improvement from directly predicting the tree type with CV?

In [None]:
df["Cover_Type"].sort_values().unique() # ADI: interesgarria. Zer izango da eraginkorragoa? Lehenik unique() egitea ala sort_values() egitea?