# Knowing the Covertype dataset

## Description

The Covertype dataset aims to provide cartographic variables (no remotely sensed data) for predicting forest cover types. Data is about four wilderness areas located in the Roosevelt National Forest of northern Colorado, whose cover types are more a result of ecological processes rather than human-caused disturbances.

The original dataset contains 12 attributes (10 quantitative, 2 qualitative), but the collected data is organized in 54 columns, where 10 correspond to the quantitative variables, 4 are binary for encoding wilderness areas and 40 are also binary for encoding the soil type. A more detailed description is given in the table below:

| Attribute name           | Type             | Measurement    | Description                          |
| ------------------------ |:-----------------|:---------------|-------------------------------------:|
| elevation                | quantitative     | meters         | Elevation in meters                  |
| aspect                   | quantitative     | azimuth        | Aspect in degress azimuth            |
| slope                    | quantitative     | degrees        | Slope in degress                     |
| horiz_dist_hydro         | quantitative     | meters         | Horiz. Dist. to nearest surface water|    
| vert_dist_hydro          | quantitative     | meters         | Vert. Dist. to nearest surface water |    
| horiz_dist_road          | quantitative     | meters         | Horiz. Dist. to nearest roadway      |    
| hillshade_9              | quantitative     | 0 to 255 index | Hillshade index at 9am, summer solstice|
| hillshade_noon              | quantitative     | 0 to 255 index | Hillshade index at noon, summer solstice|
| hillshade_15              | quantitative     | 0 to 255 index | Hillshade index at 3pm, summer solstice|
| horiz_dist_fire          | quantitative     | meters         | Horiz. Dist. to nearest wildfire ignition points |    
| wild_area[0-4]           | qualitative (4 classes)     | binary         | Wilderness area designation          |
| soil_type[0-39]          | qualitative (40 classes)     | binary         | Soil type designation                |
| cover_type               | integer          | 1 to 7         | Forest cover type designation        |

The classification problem consists in classifying the forest cover into seven types:

| Number | Type |           |
| -------|------|-----------|
| 1      | Spruce/Fir|<img src="imgs/spruce.jpg" width="150px" height="150px"/>|
| 2      | Lodgepole Pine|<img src="imgs/lodge.jpg" width="150px" height="150px"/>|
| 3      | Ponderosa Pine|<img src="imgs/ponderosa.jpg" width="150px" height="150px"/>|
| 4      | Cottonwood/Willow|<img src="imgs/cottonwood.jpg" width="150px" height="150px"/>|
| 5      | Aspen|<img src="imgs/aspen.jpg" width="150px" height="150px"/>|
| 6      | Douglas-fir|<img src="imgs/douglas.jpg" width="150px" height="150px"/>|
| 7      | Krummholz|<img src="imgs/krumm.jpg" width="150px" height="150px"/>|

## Loading and improving columns description

Below is the code for loading and previewing the raw dataset using the `pandas` library:

In [None]:
import pandas as pd
# read data as csv
dataset = pd.read_csv("datasets/covtype.data", header=None)
# preview the five first lines
dataset.head()

Notice that the header doesn't exist in the original dataset, so the column indexing must be done only with integers. It is possible to improve this by changing the column indices to a more readable form:

In [None]:
# list of column names
column_names = ["elevation", "aspect", "slope", \
                "horiz_dist_hydro", "vert_dist_hydro", \
                "horiz_dist_road", "hillshade_9", \
                "hill_shade_noon", "hill_shade_15", "horiz_dist_fire"] \
                + ["wild_area_" + str(i) for i in range(0,4)] \
                + ["soil_type_" + str(i) for i in range(0,40)] \
                + ["cover_type"]
# change column names in dataframe
dataset.columns = column_names
# confirm the dataset size
print("Dataset shape: " + str(dataset.shape))
# check the resulting dataset with column names
dataset.head()

## Dealing with class imbalancing

This is the distribution of rows per classes:

| Type | Number of rows |
| -----|----------------|
| Spruce/Fir | 211840|
| Lodgepole Pine| 283301|
| Ponderosa Pine | 35754|
| Corronwood/Willow| 2747|
| Aspen | 9493|
| Douglas-fir | 17367|
| Krummholz | 20510|
| **Total** | **581012**|

One can notice that this is a very unbalanced dataset, since there is a huge difference between the amount of individuals between classes. Since this dataset has a lot of instances and the minimum of the number of instances in a class seems still to be substantial, the dataset will be reduced in a way that every class will have 2747 instances. In this work, random 2747 instances in each class will be selected, and the rest will be kept in another dataset. The code below does the job:

In [None]:
from sklearn.utils import resample
# group by cover type
groups = dataset.groupby("cover_type")
# get minimum number of instances in a class
number_samples = groups.size().min()
# produce the new dataset
new_dataset = pd.concat([resample(df, replace=True, \
                                  n_samples=number_samples, \
                                  random_state=123) for _, df in groups])
# keeps the remaining dataset
remaining_dataset = pd.concat([dataset, new_dataset]).drop_duplicates(keep=False)
# check sizes
print("New dataset shape:" + str(new_dataset.shape))
print("Remaining dataset shape:" + str(remaining_dataset.shape))

## Missing values analysis

There are no missing values in this dataset. In order to check this fact, it is possible to use the `missingno` library to generate a simple visualization of the new dataset (the dataset of remaining instances would behave the same), where the color white means missing values in a column (notice how everything is dark for this dataset):

In [None]:
import missingno as msno
# improve plotting visualization
%matplotlib inline
# plot a graph showing the missing values (in this case, there are none)
msno.matrix(new_dataset)

## Save the new datasets

After generating the new dataset and the dataset with the remaining values, their csv versions need to be stored, in order to use them in the next steps of this work:

In [None]:
# save both datasets
new_dataset.to_csv("datasets/new_dataset_covertype.csv", index=False)
remaining_dataset.to_csv("datasets/remaining_dataset_covertype.csv", index=False)