# Knowing the Covertype dataset

## Description

The Covertype dataset aims to provide cartographic variables (no remotely sensed data) for predicting forest cover types. Data is about four wilderness areas located in the Roosevelt National Forest of northern Colorado, whose cover types are more a result of ecological processes rather than human-caused disturbances.

The original dataset contains 12 attributes (10 quantitative, 2 qualitative), but the collected data is organized in 54 columns, where 10 correspond to the quantitative variables, 4 are binary for encoding wilderness areas and 40 are also binary for encoding the soil type. A more detailed description is given in the table below:

| Attribute name           | Type             | Measurement    | Description                          |
| ------------------------ |:-----------------|:---------------|-------------------------------------:|
| elevation                | quantitative     | meters         | Elevation in meters                  |
| aspect                   | quantitative     | azimuth        | Aspect in degress azimuth            |
| slope                    | quantitative     | degrees        | Slope in degress                     |
| horiz_dist_hydro         | quantitative     | meters         | Horiz. Dist. to nearest surface water|    
| vert_dist_hydro          | quantitative     | meters         | Vert. Dist. to nearest surface water |    
| horiz_dist_road          | quantitative     | meters         | Horiz. Dist. to nearest roadway      |    
| hillshade_9              | quantitative     | 0 to 255 index | Hillshade index at 9am, summer solstice|
| hillshade_noon              | quantitative     | 0 to 255 index | Hillshade index at noon, summer solstice|
| hillshade_15              | quantitative     | 0 to 255 index | Hillshade index at 3pm, summer solstice|
| horiz_dist_fire          | quantitative     | meters         | Horiz. Dist. to nearest wildfire ignition points |    
| wild_area[0-4]           | qualitative      | binary         | Wilderness area designation          |
| soil_type[0-39]          | qualitative      | binary         | Soil type designation                |
| cover_type               | integer          | 1 to 7         | Forest cover type designation        |

The classification problem involved consists in classifying the forest cover into seven types:

| Number | Type |           |
| -------|------|-----------|
| 1      | Spruce/Fir|<img src="imgs/spruce.jpg" width="150px" height="150px"/>|
| 2      | Lodgepole Pine|<img src="imgs/lodge.jpg" width="150px" height="150px"/>|
| 3      | Ponderosa Pine|<img src="imgs/ponderosa.jpg" width="150px" height="150px"/>|
| 4      | Cottonwood/Willow|<img src="imgs/cottonwood.jpg" width="150px" height="150px"/>|
| 5      | Aspen|<img src="imgs/aspen.jpg" width="150px" height="150px"/>|
| 6      | Douglas-fir|<img src="imgs/douglas.jpg" width="150px" height="150px"/>|
| 7      | Krummholz|<img src="imgs/krumm.jpg" width="150px" height="150px"/>|


## Loading and improving columns description

Below is the code for loading and previewing the raw dataset using the `pandas` library:

In [None]:
import pandas as pd
# read data as csv
dataset = pd.read_csv("dataset/covtype.data", header=None)
# preview the five first lines
dataset.head()

Notice that the header doesn't exist in the original dataset, so the column indexing must be done only with integers. It is possible to improve this by changing the column indices to a more readable form:

In [None]:
# list of column names
column_names = ["elevation", "aspect", "slope", "horiz_dist_hydro", "vert_dist_hydro", \
                "horiz_dist_road", "hillshade_9", "hill_shade_noon", "hill_shade_15", "horiz_dist_fire"] \
                + ["wild_area_" + str(i) for i in range(0,4)] \
                + ["soil_type_" + str(i) for i in range(0,40)] \
                + ["cover_type"]
# change column names in dataframe
dataset.columns = column_names
# confirm the dataset size
print("Dataset shape: " + str(dataset.shape))
# check the resulting dataset with column names
dataset.head()