# Data Collection

For this project, we aim to build a **predictive model** that could **classify forest cover type** based on **cartographic variables**. Understanding the relationships among these variables could be valuable for **land management** and **ecological studies**.

## 1. Data Source and Subject

- **Subject**: The dataset contains information about different forest cover types, which is Spruce/Fir, Lodgepole Pine, Aspen, and so on, in four wilderness areas located in Roosevelt National Forest of Nothern Colorado. Each observation represents a $30*30$ meters cell and includes various cartographic variables such as elevation, slope, aspect, distance to hydrology, and soil type.
- **Source**: We are using the "Covertype" dataset available from UCI Machine Learning Repository: https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset/.

## 2. Data License

The dataset is made available under the Public Domain. There are no restrictions on its use, modification, or distribution.

## 3. Data Collection Method

- **Method**: The data was originally collected by the US Forest Service (USFS) Region 2 Resource Information System (RIS). The observations were derived from data obtained from the US Geological Survey (USGS) and the USFS.
  - **Catographic Variables**: Cartographic variables were determined using USGS Digital Elevation Model (DEMs) and other geographic information system (GIS) data.
  - **Soil Type**: Soil type designations were based on USFS Ecological Landtype Units (ELUs).
  - **Wilderness Area**: The actual forest cover type for each 30 x 30-meter cell was determined using observations and aerial photography.

## 4. Data Ethics and Limitations

- **Data Collection Date**: The dataset was donated on 7/31/1998. Forest cover type could change overtime due to natural disturbances and human activities. Therefore, the data may not perfectly reflect current forest conditions in Roosevelt National Forest.
- **Spatial**: The $30*30$ meters solution may not capture fine-scale variations in forest cover type within each cell.
- **Data Accuracy**: While the data has been widely used, it's important to note that the accuracy of the original USFS and USGS data sources can influence the reliability of the dataset.
- **Limited Scope:** This dataset is specific to a limited region in Colorado, so our findings might not be directly transferable to other forests.

# Data Exploratory Analysis (EDA)

## 1. Import Libraries

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 2. Basic Description

In [None]:
df = pd.read_csv('covtype.csv')
df.head()

In [None]:
rows, cols = df.shape
print(f'Number of rows: {rows}')
print(f'Number of columns: {cols}')

As you could see that there are 581012 rows, and 55 columns. 

Each row represents a $30*30$ cell of a forest land within the **Roosevelt National Forest of northern Colorado**.

The columns represent the following cartographic and ecological variables:

*   **Elevation:** Elevation in meters.
*   **Aspect:** Aspect in degrees azimuth.
*   **Slope:** Slope in degrees.
*   **Horizontal\_Distance\_To\_Hydrology:** Horizontal distance to the nearest surface water feature (in meters).
*   **Vertical\_Distance\_To\_Hydrology:** Vertical distance to the nearest surface water feature (in meters).
*   **Horizontal\_Distance\_To\_Roadways:** Horizontal distance to the nearest roadway (in meters).
*   **Hillshade\_9am:** Hillshade index at 9 am on the summer solstice (0 to 255 index).
*   **Hillshade\_Noon:** Hillshade index at noon on the summer solstice (0 to 255 index).
*   **Hillshade\_3pm:** Hillshade index at 3 pm on the summer solstice (0 to 255 index).
*   **Horizontal\_Distance\_To\_Fire\_Points:** Horizontal distance to the nearest wildfire ignition point (in meters).
*   **Wilderness\_Area\_1 to Wilderness\_Area\_4:** One-hot encoded columns representing the wilderness area:
    *   1: Rawah Wilderness Area
    *   2: Neota Wilderness Area
    *   3: Comanche Peak Wilderness Area
    *   4: Cache la Poudre Wilderness Area
*   **Soil\_Type\_1 to Soil\_Type\_40:** One-hot encoded columns representing 40 different soil types (ELUs).
*   **Cover\_Type:** The forest cover type (integer class label 1-7), which is the target variable:
    *   1: Spruce/Fir
    *   2: Lodgepole Pine
    *   3: Ponderosa Pine
    *   4: Cottonwood/Willow
    *   5: Aspen
    *   6: Douglas-fir
    *   7: Krummholz

## 3. Data Quality Checks

In [None]:
duplicate_rows = df.duplicated().sum()
print(f'Number of duplicated rows: {duplicate_rows}')

So there are no duplicated rows in the dataset.

In [None]:
df.info()

**Observation:** The data types appear appropriate. Numerical features are `int64`, and the one-hot encoded categorical features are `int64` (which is suitable for binary 0/1 representation).

## 4. Numerical Data Analysis

Let's analyze each numerical column: `Elevation`, `Aspect`, `Slope`, `Horizontal_Distance_To_Hydrology`, `Vertical_Distance_To_Hydrology`, `Horizontal_Distance_To_Roadways`, `Hillshade_9am`, `Hillshade_Noon`, `Hillshade_3pm`, `Horizontal_Distance_To_Fire_Points`.

1. `Elevation`
- Represents the height above sea level (in meters).
- Expected range: Depends on the location of the forest; typically between 0 and 8848 (highest elevation on Earth).
- Potential abnormalities:
    - Extremely low values (e.g., negative or close to zero, as forests aren’t in deep depressions).
    - Extremely high values (e.g., > 9000 meters, unlikely for forested areas).
  
2. `Aspect`
- Represents the compass direction of the slope (0 to 360 degrees).
- Expected range: 0–360 (where 0 = North, 90 = East, etc.).
- Potential abnormalities:
    - Values outside 0–360.
    - Missing or default values (e.g., if coded as -1 or 999).
      
3. `Slope`
- Represents the steepness of the slope in degrees.
- Expected range: 0–90 degrees (0 = flat, 90 = vertical).
- Potential abnormalities:
    - Values greater than 90 or less than 0.
    - Extremely high slopes (close to 90) could indicate rare or misreported areas.
      
4. `Horizontal_Distance_To_Hydrology`
- Horizontal distance to the nearest water body (in meters).
- Expected range: Depends on the forest’s geography but typically positive values (as distances cannot be negative).
- Potential abnormalities:
    - Negative values (not possible for distance).
    - Extremely large distances (e.g., thousands of kilometers, which might indicate erroneous or unscaled data).
      
5. `Vertical_Distance_To_Hydrology`
- Vertical distance to the nearest water body (in meters).
- Expected range: Positive or negative values (positive = above water level; negative = below water level).
- Potential abnormalities:
    - Extremely large positive or negative values (e.g., > ±500 meters, as water bodies are usually close to ground level).
      
6. `Horizontal_Distance_To_Roadways`
- Horizontal distance to the nearest road (in meters).
- Expected range: Positive values (distance cannot be negative).
- Potential abnormalities:
    - Negative values.
    - Extremely large distances (e.g., >100 km in highly connected areas might be questionable).
      
7. `Hillshade_9am, Hillshade_Noon, Hillshade_3pm`
- Indicate the amount of sunlight (0–255) at specific times of day.
- Expected range: 0–255 (0 = no sunlight, 255 = maximum sunlight).
- Potential abnormalities:
    - Values outside 0–255.
    - Simultaneously high values for all three times (unlikely due to natural shading and light direction).
    - Simultaneously low values for all three times (e.g., 0, 0, 0, which might indicate data recording errors).
      
8. `Horizontal_Distance_To_Fire_Points`
- Horizontal distance to the nearest fire point (in meters).
- Expected range: Positive values (distance cannot be negative).
- Potential abnormalities:
    - Negative values.
    - Extremely large distances (e.g., thousands of kilometers might indicate data errors or poor scaling).

In [None]:
numerical_cols = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
                  'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
                  'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
                  'Horizontal_Distance_To_Fire_Points']

def identify_abnormal_values(df):
    abnormal_values = {}

    # 1. Elevation: Check for negative or too high values
    abnormal_values['Elevation'] = df[(df['Elevation'] < 0) | (df['Elevation'] > 8848)]['Elevation']

    # 2. Aspect: Check for values outside 0-360
    abnormal_values['Aspect'] = df[(df['Aspect'] < 0) | (df['Aspect'] > 360)]['Aspect']

    # 3. Slope: Check for values outside 0-90
    abnormal_values['Slope'] = df[(df['Slope'] < 0) | (df['Slope'] > 90)]['Slope']

    # 4. Horizontal_Distance_To_Hydrology: Check for negative values or extreme distances
    abnormal_values['Horizontal_Distance_To_Hydrology'] = df[(df['Horizontal_Distance_To_Hydrology'] < 0) | (df['Horizontal_Distance_To_Hydrology'] > 100000)]['Horizontal_Distance_To_Hydrology']

    # 5. Vertical_Distance_To_Hydrology: Check for extreme values
    abnormal_values['Vertical_Distance_To_Hydrology'] = df[(df['Vertical_Distance_To_Hydrology'] < -500) | (df['Vertical_Distance_To_Hydrology'] > 500)]['Vertical_Distance_To_Hydrology']

    # 6. Horizontal_Distance_To_Roadways: Check for negative values or extreme distances
    abnormal_values['Horizontal_Distance_To_Roadways'] = df[(df['Horizontal_Distance_To_Roadways'] < 0) | (df['Horizontal_Distance_To_Roadways'] > 100000)]['Horizontal_Distance_To_Roadways']

    # 7. Hillshade_9am, Hillshade_Noon, Hillshade_3pm: Check for values outside 0-255
    abnormal_values['Hillshade_9am'] = df[(df['Hillshade_9am'] < 0) | (df['Hillshade_9am'] > 255)]['Hillshade_9am']
    abnormal_values['Hillshade_Noon'] = df[(df['Hillshade_Noon'] < 0) | (df['Hillshade_Noon'] > 255)]['Hillshade_Noon']
    abnormal_values['Hillshade_3pm'] = df[(df['Hillshade_3pm'] < 0) | (df['Hillshade_3pm'] > 255)]['Hillshade_3pm']

    # 8. Horizontal_Distance_To_Fire_Points: Check for negative values or extreme distances
    abnormal_values['Horizontal_Distance_To_Fire_Points'] = df[(df['Horizontal_Distance_To_Fire_Points'] < 0) | (df['Horizontal_Distance_To_Fire_Points'] > 100000)]['Horizontal_Distance_To_Fire_Points']

    # Return abnormal values for each feature
    return abnormal_values

abnormal_values = identify_abnormal_values(df)

for col in numerical_cols:
    print(f'Analysis of columns: {col}')

    plt.hist(df[col])
    plt.title(f"Distribution of {col}")
    plt.show()

    missing_percentage = df[col].isnull().sum() * 100.0 / rows
    print(f'Percentage of missing values: {missing_percentage:.2f}%')

    min_val = df[col].min()
    max_val = df[col].max()

    print(f'Min value: {min_val}')
    print(f'Max value: {max_val}')

    print(f'Number of abnormal values: {len(abnormal_values[col])}')
    print('Abnormal values: ', abnormal_values[col])

    print('\n')
    

As we could see, there are 113 abnormal values of the column `Vertical_Distance_To_Hydrology`, which is the value 500 meters above the water, **Water bodies** can indeed be found at elevations above 500 meters, especially in mountainous or highland regions.

According to the location (Roosevelt National Forest), we have some further informations:
- **Location:** Roosevelt National Forest is located in the Rocky Mountains of northern Colorado, which are highly mountainous areas.
- **Elevation:** The elevation of this forest varies widely due to the mountainous terrain. Some areas in the forest have elevations ranging from about 1,500 meters (5,000 feet) to over 4,000 meters (13,000 feet).
- **Water Bodies:** There are numerous rivers, streams, and lakes throughout the forest, and some of them are indeed located at high elevations.

**Conclusion:** It is kinda normal to have 113 values > 500 meters.