---
The "Forest Cover type" dataset is a famous dataset with about half a million samples, in which the objective is to predict the type of trees that exist in various lands. Each row in the dataset corresponds to a small plot of land, where several geological and geographic features are listed.

The file we have available offers 20000 samples from the original dataset, chosen randomly. Besides, only samples from territories 1 and 3 were kept.

| #  | Column                             | Meaning |
|--- | ------                             | ----------- |
| 0  | Elevation                          | Elevação do terreno (m) |
| 1  | Aspect                             | Azimute do terreno (graus) |
| 2  | Slope                              | Inclinação do terreno (graus) |
| 3  | Horizontal_Distance_To_Hydrology   | Distância horizontal até a característica hidrológica (lagos, rios, etc) mais próxima (m) |
| 4  | Vertical_Distance_To_Hydrology     | Distância vertical até a característica hidrológica (lagos, rios, etc) mais próxima (m) |
| 5  | Horizontal_Distance_To_Roadways    | Distância horizontal até rodovia mais próxima (m) |
| 6  | Hillshade_9am                      | Indice de sombra de encosta às 9h no solstício de verão |
| 7  | Hillshade_Noon                     | Indice de sombra de encosta ao meio-dia no solstício de verão |
| 8  | Hillshade_3pm                      | Indice de sombra de encosta às 15h no solstício de verão |
| 9  | Horizontal_Distance_To_Fire_Points | Distância horizontal até o ponto de queimada mais próximo (m) |
| 10 | Cover_Type                         | Tipo de arvores (variável categórica) (esse é o nosso target) |
| 11 | soil_type                          | Tipo de solo (variável categórica) |
| 12 | Wilderness                         | Região de coleta dos dados (variável categórica) |

O arquivo "covtype_info.txt" contém mais informação sobre esse dataset, incluindo as definições exatas de tipo de solo, região de coleta de dados, e tipo de árvores.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('dataset.csv')
df['Cover_Type'] = df['Cover_Type'].astype('category')
df['soil_type'] = df['soil_type'].astype('category')
df['Wilderness'] = df['Wilderness'].astype('category')

In [3]:
df.head(3)

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Cover_Type,soil_type,Wilderness
0,2826,330,14,60,10,1549,187,222,175,2563,2,20,3
1,3283,27,3,0,0,4401,218,232,151,3653,1,23,3
2,2923,60,12,268,63,3555,229,215,118,5196,2,29,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column                              Non-Null Count  Dtype   
---  ------                              --------------  -----   
 0   Elevation                           20000 non-null  int64   
 1   Aspect                              20000 non-null  int64   
 2   Slope                               20000 non-null  int64   
 3   Horizontal_Distance_To_Hydrology    20000 non-null  int64   
 4   Vertical_Distance_To_Hydrology      20000 non-null  int64   
 5   Horizontal_Distance_To_Roadways     20000 non-null  int64   
 6   Hillshade_9am                       20000 non-null  int64   
 7   Hillshade_Noon                      20000 non-null  int64   
 8   Hillshade_3pm                       20000 non-null  int64   
 9   Horizontal_Distance_To_Fire_Points  20000 non-null  int64   
 10  Cover_Type                          20000 non-null  category
 11  soil_type                   

In [5]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Elevation,20000.0,2990.01095,222.092373,2286.0,2845.0,3001.0,3155.0,3846.0
Aspect,20000.0,150.3866,110.280571,0.0,55.0,120.0,246.0,360.0
Slope,20000.0,13.5714,7.048431,0.0,8.0,13.0,18.0,50.0
Horizontal_Distance_To_Hydrology,20000.0,274.0208,217.941151,0.0,108.0,228.0,391.0,1371.0
Vertical_Distance_To_Hydrology,20000.0,45.6834,58.822985,-153.0,6.0,28.0,67.0,595.0
Horizontal_Distance_To_Roadways,20000.0,2540.88885,1545.080692,0.0,1308.0,2250.0,3548.0,7063.0
Hillshade_9am,20000.0,213.8832,24.622174,79.0,201.0,219.0,232.0,254.0
Hillshade_Noon,20000.0,224.01065,18.740213,95.0,213.0,226.0,237.0,254.0
Hillshade_3pm,20000.0,141.6681,36.566896,0.0,119.0,142.0,166.0,251.0
Horizontal_Distance_To_Fire_Points,20000.0,2066.9625,1330.04027,0.0,1131.0,1809.0,2609.0,7089.0


In [6]:
df['Cover_Type'].value_counts()

2    10400
1     7618
7      710
3      583
5      370
6      319
Name: Cover_Type, dtype: int64

In [7]:
df[df.isnull().any(axis=1)].head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Cover_Type,soil_type,Wilderness


Type 2 cover corresponds to "lodgepole pine" type trees. Our task is to build a classifier that detects whether a land will be covered by this type of tree or not.

To do this, we must reclassify the trees as lodgepole pine or not lodgepole pine, as follows:

In [8]:
df["Cover_Type"] = (df["Cover_Type"] == 2).astype(int)
df["Cover_Type"].value_counts()

1    10400
0     9600
Name: Cover_Type, dtype: int64

In [9]:
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)
print("{} train + {} test".format(len(train_set), len(test_set)))

16000 train + 4000 test


In [10]:
X_train = train_set.drop(columns=["Cover_Type"])
y_train = train_set["Cover_Type"]

X_test = test_set.drop(columns=["Cover_Type"])
y_test = test_set["Cover_Type"]

In [11]:
log_reg = LogisticRegression(max_iter=10000, random_state=42)

log_reg.fit(X_train, y_train)

some_data = X_train.iloc[:10]
some_valid = y_train.iloc[:10]

pred_data = log_reg.predict(some_data)
print("Predição: {}".format(pred_data))

# Compare com os valores originais:
print("Original: {}".format(some_valid.values))

y_test_pred = log_reg.predict(X_test)
accuracy_score(y_test, y_test_pred)

Predição: [0 1 0 1 1 0 0 0 0 0]
Original: [1 0 1 1 0 1 0 0 0 0]


0.6995

Slope and aspect variables are currently measured in degrees. Let's try to change it to radians. 

In [12]:
def degree_to_rad(df):
    df['Aspect'] = df['Aspect'] * (np.pi/180)
    df['Slope'] = df['Slope'] * (np.pi/180)
    
df_radians = df.copy()
degree_to_rad(df_radians)
train_set_rad, test_set_rad = train_test_split(df_radians, test_size=0.2, random_state=42)

X_train_rad = train_set.drop(columns=["Cover_Type"])
y_train_rad = train_set["Cover_Type"]

X_test_rad = test_set.drop(columns=["Cover_Type"])
y_test_rad = test_set["Cover_Type"]

log_reg_rad = LogisticRegression(max_iter=10000, random_state=42)

log_reg_rad.fit(X_train_rad, y_train_rad)

some_data_rad = X_train_rad.iloc[:10]
some_valid_rad = y_train_rad.iloc[:10]

pred_data_rad = log_reg_rad.predict(some_data_rad)
print("Predição: {}".format(pred_data_rad))

# Compare com os valores originais:
print("Original: {}".format(some_valid_rad.values))

y_test_pred_rad = log_reg_rad.predict(X_test_rad)
accuracy_score(y_test_rad, y_test_pred_rad)

Predição: [0 1 0 1 1 0 0 0 0 0]
Original: [1 0 1 1 0 1 0 0 0 0]


0.6995