# Forest Cover Type Prediction
W207 Spring 2020

Elevation - Elevation in meters\
Aspect - Aspect in degrees azimuth\
Slope - Slope in degrees\
Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features\
Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features\
Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway\
Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice\
Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice\
Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice\
Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points\
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation\
Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation\
Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

## Import Libraries & Data

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from scipy import stats

In [2]:
# Data source: https://www.kaggle.com/c/forest-cover-type-prediction/overview
# Load data

train_path = '../data/train.csv'
unlabeled_path = '../data/test.csv'
train_csv = np.genfromtxt(train_path, delimiter=',', names=True)
unlabeled_csv = np.genfromtxt(unlabeled_path, delimiter=',', names=True)

# Extract feature and label strings
all_headers = train_csv.dtype.names
feature_name = unlabeled_csv.dtype.names
label_name = set(train_csv['Cover_Type'])

## Summary Statistics

In [14]:
df = pd.DataFrame(train_csv, columns=[i for i in all_headers])
df = df.describe().T
df

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,15120.0,7560.5,4364.91237,1.0,3780.75,7560.5,11340.25,15120.0
Elevation,15120.0,2749.322553,417.678187,1863.0,2376.0,2752.0,3104.0,3849.0
Aspect,15120.0,156.676653,110.085801,0.0,65.0,126.0,261.0,360.0
Slope,15120.0,16.501587,8.453927,0.0,10.0,15.0,22.0,52.0
Horizontal_Distance_To_Hydrology,15120.0,227.195701,210.075296,0.0,67.0,180.0,330.0,1343.0
Vertical_Distance_To_Hydrology,15120.0,51.076521,61.239406,-146.0,5.0,32.0,79.0,554.0
Horizontal_Distance_To_Roadways,15120.0,1714.023214,1325.066358,0.0,764.0,1316.0,2270.0,6890.0
Hillshade_9am,15120.0,212.704299,30.561287,0.0,196.0,220.0,235.0,254.0
Hillshade_Noon,15120.0,218.965608,22.801966,99.0,207.0,223.0,235.0,254.0
Hillshade_3pm,15120.0,135.091997,45.895189,0.0,106.0,138.0,167.0,248.0


In [16]:
df.loc['Aspect', 'mean']

156.67665343915343

**'Id', 'Soil_Type7', and 'Soil_Type15' has no meaningful data based on summary statistics. Will drop from datasets**

In [4]:
# Exclude columns with no useful data for prediction
new_headers = [n for n in all_headers if n not in ['Id', 'Soil_Type7', 'Soil_Type15']]
new_train = train_csv[new_headers]

# Convert from structured array to 2D
unlabeled_data = np.array(unlabeled_csv.tolist())
new_train = np.array(new_train.tolist())

In [20]:
# Rescale bounded continuous features: All continuous input that are bounded, rescale them to [-1, 1]
aspect_rescale = (train_csv['Aspect'] * 2 - \
                df.loc['Aspect', 'max'] - \
                df.loc['Aspect', 'min']) / \
                (df.loc['Aspect', 'max'] - df.loc['Aspect', 'min'])

aspect_rescale

array([-0.71666667, -0.68888889, -0.22777778, ..., -0.25555556,
       -0.07222222,  0.09444444])

In [5]:
# Train, dev, test split (60/20/20)
split1 = int(len(train_csv)* 0.80)
split2 = int(split1 + (len(train_csv) - split1) / 2)

train_data, train_labels = new_train[:split1,:-1], new_train[:split1,-1]
dev_data, dev_labels     = new_train[split1:split2,:-1], new_train[split1:split2,-1]
test_data, test_labels   = new_train[split2:,:-1], new_train[split2:,-1]

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         label_name)
print('number of features:',   len(train_data[1]))
print('feature names:\n',   new_headers[:-1])

training label shape: (12096,)
dev label shape: (1512,)
test label shape: (1512,)
labels names: {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}
number of features: 52
feature names:
 ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13', 'Soil_Type14', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soi

## KNN Baseline

In [6]:
# k-NN

# search for an optimal value of K for KNN
k_range = [1,4,7,10]

# list of scores from k_range
k_scores = []

def KNN(k):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(train_data, train_labels)
    model_pred = model.predict(dev_data)
    return model_pred

for k in k_range:
    score = metrics.f1_score(dev_labels, KNN(k), average="weighted")
    k_scores.append(score)
    print("The f1 score for {}-NN is {}".format(k, score))

The f1 score for 1-NN is 0.8801364177041876
The f1 score for 4-NN is 0.8434970753968075
The f1 score for 7-NN is 0.824328016091458
The f1 score for 10-NN is 0.8081237472777235


## NB Baseline

ideas from:  
https://stackoverflow.com/questions/14254203/mixing-categorial-and-continuous-data-in-naive-bayes-classifier-using-scikit-lea

https://stackoverflow.com/questions/14274771/choosing-classification-algorithm-to-classify-mix-of-nominal-and-numeric-data?rq=1

https://stackoverflow.com/questions/32707914/different-types-of-features-to-train-naive-bayes-in-python-pandas

handling discrete + continous:  
https://www.quora.com/Machine-Learning/What-are-good-ways-to-handle-discrete-and-continuous-inputs-together/answer/Arun-Iyer-1

In [7]:
# Evaluate binary features and continuous features indepedently (Naive assumption means no correlation between features taken into account anyways)

# BernoulliNB (Binary features only)
train_data_binary = train_data[:,10:]
dev_data_binary = dev_data[:,10:]

bnb = BernoulliNB(alpha=1)
bnb.fit(train_data_binary, train_labels)
acc_bnb = bnb.score(dev_data_binary, dev_labels)
print("BernoulliNB model accuracy using binary features only:", acc_bnb)

BernoulliNB model accuracy using binary features only: 0.5734126984126984


In [8]:

# GaussianNB (Continuous features only)
train_data_cont = train_data[:,:10]
dev_data_cont= dev_data[:,:10]

gnb = GaussianNB() 
gnb.fit(train_data_cont, train_labels)
acc_gnb = gnb.score(dev_data_cont, dev_labels)
print("GaussianNB model accuracy using continous features only:", acc_gnb)

GaussianNB model accuracy using continous features only: 0.6064814814814815


In [9]:
gnb.predict_proba(dev_data_cont)

array([[3.63502140e-04, 7.40240341e-03, 1.09878853e-01, ...,
        8.16179008e-01, 6.61456708e-02, 3.67283001e-10],
       [9.10726648e-02, 2.88693566e-01, 1.08801543e-03, ...,
        6.14535176e-01, 4.60216242e-03, 8.41433746e-06],
       [4.13403388e-03, 2.13667357e-02, 3.25880440e-03, ...,
        9.63487591e-01, 7.75229297e-03, 5.33288634e-07],
       ...,
       [7.18784140e-01, 2.22141970e-01, 2.63096061e-04, ...,
        1.36792439e-03, 8.35873827e-04, 5.66069958e-02],
       [1.85528181e-02, 1.69590531e-01, 6.94464292e-02, ...,
        6.95364192e-01, 4.70452772e-02, 7.37804812e-07],
       [2.95569242e-02, 1.46795981e-01, 1.76030173e-02, ...,
        7.77047432e-01, 2.89957779e-02, 8.21839297e-07]])

**Improvement ideas:**

Preprocessing:
1. Rescaling bounded features and standardizing continuous features
2. Feature selection

Modeling:
1. Create new features vector from the class assignment probabilities {np.hstack((bernoulli and gaussian))}. Then run another GaussianNB fit on the new features
2. Create function to calculate merged feature probabilities (product of Bernoulli and Gaussian) since we calculate the Bayes probability independently on a specific features