# Forest Cover Type Prediction

Submitted by:
* Juanjo Carin
* Tuhin Mahmud
* Vamsi Sakhamuri

Date: July 16, 2015

Kaggle Competition hosted at https://www.kaggle.com/c/forest-cover-type-prediction

## Description

### Use cartographic variables to classify forest categories

Random forests? Cover trees? Not so fast, computer nerds. We're talking about the real thing.

In this competition we are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

First we import the libraries we'll use along this project.

In [2]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

Next we load the training and test data sets.

In [3]:
ff = "train.csv" # you will need to edit this directory
f = open(ff)
column_names = f.readline() # you'd needs this ordinarily

data = np.loadtxt(f, delimiter=",")

y, X = data[:, -1], data[:, :-1]

ff_test = "test.csv" # you will need to edit this directory
f_test = open(ff_test)
column_names_test = f_test.readline() # you'd needs this ordinarily

data_test = np.loadtxt(f_test, delimiter=",")

# note there are no labels here!
X_test = data_test

print 'The test dataset contains {0} observations with {1} features each.'.format(X_test.shape[0], X_test.shape[1])
print '\t(The 1st one is not really a feature but an observation ID.)'
print 'The training dataset contains {0} observations with the same {1} features each.'.format(X.shape[0], X.shape[1])
print 'For this training set we know the corresponding category (forest cover type) of the {0} observations.'.format(y.shape[0])

The test dataset contains 565892 observations with 55 features each.
	(The 1st one is not really a feature but an observation ID.)
The training dataset contains 15120 observations with the same 55 features each.
For this training set we know the corresponding category (forest cover type) of the 15120 observations.


Note that the test set is about 37 times larger than the training set.

To evaluate our performance, we'll split the training set in 2 subsets: **training** data (80%) plus **development** (aka **validation**) data (20%). **Test** data *must not* be used to validate our models, otherwise we might introduce bias: the more times we look at the error rate on the test set , the more we know about the test data, and the more we include our knowledge (that's very specific to that test data set) in the way we solve the problem.

We also discard the 1st variable (ID), which does not provide any information about the forest cover type.

In [4]:
train_size = int(X.shape[0] * 0.8)
y_train, X_train = y[:train_size], X[:train_size, 1:]
y_dev, X_dev = y[train_size:], X[train_size:, 1:]
X_test = X_test[:, 1:]
print X_dev.shape, X_train.shape

(3024L, 54L) (12096L, 54L)


The first 10 features of each observation (`Elevation` to `Horizontal_Distance_To_Fire_Points`) are continuous, with different ranges, while the remaining 44 are all binary.

We'll try `preprocessing.StandardScaler` (standardize features by removing the mean and scaling to unit variance) as well as `preprocessing.MinMaxScaler` (standardizes features by scaling each feature to a given range; e.g., [0,1]). We could also try to binarize also those 10 features, using `preprocessing.binarize`, but the appropriate thresholds are unknown.

http://scikit-learn.org/stable/modules/preprocessing.html

In [13]:
from sklearn import preprocessing

# Scale to range [0,1]
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_dev_minmax = min_max_scaler.transform(X_dev)
X_test_mimax = min_max_scaler.transform(X_test)

# Scale to mean = 0, sd = 1
std_scaler = preprocessing.StandardScaler()
X_train_std = std_scaler.fit_transform(X_train)
X_dev_std = std_scaler.transform(X_dev)
X_test_std = std_scaler.transform(X_test)

[ 0.41137966  0.98333333  0.26923077  0.          0.20857143  0.38955007
  0.77165354  0.74193548  0.62903226  0.95002154  1.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          1.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.        ]
[-0.18829693  1.80240932 -0.27947559 -1.09616471 -0.83263722  0.66460892
 -0.55909617 -0.21454597  0.46144063  4.45393267  1.62382295 -0.18155792
 -0.81629972 -0.64412912 -0.15365563 -0.2026165  -0.2528751  -0.23144494
 -0.10503856 -0.20829847  0.         -0.00909279 -0.02876462 -0.38973106
 -0.16056211 -0.1382948  -0.17985039 -0.10544009  0.         -0.08412399
 -0.19747591 -0.07060485 -

0.00047020884743512534

Now that we have our data prepared, we start training a very simple kNN model.

In [17]:
kNN = KNeighborsClassifier(n_neighbors=1)

How well does our first model perform on the development data?

In [18]:
kNN.fit(X_train, y_train)
print 'Accuracy using k = 1 neighbor and non-scaled data: {0}'.format(kNN.score(X_dev, y_dev))
kNN.fit(X_train_std, y_train)
print 'Accuracy using k = 1 neighbor and standardized data: {0}'.format(kNN.score(X_dev_std, y_dev))
kNN.fit(X_train_minmax, y_train)
print 'Accuracy using k = 1 neighbor and scaled data: {0}'.format(kNN.score(X_dev_minmax, y_dev))

Accuracy using k = 1 neighbor and non-scaled data: 0.853505291005
Accuracy using k = 1 neighbor and standardized data: 0.79828042328
Accuracy using k = 1 neighbor and scaled data: 0.812169312169


In [None]:
# Here's what we need to send back to Kaggle
preds = kNN.predict(X_test)
print preds

Finally we'd  need to save the output to a textfile, and upload the results to kaggle (see https://www.kaggle.com/c/digit-recognizer/data for further information).

In [None]:
test_lab_f = open("test_labeled.csv", "w") # you will need to edit this directory

test_lab_f.write("Id,Cover_Type")

idx = X.shape[0]
                 
for pp in preds:
    idx += 1
    test_lab_f.write("\n")
    test_lab_f.write(str(idx) + "," + str(int(pp)))
   
test_lab_f.close()