#Forest Cover Type Prediction

Submitted by:
Juanjo Carin, Vamsi Sakhamuri, Tuhin Mahmud

Date: July 16, 2015

![Linear Combinations](/image/front_page.png)

Kaggle Competition hosted at https://www.kaggle.com/c/forest-cover-type-prediction


## Description
### Use cartographic variables to classify forest categories
Random forests? Cover trees? Not so fast, computer nerds. We're talking about the real thing.

In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) 
from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type 
for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information 
System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. 
The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables 
such as wilderness areas and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. 
These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more
a result of ecological processes rather than forest management practices.


In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# For producing decision tree diagrams.
from IPython.core.display import Image, display
from sklearn.externals.six import StringIO
import pydot

ImportError: No module named pydot

Let's talk about how to enter a kaggle competition.
Frist you sign up for kaggle.

https://www.kaggle.com

then, you find a competition. We'll use the MNIST digit data set!

https://www.kaggle.com/c/digit-recognizer

We now have a couple of csv files, test.csv, and train.csv

Find where they live on your file system, and now we need to load the data:

In [6]:
ff = "train.csv" # you will need to edit this directory
f = open(ff)
column_names = f.readline() # you'd needs this ordinarily

data = np.loadtxt(f, delimiter=",")

y, X = data[:, -1], data[:, :-1]

ff_test = "test.csv" # you will need to edit this directory
f_test = open(ff_test)
column_names_test = f_test.readline() # you'd needs this ordinarily

data_test = np.loadtxt(f_test, delimiter=",")

# note there are no labels here!
X_test = data_test

print X_test.shape
print X.shape
print y.shape

(565892L, 55L)
(15120L, 55L)
(15120L,)


In [7]:
print y

[ 5.  5.  2. ...,  3.  3.  3.]


Now, we train a model, let's just do KNN, no time to think!

In [8]:
# to make this faster, just use the first 1000
y_train, X_train = y[:1000], X[:1000, :]

kn = KNeighborsClassifier(n_neighbors=1)

kn.fit(X_train, y_train)

# here's what we need to send back to Kaggle
preds = kn.predict(X_test)
print preds

[ 1.  1.  1. ...,  1.  1.  1.]


## Use Decission Tree 

In [17]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

preds_dec = dt.predict(X_test)
print preds_dec

[ 2.  2.  2. ...,  2.  2.  2.]


#Output results
Now we need to save the output to a textfile, and upload the results to kaggle!

Read the data page to make you submission the right format:

https://www.kaggle.com/c/digit-recognizer/data

In [11]:
test_lab_f = open("test_labeled.csv", "w") # you will need to edit this directory

test_lab_f.write("Id,Cover_Type")

idx = X.shape[0]
                 
for pp in preds_dec:
    idx += 1
    test_lab_f.write("\n")
    test_lab_f.write(str(idx) + "," + str(int(pp)))
   

test_lab_f.close()

## Results
* July 11, 2015 - Simple KNN testing on the data  <a href="image/kaggle.result-07-11-2015.png">  July 11 results score:0.4236 </a>
* July 18, 2015 - using Decision Tree on the data  <a href="image/kaggle.result-07-18-2015.png">  July 18 results- score:0.53197</a>






##Source Code
https://github.com/juanjocarin/W207-final-project


##Acknowledgements

Kaggle is hosting this competition for the machine learning community to use for fun and practice. 
This dataset was provided by Jock A. Blackard and Colorado State University. We also thank the UCI machine 
learning repository for hosting the dataset. If you use the problem in publication, please cite:

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California,
    School of Information and Computer Science