# Forest CoverType dataset


Group Members
Mohd Sadique A20380442

Description
================================================================================================================================================================================================================================================================
Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types). 

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices. 

Some background information for these four wilderness areas: Neota (area 2) probably has the highest mean elevational value of the 4 wilderness areas. Rawah (area 1) and Comanche Peak (area 3) would have a lower mean elevational value, while Cache la Poudre (area 4) would have the lowest mean elevational value. 

As for primary major tree species in these areas, Neota would have spruce/fir (type 1), while Rawah and Comanche Peak would probably have lodgepole pine (type 2) as their primary species, followed by spruce/fir and aspen (type 5). Cache la Poudre would tend to have Ponderosa pine (type 3), Douglas-fir (type 6), and cottonwood/willow (type 4). 

The Rawah and Comanche Peak areas would tend to be more typical of the overall dataset than either the Neota or Cache la Poudre, due to their assortment of tree species and range of predictive variable values (elevation, etc.) Cache la Poudre would probably be more unique than the others, due to its relatively low elevation range and species composition.



In [1]:
import numpy as np
import urllib
from sklearn import preprocessing
import pandas as pd
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from numpy import loadtxt, where
from pylab import scatter, show, legend, xlabel, ylabel
from matplotlib.colors import ListedColormap
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


In [None]:
#The Clean Version of the input File is being used for testing
input_file = "testdata.csv"
df = pd.read_csv(input_file)


In [None]:
# The Data Shape of the input File 
df.shape

In [None]:
# The data used for testing has 500049 tuples and 11 features

# Data Table

In [39]:
df.head(10)

Unnamed: 0,Elevation,Aspect Slope,Slope,Horizontal Dist.,Vertical Distance,Hor. Dist. To road,HillShade_9am,HillShade_Noon,HillShade_3pm,Hor. Dist. To Fire,CoverType
0,2596,51,3,258,0,510,221,232,148,6279,5
1,2590,56,2,212,-6,390,220,235,151,6225,5
2,2804,139,9,268,65,3180,234,238,135,6121,2
3,2785,155,18,242,118,3090,238,238,122,6211,2
4,2595,45,2,153,-1,391,220,234,150,6172,5
5,2579,132,6,300,-15,67,230,237,140,6031,2
6,2606,45,7,270,5,633,222,225,138,6256,5
7,2605,49,4,234,7,573,222,230,144,6228,5
8,2617,45,9,240,56,666,223,221,133,6244,5
9,2612,59,10,247,11,636,228,219,124,6230,5


In [43]:
# Counting the Total Number of Target 
totalTarget = df["CoverType"].count()

# Count of 0s in the target
zeroCount = df[(df["CoverType"] == 0)].count()[0]

# Count of 1s in the target
oneCount = df[(df["CoverType"] == 1)].count()[1]

# Count of 2s in the target
twoCount = df[(df["CoverType"] == 2)].count()[2]

# Count of 3s in the target
threeCount = df[(df["CoverType"] == 3)].count()[3]

# Count of 4s in the target
fourCount = df[(df["CoverType"] == 4)].count()[4]

# Count of 5s in the target
fiveCount = df[(df["CoverType"] == 5)].count()[5]

# Count of 6s in the target
sixCount = df[(df["CoverType"] == 6)].count()[6]

# Count of 7s in the target
sevenCount = df[(df["CoverType"] == 7)].count()[7]

#probabilities of the counts 
probOne = oneCount / totalTarget
probTwo = twoCount / totalTarget
probThree = threeCount / totalTarget
probFour = fourCount / totalTarget
probFive = fiveCount / totalTarget
probSix = sixCount / totalTarget
probSeven = sevenCount / totalTarget

totalTarget = df["CoverType"].count()

print("Probability of 1s in the Target ", '%0.3f'%(probOne))
print("Probability of 2s in the Target ", '%0.3f'%(probTwo))
print("Probability of 3s in the Target ", '%0.3f'%(probThree))
print("Probability of 4s in the Target ", '%0.3f'%(probFour))
print("Probability of 5s in the Target ", '%0.3f'%(probFive))
print("Probability of 6s in the Target ", '%0.3f'%(probSix))
print("Probability of 7s in the Target ", '%0.3f'%(probSeven))


Probability of 1s in the Target  0.203
Probability of 2s in the Target  0.576
Probability of 3s in the Target  0.043
Probability of 4s in the Target  0.043
Probability of 5s in the Target  0.048
Probability of 6s in the Target  0.043
Probability of 7s in the Target  0.043


Since probability for 2s in the target is higher than others, we'll use 0.576 as out baseline accuracy.

# Splitting Train and Test Data

In [48]:
array = df.values
# Features
X = list(array[:,0:10])
# Target
Y = list(array[:,10])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.70, random_state=0)

**So below we can see that Logistic Regression accuracy is best when argument C = 2.0 i.e. 0.75447**

In [50]:
# Calculating accuracy of LOGISTIC REGRESSION using CROSS_VAL_SCORE
C= [8,4,2,1,0.1,0.01,0.001,0.002,0.0001,0.00001]
for i in C: 
    clf = LogisticRegression(C = i, max_iter = 1000, random_state=0)
    scoresLR = cross_val_score(clf,X_train,Y_train,cv=10,scoring='accuracy')
    print("Accuracy with C = ",i,": %0.5f(+/- %0.5f)" %(scoresLR.mean(), scoresLR.std()*2))

Accuracy with CR =  8 : 0.75395(+/- 0.01193)
Accuracy with CR =  4 : 0.75430(+/- 0.01094)
Accuracy with CR =  2 : 0.75447(+/- 0.01213)
Accuracy with CR =  1 : 0.75441(+/- 0.01206)
Accuracy with CR =  0.1 : 0.75421(+/- 0.01083)
Accuracy with CR =  0.01 : 0.75398(+/- 0.01122)
Accuracy with CR =  0.001 : 0.75213(+/- 0.01149)
Accuracy with CR =  0.002 : 0.75310(+/- 0.01159)
Accuracy with CR =  0.0001 : 0.74616(+/- 0.00876)
Accuracy with CR =  1e-05 : 0.73146(+/- 0.01151)


**Best Accuracy with Decision Tree Classifier is 0.89353**

In [51]:
from sklearn import tree
dtree = tree.DecisionTreeClassifier()
scoresDT = cross_val_score(dtree,X_train,Y_train,cv=10,scoring='accuracy')
print("Accuracy with Decision Tree: %0.5f (+/- %0.5f)" % (scoresDT.mean(), scoresDT.std() * 2))

Accuracy with Decision Tree: 0.89353 (+/- 0.00977)


**Best Accuracy with Naive Bayes is 0.69869**

In [52]:
from sklearn.naive_bayes import GaussianNB
gauss = GaussianNB()
scores = cross_val_score(gauss,X_train,Y_train,cv=10,scoring='accuracy')
print("Accuracy with Gaussian Naive Bayes: %0.5f (+/- %0.5f)" % (scores.mean(), scores.std() * 2))

Accuracy with Gaussian Naive Bayes: 0.69869 (+/- 0.01266)


**Best Accuracy with MLPC Classifier is when learning_rate = 0.006 i.e. 0.80208**

In [None]:
from sklearn.neural_network import MLPClassifier
L = [1,0.1,0.01,0.001,0.002,0.003,0.004,0.005,0.006,0.0001,0.00001, 0.000001]
for l in L: 
    MLPC = MLPClassifier(random_state=0, learning_rate='adaptive',max_iter=10000, learning_rate_init=l)
    scoresMLPC = cross_val_score(MLPC,X_train,Y_train,cv=10,scoring='accuracy')
    print("Accuracy with MLPC Classifier with",l,": %0.5f (+/- %0.5f)" % (scoresMLPC.mean(), scoresMLPC.std() * 2))

Accuracy with MLPC Classifier with 1 : 0.57598 (+/- 0.00067)
Accuracy with MLPC Classifier with 0.1 : 0.57598 (+/- 0.00067)
Accuracy with MLPC Classifier with 0.01 : 0.77824 (+/- 0.05321)
Accuracy with MLPC Classifier with 0.001 : 0.71244 (+/- 0.14857)
Accuracy with MLPC Classifier with 0.002 : 0.72115 (+/- 0.05858)
Accuracy with MLPC Classifier with 0.003 : 0.69509 (+/- 0.18500)
Accuracy with MLPC Classifier with 0.004 : 0.69856 (+/- 0.13267)
Accuracy with MLPC Classifier with 0.005 : 0.76531 (+/- 0.06178)
Accuracy with MLPC Classifier with 0.006 : 0.80208 (+/- 0.01917)
Accuracy with MLPC Classifier with 0.0001 : 0.75903 (+/- 0.02978)
Accuracy with MLPC Classifier with 1e-05 : 0.76954 (+/- 0.02184)


**Best Accuracy with Support Vector Machines is 0.86523**

In [None]:
from sklearn import svm
from sklearn.model_selection import cross_val_score
clf = svm.SVC()
scores = cross_val_score(clf,X_train,Y_train,cv=10,scoring='accuracy')
print("Accuracy with Support Vector Machines: %0.5f (+/- %0.5f)" % (scores.mean(), scores.std() * 2))

**Accuracy of the various models in the order of increasing accuracy:**

1) Logistic Regression : 0.61068

2) Decision Tree Classifier: 0.89353

3) Naive Bayes : 0.63920

4) MLPC Classifer : 0.85161 

5) Support Vector Machines : 0.86523

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, Y_train)
predict = clf.predict(X_test)
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

a=accuracy_score(Y_test,predict)
p=precision_score(Y_test,predict,average=None)
r=recall_score(Y_test,predict,average=None)
print("Accuracy Score : %0.5f(+/- %0.5f)" %(a.mean(), a.std()*2))
print("Precision Score : %0.5f(+/- %0.5f)" %(p.mean(), p.std()*2))
print("Recall Score : %0.5f(+/- %0.5f)" %(r.mean(), r.std()*2))