# Theoretical Questions

### Q1: 
What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?

### Q2:

What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error? 

In which case would each error metric be appropriate?

### Q3:

What error metric would you use to evaluate how good a binary classifier is?

What if the classes are imbalanced?

What if there are more than 2 groups?

### Q4: 

Why might it be preferable to include fewer predictors over many?


# Programming Homework

### Question: Forest Cover Type Prediction

###### Focus: Multi-Class Classification

In the second part of this week's programming homework, you are tasked to develop a classification system to classify forest categories.

The interesting part of the dataset is that the predictors are all cartographic rather than remotely sensed data. 

Data is available at:  https://www.kaggle.com/c/forest-cover-type-prediction

* train.csv : the file provides both the predictors and response varaible
* test.csv  : the file provides predictors
* sampleSubmission.csv: the file provides the format that we expect you to provide

Your task:
* Understand and visualize the data collected by the US Forest Service
* Build a classifier which can accurately classify the forest cover

Hints:
* Always start with a data walk (i.e. visualization, data exploration) 
* Consider standardizing the predictors
* Always consider a simple model
* Consider feature engineering
* Always use Cross Validation to conduct model selection
* Pandas (a python package) has functions to deal with timestamps effectively





In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
import xgboost as xgb

ImportError: No module named 'xgboost'

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
samp_sub = pd.read_csv('sampleSubmission.csv')

In [3]:
train

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,1,2596,51,3,258,0,510,221,232,148,...,0,0,0,0,0,0,0,0,0,5
1,2,2590,56,2,212,-6,390,220,235,151,...,0,0,0,0,0,0,0,0,0,5
2,3,2804,139,9,268,65,3180,234,238,135,...,0,0,0,0,0,0,0,0,0,2
3,4,2785,155,18,242,118,3090,238,238,122,...,0,0,0,0,0,0,0,0,0,2
4,5,2595,45,2,153,-1,391,220,234,150,...,0,0,0,0,0,0,0,0,0,5
5,6,2579,132,6,300,-15,67,230,237,140,...,0,0,0,0,0,0,0,0,0,2
6,7,2606,45,7,270,5,633,222,225,138,...,0,0,0,0,0,0,0,0,0,5
7,8,2605,49,4,234,7,573,222,230,144,...,0,0,0,0,0,0,0,0,0,5
8,9,2617,45,9,240,56,666,223,221,133,...,0,0,0,0,0,0,0,0,0,5
9,10,2612,59,10,247,11,636,228,219,124,...,0,0,0,0,0,0,0,0,0,5


In [4]:
test

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
0,15121,2680,354,14,0,0,2684,196,214,156,...,0,0,0,0,0,0,0,0,0,0
1,15122,2683,0,13,0,0,2654,201,216,152,...,0,0,0,0,0,0,0,0,0,0
2,15123,2713,16,15,0,0,2980,206,208,137,...,0,0,0,0,0,0,0,0,0,0
3,15124,2709,24,17,0,0,2950,208,201,125,...,0,0,0,0,0,0,0,0,0,0
4,15125,2706,29,19,0,0,2920,210,195,115,...,0,0,0,0,0,0,0,0,0,0
5,15126,2699,21,18,30,3,2890,206,200,127,...,0,0,0,0,0,0,0,0,0,0
6,15127,2699,15,17,30,6,2860,202,202,133,...,0,0,0,0,0,0,0,0,0,0
7,15128,2696,10,16,30,6,2830,202,207,140,...,0,0,0,0,0,0,0,0,0,0
8,15129,2696,17,13,30,6,2770,208,211,138,...,0,0,0,0,0,0,0,0,0,0
9,15130,2693,14,15,30,7,2741,205,209,138,...,0,0,0,0,0,0,0,0,0,0


In [None]:
samp_sub

In [8]:
X_train = train.drop('Cover_Type', axis=1)
Y_train = train['Cover_Type']
X_test  = test.copy()
X_train.shape, Y_train.shape, X_test.shape

((15120, 55), (15120,), (565892, 55))

In [9]:
# Perceptron
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

28.84

In [10]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

66.920000000000002

In [None]:
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

In [None]:
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

In [None]:
# XGBoost
#gradboost = xgb.XGBClassifier(n_estimators=1000)
#gradboost.fit(X_train, Y_train)
#Y_pred = gradboost.predict(X_test)
#acc_grad = round(gradboost.score(X_train, Y_train) * 100, 2)
#acc_grad

In [None]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=1000)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

In [None]:
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

In [None]:
# Stochastic Gradient Descent
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

In [None]:
# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

In [None]:
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

In [None]:
models = pd.DataFrame({
    'Classifier Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

* train.csv : the file provides both the predictors and response varaible (Cover_Type)
* test.csv  : the file provides predictors
* sampleSubmission.csv: the file provides the format that we expect you to provide

Your task:
* Understand and visualize the data collected by the US Forest Service
* Build a classifier which can accurately classify the forest cover

Hints:
* Always start with a data walk (i.e. visualization, data exploration) 
* Consider standardizing the predictors
* Always consider a simple model
* Consider feature engineering
* Always use Cross Validation to conduct model selection
* Pandas (a python package) has functions to deal with timestamps effectively

## Prioject Homework

0. Create and turn in a "Product Requirement Document" that has these items in it.

1. Include 2 prioritized lists of requirements:

    * user requirements, mark items where you have low confidence
    * technical requirements, mark items where you have low confidence

2. Decide how you will validate user requirements which you are not sure about?

3. Decide how you will learn and test techcical challenges that you are not sure thatyou can achieve at this time.

4. Whiteboard or draw an architecture and/or data model based on what you know.  Take a picture of it and include it in the document

5. Make your list of what should be done in the next week, assign names to each item.

6. Insert your low tech demo picture (possibly modified) into this document

Turn in this document and also start a folder for your project and save it for your own review later.




## Extra Credit: Forecasting Bike Sharing Demand 

###### Focus: Regression and Feature Engineering

In this week's programming homework, you are tasked to develop a forecasting system for the use of a city bikeshare system. 

The question is based on a for-knowledge competition hosted on Kaggle: 
Reference: Fanaee-T, Hadi, and Gamo, Joao, Event labeling combining detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.

Data is available at: https://www.kaggle.com/c/bike-sharing-demand

3 files are provided:
* train.csv : the file provides both the predictors and response varaible
* test.csv  : the file provides predictors
* sampleSubmission.csv: the file provides the format that we expect you to provide

Your task:
* Understand and visualize the data generated by bike sharing systems
* Predict the number of total rentals in each time period during the test phase

Hints: 
* Always start with a data walk (i.e. visualization, data exploration) 
* Consider standardizing the predictors
* Always consider a simple model
* Consider feature engineering
* Always use Cross Validation to conduct model selection
* Pandas (a python package) has functions to deal with timestamps effectively


