# SYD DAT 4 Lab 2 - Visualisation and Regression

##Homework - Due 29th April 2016

#### Setup
* Signup for an AWS account

#### Communication
* Imagine you are trying to explain to someone what Linear Regression is - but they have no programming/maths experience? How would you explain the overall process, what a p-value means and what R-Squared means?
* Read the paper [Useful things to know about machine learning]( https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf). 
    * What have we covered so far from this paper? 
    * Explain sections 6-13 in your own words

#### Machine Learning
* Read chapters 3 and 6 of Introduction to Statistical Learning
* Describe 3 ways we can select what features to use in a model
* Complete the first 3 exercises from Chapter 3 in Python

#### Course Project
* For the following setup a new github repository for your project and share it with Matt and Ian over Slack.
* Load the data you have gathered for your project into Python and run some summary statistics over the data. Are there any interesting features of the data that jump out? (Include the code)
* Draft/Sketch on paper (or wireframe) some data visualisations that would be useful for you to explore your data set
* Are there any regresion or clustering techniques you could use in your project? Write them down (with the corresponding scikit learn function) and what you think you would get out of it.


**Instructions: copy this file and append your name in the filename, e.g. Homework2_ian_hansel.ipynb.
Then commit this in your local repository, push it to your github account and create a pull request so I can see your work. Remeber if you get stuck to look at the slides going over Fork, Clone, Commit, Push and Pull request.**

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

import sklearn as ml
# from sklearn.linear_model import LinearRegression
# from sklearn.cross_validation import train_test_split

# this allows plots to appear directly in the notebook
%matplotlib inline

### Communication

#### Linear Regression

Linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. The least square approach minimises the RSS. The whole point of linear regression is to minimise the sum of squared errors.

A residual sum of squares (RSS) is a statistical technique used to measure the amount of variance in a data set that is not explained by the regression model. The residual sum of squares is a measure of the amount of error remaining between the regression function and the data set. A smaller residual sum of squares figure represents a regression function which explains a greater amount of the data. 

**Assessing the accuracy of the coefficient estimates

Null Hypothesis - there is no relationship between X and Y
Alternate Hypothesis - there is some relationship between X and Y

To test Null Hypothesis, we need to determine whether our estimate of the coefficient of X is sufficiently far enough from zero that we can be confident that the actual value is non-zero. We compute the t-statistic which measures the number of standard deviations the estimated coefficient is away from zero.

A small **p-value** indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence if we see a small p-value, we can infer that there is an association between the predictor and the response.

The **coefficient of determination**, r2, is useful because it gives the proportion of the variance (fluctuation) of one variable that is predictable from the other variable. It is a measure that allows us to determine how certain one can be in making predictions from a certain model/graph. The coefficient of determination is such that 0 <  r2< 1,  and denotes the strength of the linear association between x and y. The coefficient of determination represents the percent of the data that is the closest to the line of best fit.  For example, if r = 0.922, then r2 = 0.850, which means that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation).  The other 15% of the total variation in y remains unexplained.

#### Useful things to know about machine learning

The paper majorly focuses on ideas surrounding Machine Learning from Classification models' point of view. The paper was too complicated for my understanding. From what I could understand, they key take aways from the paper and those which we have covered are:

- No matter how mych data we have, it is very unlikely that we will see those exact data in the test or production environment. So, doing well on the training set is not a measure of success.
- **Machine learning is not magic, it can't get something from nothing. What it does is get more from less.**
- **Bias** and **Variance** using the dart example
 - Bias is a learner's tendancy to consistantly learn the same wrong thing
 - Variance is the tendancy to learn random things irrespective of the real signal.
- Cross validaton and k-folds cross validation - Both can help combat overfitting.

The *Curse of dimensionality* was a term coined by Bellman in 1961 which refers tothe fact that many algorithms that work find in low dimensions become unmanageable when the input is high dimensional. One might think that gathering more features never hurts, since at worst they provide no new information about the class. But in fact the curse of dimensionality might over-weigh their benefits.

The most important factor in making a machine leaning project a success or failure is the features used. If you have many independant features that each correlate well with teh class, learning is easy. Often the raw data is not in a form that the learner can fully utilize, but you can construct features from it that are. This is typically where most of the effort in a machine learning project goes.

Machine learning is not one-shot process of cleaning a dataset and running a learner, but rather an iterative process of running, analyzing the results, modifying the data/learner and repeating.

A dumb algorithm with lots and lots of data beats a clever one with modest amount of data. Machine learning is all about letting data do the heavy lifting.

### Machine Learning
#### Chapter 3 - Question 1 to 3

Question 1: 
Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

In [3]:
df = pd.read_csv('../data/ISLR_Data/Advertising.csv')
df = df.drop('Unnamed: 0', axis=1)
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [4]:
# Statsmodel API
#model = smf.ols(formula='Sales ~ TV', data=df[:100])
model = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=df)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Mon, 25 Apr 2016",Prob (F-statistic):,1.58e-96
Time:,12:38:37,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,2.9389,0.312,9.422,0.000,2.324 3.554
TV,0.0458,0.001,32.809,0.000,0.043 0.049
Radio,0.1885,0.009,21.893,0.000,0.172 0.206
Newspaper,-0.0010,0.006,-0.177,0.860,-0.013 0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


Answer 1:
Null Hypotheses for TV: TV and Sales do not have any relationship when we include radio and newspaper
Null Hypotheses for Radio: Radio and Sales do not have any relationship when we include TV and newspaper
Null Hypotheses for Newspaper: Newspaper and Sales do not have any relationship when we include radio and TV

coef: predicted coefficients beta0 and beta1
std err: standard error
t: chef/std err. It tells us how standard deviations far off is the coefficient
p value: We use p-values to determine statistical significance in a hypothesis test. Since the p-value of TV and Radio is less than 0.05, the null hypothesis does not hold true for them. In case of Newspaper, Null hypothesis does hold true as the p-value is greater than 0.05.

Question 2:
Carefully explain the differences between the KNN classifier and KNN regression methods.

Answer 2:
KNN classifier gives a classification output for Y. KNN regression predicts the quantitative value of Y.

Answer 3(a)(i)
Y = 50 + 20*x1 + 0.07*x2 + 35*x3 + 0.01*x4 + -10*x5
where,
x1 = GPA
x2 = IQ
x3 = Gender
x4 = Interaction between GPA and IQ
x5 = Interaction between GPA and Gender
Or,
Y = 50 + 20*x1 + 0.07*x2 + 35*x3 + 0.01*(x1*x2) + -10*(x1*x3)

So,
For Male (x3=0)  : Y = 50 + 20*x1 + 0.07*x2 + 0.01*(x1*x2)
For Female (x3=1): Y = 50 + 20*x1 + 0.07*x2 + 35 + 0.01*(x1*x2) + -10*(x1)

Due to the negative factor at the end, the salary would depend on the GPA.

Answer 3(a)(ii)
Cannot tell whether male or female earn more on average than males.

Answer 3(a)(iii)
On high values of GPA, male earn more than females due to the negative factor -10*(x1)

Answer 3(a)(iv)
On high values of GPA, male earn more than females due to the negative factor -10*(x1)

In [5]:
#Answer 3(b)
y = 50 + 20*4.0 + 0.07*110 + 35*1 + 0.01*(4.0*110) + -10*(4.0*1)
print y

137.1


Answer 3(c)
The coefficient of GPA/IQ interaction is very small. That does not mean that the interaction effect on the regression will be negligible. The p-value of the coefficient will decide the effect of the variable.

#### 3 ways we can select what features to use in a model

A standard linear model is commonly used to describe the relationship between a response Y and a set of parameters X1, X2 .. Xp. One usually fits this model using least squares.

Alternate fitting procedures can result in better prediction accuracy and model interpretability. If the number of observations n is not much larger than the number of variables p, then least squares method can produce a lot of variablity, resulting in over-fitting and poor predictions on future observations noy used in model training.

There might also be the scenario that the variables are not associated with the response. Including such variables adds to the complexity of the model.

Shrinkage, or regularization, is an approach that involves all the p predictors. However the coefficients are shrunken towards zero and this has an effect of reducing variance.

1. Ridge Regression
2. Lasso
3. Elastic Net