# SYD9DAT - Homework 2 - Grace Palma

### Course Project

<b>SETUP</b>

* Sign up for an AWS account - Username: gracepalma

<b>COMMUNICATION</b>

Imagine you are trying to describe Linear Regression to someone - but they have no programming/maths experience! How would you explain...

<b><U> Linear Regression - the overall process?</U></b>

Linear regression is a method used for identifying the relationship between a predictor and an outcome, the most basic form of linear regression is simple linear regression where a response is predicted based on a single variable. Linear regression methods operate under the assumption that there is a linear relationship between the predictor and the response, this is the assumption that there is direct proportionality betweeen variables that when plotted on a graph, it traces a straight line.

For example, we want to look at the relationship between hours spent studying (predictor) and gpa scores (outcome). We can create a simple linear regression model to identify an approximation of how hours spent studying impact gpa scores.

Simple linear regression uses this equation <b>Y = β0 + β1X</b>

<i>Where Y is the outcome, X is the predictor, β0 and β1 represent two unknown constants/model coefficients for the intercept and the slope. The intercept β0 is the outcome (Y) when the predictor (X) is 0  and the slope β1 is the estimated value or effect of the predictor on the outcome.</i>

For our exampe above the equation would look something like this: 

    GPA score = β0 + β1(Hours spent studying)

Once we've used our training data to get the estimates of β0 and β1 we can now predict gpa scores based on hours spent studying. As previously stated, linear regression assumes direct proportionality between variables.. that means in this model the relationship could either find a direct relationship or an inverse between variables:

In [1]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://image.slidesharecdn.com/labreportwalkthrough-140918235306-phpapp02/95/lab-report-walk-through-16-638.jpg")

Direct linear relationship in this example would mean that as the number of hours spent on studying increases, GPA scores also increase - ie. those who spent more hours studying had better grade point average scores

and inverse linear relationship in this example mean that as the number of hours spent studying increases, student's gpa scores decreases - ie. the more people study, the worse they scored.

</N>
<B><u>What a p-value is?</u></B>

Hypothesis tests are tests used to measure the validity of a claim that is made about a sample or a population. When performing hypothesis tests, the p-value is a value between 0 - 1 which is used to determine the statistical significance of your results. Drawing back to the example above using a cut-off p-value of 0.05 ..

If we hypothesize that hours spent studying has nothing to do with GPA scores and we find that the p-value we get from our sample is 0.001 our hypothesis will be rejected. Which means that based on our sample it appears that number of hours spent studying does have an impact on GPA scores.

This is because the probability of us mistakenly rejecting our claims that hours spent studying has no impact on GPA scores is low -- much lower than our assigned p-value cut-off (0.001 < 0.05). Simply put there is a low chance of us incorrectly claiming that hours studying is related to GPA scores.

</N>
<B><u>What R-Squared means?</u></B>

Residual sum of squares (RSS) or also known as the sum of squared residuals (SSR) is a statistical measure of how close the data is to the fitted regression line. It is a measure of the variation between the data and the model. 
Hence an RSS that is closer to 1 would indicate a model that tightly fits your data.

<b><U>Read the paper Useful things to know about machine learning.</U></b>

What have we covered so far from this paper?

* Classifiers 
* K-nearest neighbor 
* Logistic regression 
* Decision trees 
* Accuracy/Error rate 
* Squared error 
* Bias & variance 
* Overfitting 
* Hypothesis testing 
* Cross validation 
* Regularisation 

Explain sections 6-13 in your own words.

Section 6: Intuition fails in high dimensions 

* Section 6 explains the curse of dimensionality - this refers to the obstructive effect of having a large number of features/dimensions on many machine learning algorithms and methods. For example, if we find that only 2 dimensions turn out to be relevant in a model that looks at 100 dimensions, noise from all 98 irrelevant dimensions saturates that of the 2 relevant ones which makes making predictions arbitrary. However the same issue still arises even when majority of the dimensions are relevant.. this is because examples look very similar when in high dimensions.

Section 7: Theoretical guarantees are not what they seem 

* In machine learning, there is no certainty in determining the accuracy of a model.. its main purpose is to ensure understanding and conceptualisation of algorithm designs.  Always exercise a level of caution whenever dealing with theoretical guarantees in machine learning because the bounds obtained are often loose. 


Section 8: Feature engineering is the key 

* Machine learning is an iterative process of learning, analyising and modifying data (ie. feature engineering). Feature engineering is an integral step in machine learning because it ensures that the appropriate features are selected for your model.

Section 9: More data beats a cleverer algorithm

* This section explains that as a general rule, a more complex or clever algorithim with less data is often times less effective than a simple algorithm with more data. 

Section 10: Learn many models, not just one 

* The best models and learners often varies from application to application, hence there is a current move towards models that combine many model variations, which has been found to often yeild much better results. Several techniques can be used in creating such models: a) bagging is done by generating random variations of the training data through resampling, b) boosting is done by varrying the weights allocated to the training samples and c) stacking is the process by which a classifiers output becomes the input of a learner which will decide how best to combine them.

Section 11: Simplicity does not imply accuracy 

* Section 11 points out that there is no necessary relationship between the number of parameters used in a model and its tendency to overfit. Simpler hypotheses are preferred not because it leads to more accuracy but rather that simplicity is its own merit.

Section 12: Representable does not imply learnable 

* This section highlights the fact that some functions that are representable cannot always be learned. Some considerations that could impact whether a function is learnable or not are time, data and memory.

Section 13: Correlation does not imply causation 

* In machine learning, we often times deal with observational data rather than experimental data. Observational data do not alway include predictive variables that are controlled by the learner. Hence why it is limited in its ability to make causal inferences. However these predictions can be a useful tool or information to further investigate potential cause.


<b><U>Machine Learning</U></b>
Describe 3 ways we can select what features to use in a model.

1. Domain knowldge on a specific field or topic can help dictate which features may be the most suitable to use in a model
2. By looking at the correlation between features using scatterplots and correlation coefficients  
2. Lasso Regularisation 
3. Look at the significant proportion of variable variance 

### Course Project
* Create a github repository for your project and share it with Allen and Kieran over Slack.
* Load the data you have gathered for your project into Python and run some summary statistics over the data. Are there any interesting features of the data that jump out?
* Create some data visualisations that explore some aspect of your data set.
* Are there any regresion or clustering techniques you could use in your project? Write them down (with the corresponding scikit learn function) and what you think you would get out of it. Try a preliminary model, if you get a chance (you can choose a small subset of features to begin with).

In [6]:
import numpy as np
import pandas as pd

In [7]:
# graphic and data viz tools
import matplotlib.mlab as mlab
import plotly.plotly as plotly
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
# upload data file
df = pd.read_csv("~/Documents/GMTPalma/PP_Feed.csv")
df.head(10)

Unnamed: 0,collection_index,tag_version,collector_version,tag_id,account_id,event_type,event_id,event_time,utc_year,utc_month,...,doc_title,page_hostname,ref_hostname,creative_id,campaign_id,line_item_id,campaign_birth,campaign_views,device_type,load_index
0,2017081400,1.6.3,1.7.5,pplcorp,pplcorp,page,21f2a741-d8ce-4b8c-86bf-e43e8cd42085,2017-08-14 00:03:04,2017,8,...,Pureprofile,my.pureprofile.com,my.pureprofile.com,,,,,0,pc,201708140000_20
1,2017081400,1.6.3,1.7.5,pplcorp,pplcorp,page,81d95c1c-ff6b-4d49-a7e4-736894ec9f06,2017-08-14 00:18:23,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,tablet,201708140000_20
2,2017081400,1.6.3,1.7.5,pplcorp,pplcorp,page,5ddaecf0-49f6-4b7b-a9f8-c3e401e33df8,2017-08-14 00:14:41,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,pc,201708140000_20
3,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,26b75f16-99bf-4429-8aba-7105a9c00d42,2017-08-14 01:36:24,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,pc,201708140120_20
4,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,9cdc0149-a13c-4b3a-8f5f-d568d7c57925,2017-08-14 01:34:14,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,pc,201708140120_20
5,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,f6ed1435-3a05-482c-a0f3-197d55d12b27,2017-08-14 01:25:03,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,pc,201708140120_20
6,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,fbfc83a8-1039-4040-ac9c-69504937e258,2017-08-14 01:29:16,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,pc,201708140120_20
7,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,979d9940-d7a5-4d84-bd61-4627e395ea81,2017-08-14 01:32:12,2017,8,...,Pureprofile,my.pureprofile.com,,,,,,0,mobile,201708140120_20
8,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,9c594180-17c1-4b62-93d8-ae27b79cb014,2017-08-14 01:27:45,2017,8,...,Pureprofile,my.pureprofile.com,survey.pureprofile.com,,,,,0,mobile,201708140120_20
9,2017081401,1.6.3,1.7.5,pplcorp,pplcorp,page,c0369b0a-d8cb-4b74-99ee-3a46d10dcecd,2017-08-14 01:20:24,2017,8,...,Pureprofile,my.pureprofile.com,www.msn.com,,,,,0,pc,201708140120_20


In [9]:
df.describe()

Unnamed: 0,utc_month,utc_day,utc_hour,user_month,user_day,user_hour,network_visit_views,creative_id,campaign_id,line_item_id,campaign_birth,campaign_views
count,109962.0,109962.0,109962.0,109962.0,109962.0,109962.0,109962.0,0.0,0.0,0.0,0.0,109962.0
mean,7.425693,16.733999,9.836798,7.430703,16.7644,13.884315,2.697577,,,,,0.0
std,0.49445,8.79661,7.015596,0.495177,8.81341,5.31045,4.704571,,,,,0.0
min,7.0,1.0,0.0,7.0,1.0,0.0,1.0,,,,,0.0
25%,7.0,9.0,4.0,7.0,9.0,10.0,1.0,,,,,0.0
50%,7.0,17.0,9.0,7.0,17.0,14.0,1.0,,,,,0.0
75%,8.0,25.0,14.0,8.0,26.0,18.0,3.0,,,,,0.0
max,8.0,31.0,23.0,8.0,31.0,23.0,857.0,,,,,0.0
