# Regression

Regression modeling is any attempt to predict or explain a continous variable from a collection of input data. This could be student GPA, the position of a planet orbiting a sun, or the color of a pixel in a photo. Values such as whether a student is a STEM student or not, the probability of an event occuring (such as changing a major, an earthquake) are not regression tasks (they are classification).

After completing this tutorial you should be able to:

* use `sci-kit learn` to split data into training and testing sets
* understand the model, fit, score paradigm in `sci-kit learn` and apply it to a problem
* understand the most important visualizations of regression analysis: actual vs. predicted, actual vs. residuals, residuals distribution vs. assumed theoretical distribution (in case of OLS models)
* have a conceptual understanding of the basic goal of any regression task
* have some understanding that most statistical "tests" are typically just specific solutions of a linear regression problem
* have some understanding of the assumptions of linear models

## Further reading

1. Hands on machine learning, probably the best practical machine learning textbook ever written https://github.com/ageron/handson-ml
2. Common statistical tests are linear models, stop thinking statistics are something other than y=mx+b, they are not. lol. https://lindeloev.github.io/tests-as-linear/?fbclid=IwAR09Rp4Vv18fOO4lg0ITnCYJICCC1iuzeq-tNYPWsnmK6CrGgdErpvHfyWE
3. 

## Data

In the data folder you should have some data. Import the data like you did in the previous tutorial "exploring data". The first step in any regression task is to explore the data the raw data.

### DATA TODO 

1. Visualize the data in three different ways
2. Which variables may covary? How will this effect your model?

## Modeling

Modeling data is as much an art as it is science. There is no "true" model, there is only a model that reduces error to an acceptable amount. Most models attempt to do this automatically by minimizing some sort of cost function (or error) using some kind of solver algorithm. These solving methods are beyond the scope of this workshop but are important to know they exist and somewhat how they work. If you are interested in this sort of thing I recommend starting with [this stats exchange thread](https://stats.stackexchange.com/questions/160179/do-we-need-gradient-descent-to-find-the-coefficients-of-a-linear-regression-mode) and googling each solver in the answer that seems interesting. This is only for Linear Least Squares models but its a good place to start. Moving on...

### MODELING TODO
1. split the data into training and testing data sets using the `sklearn.model_selection` method `train_test_split` [[link]](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
2. create an OLS model using the `sklearn.linear_model` function `LinearRegression` [[link]](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)
3. score the model using your model's built in `score` method. What does this number represent? What is it summarizing?

## Analysing the model output

Now that we have established the goal of the model is to minimize the error, created a model, and found some brilliant amazing score for the model, we still must recognize that the model has some error. The error/residual is really just the linear distance from the model "plane" to the predicted value as shown below:

<img src="https://internal.ncl.ac.uk/ask/numeracy-maths-statistics/images/Residuals.png" />

These residuals are data in their own right. But instead of being data about students, courses, etc. they are data about the model and how it is giving predictions. Thus we can use it to describe the model performance.

### ANALYSING TODO
 
1. Create predicted data using the model's `predict` method. Now make a scatter plot to compare it to the actual values and draw a diagonal through this plot. What "shape" does the scatter plot "blob" look like? Does the "blob" follow the diagonal line or does it deviate in some way?
2. Write a function to calculate the residuals of the model. Plot the actual values versus the residuals using a scatter plot.

## CHALLENGE

1. Can you think of a model that ISNT a "linear model" that could be used to perform a regression? Find the `sklearn` implementation and apply it to this data.
2. Is your new model better or worse than the model you already built? How are the residuals different?