# Regression

Regression modeling is any attempt to predict or explain a continous variable from a collection of input data. This could be student GPA, the position of a planet orbiting a sun, or the color of a pixel in a photo. Values such as whether a student is a STEM student or not, the probability of an event occuring (such as changing a major, an earthquake) are not regression tasks (they are classification).

After completing this tutorial you should be able to:

* use `sci-kit learn` to split data into training and testing sets
* understand the model, fit, score paradigm in `sci-kit learn` and apply it to a problem
* understand the most important visualizations of regression analysis: actual vs. predicted, actual vs. residuals, residuals distribution vs. assumed theoretical distribution (in case of OLS models)
* have a conceptual understanding of the basic goal of any regression task
* have some understanding that most statistical "tests" are typically just specific solutions of a linear regression problem
* have some understanding of the assumptions of linear models

## Further reading

1. Hands on machine learning, probably the best practical machine learning textbook ever written https://github.com/ageron/handson-ml
2. Common statistical tests are linear models, stop thinking statistics are something other than y=mx+b, they are not. lol. https://lindeloev.github.io/tests-as-linear/?fbclid=IwAR09Rp4Vv18fOO4lg0ITnCYJICCC1iuzeq-tNYPWsnmK6CrGgdErpvHfyWE

## Data

In the data folder there is a filed named `regression_data.csv`. Import the data like you did in the previous tutorial "exploring data". The first step in any regression task is to explore the data the raw data.

## Import the data

We will first need to import the data. To do so, we need to first import the relevant libraries that are necessary to import and visualize the data. Then, we can import the data into a dataframe for analysis. 

1. First import the ``pandas``, ``numpy``, and ``matplotlib.pyplot`` libraries
2. Then, import the data into a data frame using the ``read_csv()`` method.

### DATA TODO 

1. Visualize the data in three different ways
2. Which variables may covary? How will this effect your model?

## Modeling

Modeling data is as much an art as it is science. There is no "true" model, there is only a model that reduces error to an acceptable amount. Most models attempt to do this automatically by minimizing some sort of cost function (or error) using some kind of solver algorithm. These solving methods are beyond the scope of this workshop but are important to know they exist and somewhat how they work. If you are interested in this sort of thing I recommend starting with [this stats exchange thread](https://stats.stackexchange.com/questions/160179/do-we-need-gradient-descent-to-find-the-coefficients-of-a-linear-regression-mode) and googling each solver in the answer that seems interesting. This is only for Linear Least Squares models but its a good place to start. Moving on...

### MODELING TODO
1. split the data into training and testing data sets using the `sklearn.model_selection` method `train_test_split` [[link]](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
2. create an OLS model using the `sklearn.linear_model` function `LinearRegression` [[link]](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)
3. score the model using your model's built in `score` method. What does this number represent? What is it summarizing?

## Analysing the model output

Now that we have established the goal of the model is to minimize the error, created a model, and found some brilliant amazing score for the model, we still must recognize that the model has some error. The error/residual is really just the linear distance from the model "plane" to the predicted value as shown below:

<img src="https://internal.ncl.ac.uk/ask/numeracy-maths-statistics/images/Residuals.png" />

These residuals are data in their own right. But instead of being data about students, courses, etc. they are data about the model and how it is giving predictions. Thus we can use it to describe the model performance.

### ANALYSING TODO
 
1. Create predicted data using the model's `predict` method. Now make a scatter plot to compare it to the actual values and draw a diagonal through this plot. What "shape" does the scatter plot "blob" look like? Does the "blob" follow the diagonal line or does it deviate in some way?
2. Write a function to calculate the residuals of the model. Plot the actual values versus the residuals using a scatter plot.

## Model Features - training and fitting

All models have some input data X and some output prediction Y. The input data X is of the shape $m \times n$, so that means there are $m$ columns (or features) and $n$ data "points" (or vectors if $m>1$). For many models, you can return values from the model that give some indication as to how "important" each particular feature is to the model's training. Typically, the larger the magnitude of this value, the more important the feature is for prediction. This value for linear models is called the model *coefficients*. It may also be called *feature importance*. These values are always calculated from the data that was used to train (fit) the model. Thus, they don't really tell us about how important the features are for new data, rather how important the features were in deciding the "shape" of the model itself.

### TRAINING AND FITTING TODO

1. make a bar graph of all the features in the model. Which is the most important feature for fitting? Which is least important?
2. Often times linear model coefficients have [confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval). Can you describe a way you might generate this interval for each coefficient?

## Model Features - predicting

The correlary to each feature's coefficient or importance value, is the amount of variance that feature explains in the prediction. Remember, we have split the data into two separate sets, the training data and the testing data. The test data is never shown to the model until after the model is "fit" to the training data. This secrecy is why we are able to test the predictive power of each model. This secret or "hold out" data can be used to measure the "explained variance" of each coefficient/feature. One method of doing this is called [recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html). Essentially, the coefficient of the model are ordered by magnitude, and the smallest are then removed one at a time until only one feature is left. Each iteration the model's `score` function is called. This provides a ranking based on the predictive power of the features.

### PREDICTING TODO

1. Using the `RFE` function, calculate the explained variance of each of the features in your model.
2. Plot the scores returned for each of the combination of features from largest contributions to smallest as a line plot.

## WRAP UP

A lot of what you have learned here is very relevant to all machine learning tasks whether its classification, text analysis, or something more esoteric. Splitting data and training different models and comparing the outputs is the bread and butter of any data scientists tool box. Great job!

## CHALLENGE

1. Can you think of a model that ISNT a "linear model" that could be used to perform a regression? Find the `sklearn` implementation and apply it to this data.
2. Is your new model better or worse than the model you already built? How are the residuals different?
3. Can you think of different ways of splitting the data for training and testing the models? Why would this way be better than the way the data is split here?
4. What other ways could there be to measure the contribution to prediction than recursive feature elimination? Can you test this on your model and compare the two methods?