# Lab: Regression Analysis

### Before you start:

* Read the README.md file
* Comment as much as you can and use the resources (README.md file) 

Happy learning!

## Challenge 1
I work at a coding bootcamp, and I have developed a theory that the younger my students are, the more often they are late to class. In order to test my hypothesis, I have collected some data in the following table:

| StudentID | Age | Tardies |
|--------|-----|------------|
| 1      | 17  | 10         |
| 2      | 51  | 1          |
| 3      | 27  | 5          |
| 4      | 21  | 9         |
| 5      | 36  |  4         |
| 6      | 48  |  2         |
| 7      | 19  |  9         |
| 8      | 26  | 6          |
| 9      | 54  |  0         |
| 10     | 30  |  3         |

Use this command to create a dataframe with the data provided in the table. 
~~~~
student_data = pd.DataFrame({'X': [x_values], 'Y': [y_values]})
~~~~

In [None]:
# Your code here. 
import pandas as pd
import numpy as np
import seaborn as sns
#Paolo: adding matplotlib
import matplotlib.pyplot as plt

student_data = pd.DataFrame({'X': [17, 51, 27, 21, 36, 48, 19, 26, 54, 30], 'Y': [10, 1, 5, 9, 4, 2, 9, 6, 0, 3]})

# checking data
student_data.head()

Draw a dispersion diagram (scatter plot) for the data.

In [None]:
# Your code here.
sns.scatterplot(x="X", y="Y", data=student_data)


Do you see a trend? Can you make any hypotheses about the relationship between age and number of tardies?

In [None]:
# Your response here. 
# It seems indeed that as students get older they have less tardies, the trend is very clear

Calculate the covariance and correlation of the variables in your plot. What is the difference between these two measures? Compare their values. What do they tell you in this case? Add your responses as comments after your code.

In [None]:
# Your response here.

student_data.cov()

student_data.corr()

# negative correlation, as the increase of X means the decrease of Y

Build a regression model for this data. What will be your outcome variable? What type of regression are you using? Add your responses as comments after your code.

In [None]:
# Your response here
import statsmodels.api as sm

model = sm.OLS(student_data['Y'],student_data['X'])
results = model.fit()
predictions = results.predict(student_data['X'])
pd.DataFrame({'observed':student_data['Y'], 'predicted':predictions}).sort_values(by='observed')

# we can still see the negative slope, while Y (oserved / tardies) increases, X decreases (prediction / age)
# but here predicted (X) does not represent the age, right? 
#Paolo: the prediction is about the number of tardies (Y) given a student of certain age(X) 
# what does predicted 5 mean for observed (tardies) 0?
#Paolo: If we take the first row, it means that the linear regression model predicts about 5 tardies while in reality the
#observed tardies are 0, so the model makes an error (for this particular 1 row) of about 5 tardies.
#Similar for other rows
#The linear regression tries to model the reality but it is not perfect


In [None]:
pd.DataFrame({'observed':student_data['Y'], 'predicted':predictions})

In [None]:
results.summary()

Plot your regression model on your scatter plot.

In [None]:
#Paolo: you see in the summary you only have one coefficient (0.00961) that is the inclination of the line. The system
# is trying to fit the model y=ax to your data changing only a and it is not very successful, see below but also notice poor R^2. 

In [None]:
# Your code here.
# this is a regression plot, but I'm not sure how to add my prediction to the model?
sns.regplot(x="X", y="Y", data=student_data)
#Paolo: yes, regplot generates the linear model automatically, given X and Y
#Paolo: regplot uses a different default model to OLS you used above so you can get a different result

In [None]:
#Paolo: Here is the comparison between your model (red) and the observed data (black dots)
plt.scatter(student_data['X'], student_data['Y'], color = 'black')
plt.plot(student_data['X'],predictions, '--',color='red');
#Paolo: in this case the model generated by OLS is not accurate

Interpret the results of your model. What can conclusions can you draw from your model and how confident in these conclusions are you? Can we say that age is a good predictor of tardiness? Add your responses as comments after your code.

In [None]:
# Your response here. 
# I'm not really sure how to test this, but as there is such a clear trend, I would assume that yes

# age seems to be clear factor in tardines, so I would use it to predict
#Paolo: in this case the generated model does not fit the data well

## Note
The issue here is that OLS does not use an intercept by default to fit the data. So your linear model is not y=ax+b but just y=ax. So your model tries and fails to fit the data with only one parameter, a in this example. To to add an intercept you have to add it manually


In [None]:
x = sm.add_constant(student_data.X) #Paolo: here you are adding a constant to the model
model_intercept = sm.OLS(student_data.Y,x) # Paolo: here you are fitting y=ax+b (b is the intercept)
results_with_intercept = model_intercept.fit()

In [None]:
plt.scatter(student_data['X'], student_data['Y'], color = 'black')
plt.plot(student_data['X'],results_with_intercept.predict(x), '--',color='red');
#Paolo: better fit with additional parameter

In [None]:
# Paolo: R^2 is a good indication of a match 
results_with_intercept.summary()

In [None]:
#Paolo:in the summary above you have now two parameters, const (12.88 it is the intercept)  and the inclination of the line
#(-0.25). 

## Challenge 2
For the second part of this lab, we will use the vehicles.csv data set that you can download from [here](https://drive.google.com/file/d/1EyAN0RXmAM5OLzKcxyWqdExQJ3KiswO9/view?usp=sharing). Please place the data it in the provided data folder for this lab. You can find a copy of the dataset in the git hub folder. This dataset includes variables related to vehicle characteristics, including the model, make, and energy efficiency standards, as well as each car's CO2 emissions. As discussed in class the goal of this exercise is to predict vehicles' CO2 emissions based on several independent variables. 

In [None]:
# Import any libraries you may need 

In [None]:
# Import the data
vehicles = pd.read_csv('../data/vehicles.csv')
vehicles.head()

Let's use the following variables for our analysis: Year, Cylinders, Fuel Barrels/Year, Combined MPG, and Fuel Cost/Year. We will use 'CO2 Emission Grams/Mile' as our outcome variable. 

Calculate the correlations between each of these variables and the outcome. Which variable do you think will be the most important in determining CO2 emissions? Which provides the least amount of helpful information for determining CO2 emissions? Add your responses as comments after your code.

In [None]:
# Your response here. 
vehicles[['Year', 'Cylinders', 'Fuel Barrels/Year', 'Combined MPG', 'Fuel Cost/Year', 'CO2 Emission Grams/Mile']].corr()

# I would say that Fuel Barrels/Year (because of the .98 positive correlation) would be the best to determine CO2 emissions
# the variable Year seems to be the least helpful
#Paolo:yes

Build a regression model for this data. What type of regression are you using? Add your responses as comments after your code.

In [None]:
# Your response here. 
# here a linear regression model with Fuel Barrels/Year as it had the highest correlation
x = vehicles['Fuel Barrels/Year']
y = vehicles['CO2 Emission Grams/Mile']

results = sm.OLS(y, x).fit()
predictions = results.predict(x)
lin = pd.DataFrame({'observed': y, 'predicted':predictions})
lin
#seems to be very accurate!

In [None]:
# here a multiple linear regression model without year, as it did not seem usefeul
x = vehicles[['Cylinders', 'Fuel Barrels/Year', 'Combined MPG', 'Fuel Cost/Year']]
y = vehicles['CO2 Emission Grams/Mile']

results = sm.OLS(y, x).fit()
predictions = results.predict(x)
mul = pd.DataFrame({'observed': y, 'predicted':predictions})
mul




In [None]:
#the linear seems more accurate, we can check:
dif_lin = lin['observed']-lin['predicted']

dif_mul = mul['observed']-mul['predicted']


dif_mul.std() # has a std of 18.06
dif_lin.std() # has a std of 19.94

# because the std is smaller on the multiple regression, I think this one is more accurate

#Paolo:ok but for the moment use a standard error measure like mean_squared_error to evaluate. Your way to evaluate the accuracy
# does not work in all cases, see below for example with root_mean_squared_error (definition is in lesson)
#Paolo: in this case your calculation and root_mean_squared give the same result but it is not always true. For example if
# the mean of the differences for dif_lin is not close to zero you get a different result to the root mean squared error.

In [None]:
#Paolo: example using python library for root mean squared error 
from sklearn.metrics import mean_squared_error
from math import sqrt
rms_lin = sqrt(mean_squared_error(lin['observed'], lin['predicted']))#Paolo: root mean squared error
rms_lin

Print your regression summary, and interpret the results. What are the most important varibles in your model and why? What can conclusions can you draw from your model and how confident in these conclusions are you? Add your responses as comments after your code.

In [None]:
# Your response here. 
results.summary()
# did we learn how to read this in class? Because I dont know what I'm looking at?
# apparently R-squared is an indicator of how well the model or regression line “fits” the data
# as our result is almost 1, I'm going to assume this is very high
#Paolo:yes R^2 is a good indication

## Bonus Challenge: Error Analysis

I am suspicious about the last few parties I have thrown: it seems that the more people I invite the more people are unable to attend. To know if my hunch is supported by data, I have decided to do an analysis. I have collected my data in the table below, where X is the number of people I invited, and Y is the number of people who attended. 

|  X |  Y |
|----|----|
| 1  |  1 |
| 3  |  2 |
| 4  |  4 |
| 6  |  4 |
| 8  |  5 |
| 9  |  7 |
| 11 |  8 |
| 14 |  13 |

We want to know if the relationship modeled by the two random variables is linear or not, and therefore if it is appropriate to model it with a linear regression. 
First, build a dataframe with the data. 

In [None]:
# Your code here. 

Draw a dispersion diagram (scatter plot) for the data, and fit a regression line.

In [None]:
# Your code here.

What do you see? What does this plot tell you about the likely relationship between the variables? Print the results from your regression.

In [None]:
# Your response here. 

Do you see any problematic points, or outliers, in your data? Remove these points and recalculate your regression. Print the new dispersion diagram with your new model and the results of your model. 

In [None]:
# Your response here. 

What changed? Based on the results of the two models and your graphs, what can you say about the form of the data with the problematic point and without it?

In [None]:
# Your response here. 