# Lab: Regression Analysis

### Before you start:

* Read the README.md file
* Comment as much as you can and use the resources (README.md file) 

Happy learning!

## Challenge 1
I work at a coding bootcamp, and I have developed a theory that the younger my students are, the more often they are late to class. In order to test my hypothesis, I have collected some data in the following table:

| StudentID | Age | Tardies |
|--------|-----|------------|
| 1      | 17  | 10         |
| 2      | 51  | 1          |
| 3      | 27  | 5          |
| 4      | 21  | 9         |
| 5      | 36  |  4         |
| 6      | 48  |  2         |
| 7      | 19  |  9         |
| 8      | 26  | 6          |
| 9      | 54  |  0         |
| 10     | 30  |  3         |

In [None]:
%pip install statsmodels

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

Use this command to create a dataframe with the data provided in the table. 
~~~~
student_data = pd.DataFrame({'Age': [17,51,27,21,36,48,19,26,54,30], 'Tardies': [10,1,5,9,4,2,9,6,0,3]})
~~~~

In [None]:
student_data = pd.DataFrame({'Age': [17,51,27,21,36,48,19,26,54,30], 'Tardies': [10,1,5,9,4,2,9,6,0,3]})
student_data.head()

Draw a dispersion diagram (scatter plot) for the data.

In [None]:
# Drawing a scatter plot to visualize the relationship between Age and Tardies
sns.scatterplot(x='Age', y='Tardies', data=student_data)
plt.title('Age vs Tardies')
plt.show()

Do you see a trend? Can you make any hypotheses about the relationship between age and number of tardies?

Yeah, I see a pattern here. It looks like older students are late less often. So, as age goes up, tardiness goes down.

Calculate the covariance and correlation of the variables in your plot. What is the difference between these two measures? Compare their values. What do they tell you in this case? Add your responses as comments after your code.

In [None]:
# Calculating covariance
covariance = student_data.cov().loc['Age', 'Tardies']
print(f'Covariance: {covariance}')

# Calculating correlation
correlation = student_data.corr().loc['Age', 'Tardies']
print(f'Correlation: {correlation}')

# Explanation:
# Covariance is negative, which means when one goes up, the other goes down.
# Correlation is also negative and close to -1, which tells me there's a really strong connection between age and being late. They are definitely related!

Build a regression model for this data. What will be your outcome variable? What type of regression are you using? Add your responses as comments after your code.

In [None]:
# I'm trying to predict 'Tardies' using 'Age'.
# Since I only have one thing to predict with (Age), I'll use Simple Linear Regression.

model = smf.ols('Tardies ~ Age', data=student_data).fit()
print(model.summary())

Plot your regression model on your scatter plot.

In [None]:
# Plotting the regression line on the scatter plot
sns.regplot(x='Age', y='Tardies', data=student_data)
plt.title('Regression Model: Age vs Tardies')
plt.show()

Interpret the results of your model. What can conclusions can you draw from your model and how confident in these conclusions are you? Can we say that age is a good predictor of tardiness? Add your responses as comments after your code.

The model looks good! The R-squared is high, so it fits well. The negative number for Age means older people are late less. The p-value is super small, so this isn't just luck. I'm pretty sure age helps predict if someone will be late.

## Challenge 2
For the second part of this lab, we will use the vehicles.csv data set. You can find a copy of the dataset in the git hub folder. This dataset includes variables related to vehicle characteristics, including the model, make, and energy efficiency standards, as well as each car's CO2 emissions. As discussed in class the goal of this exercise is to predict vehicles' CO2 emissions based on several independent variables. 

In [None]:
# Import any libraries you may need & the data
vehicles = pd.read_csv("vehicles.csv")

Let's use the following variables for our analysis: Year, Cylinders, Fuel Barrels/Year, Combined MPG, and Fuel Cost/Year. We will use 'CO2 Emission Grams/Mile' as our outcome variable. 

Calculate the correlations between each of these variables and the outcome. Which variable do you think will be the most important in determining CO2 emissions? Which provides the least amount of helpful information for determining CO2 emissions? Add your responses as comments after your code.

In [None]:
# Calculating correlations between the selected variables and the outcome
features = ['Year', 'Cylinders', 'Fuel Barrels/Year', 'Combined MPG', 'Fuel Cost/Year', 'CO2 Emission Grams/Mile']
correlation_matrix = vehicles[features].corr()
print(correlation_matrix['CO2 Emission Grams/Mile'])

# Interpretation:
# It looks like 'Fuel Barrels/Year' and 'Fuel Cost/Year' are super related to CO2 emissions (they go up together).
# 'Combined MPG' goes down when CO2 goes up.
# I think 'Fuel Barrels/Year' is the most important one here.

Build a regression model for this data. What type of regression are you using? Add your responses as comments after your code.

In [None]:
# Building a Multiple Linear Regression model
# I am using Multiple Linear Regression because there are multiple independent variables.

# Renaming columns to remove spaces/special chars for statsmodels formula
vehicles.columns = [col.replace(' ', '_').replace('/', '_') for col in vehicles.columns]

model_vehicles = smf.ols('CO2_Emission_Grams_Mile ~ Year + Cylinders + Fuel_Barrels_Year + Combined_MPG + Fuel_Cost_Year', data=vehicles).fit()

Print your regression summary, and interpret the results. What are the most important varibles in your model and why? What can conclusions can you draw from your model and how confident in these conclusions are you? Add your responses as comments after your code.

In [None]:
print(model_vehicles.summary())

# Interpretation:
# Wow, the R-squared is really high! That means our model is doing a great job.
# 'Fuel_Barrels_Year' seems to matter the most.
# I'm pretty confident that how much fuel a car uses tells us a lot about its CO2 emissions.

## Bonus Challenge: Error Analysis

I am suspicious about the last few parties I have thrown: it seems that the more people I invite the more people are unable to attend. To know if my hunch is supported by data, I have decided to do an analysis. I have collected my data in the table below, where X is the number of people I invited, and Y is the number of people who attended. 

|  X |  Y |
|----|----|
| 1  |  1 |
| 3  |  2 |
| 4  |  4 |
| 6  |  4 |
| 8  |  5 |
| 9  |  7 |
| 11 |  8 |
| 14 |  13 |

We want to know if the relationship modeled by the two random variables is linear or not, and therefore if it is appropriate to model it with a linear regression. 
First, build a dataframe with the data. 

In [None]:
party_data = pd.DataFrame({'X': [1,3,4,6,8,9,11,14], 'Y': [1,2,4,4,5,7,8,13]})
party_data

Draw a dispersion diagram (scatter plot) for the data, and fit a regression line.

In [None]:
# Scatter plot with regression line
sns.regplot(x='X', y='Y', data=party_data)
plt.title('Party Guests: Invited vs Attended')
plt.show()

model_party = smf.ols('Y ~ X', data=party_data).fit()
print(model_party.summary())

What do you see? What does this plot tell you about the likely relationship between the variables? Print the results from your regression.

Most of the points go up in a straight line, but that last one (14, 13) looks weird. It's way higher than the others. I think it's an outlier.

Do you see any problematic points, or outliers, in your data? Remove these points and recalculate your regression. Print the new dispersion diagram with your new model and the results of your model. 

Yes, that last point (14, 13) is definitely an outlier. It's way off from the line that the other points form. I should remove it to see if the model gets better.

In [None]:
# Removing the outlier (the last point where X=14)
party_data_clean = party_data[party_data['X'] < 14]

# Recalculating regression
model_party_clean = smf.ols('Y ~ X', data=party_data_clean).fit()

# Plotting the new model
sns.regplot(x='X', y='Y', data=party_data_clean)
plt.title('Party Guests (Outlier Removed)')
plt.show()

print(model_party_clean.summary())

What changed? Based on the results of the two models and your graphs, what can you say about the form of the data with the problematic point and without it?

After I removed that weird point, the line fits the rest of the dots much better! The R-squared went up, so the model is stronger now. That one point was really throwing things off. The data actually looks very linear (like a straight line) once you ignore that one outlier.