# Lab: Regression Analysis

## Challenge 1
I work at a coding bootcamp, and I have developed a theory that the younger my students are, the more often they are late to class. In order to test my hypothesis, I have collected some data in the following table:

| StudentID | Age | Tardies |
|--------|-----|------------|
| 1      | 17  | 10         |
| 2      | 51  | 1          |
| 3      | 27  | 5          |
| 4      | 21  | 9         |
| 5      | 36  |  4         |
| 6      | 48  |  2         |
| 7      | 19  |  9         |
| 8      | 26  | 6          |
| 9      | 54  |  0         |
| 10     | 30  |  3         |

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns

Use this command to create a dataframe with the data provided in the table. 
~~~~
student_data = pd.DataFrame({'Age': [17,51,27,21,36,48,19,26,54,30], 'Tardies': [10,1,5,9,4,2,9,6,0,3]})
~~~~

In [None]:
student_data = pd.DataFrame({'Age': [17,51,27,21,36,48,19,26,54,30], 'Tardies': [10,1,5,9,4,2,9,6,0,3]})

Draw a dispersion diagram (scatter plot) for the data.

In [None]:
# Create scatter plot
plt.figure(figsize=(8, 6))  # Set the figure size
sns.scatterplot(x='Age', y='Tardies', data=student_data)

# Add labels and title
plt.xlabel('Age')
plt.ylabel('Tardies')
plt.title('Scatter Plot of Age vs. Tardies')

# Show the plot
plt.show()

Do you see a trend? Can you make any hypotheses about the relationship between age and number of tardies?

### Do you see a trend?

The scatter plot shows a potential inverse trend between age and the number of tardies—generally, as age increases, the number of tardies decreases.

### Can you make any hypotheses about the relationship between age and number of tardies?

- **Maturity Impact**: Older individuals might have better time management skills, leading to fewer tardies.
- **Responsibilities and Priorities**: Older individuals could have different responsibilities or priorities that reduce tardiness.
- **Environment or Routine**: Younger individuals might experience less structured routines, contributing to higher tardiness.

Calculate the covariance and correlation of the variables in your plot. What is the difference between these two measures? Compare their values. What do they tell you in this case? Add your responses as comments after your code.

In [None]:
# Calculate covariance
covariance = student_data['Age'].cov(student_data['Tardies'])

# Calculate correlation
correlation = student_data['Age'].corr(student_data['Tardies'])

# Print results
print(f"Covariance between Age and Tardies: {covariance:.4f}")
print(f"Correlation between Age and Tardies: {correlation:.4f}")


#### Comments on results

- Covariance: Covariance measures the direction of the linear relationship between Age and Tardies. 

    In this case, a negative covariance indicates that as Age increases, Tardies tend to decrease. However, the magnitude of covariance depends on the units of the variables, making it hard to interpret directly.

- Correlation: Correlation standardizes this relationship, providing a value between -1 and 1. A correlation near -1 indicates a strong negative linear relationship. 

    In this case, the correlation value gives a clear indication of the strength and direction of the relationship without being affected by the variables' units.


Build a regression model for this data. What will be your outcome variable? What type of regression are you using? Add your responses as comments after your code.

In [None]:
# Regression Model
# Define independent (X) and dependent (y) variables
X = student_data[['Age']]  # Predictor variable (independent variable)
y = student_data['Tardies']  # Outcome variable (dependent variable)

# Initialize and fit the regression model
model = LinearRegression()
model.fit(X, y)

# Print regression coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficient: {model.coef_[0]}")

### Comments on the regression model

- Outcome variable: The dependent variable in this regression is 'Tardies', which we aim to predict based on 'Age'.
- Type of regression: __I'm using a simple linear regression model as there is only one independent variable (Age)__.

The coefficient indicates the rate at which Tardies decrease with Age. In the following plot we could see the regression line and helps us understand the linear relationship between Age and Tardies.

Plot your regression model on your scatter plot.

In [None]:
# Predict values and add to the data for visualization
student_data['Predicted_Tardies'] = model.predict(X)

# Plot the regression line
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Age', y='Tardies', data=student_data, label='Actual Data')
plt.plot(student_data['Age'], student_data['Predicted_Tardies'], color='red', label='Regression Line')
plt.xlabel('Age')
plt.ylabel('Tardies')
plt.title('Regression Model: Age vs. Tardies')
plt.legend()
plt.show()

In [None]:
import statsmodels.api as sm

# Calculate R-squared
r_squared = model.score(X, y)
print(f"R-squared: {r_squared}")

mse_train = np.mean((y_train_pred - y_train)**2)


# Calculate p-values using statsmodels
X_with_constant = sm.add_constant(X)  # Add constant term for intercept
ols_model = sm.OLS(y, X_with_constant).fit()
p_values = ols_model.pvalues

# Print p-values
print("P-values:")
print(p_values)

Interpret the results of your model. What can conclusions can you draw from your model and how confident in these conclusions are you? Can we say that age is a good predictor of tardiness? Add your responses as comments after your code.

### Comments on results:

The regression model visualized in the scatter plot with the regression line (red) shows a clear linear relationship between age and tardiness, where tardiness decreases as age increases. The negative slope of the regression line and the statistically significant results suggest that as age increases, tardiness decreases. The statistical measures (high R-squared, significant p-values) provide strong confidence in the model's conclusions. 

The R-squared indicates that 88% of the variability in tardiness can be explained by the variability in age. This is a strong value, suggesting that age is a good predictor of tardiness. Since R-squared is high, age is a reliable predictor of tardiness within the scope of this dataset.

The p-value for both the constant (intercept) and the age variable is very low (below 0.05), indicating that the relationship between age and tardiness is statistically significant. In other words, the effect of age on tardiness is unlikely to have occurred by chance.

## Challenge 2
For the second part of this lab, we will use the vehicles.csv data set. You can find a copy of the dataset in the git hub folder. This dataset includes variables related to vehicle characteristics, including the model, make, and energy efficiency standards, as well as each car's CO2 emissions. As discussed in class the goal of this exercise is to predict vehicles' CO2 emissions based on several independent variables. 

In [None]:
# Import any libraries you may need & the data
vehicles = pd.read_csv("../vehicles.csv")
vehicles.head()

Let's use the following variables for our analysis: Year, Cylinders, Fuel Barrels/Year, Combined MPG, and Fuel Cost/Year. We will use 'CO2 Emission Grams/Mile' as our outcome variable. 

Calculate the correlations between each of these variables and the outcome. Which variable do you think will be the most important in determining CO2 emissions? Which provides the least amount of helpful information for determining CO2 emissions? Add your responses as comments after your code.

In [None]:
# Select the relevant variables
variables = ['Year', 'Cylinders', 'Fuel Barrels/Year', 'Combined MPG', 'Fuel Cost/Year', 'CO2 Emission Grams/Mile']
df = vehicles[variables]

# Calculate the correlations between each variable and the outcome variable
outcome = 'CO2 Emission Grams/Mile'
predictors = ['Year', 'Cylinders', 'Fuel Barrels/Year', 'Combined MPG', 'Fuel Cost/Year']

correlations = {}
for predictor in predictors:
    corr = df[predictor].corr(df[outcome])
    correlations[predictor] = corr

# Print the correlation coefficients
print("Correlation coefficients with CO2 Emission Grams/Mile:")
for predictor in correlations:
    print(f"  o {predictor}: {correlations[predictor]:.4f}")

### Comments on resuls:
- Based on the correlation coefficients, 'Fuel Barrels/Year' has the highest positive correlation with CO2 Emission Grams/Mile, indicating it is the most important variable in determining CO2 emissions.

- 'Year' has the lowest correlation with CO2 Emission Grams/Mile, suggesting it provides the least amount of helpful information for determining CO2 emissions.

Build a regression model for this data. What type of regression are you using? Add your responses as comments after your code.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define predictors (X) and outcome (y)
X = df[['Year', 'Cylinders', 'Fuel Barrels/Year', 'Combined MPG', 'Fuel Cost/Year']]
y = df['CO2 Emission Grams/Mile']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

### Comments:
1. The regression model used here is a Linear Regression. Linear regression assumes a linear relationship between the predictors and the outcome variable. The regression model used in the code is appropriate for multiple variables because linear regression can handle multiple predictors (independent variables) in addition to the dependent variable (outcome).
2. The LinearRegression model from sklearn supports multiple predictors without any additional modifications. The model assigns a coefficient to each predictor, which represents the effect of that predictor on the dependent variable while holding other predictors constant.

Print your regression summary, and interpret the results. What are the most important varibles in your model and why? What can conclusions can you draw from your model and how confident in these conclusions are you? Add your responses as comments after your code.

In [None]:
# Make predictions with the sklearn LinearRegression Model
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2): {r2:.4f}")

# Print the coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
print("\nCoefficients:")
print(coefficients)

# Fit the Regression Model using statsmodels (statsmodels.api)
model_ols = sm.OLS(y_train, X_train).fit()

# Print the regression summary
print(model_ols.summary())

# Extract the p-values
p_values = model_ols.pvalues
print("\nP-values for the predictors:")
print(p_values)

# Extract the coefficients
coefficients = model_ols.params
print("\nCoefficients of the predictors:")
print(coefficients)

### Interpretation of the results
1. The regression summary provides coefficients, p-values, and R-squared values.
2. The most important variables in the model can be identified by looking at the p-values and coefficients:
    - Variables with smaller p-values (typically < 0.05) are statistically significant.
    - Larger coefficients (in absolute value) indicate a stronger impact on the dependent variable.
3. Based on the coefficients:
    - 'Fuel Barrels/Year' is likely the most important variable because it has a large positive coefficient, indicating a strong impact on CO2 emissions.
    - 'Combined MPG' has a significant negative coefficient, meaning higher MPG reduces CO2 emissions.
4. The 'Year' variable may have the least importance if its p-value is high, suggesting it does not contribute significantly to the model.

### Conclusions from the model:
- The model explains a portion of the variability in CO2 emissions (as indicated by the R-squared value).
- Fuel consumption metrics ('Fuel Barrels/Year' and 'Combined MPG') are key drivers of CO2 emissions.
- Variables like 'Year' may not add much explanatory correlation or knowledge for the analysis.

### Confidence in conclusions:
- Confidence depends on the p-values and overall R-squared.
- If the p-values for key variables are low and the R-squared is high, we can be confident in the model's conclusions.
- However, external factors not included in the model may also affect CO2 emissions, so the model's predictions have limitations.


## Bonus Challenge: Error Analysis

I am suspicious about the last few parties I have thrown: it seems that the more people I invite the more people are unable to attend. To know if my hunch is supported by data, I have decided to do an analysis. I have collected my data in the table below, where X is the number of people I invited, and Y is the number of people who attended. 

|  X |  Y |
|----|----|
| 1  |  1 |
| 3  |  2 |
| 4  |  4 |
| 6  |  4 |
| 8  |  5 |
| 9  |  7 |
| 11 |  8 |
| 14 |  13 |

We want to know if the relationship modeled by the two random variables is linear or not, and therefore if it is appropriate to model it with a linear regression. 
First, build a dataframe with the data. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress

# Data
X = np.array([1, 3, 4, 6, 8, 9, 11, 14])  # Number of people invited
Y = np.array([1, 2, 4, 4, 5, 7, 8, 13])  # Number of people attended

# Calculate correlation coefficient
correlation_coefficient = np.corrcoef(X, Y)[0, 1]

# Perform linear regression
slope, intercept, r_value, p_value, std_err = linregress(X, Y)

# Generate regression line
regression_line = slope * X + intercept

# Print results
print("Correlation Coefficient (r):", correlation_coefficient)
print("Linear Regression Equation: Y = {:.2f}X + {:.2f}".format(slope, intercept))
print("Coefficient of Determination (r^2):", r_value**2)
print("P-value:", p_value)
print("Standard Error:", std_err)

Draw a dispersion diagram (scatter plot) for the data, and fit a regression line.

In [None]:
# Plot the data and regression line
plt.scatter(X, Y, label="Data points", color="blue")
plt.plot(X, regression_line, color='red', label=f"Regression line: Y = {slope:.2f}X + {intercept:.2f}")
plt.xlabel("Number of People Invited (X)")
plt.ylabel("Number of People Attended (Y)")
plt.title("Analysis of Invitations vs Attendance")
plt.legend()
plt.grid(True)
plt.show()

What do you see? What does this plot tell you about the likely relationship between the variables? Print the results from your regression.

*your explanation here*

Do you see any problematic points, or outliers, in your data? Remove these points and recalculate your regression. Print the new dispersion diagram with your new model and the results of your model. 

### What do I see?
- The data points (blue dots) represent the number of people invited (X) versus the number of people who attended (Y).
- A red regression line models the relationship between X and Y, with the equation Y = 0.85X - 0.44.

### What does the plot tell about the relationship?
- The plot suggests a positive linear relationship: as the number of people invited increases, the number of attendees also increases.
- The slope of 0.85 indicates that, on average, for every additional person invited, 0.85 people attend.
- The relationship is quite strong, as shown by a high correlation coefficient.

### Regression Results:
- Correlation Coefficient (r): 0.965 (strong positive correlation).
- Regression Equation: Y=0.85X − 0.44.
- Coefficient of Determination (R2): 0.932 (93.2% of the variance in Y is explained by X).
- P-value: Statistically significant (exact value can be derived from the code).

This analysis does not support the suspicion that more invitations lead to fewer attendees. Instead, it suggests that inviting more people generally results in more attendees.

In [None]:
# Specify points to remove (index-based)
indices_to_remove = [1, 7]  # Adjust indices based on points you want to exclude
X_filtered = np.delete(X, indices_to_remove)
Y_filtered = np.delete(Y, indices_to_remove)

# Recalculate regression with filtered data
slope, intercept, r_value, p_value, std_err = linregress(X_filtered, Y_filtered)
correlation_coefficient = np.corrcoef(X_filtered, Y_filtered)[0, 1]

# Generate regression line
regression_line = slope * X_filtered + intercept

# Print updated results
print("Updated Regression Model:")
print(f"Correlation Coefficient (r): {correlation_coefficient:.3f}")
print(f"Linear Regression Equation: Y = {slope:.2f}X + {intercept:.2f}")
print(f"Coefficient of Determination (r^2): {r_value**2:.3f}")
print(f"P-value: {p_value}")
print(f"Standard Error: {std_err:.3f}")

# Plot filtered data and new regression line
plt.scatter(X_filtered, Y_filtered, label="Filtered Data points", color="blue")
plt.plot(X_filtered, regression_line, color='red', label=f"Regression line: Y = {slope:.2f}X + {intercept:.2f}")
plt.xlabel("Number of People Invited (X)")
plt.ylabel("Number of People Attended (Y)")
plt.title("Updated Analysis of Invitations vs Attendance")
plt.legend()
plt.grid(True)
plt.show()


What changed? Based on the results of the two models and your graphs, what can you say about the form of the data with the problematic point and without it?

### What changed between the two models?
- Slope of the Regression Line:
    - Original: 0.85, suggesting a steeper relationship between the number of people invited and attendees.
    - Updated: 0.66, indicating a less steep relationship.
- Intercept:
    - Original: −0.44, with attendees starting at a lower baseline.
    - Updated: 0.52, with attendees starting at a higher baseline when fewer people are invited.
- Coefficient of Determination (R2): 
    - Original: 0.932, indicating that 93.2% of the variability in attendance was explained by the invitations.
    - Updated: 0.937, slightly higher, showing that the updated model fits the filtered data slightly better.
### Impact of Removing Points:
    - Removing certain points has flattened the slope of the regression line, suggesting that the relationship between invitations and attendance is less strong in the updated model.
    - The updated data points provide a more consistent trend.
### With vs. Without the Problematic Points:
    - With Problematic Points: The data showed a slightly steeper trend, but certain outliers or influential points might have skewed the slope and intercept.
    - Without Problematic Points: The data now aligns more closely with a linear trend, leading to a model that better represents the majority of the data points.
### What this tells us about the form of the data:
    - With the problematic points, the relationship appeared to be stronger, but it might not have been representative of the overall trend.
    - Removing these points reveals a more moderate but consistent linear relationship between the number of invitations and attendees.