<img src="https://github.com/CorndelDataAnalyticsDiploma/workshop/blob/master/Corndel%20Digital%20Logo%20Centre.png?raw=true" alt="Corndel" width ="301.5" align="left">


# Linear Regression Analysis


In this first section we will look at the traditional statisticians' approach to Linear Regression.
These models are built to **gain understanding about the relationships between the dependent and independent variables** (e.g. how does a change in the number of hours students study affect their exam score?).

This traditional approach involves ensuring lots of assumptions are satisfied.

We will not go into depth with these assumptions as our focus is on building predictive models and that has a different approach.

In the second section we will build a predictive model and evaluate this against our data to determine how good a model it is.






## Simple Linear Regression

In [None]:
# Import all relevant libraries 

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

In [None]:
# enlarge seaborn plots
# sns.set_theme(rc={'figure.figsize':(11.7,8.27)})

## Step 1 EDA

🤖`*"How do import a csv named 'fertililizer_yield_50-80.csv' into pandas ?"*`

### Load in data and quick overview

<div class="alert alert-block alert-warning">

-**Exploratory Data Analysis**
    
        -- It is crucial that we explore the data before we attempt any kind of modelling




In [None]:
# Import CSV and view head
crop_yield = pd.read_csv('data/fertililizer_yield_50-85.csv')

crop_yield.head()

* fertilizer_vol: how much fertilizer was applied (in litres)
* farm_size: the area the fertilizer was applied to (in acres)
* planting density: the number of plants per acre
* yield: the amount of crop produced (per acre)

### Summary statistics and DataFrame info

In [None]:
# numerical analysis of data

crop_yield.describe().round(2)

In [None]:
# get info of dataframe

crop_yield.info()

### Scatter Plot, investigate Linearity

We will investigate whether we can build a model to understand the relationship between the amount of fertilizer that is applied to crops and the crop yield. 

🤖`*"How do I build a scatter plot in seaborn from a dataframe named crop_yield, I want fertilizer_vol on the x_axis and yield on the y-axis?"*`

<div class="alert alert-block alert-warning">
    
An AI chatbot will ask you if you would like a regression plotted on this too. Seaborn can do this, however we cannot extract the model parameters from it. Further querying an AI chatbot  will direct you to use Statsmodels, Scipy or Scikit-Learn.

In [None]:
# Use seaborn to create a scatter plot of "fertilizer_vol" against "yield"

sns.scatterplot(data=crop_yield, x="fertilizer_vol", y="yield")
plt.show()

<div class="alert alert-block alert-warning">

### Questions?
    Can we see a linear relationship?
    
    Is it viable to proceed?
    

(We will ignore the possible presence of heteroscedasticity and move on for now. It is important for the Statisticians' approach and explainability of features, but not for predictive modelling).

# Correlation coefficient

In [None]:
# Quantify this with the correlation coefficient

crop_yield['fertilizer_vol'].corr(crop_yield["yield"]).round(2)

## Building the Linear Model

Statsmodels does not include an intercept by default because it gives users explicit control over model specification, unlike libraries like scikit-learn which include an intercept by default for convenience.

Why this design choice?
Transparency and statistical clarity:
statsmodels is designed for statisticians and researchers who often want precise control over the design matrix (independent variables). Automatically adding a constant could obscure model structure or interfere with special cases (e.g., no-intercept models, fixed effects, etc.).

In [None]:
# Build the model and print out the summary
# Prepare data

# select the independent variable
X = sm.add_constant(crop_yield['fertilizer_vol'])  # adds intercept term
# select the dependent variable
y = crop_yield['yield']

# Fit model
model = sm.OLS(y, X).fit()
model.summary()

<div class="alert alert-block alert-warning">

    --P-value a feature should have a p-value of less than 0.05 in order for us to regard it as significant in our model. A value higher than this is generally viewed to be stastically insignificant and it should be excluded from our explanitory variables
    
    -- R^2 is the coefficient of determination. It represents the proportion of variance in target variable due to the input variables.

In [None]:
# Using Matplotlib, pLot the model on the same chart as the data

# set up for multiple plots
fig, ax = plt.subplots()

# Scatter plot
ax.scatter(crop_yield['fertilizer_vol'], crop_yield['yield'])
plt.xlabel('Fertilizer Volume')
plt.ylabel('Yield')


# Regression line on same axes
sm.graphics.abline_plot(model_results=model, ax=ax, color='red')

# display the plot
plt.show()

# Interpolation, Extrapolation and Outliers

Our model was built on data where between 50 and 85 litres of fertilizer was applied to the crops. 

What happens when we use a dataset that includes observations where between *1 and 104* litres of fertilizer was applied to the crops?

In [None]:
# load in the full data and display as a scatter plot

crop_yield_full = pd.read_csv('data/fertililizer_yield.csv')
sns.scatterplot(data=crop_yield_full, x="fertilizer_vol", y="yield");

In [None]:
# Using Matplotlib, pLot the model on the same chart as the data

fig, ax = plt.subplots()

# Scatter plot
ax.scatter(crop_yield_full['fertilizer_vol'], crop_yield_full['yield'])
plt.xlabel('Fertilizer Volume')
plt.ylabel('Yield')

# Regression line on same axes
sm.graphics.abline_plot(model_results=model, ax=ax, color='red')

# Set limits
#ax.set(ylim=(2,11), xlim=(50,85))

plt.show()

<div class="alert alert-block alert-warning">

-**Extrapolation**
    
     --Here we can see the dangers of extrapolation.
    -- This is drawing conclusions or making predictions about data outside the range of what we have previously seen
    
    
    
-**Solutions**
     
    --There are three distinct populations, roughly 0-50, 50-90 and 90+
    --Different ways to model this are
        --Piecewise, seperate Linear Models of each distinct population
        --A Non-linear regression model
    
-**Outliers**
    
    -- Do you see any outliers?
    -- What are the options for these?

# Menti Quiz

# Break

# Exercise - Break out rooms (20 mins)

The csv file `who_life_exp_albania.csv` contains data about life expectancy in Albania between 2000 and 2014. It was compiled by the [World Health Organization](https://www.who.int/).

A data dictionary is available at https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

Your task is to understand the relationship between **life_expectancy** (the dependent variable) and an independent variable of your choice, by creating a simple linear regression model. 

Use the approach we have used so far:

* **load the dataset** in a notebook
* **explore the data** (summary statistics and DataFrame info)
* **investigate linearity** between the DV and potential IVs (scatter plot and calculate correlation coefficient)
* **validate that your chosen data is suitable** for  linear regression
* if required, **handle outliers** (or choose another suitable IV)

When you have found an IV that is suitable for building a linear model with:

* **build a linear model**
* from the model summary, **evaluate it**.

As a group, be ready to share:

* why you think your data is suitable for linear regression
* your model's R value. What does it mean?
* your IV's coefficient and p-value. What do they mean?

Ignore any statsmodels warnings you may see - this will be because there is a small number of observations in the dataset. 