# Analysis and Visualization of Complex Agro-Environmental Data
---
## Regression

### 1. Run simple linear regression

Simple linear regression is the simplest form of regression analysis. It is also commonly used in Exploratory Data Analysis when we are interested in exploring if a given continuous response variable is affected by other independent (or predictor) variables. When we aim at modelling a response continuous variable with multiple linear regression using a big set of potential candidate variables, simple regression analysis is often used as a first filter to select a subset of candidate variables. Again, significant effect of a predictor on the response variable does not imply causation; but causation implies a significant an effect of a predictor on the response. Along with correlation analysis, regression is also important as a basis to establish hypothesis to be tested with more elaborated confirmatory statistics.

Linear regression models may be run with several python modules such as SciPy, statsmodel and scikit-learn.

##### Example with the Iris dataset

In [None]:
# import the packages we are going to be using
import numpy as np # for getting our distribution
import pandas as pd # to handle data frames
import matplotlib.pyplot as plt # for plotting
import seaborn as sns # for plotting
from scipy import stats # to compute statistics

# import data ('iris' dataset)
data = sns.load_dataset('iris')
print(data)

In [None]:
# Use the `Iris`dataset to relate petal width as function of sepal width

x=data["petal_width"]
y=data["sepal_width"]

# Execute a method that returns some important key values of Linear Regression:
slope, intercept, r, p, std_err = stats.linregress(x, y)

# plot data with fitted line
def myfunc(x):
  return intercept + slope * x # function that returns fitted values

mymodel = list(map(myfunc, x)) # apply function to each x value

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

In [None]:
# Run the regression test
print('slope estimate=%.2f, r=%.2f, p=%.6f' % (slope, r, p))
alpha=0.05
if p <= alpha:
 print('reject H0 that the slope of the relationship is = 0')
else:
 print('fail to reject H0 that the slope of the relationship is = 0')

In [None]:
# using statsmodel instead - gives an extended output with much less code
# Do not worry much about many of the outputs that you still not know about.

import statsmodels.api as sm

x = sm.add_constant(x) # adding a constant (Intercept)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)

In [None]:
# Now using sklearn

import pandas as pd
from sklearn import linear_model

# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

In [None]:
# same scatterplot but showing data per species
sns.scatterplot(x=x, y=y, hue=data["species"])
plt.plot(x, mymodel)
plt.show()

In [None]:
# Plot by group
sns.lmplot(x='petal_width', y='sepal_width', hue="species", data=data)

In [None]:
# Run same regression but only for the species 'versicolor'

x=data[(data['species']=='versicolor')]['petal_width']
y=data[(data['species']=='versicolor')]['sepal_width']

# Execute a method that returns some important key values of Linear Regression:
slope, intercept, r, p, std_err = stats.linregress(x, y)

# plot data with fitted line
def myfunc(x):
  return intercept + slope * x # function that returns fitted values

mymodel = list(map(myfunc, x)) # apply function to each x value

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

In [None]:
# Model output table
x = sm.add_constant(x) # adding a constant (Intercept)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)