**Applied Statistics**<br/>
Prof. Dr. Jan Kirenz <br/>
Hochschule der Medien Stuttgart

In [5]:
# Python set up (load modules) 
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.compat import lzip
from statsmodels.stats.outliers_influence import summary_table
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.stats.outliers_influence import OLSInfluence
from statsmodels.graphics.regressionplots import plot_leverage_resid2
import matplotlib.pyplot as plt
%matplotlib inline 
plt.style.use('ggplot') 
import seaborn as sns  
sns.set() 
from IPython.display import Image

# Application 5: Linear regression 'Auto' 

This question involves the use of simple linear regression on the **Auto data set** (see data description). 

(a) Use the lm() function to perform a simple linear regression with **mpg** as the response and **horsepower** as the predictor. Use the summary() function to print the results. Comment on the output. For example:
   1. Is there a relationship between the predictor and the response?
   2. How strong is the relationship between the predictor and the response?
   3. Is the relationship between the predictor and the response positive or negative?
   4. What is the predicted mpg associated with a horsepower of 98? What are the associted 95% confidence and prediction intervals?

(b) Plot the response and the predictor. Display the least squares regression line.

(c) Produce some diagnostic plots (e.g. 1. Residuals vs fitted plot, 2. Normal Q-Q plot, 3. Scale-location plot, 4. Residuals vs leverage plot) to describe the linear regression fit. Comment on any problems you see with the fit. 

---

## 1 Import data

In [32]:
# Load the csv data files into pandas dataframes

## 2 Tidying data

### 2.1 Data inspection

First of all, let's take a look at the variables (columns) in the data set.

In [3]:
# show all variables in the data set

In [2]:
# show the first 5 rows (i.e. head of the DataFrame)

In [5]:
# show the lenght of the variable id (i.e. the number of observations)

397

In [4]:
# check for duplicates and print results (if the two numbers match, we have no duplicates)
# show the lenght of the variable id (i.e. the number of observations)

# count the number of individual id's


It is not possible to easily check for duplicates since it is plausible that there are multiple car types of the same name...

In [6]:
# data overview (with meta data)


In [7]:
# change data type

#df['horsepower'] = pd.to_numeric(df['horsepower']) # produces error


### 2.2 Handle missing values

In [8]:
# show missing values (missing values - if present - will be displayed in yellow )


We can also check the column-wise distribution of null values:

## 3 Transform data

In [9]:
# summary statistics for all numerical columns

In [10]:
# summary statistics for all categorical columns


## 4. Visualize data

### Distibution of Variables

## 5 Model

# Task a)

(a) Use the lm() function to perform a simple linear regression with **mpg** as the response and **horsepower** as the predictor. Use the summary() function to print the results. 

### Simple Linear Regression

In [13]:
# fit linear model with statsmodels.formula.api (with R-style formulas) 

### Interpretation

**1. Is there a relationship between the predictor and the response?**

...

**2. How strong is the relationship between the predictor and the response?**

In [14]:
# Test relationship and strength with correlation stats.pearsonr()


...

**3. Is the relationship between the predictor and the response positive or negative?**

...

**4. What is the predicted mpg associated with a horsepower of 98? What are the associted 95% confidence and prediction intervals?**

...

---
---

# Task b)

(b) Plot the response and the predictor. Display the least squares regression line.

We use [Seaborne's lmplot](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to plot the regression line:

In [15]:
# Plot regression line with 95% confidence intervall


---
---

# Task c)

(c) Produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

### 1) Residuals vs fitted plot

In [17]:
# replace "foo"

# fitted values
model_fitted_y = foo
# Basic plot
plot = sns.residplot(foo, 'mpg', data=foo, lowess=True, 
                     scatter_kws={'alpha': 0.5}, 
                     line_kws={'color': 'red', 
                               'lw': 1, 'alpha': 0.8});

plot.set_title('Residuals vs Fitted');
plot.set_xlabel('Fitted values');
plot.set_ylabel('Residuals');

Interpretation: ...

### 2) Normal Q-Q

This plots the standardized (z-score) residuals against the theoretical normal quantiles. Anything quite off the diagonal lines may be a concern for further investigation.

In [18]:
# replace "foo"

# Use standardized residuals
sm.qqplot(foo.get_influence().resid_studentized_internal);

Interpretation ...

### 3) Scale-Location plot

In [19]:
# replace "foo"

# Scale Location plot

plt.scatter(foo.fittedvalues, np.sqrt(np.abs(foo.get_influence().resid_studentized_internal)), alpha=0.5)
sns.regplot(foo.fittedvalues, np.sqrt(np.abs(foo.get_influence().resid_studentized_internal)), 
            scatter=False, ci=False, lowess=True,
            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});

Interpretation ...

### 4) Residuals vs leverage plot

In [20]:
# replace "foo"

fig, ax = plt.subplots(figsize=(8,6))
fig = plot_leverage_resid2(foo, ax = ax)

---
---