Skip to content

Commit

Permalink
adjust 05_DataModels.qmd
Browse files Browse the repository at this point in the history
  • Loading branch information
thomasmanke committed Apr 5, 2024
1 parent 8b47aea commit 9bb5ffe
Showing 1 changed file with 29 additions and 21 deletions.
50 changes: 29 additions & 21 deletions qmd/05_DataModels.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ categories:
- selection/unselection of elements `[,-5]`
- plot() function and arguments

```{r}
```{r cor_plot}
# assign species-colors to each observation
cols = iris$Species # understand how color is defined
plot(iris[,-5], col=cols, lower.panel=NULL) # "cols" was defined in task above
Expand Down Expand Up @@ -53,9 +53,14 @@ abline(fit, lwd=3, lty=2) # add regression line

**Task**: Extract the coefficients of the fitted line.

```{r, echo=FALSE}
```{r get_coef, echo=FALSE}
fit$coefficients
coef(fit)
coef(fit)
```

There are many more methods to access information for the `lm` class
```{r class_methods}
methods(class='lm')
```


Expand All @@ -70,16 +75,14 @@ This is a good fit as suggested by a

- small residual standard error
- a large coefficient of variation $R^2$
- a small p-value
- large F-statistics $\to$ small p-value
- and by visualization

Fraction of variation explained by model:
$$
R^2 = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum_i(y_i - y(\theta,x_i))^2}{\sum_i(y_i-\bar{y})^2}
$$
There are manny more methods to access information for the `lm` class
```{r class_methods}
methods(class='lm')
```



# Predictions (with confidence intervals)
Expand Down Expand Up @@ -107,19 +110,21 @@ polygon(c(rev(xn), xn), c(rev(p[ ,"upr"]), p[ ,"lwr"]), col = rgb(1,0,0,0.5), bo
# Poor Model

Just replace "Petal" with "Sepal"
```{r}
```{r poor_model}
plot(Sepal.Width ~ Sepal.Length, data=iris, col=cols)
fit1=lm(Sepal.Width ~ Sepal.Length, data=iris)
abline(fit1, lwd=3, lty=2)
confint(fit1) # estimated slope is indistinguishable from zero
summary(fit1)
```
*Interpretation*: slope is not significantly distinct from 0.
Model does not account for much of the observed variation.
*Interpretation*:

- slope is not significantly distinct from 0.
- model does not account for much of the observed variation.


**Task**: Use the above template to make predictions for the new poor model.
```{r echo=FALSE}
```{r poor_model_pred, echo=FALSE}
x=iris$Sepal.Length # explanatory variable from fit (here:Sepal.Length)
xn=seq(min(x), max(x), length.out = 100) # define range of new explanatory variables
ndf=data.frame(Sepal.Length=xn) # put them into data frame
Expand All @@ -136,7 +141,7 @@ lines(xn, p[,"upr"] )
In the iris example the "Species" variable is a factorial (categorical) variable with 3 levels.
Other typical examples: different experimental conditions or treatments.

```{r}
```{r model_factorial}
plot(Petal.Width ~ Species, data=iris)
fit=lm(Petal.Width ~ Species, data=iris)
summary(fit)
Expand All @@ -161,15 +166,18 @@ anova(fit)
# More complicated models
Determine residual standard error `sigma` for different fits with various complexity

```{r}
```{r model_comp}
fit=lm(Petal.Width ~ Petal.Length, data=iris)
paste(toString(fit$call), sigma(fit))
paste(sigma(fit), deparse(formula(fit)))
fit=lm(Petal.Width ~ Petal.Length + Sepal.Length, data=iris) # function of more than one variable
paste(toString(fit$call), sigma(fit))
paste(sigma(fit), deparse(formula(fit)))
fit=lm(Petal.Width ~ Species, data=iris) # function of categorical variables
paste(toString(fit$call), sigma(fit))
paste(sigma(fit), deparse(formula(fit)))
fit=lm(Petal.Width ~ . , data=iris) # function of all other variable (numerical and categorical)
paste(toString(fit$call), sigma(fit))
paste(sigma(fit), deparse(formula(fit)))
```

... more complex models tend to have smaller residual standard error (overfitting?)
Expand All @@ -192,14 +200,14 @@ Linear models $y_i=\theta_0 + \theta_1 x_i + \epsilon_i$ make certain assumptio

* residuals $\epsilon_i$ are independent from each other (non-linear patterns?)
* residuals are normally distributed
* have equal variance $\sigma^2$ (homoscedascity)
* are there outliers (large residuals) or observations with strong influence on fit
* have equal variance $\sigma^2$ ("homoscedasticity")
* no outliers (large residuals) or observations with strong influence on fit

***

# Review
* dependencies between variable can often be modeled
* linear model lm(): fitting, summary and interpretation
* correlation coefficients can be misleading
* linear models may not be appropriate. >example(anscombe)
* linear models may not be appropriate $\to$ example(anscombe)

0 comments on commit 9bb5ffe

Please sign in to comment.