# Lab 3 Practice: Introduction to Simple Linear Regression

## Reminder - working with notebooks

#### 1) It is important to save your work, exit the notebook, and logout of syzygy whenever you are finished working on the notebook for that session. Simply closing the window in which you are working will leave the notebook running which can produce some minor problems when you next try to log in to resume working on the notebook.

- **Select File > Save Notebook or select the Save icon above to save your work.**
- **To exit the notebook, select File > Close and Shutdown Notebook.**
- **Select File > Log Out.**


#### 2) When you resume your work on a notebook, your previous work/output may still be displayed, but none of your previous work is maintained in memory accessible by the notebook. In particular, you will need to load the dataset again in order to continue working with the data. One easy way to refresh your notebook is to go to the notebook cell where you left off and do the following.

- **Select Kernel > Restart Kernel and Run up to Selected Cell.**
#### This will run all of the code in your notebook up to the selected cell.

## Objectives

* Relationship between two numerical variables
* Graphical summary: scatterplot
* Numerical summary: correlation
* Modelling linear relationship: linear regression.

### Reaearch Question

How well does SAT scores or high school GPA predict student success at the postsecondary level?  

## University/College Admission Decisions

Universities and colleges in Canada generally rely upon high school grades when making admission decisions regarding high school graduates. In the U.S., the use of standard test scores when making admission decisions is quite common. The SAT (Scholastic Assessment Test) and ACT (American College Testing) are the most common entrance exams used by U.S. universities and colleges when making admission decisions. Considerable research has been undertaken into the ability of these standardized tests and high school grades to predict success at university. In what follows, we will study the relationship of both SAT scores and high school GPA with first year GPA at university/college. 

**Note: SAT has a verbal component and a math component. The maximum score for each of these is 800, which means the maximum total score is 1600. As you will see, our dataset records the verbal SAT percentile and math SAT percentile for each student instead of the raw score. The maximum percentile on each component would be 100, making 200 the maximum total of the verbal and math percentiles.**

## Data Information:

### Data Set:

This dataset contains SAT and GPA data for 1000 students at an unnamed college.


#### Name: #### 
* `satgpa` - SAT and GPA data for 1000 students at an unnamed college

#### Variables: ####
* `sex` - Gender of the student.
* `sat_v` - Verbal SAT percentile.
* `sat_m` - Math SAT percentile.
* `sat_sum` - Total of verbal and math SAT percentiles.
* `hs_gpa` - High school grade point average.
* `fy_gpa` - First year (college) grade point average.

## Load Data: 

In [None]:
source("http://www.openintro.org/data/R/satgpa.R")

The `source` function is used to import the dataset that will be used in the tutorial. The data that is available to you is called `satgpa`.

We may wish to begin by using some combination of the `names`, `dim`, `str`, and `head` commands to begin exploring the `satgpa` dataset.

`names` lists the names of the variables in the dataset. 

In [None]:
names(satgpa)

`dim` reports the dimension, or size, of the dataset measured as the number of rows (observations) and number of columns (generally the number of variables).

In [None]:
dim(satgpa)

 `str` reports the structure of the dataset. It lists the variables in the dataset, identifies the type of each variable, and provides a glimpse of some of the data recorded for each variable.

In [None]:
str(satgpa)

 `head` reports few rows of data in the dataset. The default is the first six rows, but the `n = k` argument may be included to specify the number of rows to display.

In [None]:
head(satgpa)

### Graphical Summary

Before investigating the relationship between total SAT score and first year GPA, or between high school GPA and first year GPA, it is a good idea to review the individual variables. One common way to visualize a single numerical variable is with **histograms** (`hist`). 

#### Histogram

We can begin by looking at the histogram, `hist`, for the variable `fy_gpa` with the command

In [None]:
hist(satgpa$fy_gpa, xlab = "GPA", main = "First Year GPA")

How might one describe the distribution of first year GPA?

First year GPA appears to be unimodal and slightly left skewed. Typical first year GPA is between 2-3 and almost all GPAs are between 1 and 4 (or 0 and 4). There do not appear to be any unusual observations. 

Next, we can look at the total verbal and math SAT percentiles, `sat_sum`.

In [None]:
hist(satgpa$sat_sum, xlab = "Total Percentile", main = "Total Verbal and Math SAT Percentile")

How might one describe the distribution of total verbal and math SAT percentiles?

Total verbal and math SAT percentiles appears to be unimodal and symmetric (bell-shaped). Typical total percentiles is around 100 and almost total percentiles are between 60 and 140 (or 70 and 130, or 50 and 150). There do not appear to be any unusual observations. 

#### Question: Construct a histogram for the high school GPA, `hs_gpa`, of the subjects in the study.

<details>

<summary><b>Click to view sample code:</b></summary>

```
hist(satgpa$hs_gpa, xlab = "GPA", main = "High School GPA")
```

</details>

#### Question: Describe the distribution of high school GPA.

#### Answer

<details>

<summary><b>Sample Answer:</b></summary>

The distribution of high school GPA may be bimodal, with one mode concentrated around a GPA of 3.5-4.0 and another mode concentrated around 2.75-3.0. 

**Note 1: If the distribution is actually bimodal, what might explain the two modes? Perhaps we are observing two distinct groups of students, one of which gained admission primarily on the basis of high high school GPA and another group whose high school GPA may not be as high, but perhaps performed well on one of the standardized tests. Alternatively, perhaps the mode at the lower GPA reflects a tendency to bump students sitting at C or C+ to a B-.**

**Note 2: Why doesn't the distribution of high school GPA look more like the distribution of first year GPA? If we consider high school GPA for all students in their final year of high school, it probably would be more similar, bell-shaped, centred around 2-3, and ranging from 0-4. However, we are only looking at those students who went on to university/college. So it is reasonable that GPA for these student will tend to be concentrated at higher values. Once these students are in university/college, there is nothing to prevent grades for exhibitng a more bell-shaped distribution ranging from 0-4.**

</details>

#### Scatterplots

**Scatterplots** are useful for exploring the **relationship** between **two numerical variables**. 

The `plot` function may be used to create a scatterplot, by including the two quantitative variables of interest as arguments.  

Construct a scatterplot to display the relationship between first year GPA, `fy_gpa`, and total verbal and math SAT percentiles, using the variable `sat_sum` as the predictor (explanatory) variable and the variable `fy_gpa` as the response variable.  

There are two ways to do this using the `plot` function.

One way is to use `x = ` and `y = ` to identify the explanatory variable and response variable. 

In [None]:
plot(x = satgpa$sat_sum, y = satgpa$fy_gpa)

Alternatively, one may use `y~x` notation to state the relationship between the two variables of interest. For example

In [None]:
plot(satgpa$fy_gpa ~ satgpa$sat_sum)

The `plot` function only requires the data to be graphed, but there are some optional arguments that may be used to customize the plot. Some of these optional arguments include:

* `xlab` - specify the label for the x-axis, eg `xlab = "x-axis label"`
* `xlim` - specify the minimum and maximum value for the x-axis, eg `xlim=c(minimum, maximum)`
* `ylab` - specify the label for the y-axis, eg `ylab = "y-axis label"`
* `ylim` - specify the minimum and maximum value for the y-axis, eg `ylim=c(minimum, maximum)`
* `main` - specify a main title for the graph, eg `main = "Main Title for the Graph"`

Not all of these options will be relevant in each situation. There are also more optional arguments that we will explore in coming weeks.

In [None]:
plot(x = satgpa$sat_sum, y = satgpa$fy_gpa, xlab = "Total Verbal and Math SAT Percentiles", ylab = "First Year GPA")

#### Question: <br>-Does there appear to be a relationship between total verbal and math SAT percentiles and first year GPA? <br>-If there does appear to be a relationship, does the relationship appear to be positive or negative? <br>-Does the relationship look linear? <br>-How strong is the relationship? 

#### Answer: 

<details>

<summary><b>Sample Answer:</b></summary>

*There does appear to be a relationship between total verbal and math SAT percentiles and first year GPA. In particular as the total SAT percentiles increases, first year GPA also tends to increase. This represents a positive relationship. The relationship appears to be roughly linear, and moderately strong. There are a few unusual observations*

</details>

#### Question:  If you knew a students's total verbal and math SAT percentiles, would  you be comfortable using a linear model to predict first year GPA for that student? 

#### Answer:

<details>

<summary><b>Sample Answer:</b></summary>

*Since the relationship seems roughly linear, I would feel comfortable using a linear model to predict firat year GPA using total verbal and math SAT percentiles. However, because the data is so spread out, there is not a very strong linear relationship between total SAT percentiles and first year GPA. Hence, I would not expect my predictions to be especially accurate.*


</details>

#### Question: Construct a scatterplot to display the relationship between high school GPA, `hs_gpa`, and first year GPA, `fy_gpa`, using the variable `hs_gpa` as the predictor (explanatory) variable and the variable `fy_gpa` as the response variable.  

<details>

<summary><b>Click to view sample code:</b></summary>

<br>

Option 1
```
plot(x = satgpa$hs_gpa, y = satgpa$fy_gpa, xlab = "High School GPA", ylab = "First Year GPA")
```

<br>

Option 2
```
plot(satgpa$fy_gpa ~ satgpa$hs_gpa, xlab = "High School GPA", ylab = "First Year GPA")
```

</details>

### Numerical Summary

When summarizing a single quantitative variable, we used numerical quantities such as the mean and standard deviation. It would be useful to be able to summarize the relationship between two quantitative variables. If the relationship looks linear, we can describe the relationship by quantifying the strength of the linear relationship, and by choosing the line that best fits, or summarizes, the relationship.

### Correlation

The scatterplot provides a visual indication of how well the regression line *fits* the data. The **Correlation coefficient** is a numerical summary used to quantify the **strength of the linear relationship** between to quantitative variables. Correlation ranges between +1, for variables with a perfect positive linear relationship, to -1, for variables with a perfect negative linear relationship. A correlation of 0 represents no linear relationship between two numerical variables. The `cor` function is used for this. `cor(x,y)` calculates the correlation between two quantitative variables `x` and `y`.

In [None]:
cor(satgpa$sat_sum, satgpa$fy_gpa)

#### Question: Calculate the correlation between `hs_gpa` and `fy_gpa`.

<details>

<summary><b>Click to view sample code:</b></summary>


```
cor(satgpa$hs_gpa, satgpa$fy_gpa)
```

</details>

### The linear model

Choosing a line that summarizes the relationship between two quantitative variable, mean choosing and **intercept, $b_0$**, and **slope, $b_1$**, such that for each value of the explanatory variable, $x$, we can predict the value of the response variable, $\hat{y}$.

$$\hat{y} = b_0 + b_1\cdot x$$

To find the line that **best** summarizes the relationship, we need to consider the difference between the observed values and the values predicted by the line.

$$ e_i = y_i - \hat{y}_i$$

The $e_i$'s represent the residuals, or prediction errors, associated with the proposed linear relationship.  

The most common way to do linear regression is to select the line, that is to select intercept, $a$, and slope, $b$, that minimizes the sum of squared residuals.

$$\sum e_i^2 = \sum (y_i - \hat{y}_i)^2$$

It is rather cumbersome to try to get the correct **least squares line**, i.e. the line that minimizes the sum of squared residuals by hand, either through trial and error or through the use of calculus. 

Instead we can use the `lm` function in R to fit the linear model (find the least squares regression line).

In [None]:
regress1 = lm(fy_gpa ~ sat_sum, data = satgpa)

* The first argument in the function lm is a formula that takes the form `y ~ x`. Here it can be read that we want to make a linear model of `fy_gpa` as a function of `sat_sum`.  
  

* The second argument specifies that R should look in the `satgpa` data frame to find the `fy_gpa` and `sat_sum` variables.
  
  
* The output of the `lm` function is being saved to a new object called `regress1` here, that contains information about the linear model that was just fit. We can access this information using the `summary` function.

In [None]:
summary(regress1)

Consider this output piece by piece. 

* First, the formula used to describe the model is shown at the top. 
  
  
* After the formula you find the five-number summary of the residuals.  
  
  
* The `Coefficients` table shown next is key; its first column displays the linear model's *y*-intercept and the coefficient of the explanatory variable, `sat_sum`. With this table, we can write down the least squares regression line for the linear model:

$$\hat{fy\_gpa} = 0.00193 + 0.02387 \cdot sat\_sum$$

or

$$\hat{y} = 0.00193 + 0.02387 \cdot x$$

The output also contains additional information. Some of this will be covered in a subsequent lab, and some is not covered in our course.

#### Question:  Fit a new model that uses `hs_gpa` to predict `fy_gpa`. (Use a different model name than `regress1` to save the results)

<details>

<summary><b>Click to view sample code:</b></summary>


```
regress2 = lm(fy_gpa ~ hs_gpa, data = satgpa)
summary(regress2)
```

</details>

#### Question: Using the R output from fitting the linear model, write the equation of the regression line.  

#### Answer:  

<details>

<summary><b>Sample Answer:</b></summary>

$\hat{fy\_gpa} = 0.09132 + (0.74314) \times (hs\_gpa)$

#### Your answer does not need to use $\hat{fy\_gpa}$ notation. Your answer could simply use *fy_gpa*, or *fy_gpa-hat*.  

</details>

### Add regression line to scatterplot

Let’s create a scatterplot of `fy_gpa` versus `sat_sum` with the least squares overlaid on top. The function `abline` may be used to add a line of the form $y=a+bx$ to any plot. For example to add a line corresponding to the regression equation one could use `abline(a=0.00193, b=0.02387)`.

However, since the regression object `regress1` contains all of the relevant information regarding the slope and intercept of the regression line, the `abline` function can be used in conjunction with the `regress1` object. Therefore, `abline(regress1)` will add the regression line to the scatterplot. 

In [None]:
plot(x = satgpa$sat_sum, y = satgpa$fy_gpa, xlab = "Total SAT score", ylab = "GPA in first year of university/college team")
abline(regress1)

#### Question: Create a scatterplot of first year GPA, `fy_gpa`, and high school GPA, `hs_gpa`, with the least squares line overlaid on the scatterplot.

<details>

<summary><b>Click to view sample code:</b></summary>


```
plot(x = satgpa$hs_gpa, y = satgpa$fy_gpa, xlab = "High school GPA", ylab = "GPA in first year of university/college")
abline(regress2)
```

</details>

### Prediction and prediction errors

The regression equation may be used to predict `y` at any value of `x`. When predictions are made for values of `x` that are beyond the range of the observed data, it is referred to as **extrapolation** and is not usually recommended. However, predictions made within the range of the data are more reliable. In a subsequent lab we will consider an R function that may be used for prediction, but for now we will rely upon calculations by hand.

#### Question: Student number 1 in the dataset had total verbal and math SAT percentiles of 127. What first year GPA would the least squares regression line predict for a student with total SAT percentiles of 127?
**You can just use your calculator together with the regression equation in order to predict first year GPA. There is a `predict` function that may be used for this purpose, but we will not use this function at this time.** 

#### Answer:

<details>

<summary><b>Sample Answer:</b></summary>

*$$\hat{fy\_gpa} = 0.00193 + 0.02387 \cdot sat\_sum$$ $$= 0.00193 + (0.02387)(127) = 3.03$$. The model predicts a first year GPA of approximately 3.03 for a student with total SAT percentiles of 127.*

#### Your answer does not need to use $\hat{fy\_gpa}$ notation. Your answer could simply use *fy_gpa*, or *fy_gpa-hat*.  

</details>

####  Question: Student number 1 in the dataset actually recorded a first year GPA of 3.18. Did the linear model overestimate or underestimate first year GPA for Student 1?

#### Answer:  

<details>

<summary><b>Sample Answer:</b></summary>

*The model underestimated the actual first year GPA for student 1, by about 0.15 points $(y-\hat{y}= 3.18 - 3.03 = 0.15)$*

</details>

#### Let’s stop here. 

#### It is important to save your work, exit the notebook, and logout of syzygy when you are done. Simply closing the window in which you are working will leave the notebook running which can produce some minor problems when you next try to log in.

- **Select File > Save Notebook or select the Save icon above to save your work.**
- **To exit the notebook, select File > Close and Shutdown Notebook.**
- **Select File > Log Out.**