## Introduction to Regression - Lesson 1

#### Putting it into perspective

✅ There are many types of regression methods, and the one you choose depends on the question you're trying to answer. For example, if you want to predict the likely height of a person based on their age, you would use `linear regression`, as you're looking for a **numerical value**. On the other hand, if you're trying to determine whether a type of cuisine should be classified as vegan or not, you're dealing with a **category assignment**, so you would use `logistic regression`. You'll learn more about logistic regression later. Take a moment to think about some questions you could ask of data and which of these methods might be most suitable.

In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine you wanted to test a treatment for diabetic patients. Machine Learning models could help you identify which patients might respond better to the treatment based on combinations of variables. Even a very simple regression model, when visualized, could reveal insights about variables that might assist in organizing your theoretical clinical trials.

With that in mind, let's dive into this task!

<p >
   <img src="../../images/encouRage.jpg"
   width="630"/>
   <figcaption>Artwork by @allison_horst</figcaption>

<!--![Artwork by \@allison_horst](../../../../../../translated_images/encouRage.e75d5fe0367fb9136b78104baf4e2032a7622bc42a2bc34c0ad36c294eeb83f5.en.jpg)<br>Artwork by @allison_horst-->


## 1. Setting up our tools

For this task, we'll need the following packages:

-   `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [set of R packages](https://www.tidyverse.org/packages) designed to make data science faster, easier, and more enjoyable!

-   `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [set of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.

You can install them using:

`install.packages(c("tidyverse", "tidymodels"))`

The script below checks if you have the necessary packages to complete this module and installs any that are missing.


In [2]:
suppressWarnings(if(!require("pacman")) install.packages("pacman"))
pacman::p_load(tidyverse, tidymodels)

Loading required package: pacman



Now, let's load these awesome packages and make them available in our current R session. (This is for mere illustration, `pacman::p_load()` already did that for you)


In [None]:
# load the core Tidyverse packages
library(tidyverse)

# load the core Tidymodels packages
library(tidymodels)


## 2. The diabetes dataset

In this exercise, we'll showcase our regression skills by making predictions using a diabetes dataset. The [diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt) contains `442 samples` of data related to diabetes, with 10 predictor feature variables: `age`, `sex`, `body mass index`, `average blood pressure`, and `six blood serum measurements`, as well as an outcome variable `y`, which is a quantitative measure of disease progression one year after baseline.

|Number of observations|442|
|----------------------|:---|
|Number of predictors|First 10 columns are numeric predictive|
|Outcome/Target|Column 11 is a quantitative measure of disease progression one year after baseline|
|Predictor Information|- age in years
||- sex
||- bmi body mass index
||- bp average blood pressure
||- s1 tc, total serum cholesterol
||- s2 ldl, low-density lipoproteins
||- s3 hdl, high-density lipoproteins
||- s4 tch, total cholesterol / HDL
||- s5 ltg, possibly log of serum triglycerides level
||- s6 glu, blood sugar level|

> 🎓 Remember, this is supervised learning, and we need a target variable named 'y'.

Before you can work with data in R, you need to import the data into R's memory or establish a connection that allows R to access the data remotely.

> The [readr](https://readr.tidyverse.org/) package, part of the Tidyverse, offers a fast and user-friendly way to read rectangular data into R.

Now, let's load the diabetes dataset from the following source URL: <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html>

We'll also perform a quick check on our data using `glimpse()` and display the first 5 rows with `slice()`.

Before moving forward, let's introduce something you'll frequently encounter in R code 🥁🥁: the pipe operator `%>%`

The pipe operator (`%>%`) allows you to perform operations in a logical sequence by passing an object forward into a function or expression. You can think of the pipe operator as saying "and then" in your code.


In [None]:
# Import the data set
diabetes <- read_table2(file = "https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt")


# Get a glimpse and dimensions of the data
glimpse(diabetes)


# Select the first 5 rows of the data
diabetes %>% 
  slice(1:5)

`glimpse()` reveals that this dataset contains 442 rows and 11 columns, with all columns being of the `double` data type.

<br>

> `glimpse()` and `slice()` are functions from [`dplyr`](https://dplyr.tidyverse.org/). Dplyr, part of the Tidyverse, is a collection of tools for data manipulation that provides a consistent set of verbs to address common data manipulation tasks.

<br>

Now that we have the dataset, let's focus on one feature (`bmi`) for this exercise. To do this, we need to select the relevant columns. So, how can we achieve this?

[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) enables us to *choose* (and optionally rename) specific columns in a data frame.


In [None]:
# Select predictor feature `bmi` and outcome `y`
diabetes_select <- diabetes %>% 
  select(c(bmi, y))

# Print the first 5 rows
diabetes_select %>% 
  slice(1:10)

## 3. Training and Testing Data

In supervised learning, it's a common approach to *divide* the data into two subsets: a (usually larger) set used to train the model, and a smaller "reserved" set used to evaluate the model's performance.

Now that our data is prepared, we can explore whether a machine can assist in identifying a logical way to split the numbers in this dataset. To achieve this, we can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework. This package allows us to create an object that contains the details of *how* the data should be split, followed by two additional rsample functions to extract the resulting training and testing sets:


In [None]:
set.seed(2056)
# Split 67% of the data for training and the rest for tesing
diabetes_split <- diabetes_select %>% 
  initial_split(prop = 0.67)

# Extract the resulting train and test sets
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

# Print the first 3 rows of the training set
diabetes_train %>% 
  slice(1:10)

## 4. Train a linear regression model with Tidymodels

Now it's time to train our model!

In Tidymodels, models are defined using `parsnip()` by specifying three key aspects:

-   The **type** of model distinguishes between options like linear regression, logistic regression, decision tree models, and others.

-   The **mode** of the model refers to common tasks like regression or classification; some model types can handle both modes, while others are limited to one.

-   The **engine** is the computational tool that will be used to fit the model. These are often R packages, such as **`"lm"`** or **`"ranger"`**.

This modeling information is stored in a model specification, so let's create one!


In [None]:
# Build a linear model specification
lm_spec <- 
  # Type
  linear_reg() %>% 
  # Engine
  set_engine("lm") %>% 
  # Mode
  set_mode("regression")


# Print the model specification
lm_spec

After a model has been *defined*, it can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, usually with a formula and some data.

`y ~ .` indicates that we will fit `y` as the target or predicted value, explained by all the predictors/features, i.e., `.` (in this case, we only have one predictor: `bmi`).


In [None]:
# Build a linear model specification
lm_spec <- linear_reg() %>% 
  set_engine("lm") %>%
  set_mode("regression")


# Train a linear regression model
lm_mod <- lm_spec %>% 
  fit(y ~ ., data = diabetes_train)

# Print the model
lm_mod

From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that minimizes the overall error between the actual and predicted variable.

<br>

## 5. Make predictions on the test set

Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will help us draw the line separating the data groups.


In [None]:
# Make predictions for the test set
predictions <- lm_mod %>% 
  predict(new_data = diabetes_test)

# Print out some of the predictions
predictions %>% 
  slice(1:5)

Woohoo! 💃🕺 We just trained a model and used it to make predictions!

When making predictions, the tidymodels convention is to always generate a tibble/data frame of results with standardized column names. This ensures that combining the original data with the predictions is straightforward and results in a format that can be easily used for further tasks like plotting.

`dplyr::bind_cols()` efficiently combines multiple data frames by columns.


In [None]:
# Combine the predictions and the original test set
results <- diabetes_test %>% 
  bind_cols(predictions)


results %>% 
  slice(1:5)

## 6. Plot modelling results

Now, it's time to visualize this 📈. We'll create a scatter plot of all the `y` and `bmi` values from the test set, and then use the predictions to draw a line in the most suitable position, reflecting the model's data groupings.

R offers several systems for creating graphs, but `ggplot2` is one of the most elegant and versatile. It allows you to build graphs by **combining independent components**.


In [None]:
# Set a theme for the plot
theme_set(theme_light())
# Create a scatter plot
results %>% 
  ggplot(aes(x = bmi)) +
  # Add a scatter plot
  geom_point(aes(y = y), size = 1.6) +
  # Add a line plot
  geom_line(aes(y = .pred), color = "blue", size = 1.5)

> ✅ Take a moment to think about what's happening here. A straight line is passing through many small data points, but what is its purpose exactly? Can you understand how this line can help predict where a new, unseen data point might fall in relation to the plot's y-axis? Try to describe in words the practical application of this model.

Congratulations, you've built your first linear regression model, made a prediction with it, and visualized it in a plot!



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
