# What factors are driving pay discrimination between men and women in your organization?

## Goals

In this case, we will set up a foundational understanding of statistics necessary for linear regression and then introduce linear regression starting with 2 parameters. We hope students will thoroughly understand the working components of a linear regression model such as interpreting the coefficients and understanding various metrics to properly evaluate the performance of the model.

## Introduction

**Business Context**. You are a data scientist in a medium-sized organization. Your company is undergoing an internal review of its hiring practices and employee compensation. In recent years, your firm has had low success in converting high-quality female candidates that it has wanted to hire. Management hypothesizes that this is due to possible pay discrimination and wants to figure out what is causing it.

**Business Problem.** As part of the internal review, the human resources department has approached you to specifically investigate the following question: **"On balance, are men paid more than women in your organization? If so, what is driving this gap?"**

**Analytical Context**. The human resources department has provided you with an employee database that contains information on various attributes such as performance, education, income, seniority, etc. We will use linear regression techniques on this dataset to solve the business problem described above. We will see how linear regression quantifies the correlation between the output variable (pay) and the input variables (e.g. education, income, seniority, etc.)

The case is structured as follows: we will (1) perform exploratory data analysis to visually investigate the differences in pay; (2) use the observed insights to formally fit regression models; and finally (3) address the pay discrimination issue.

Importing the applets:

In [None]:
import c1applet.pairwise_boxplots as pairwise_boxplots
import c1applet.pairwise_boxplots_gender as pairwise_boxplots_gender

## Data exploration

Here are the first 5 rows of the dataset:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>job_title</th>      <th>age_years</th>      <th>performance_score</th>      <th>education</th>      <th>seniority_years</th>      <th>pay_yearly</th>      <th>male_female</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>Project Manager</td>      <td>34</td>      <td>33.33</td>      <td>High School</td>      <td>4</td>      <td>118503</td>      <td>M</td>    </tr>    <tr>      <th>1</th>      <td>Marketing associate</td>      <td>66</td>      <td>16.67</td>      <td>High School</td>      <td>3</td>      <td>129393</td>      <td>M</td>    </tr>    <tr>      <th>2</th>      <td>Marketing associate</td>      <td>51</td>      <td>50.00</td>      <td>Masters</td>      <td>8</td>      <td>139440</td>      <td>M</td>    </tr>    <tr>      <th>3</th>      <td>Sales representative</td>      <td>26</td>      <td>16.67</td>      <td>Masters</td>      <td>3</td>      <td>118191</td>      <td>F</td>    </tr>    <tr>      <th>4</th>      <td>Account executive</td>      <td>36</td>      <td>50.00</td>      <td>PhD</td>      <td>4</td>      <td>77717</td>      <td>M</td>    </tr>  </tbody></table>

The available features are:

1. **job_title**: the title of the job (e.g. “Graphic Designer”, “Software Engineer”, etc)
2. **age_years**: age
3. **performance_score**: on a scale of 0 to 100, 0 being the lowest and 100 being the highest
4. **education**: different levels of education (e.g. "College", "PhD", "Masters", "Highschool")
5. **seniority_years**: years of seniority
6. **pay_yearly**: pay in dollars
7. **male_female**: male or female

There are 241 males and 222 females in this dataset, for a total of 463 people.

###  Exercise 1

#### 1.1

This is a box plot comparing pay between men and women. What can you conclude?

![](data/images/pay_gender_boxplot.png)

**Answer.**

-------

#### 1.2

A $t$ - test on the difference in average pay between men and women produced a $p$ - value of `0.00168638900645661`. What does this suggest?


**Answer.**

-------

#### 1.3

Inspect the scatterplot below and the boxplots of pay with respect to the following attributes: `seniority_years`, `education`, `job_title`, and `performance_score`. What patterns do you observe?

![Pay vs. Age](data/images/pay_age_scatterplot.png)

In [None]:
pairwise_boxplots.app.run_server(port="8050", mode="inline")

**Answer.**

-------

### Exercise 2

Now, let's make the same plots as in Exercise 1.3, but additionally differentiate by gender. What patterns do you observe?

You can use the following table to complement the plots:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th>Descriptive statistics of <code>pay_yearly</code></th>      <th>F</th>      <th>M</th>    </tr>  </thead>  <tbody>    <tr>      <th>count</th>      <td>222.00</td>      <td>241.00</td>    </tr>    <tr>      <th>mean</th>      <td>96255.95</td>      <td>103821.54</td>    </tr>    <tr>      <th>std</th>      <td>26971.22</td>      <td>24558.90</td>    </tr>    <tr>      <th>min</th>      <td>38006.00</td>      <td>43848.00</td>    </tr>    <tr>      <th>25%</th>      <td>76199.75</td>      <td>89361.00</td>    </tr>    <tr>      <th>50%</th>      <td>96413.50</td>      <td>103432.00</td>    </tr>    <tr>      <th>75%</th>      <td>114349.25</td>      <td>120357.00</td>    </tr>    <tr>      <th>max</th>      <td>183827.00</td>      <td>181662.00</td>    </tr>  </tbody></table>'

In [None]:
pairwise_boxplots_gender.app.run_server(port="8050", mode="inline")

**Answer.**

-------

### Exercise 3

Which of the following is true? Select all that apply.

<ul>
I. The average salary for men is around USD \\$7,500 more than that of women in this organization.<br>
II. Men are paid significantly more than women due solely to gender differences.<br>
</ul>

**Answer.**

-------

## What are the variables that influence pay?

As we discussed in Exercise 3 and also saw in the data exploration, even though there is a significant pay gap between the genders, there are also some other factors at work which are driving this difference. Thus, ignoring these factors while addressing pay discrimination could lead to wrong or misleading conclusions.

How do we take into account the influence of the other variables on pay? What are these variables? A good place to start is to take inspiration from our exploratory data analysis before. The only numerical variable in this dataset is age, so let's check again our scatter plot of pay vs. age:

![Pay vs. Age](data/images/pay_age_scatterplot.png)

Pay seems to be positively correlated with age; i.e. the older someone is, the more they tend to get paid. Thus, it could be the case that there are more men in our dataset that are older and the pay difference between men and women we see could be a consequence of this.

### Exercise 4

Here are some simulated scenarios. In each of these cases, guess the correlation:

![Guess the correlation](data/images/guess_the_correlation.png)

**Answer.**

-------

### Exercise 5

Is the following statement true or false, and why: "If the correlation between two variables is zero, then the two variables are unrelated."?

**Answer.**

-------

To find the variables that have the greatest influence on pay, we can compute a correlation matrix and plot it as a heat map:

![Correlation matrix](data/images/correlation_matrix.png)

It seems that the two variables that are most linearly related with pay are age and seniority. We must make sure to include them in our models.

## Using linear models to account for variables correlated with pay

Once we identify some variables that are correlated with the output variable, we can use a linear model to capture this relationship quantitatively. A linear model does this by finding a line that [**best fits**](https://mathbits.com/MathBits/TISection/Statistics1/LineFit.htm) the data points:

![Pay vs. age line of best fit](data/images/pay_age_line_best_fit.png)

A line has two parameters - intercept ($\beta_0$) and slope ($\beta_1$), also known as the **coefficients** of the model. Thus, a linear model for pay vs. age can be represented as:

$$ PAY{\_}YEARLY = \beta_0 + \beta_1 AGE{\_}YEARS + \varepsilon $$

The interpretation of the coefficient $\beta_1$ is the following: an increase of one year in age will on average lead to a change of USD $\beta_1$ in pay. The intercept $\beta_0$ can be thought of as a sort of "baseline" pay. The difference between the value predicted by the line and the actual value (the data point) is the error or **residual** (represented above as the Greek letter $\varepsilon$ (epsilon)).

This diagram summarizes the above ideas:

![Line of best fit explained](data/images/line_best_fit_explained.jpg)

## Interpreting the output of a linear model


The straight line that we drew on the pay vs. age scatterplot was fitted using a statistical model called a **linear regression**, whose outputs can be seen below:


<pre class="western" style="border: none; padding: 0cm; text-align: center; orphans: 2; widows: 2; background: #ffffff"><code>
OLS Regression Results
==============================================================================

<font color="red"><b>Dep. Variable:             pay_yearly</b></font>   <font color="red"><b>R-squared:                       0.238</b></font>
Model:                            OLS   Adj. R-squared:                  0.236
Method:                 Least Squares   F-statistic:                     143.8
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           5.00e-29
Time:                        14:39:39   Log-Likelihood:                -5300.3
No. Observations:                 463   AIC:                         1.060e+04
Df Residuals:                     461   BIC:                         1.061e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 <font color="red"><b>coef</b></font>    std err          t      <font color="red"><b>P&gt;|t|</b></font>      [0.025      0.975]
------------------------------------------------------------------------------
<font color="red"><b>Intercept   6.384e+04</b></font>   3209.744     19.891      <font color="red"><b>0.000</b></font>    5.75e+04    7.02e+04
<font color="red"><b>age_years    873.5006</b></font>     72.840     11.992      <font color="red"><b>0.000</b></font>     730.360    1016.641
==============================================================================
Omnibus:                        2.346   Durbin-Watson:                   2.033
Prob(Omnibus):                  0.310   Jarque-Bera (JB):                2.414
Skew:                           0.151   Prob(JB):                        0.299
Kurtosis:                       2.817   Cond. No.                         134.
==============================================================================
</code></pre>


Although the output table above contains a lot of information, we only need to focus on a small number of quantities. These are the output variable (also known as the **dependent variable**), $R^2$, and the coefficients (the estimates of $\beta_0$ and $\beta_1$) and their $p$-values. We have highlighted them in red for your convenience.

### Coefficients

The intercept $\beta_0$ is about USD \\$63,840. This can be thought of as the baseline pay; that is, the expected pay of a person of age zero. Frequently, the intercept does not have a meaningful interpretation (like in this case) - that is okay so long as we acknowledge it and have a sound explanation as to why. The slope (the coefficient $\beta_1$ for the age) is USD \\$873.50. The interpretation of this coefficient is as explained before - if an employee becomes one year older, their pay is expected to increase by USD \\$873.50 on average.

Input variables like age are also called **independent variables** in the context of a linear regression model.

### $p$-values

You might notice that for each coefficient in the above output table, there are associated $p$ - values. This is because the coefficients are estimated based on our available data, so they may not necessarily represent the "true" coefficient across the entire population. The null hypothesis that is being tested here for $\beta_1$ is:

$$ H_0:  \beta_1 = 0 $$ 
and the alternative is
$$ H_a: \beta_1 \neq 0.$$

and similarly for $\beta_0$. The $p$ - value of $\beta_1$ (given under the column: `P>|t|` in the table) is 0.000.  Thus, the difference between zero and $\beta_1$ is statistically significant at the 0.05 significance level, and we reject the null hypothesis. This implies that age is indeed associated with at least some of the differences in pay.

If a coefficient in your model is not statistically significant, it means that, assuming your model represents reality reasonably correctly, there is no discernible association between that variable and your output variable.

### $R$-squared

Another key quantity that should be paid attention to when interpreting a regression table is
[**$R$-squared**](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit) (a.k.a. the **coefficient of determination**). (Note that the table shows $R^2$ and adjusted $R^2$ - we will focus on $R^2$ for now.) This quantity is always going to be between 0 and 1. The higher the $R^2$, the higher the percentage of observed variation that can be explained by the model. You can think of it as an indicator of how well your predictions will match the actual data points; i.e. when $R^2=1$ your model's predictions are perfect, and when $R^2=0$ they are so far off that they are basically useless. Thus, $R^2$ is said to be a representation of the model's **goodness of fit**.

Here are two example models - one with high goodness of fit, and another with reasonable but notably lower goodness of fit:

![High and low R squared](data/images/high_low_r_2.png)

For the pay vs. age model, $R^2 = 0.238$. Since this model only explains about 23.8% of the variation, this motivates us to investigate if factors other than age can be used to explain the pay differences and thus improve our predictions.

One important mathematical fact to keep in mind is that when we are dealing with two-variable relationships, $R^2 \equiv \rho^2$, where $\rho$ (rho) is the correlation coefficient. In the pay vs. age scatterplot, the correlation coefficient $\rho$ was 0.4876, and therefore $R^2=0.4876^2=0.238$. From this, it follows that *the stronger the linear correlation between two variables, the better the goodness of fit of the corresponding linear regression*.

There are many possible lines that we can fit to our data, but as we said earlier, the one we want is the line of best fit, which is defined as the line (among all the possible lines) that maximizes the model's $R^2$.

The usual way to solve this optimization problem is by taking each residual, squaring it (i.e. taking it to the second power), and finally summing up all the squared residuals. As an example, let's say that you have these four residuals:

* 7.2
* -5.1
* -3.33
* 6.7

Then the sum of the squared residuals (**SSR**) will be:

$$
SSR = (7.2)^2 + (-5.1)^2 + (-3.33)^2 + (6.7)^2 = 133.829
$$

We square the residuals to avoid negative numbers canceling out positive numbers.

Different lines will give you different SSRs. The goal is to find the one with the lowest SSR, which will in turn give you the highest possible $R^2$. Here is an example of two lines fitted to the same set of data points. You can see that the line of best fit is the one with the smaller squares (the squares represent squared residuals):

![Squared residuals](data/images/squared_residuals.png)

You might wonder why minimizing the SSR maximizes the $R^2$ - we will touch upon this in a future case.

## Looking at age and gender

Now that we have seen that age explains some of the relationship with pay, let's consider a model in which we take age and gender into account simultaneously. We will move from fitting a [**simple linear regression**](https://en.wikipedia.org/wiki/Simple_linear_regression) (one independent variable, one dependent variable) to fitting a [**multiple linear regression**](https://www.investopedia.com/terms/m/mlr.asp) (several independent variables, one dependent variable). Age is a numeric variable (e.g. 26.5, 32). In contrast, gender takes only two values - male and female, which makes it categorical.


Linear regressions can incorporate categorical variables just as easily as numerical variables. The trick is to code the categories as numbers so that the model can interpret them. The are [a handful](https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2) of ways to code categorical variables, but one of the most common is to transform them into sets of ones and zeroes. We will look at this topic more in detail in future cases, but for now, you don't have to worry too much about it because Python usually does the conversion in the background for you. What *is* necessary to keep in mind though is that the way we interpret the coefficients of categorical variables in a linear model is slightly different from those of numeric variables:

<pre class="western" style="border: none; padding: 0cm; text-align: left; orphans: 2; widows: 2; background: #ffffff"><code>
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             pay_yearly   R-squared:                       0.258
Model:                            OLS   Adj. R-squared:                  0.255
Method:                 Least Squares   F-statistic:                     79.99
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           1.54e-30
Time:                        20:02:08   Log-Likelihood:                -5294.0
No. Observations:                 463   AIC:                         1.059e+04
Df Residuals:                     460   BIC:                         1.061e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         6.006e+04   3344.884     17.957      0.000    5.35e+04    6.66e+04
<font color="red"><b>male_female[T.M]  7398.3050</b></font>   2087.361      3.544      <font color="red"><b>0.000</b></font>    3296.361    1.15e+04
age_years          871.8142     71.945     12.118      0.000     730.432    1013.197
==============================================================================
Omnibus:                        3.607   Durbin-Watson:                   2.034
Prob(Omnibus):                  0.165   Jarque-Bera (JB):                3.688
Skew:                           0.203   Prob(JB):                        0.158
Kurtosis:                       2.840   Cond. No.                         146.
==============================================================================
</code></pre>

The interpretation of the coefficient of age is similar as before - within same-sex groups, if age increases by one year, pay is expected to increase by USD \\$872 (notice that this value is different from the one we obtained in the previous regression - this is because the variability present in `pay_yearly` now has to be split among more explanatory variables).

Now focus on the coefficient of the `male_female` variable (we highlighted it for you). It shows male (`T.M`) only, because the category female is taken as the default category. That means that the coefficient represents the additional or reduced `pay_yearly` that happens just because someone is male as opposed to female. Ultimately, the choice of default category doesn't matter - we could easily have chosen to make male the default category and hence the coefficient for gender would be `T.F`. The coefficient \\$7,398 is interpreted as follows: for employees *of the same age*, men on average make USD \\$7,398 more than women do.

But we still haven't satisfactorily answered our main question yet. So far, we have only accounted for age in addition to gender for explaining pay gaps. There are still a few more factors that could affect pay. We consider education next. The following plot shows that employees with a PhD are paid more:

![Pay vs. education](data/images/pay_education_boxplot.png)

### Exercise 6

This is the output of the model

$$
PAY{\_}YEARLY = \beta_0 + \beta_1 AGE{\_}YEARS + \beta_2 {MALE{\_}FEMALE} + \beta_3 EDUCATION + \varepsilon
$$

</br>

~~~plain
                        OLS Regression Results                            
==============================================================================
Dep. Variable:             pay_yearly   R-squared:                       0.290
Model:                            OLS   Adj. R-squared:                  0.283
Method:                 Least Squares   F-statistic:                     37.38
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           3.93e-32
Time:                        20:42:46   Log-Likelihood:                -5283.8
No. Observations:                 463   AIC:                         1.058e+04
Df Residuals:                     457   BIC:                         1.060e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                  5.69e+04   3658.387     15.552      0.000    4.97e+04    6.41e+04
male_female[T.M]          7271.6447   2054.014      3.540      0.000    3235.161    1.13e+04
education[T.High School]  -259.3779   2851.527     -0.091      0.928   -5863.108    5344.353
education[T.Masters]      4890.6387   2894.975      1.689      0.092    -798.476    1.06e+04
education[T.PhD]          1.156e+04   3007.694      3.842      0.000    5645.114    1.75e+04
age_years                  859.2562     70.717     12.151      0.000     720.286     998.226
==============================================================================
Omnibus:                        2.897   Durbin-Watson:                   2.038
Prob(Omnibus):                  0.235   Jarque-Bera (JB):                2.919
Skew:                           0.192   Prob(JB):                        0.232
Kurtosis:                       2.937   Cond. No.                         203.
==============================================================================
~~~

Compare its $R^2$ with that of the previous model. What conclusions can we make?

**Answer.**

-------

### Exercise 7

Which of the following statements is true regarding the above model? Select all that apply.

<ul>
I. After accounting for age and gender, employees with a college education are paid on average USD \\$11,560 less than those with a PhD.<br>
II. After accounting for age and gender, employees with a masters degree are paid on average USD \\$4,891 more than those with just a high school degree.<br>
</ul>

**Answer.**

-------

## Integrated model accounting for all variables

Let us account for all the other factors that could explain pay gaps at once. Adding `job_title`, `performance_score`, and `seniority_years`:

~~~plain
                         OLS Regression Results                            
==============================================================================
Dep. Variable:             pay_yearly   R-squared:                       0.506
Model:                            OLS   Adj. R-squared:                  0.493
Method:                 Least Squares   F-statistic:                     38.41
Date:                Wed, 02 Jun 2021   Prob (F-statistic):           2.11e-61
Time:                        21:02:00   Log-Likelihood:                -5199.9
No. Observations:                 463   AIC:                         1.043e+04
Df Residuals:                     450   BIC:                         1.048e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Intercept                           4.41e+04   4251.087     10.373      0.000    3.57e+04    5.25e+04
job_title[T.Dog trainer]          -1.095e+04   2778.457     -3.942      0.000   -1.64e+04   -5491.545
job_title[T.Marketing associate]   1.337e+04   3368.690      3.970      0.000    6752.164       2e+04
job_title[T.Project Manager]       1.373e+04   2899.199      4.736      0.000    8032.270    1.94e+04
job_title[T.Sales representative] -1207.9643   2816.886     -0.429      0.668   -6743.848    4327.920
job_title[T.Web Designer]         -1455.6120   3307.946     -0.440      0.660   -7956.551    5045.327
education[T.High School]           -278.3053   2404.785     -0.116      0.908   -5004.309    4447.698
education[T.Masters]               4937.7643   2446.576      2.018      0.044     129.631    9745.898
education[T.PhD]                   9838.9167   2537.772      3.877      0.000    4851.562    1.48e+04
male_female[T.M]                   3709.8452   1853.780      2.001      0.046      66.704    7352.987
age_years                           666.1748     62.980     10.578      0.000     542.403     789.946
performance_score                    83.8079     31.411      2.668      0.008      22.077     145.539
seniority_years                    3613.3282    356.544     10.134      0.000    2912.630    4314.026
==============================================================================
Omnibus:                        0.094   Durbin-Watson:                   1.939
Prob(Omnibus):                  0.954   Jarque-Bera (JB):                0.062
Skew:                          -0.028   Prob(JB):                        0.969
Kurtosis:                       3.002   Cond. No.                         466.
==============================================================================
~~~

### Exercise 8

Taking all of the factors into account, which one of the following jobs pays the highest?

<ul>
A. Marketing associate<br>
B. Sales representative<br>
C. Project manager<br>
D. Web designer<br>
</ul>

**Answer.**

-------

## Revisiting the question of pay discrimination

Now that we have looked at and accounted for various attributes that are correlated with pay, let's revisit the question of what is driving pay discrimination. Our last model, which accounts for all of the variables in the dataset, has an $R$-squared of 50.6%. This is definitely a vast improvement from the simplistic model (pay vs. age) which had an $R$-squared of 23.8%.

### Exercise 9

Based on the analysis we have done so far, which of the following statements is correct? Select all that apply.

<ul>
I. After accounting for job title, education, performance and age, the proportion of pay difference attributable solely to gender is small.<br>
II. There is evidence that pay discrimination between men and women is due solely to gender.<br>
III. There is reason to believe to that there could be a disproportionate amount of women in lower paying jobs, while there could be more men in higher paying jobs, such as project manager or marketing associate.<br>
</ul>

**Answer.**

-------

## Investigating the distribution of gender across seniority and job types

Motivated by the previous exercise, let's look into how women are distributed across various factors. The following plot shows that men and women are similarly distributed by seniority:

![Gender vs. seniority (MALE)](data/images/gender_seniority_male.png)
![Gender vs. seniority (FEMALE)](data/images/gender_seniority_female.png)

However, looking at the distribution of women across various job types tells a different story. From the following barplot, we see that women are underrepresented in the project manager and marketing associate roles (which are the best paid). Also, women are disproportionately overrepresented in the lower-paid dog trainer job title:

![Gender vs. job title](data/images/gender_job_title.png)


## Conclusions

We used the techniques of linear regression to determine whether or not there was gender-based pay discrimination within your organization. We modeled the effect of various input variables (in this case, seniority, age, performance, and job title) to explain the observed variation of a output variable (in this case, pay). We looked at the $R^2$ coefficient of our linear models to help us measure what percentage of observed variation in pay was explained by the input variables.

We saw that the difference in average salary between men and women is statistically significant. Further exploration of the data suggested that a big driver of this difference was due to women being overrepresented in the lowest-paying jobs, and underrepresented in the highest-paying jobs.

Thus, an investigation into the hiring, promotion, and job placement practices of men and women is warranted. In your report to the human resources department, you should ask them to look into the following questions:

1. Are women choosing to or being forced to take lower paying jobs?
2. Are women being discriminated against in hiring processes for higher-paying jobs?

## Takeaways

In this case, you learned how to leverage your skills in exploratory data analysis to build an effective linear model that accounted for several factors related to the outcome of interest (pay). Crucially, we learned that:

1. Looking directly at the relationship between the outcome of interest and the input variable of interest is not enough - there may be several confounding variables.
2. Conducting EDA before doing any model-building is important to discover and account for these confounding variables that could be driving variation in the outcome of interest.
3. $R^2$ is an important quantity that explains how well your model explains the observed variation. It can be used to compare different models.
4. Analyzing the coefficients of a linear regression to gain an understanding of how the various parameters impact the final output is extremely important - this interpretability is a key part of translating data to business action.

These days, media constantly shines a spotlight on more advanced machine learning algorithms such as neural networks. It is important for you to recognize the immense value of linear regression, particularly for its inference capabilities and interpretability. While a neural network may outperform linear regression in certain tasks, it is much more of a black box whereas having clarity on how the inputs of a model link to the outputs is extremely important in most scenarios.

## Attribution

"Correlation examples2.svg", 9 May 2011, DenisBoigelot, Public Domain, https://commons.wikimedia.org/wiki/File:Correlation_examples2.svg

"R2values.svg", 6 April 2018, Debenben, CC BY-SA 4.0, https://de.wikipedia.org/wiki/Datei:R2values.svg

"Coefficient of Determination.svg", 6 September 2010, Orzetto, CC BY-SA 3.0, https://commons.wikimedia.org/wiki/File:Coefficient_of_Determination.svg