**D1DAE: Análise Estatística para Ciência de Dados** <br/>
IFSP Campinas

Prof. Dr. Samuel Martins <br/><br/>

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

#### Custom CSS

In [1]:
%%html
<style>
.dashed-box {
    border: 1px dashed black !important;
#    font-size: var(--jp-content-font-size1) !important;
}


.dashed-box tr {
    background-color: white !important;
}
</style>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

# Simple Linear Regression

## 📊 1. Exploring the Data


Dummy dataset created to study simple linear regression. <br/>
https://www.kaggle.com/karthickveerakumar/salary-data-simple-linear-regression
    
The file considered in this notebook corresponds to such a dataset only with the shuffled lines.

### 1.1. Importing the Dataset

In [4]:
df = pd.read_csv('datasets/experience_salary_dataset.csv')
df

Unnamed: 0,YearsExperience,Salary
0,8.7,109431.0
1,5.3,83088.0
2,9.0,105582.0
3,6.0,93940.0
4,9.6,112635.0
5,2.9,56642.0
6,4.0,56957.0
7,3.9,63218.0
8,3.2,54445.0
9,3.2,64445.0


## 🤖 2. Estimating a Linear Regressor

In [18]:
# Getting the independent and dependent variables
X = df[['YearsExperience']]
y = df['Salary']

In [19]:
# splitting the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
# training a linear regression by OLS from statsmodel package
# https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html

import statsmodels.api as sm

# Our model needs an intercept so we add a column of 1s
X_train = sm.add_constant(X_train)

# it's the opposite of sklearn fit()
model = sm.OLS(y_train, X_train)

# fit the line
results = model.fit()

In [21]:
results.params

const              24568.367141
YearsExperience     9694.407463
dtype: float64

In [22]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 Salary   R-squared:                       0.949
Model:                            OLS   Adj. R-squared:                  0.947
Method:                 Least Squares   F-statistic:                     409.5
Date:                Tue, 18 Apr 2023   Prob (F-statistic):           1.04e-15
Time:                        15:14:59   Log-Likelihood:                -241.53
No. Observations:                  24   AIC:                             487.1
Df Residuals:                      22   BIC:                             489.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            2.457e+04   2811.221     

<br/>

We will learn more about _p-value_ and _hypothesis testing_ - which in this case rejects the null hypothesis that the coefficients do not impact/influence the y-variable. <br/>
For now, let's focus only on the **confidence interval** for each _coefficient_.

Considering a **significance level α=5%**, then a **confidence level CL=95%**, we can say: <br/>
We are **95% certain (confidence level)** that:
- For each _additional **year of experience**_ the employee has, his/her salary (in the population) is between _\$8,700.904_ and _\$10,700.0_ ***more***.