In [None]:
# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython

---
license:
    code: MIT
    content: CC-BY-4.0
github: https://github.com/ocademy-ai/machine-learning
venue: By Ocademy
open_access: true
bibliography:
  - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib
---

##  Univariate linear regression


In this section, we will use a dataset containing real-life information about years of work experience and corresponding salaries. We will step-by-step explore the potential relationship between the data and eventually attempt a simple linear regression on it.

1 . We need to import some libraries and the dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

salary_dataset = pd.read_csv("../../assets/data/Salary_Data.csv")

2 . Understand the basic information and structure of the dataset.

In [None]:
# It displays the top 5 rows of the data
salary_dataset.head()

In [None]:
# It provides some information regarding the columns in the data
salary_dataset.info()

In [None]:
# It provides basic statistical characteristics of the dataset.
salary_dataset.describe()

Most of the time, it is difficult to identify correlations between data if we merely rely on viewing tables. Therefore, we need to make the dataset more visually intuitive and vivid!

3 . Visualize the salary dataset.

In [None]:
# These Plots help to explain the values and how they are scattered.
year = salary_dataset.YearsExperience
salary = salary_dataset.Salary
plt.scatter(year, salary)
plt.show()

It is obvious that we can fit these scattered points with a straight line. In the next step, we proceed with univariate linear regression.

4 . Split the dataset into the Training set and Test set.

First, extract the data for years of experience and salary from the dataset separately.

In [None]:
# get a copy of dataset exclude last column
X = salary_dataset.iloc[:, :-1].values

# get array of dataset in column 1st
y = salary_dataset.iloc[:, 1].values


`X` : the first column which contains Years Experience array

`y`: the last column which contains Salary array

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

* `test_size=1/3`: we will split our dataset (30 observations) into 2 parts (training set, test set) and the ratio of **test set** compare to dataset is 1/3 ( 10 observations will be put into the **test set**. You can put it 1/2 to get 50% or 0.5, they are the same. We should not let the test set too big; if itâ€™s too big, we will lack of data to train. Normally, we should pick around 5% to 30%.

* `train_size`: if we use the test_size already, the rest of data will automatically be assigned to `train_size`.

* `random_state`: this is the seed for the random number generator. We can put an instance of the **RandomState** class as well. If we leave it blank or 0, the **RandomState** instance used by **np.random** will be used instead.

5 . Build the regression model

In [None]:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

`regressor = LinearRegression()`: our training model which will implement the Linear Regression.

`regressor.fit`: in this line, we pass the `X_train` which contains value of Year Experience and `y_train` which contains values of particular Salary to form up the model. This is the training process.

In [None]:
# Predicting the Salary for the Test values
y_pred = regressor.predict(X_test)

6 . Visualize our training model and testing model

In [None]:
plt.scatter(X_train, y_train, color='red', label='Training data')
plt.scatter(X_test, y_test, color='blue', label='Test data')
plt.plot(X_train, regressor.predict(X_train), color='green')
plt.title('Salary VS Experience')
plt.xlabel('Year of Experience')
plt.ylabel('Salary')
plt.legend()  
plt.show()

The linear regression line is generated from the `Training data`. 

The red points in the graph represent the `Training data`, while the blue points represent the `Test data`.

It seems like our trained model performs well accroding to the plot. However, in most cases, it is necessary to quantitatively calculate the extent of error when using this model for predictions.

7 . Evaluate the model

**Mean square error** (MSE) metrics, which is the mean of all squared differences between expected and predicted values, is a commonly used metric for evaluating model performance in regression problems.

In [None]:
from sklearn.metrics import mean_squared_error,r2_score
mse = mean_squared_error(y_test,y_pred)

# calculate Mean square error
mse = mean_squared_error(y_test,y_pred)
print(f"Mean error: {mse:3.3} ({mse/np.mean(y_pred)*100:3.3}%)")

Another indicator of model quality is `r2_score`, which represents the degree to which the model explains the variance of the target variable. The **RÂ² score** ranges from 0 to 1, where a value closer to 1 indicates a better fit of the model to the data.

In [None]:
score = r2_score(y_test,y_pred)
print("Model determination: ", score)

8 . Obtain the regression line

In [None]:
# Intecept and coeff of the line
print('Intercept of the model:',regressor.intercept_)
print('Coefficient of the line:',regressor.coef_)

## Self study

In this tutorial, you worked with simple linear regression, rather than univariate or multiple linear regression. Read a little about the differences between these methods, or take a look at [this video](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef)

Read more about the concept of regression and think about what kinds of questions can be answered by this technique. Take this [tutorial](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-77952-leestott) to deepen your understanding.

## Your turn! ðŸš€

Plot a different variable from this dataset. Hint: edit this line: `X = X[:, np.newaxis, 2]`. Given this dataset's target, what are you able to discover about the progression of diabetes as a disease?

Assignment - [Regression with scikit-learn](../../assignments/ml-fundamentals/regression-with-scikit-learn.md)

## Acknowledgments

Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter.

---