# Linear Regression in Sci-Kit Learn - Introduction

This dataset concerns housing values in suburbs of Boston. The original dataset was taken from the StatLib library which is maintained at Carnegie Mellon University, here it is downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/).

Your goal is to create and train a model that can estimate the average housing price.

### Dataset description (columns)

     1. CRIM     per capita crime rate by town
     2. ZN       proportion of residential land zoned for lots over
                 25,000 sq.ft.
     3. INDUS    proportion of non-retail business acres per town
     4. CHAS     Charles River dummy variable (= 1 if tract bounds
                 river; 0 otherwise)
     5. NOX      nitric oxides concentration (parts per 10 million)
     6. RM       average number of rooms per dwelling
     7. AGE      proportion of owner-occupied units built prior to 1940
     8. DIS      weighted distances to five Boston employment centres
     9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000's of dollars
    

<a target = "_blank" href = "https://colab.research.google.com/github/PrzemekSekula/DeepLearningClasses1/blob/master/LinearRegressionSKLearn/LinearRegressionSKLearn - Empty.ipynb" >
<img src = "https://www.tensorflow.org/images/colab_logo_32px.png" / >
Run in Google Colab </a>


[tekst linku](`https://`)Load and display data.

In [4]:
import pandas as pd
import numpy as np

# Pobieranie pliku CSV
!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv

# Wczytanie danych
df = pd.read_csv("housing.csv")

# Podgląd danych
df.head()


--2025-06-10 20:00:16--  https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38448 (38K) [text/plain]
Saving to: ‘housing.csv.2’


2025-06-10 20:00:17 (780 KB/s) - ‘housing.csv.2’ saved [38448/38448]



Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### Task 1
Select X (columns `['CRIM', 'TAX', 'RM']`) and y (column `MEDV`)

In [7]:
# Enter your code here
x= df[['CRIM', 'TAX', 'RM']]
x.head()

Unnamed: 0,CRIM,TAX,RM
0,0.00632,296.0,6.575
1,0.02731,242.0,6.421
2,0.02729,242.0,7.185
3,0.03237,222.0,6.998
4,0.06905,222.0,7.147


In [8]:
# Enter your code here
y=df[['MEDV']]
y.head()

Unnamed: 0,MEDV
0,24.0
1,21.6
2,34.7
3,33.4
4,36.2


### Task 2
Split data into two subsets
- train subset: 70% of data
- test subset: 30% of data
- set random_state to 1

In [9]:
# Enter your code here
from sklearn.model_selection import train_test_split

In [10]:
# Enter your code here
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)


print("x train: ", x_train.shape)
print("x test: ", x_test.shape)
print("y train: ", y_train.shape)
print("y test: ", y_test.shape)

x train:  (354, 3)
x test:  (152, 3)
y train:  (354, 1)
y test:  (152, 1)


### Task 3
Create and train linear regression model.

In [None]:
# Enter your code here

### Task 4
Compute $R^2$ coefficient for train and test datasets. Use `model.score()` to do it.

$$R^2=1-\frac{\Sigma{(y-\hat{y})^2}}{\Sigma{(y-\overline{y})^2}}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $\overline{y}$ - mean value of `y`

In [None]:
# Enter your code here

### MAPE - Mean Absolute Percentage Error

$$MAPE = \frac{1}{n} \sum{ \left\lvert{\frac{y-\hat{y}}{y}}\right\rvert}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $n$ - number of samples

### Task 5
Create a function mape, that returns  𝑀𝐴𝑃𝐸  value given  𝑋 ,  𝑦  and the model that is used to create  𝑦̂   estimates. Then use your function to compute  𝑀𝐴𝑃𝐸  for train and test datasets.

In [None]:
def mape(model, X, y):
    # Enter your code here

In [None]:
# Enter your code here

## Random forest regressor

In [None]:
# Enter your code here

### Task 6
Experiment with `min_samples_leaf` parameter to avoid overfitting.

In [None]:
# Enter your code here

# Part 2

### Task 7
Select all 13 features as $X$ and split dataset into two subsets (the same split ratio and random state).

In [None]:
# Enter your code here

In [None]:
# Enter your code here

In [None]:
# Enter your code here

In [None]:
# Enter your code here

### Task 8
Train and test linear regression model. Compare the results with the previous ones.

In [None]:
# Enter your code here

### Task 9
Train and test Random Forest model (keep all parameters default). Does your model suffer from overfitting / underfitting?

In [None]:
# Enter your code here

### Task 10
Try to modify `min_samples_leaf` parameter to get the best model possible.

In [None]:
# Enter your code here