Dado o problema de previsão dos valores de venda de um imóvel baseado em seu tamanho teremos:

$x=area$

$y=price$

Logo teremos a tupla $(x,y)$, que serão os dados utilizados para ajustar nossa regressão linear.

Os dados a serem utilizados fazem parte de um dataset disponibilizado na plataforma Kaggle, disponível no link abaixo:

https://www.kaggle.com/code/ashydv/housing-price-prediction-linear-regression

Importando os dados para uso das informações.

In [97]:

import numpy as np
import pandas as pd
import random

import matplotlib.pyplot as plt 
import seaborn as sns

In [98]:
housing=pd.DataFrame(pd.read_csv("Housing.csv"))

In [99]:
housing.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


Separando os dados em train e test

In [120]:
#Seperando os dados em housing_train e housing_test

from sklearn.model_selection import train_test_split

housing_train, housing_test = train_test_split(housing, test_size=0.5, random_state=0)

In [121]:
X_train=housing_train[['area']]
X_test=housing_test[['area']]

In [122]:
y_train=housing_train[['price']]
y_test=housing_test[['price']]

In [124]:
X_train.head


<bound method NDFrame.head of       area
2     9960
221   3420
513   4400
146  10500
241   3760
..     ...
70    4000
277  10360
9     5750
359   3600
192   6600

[272 rows x 1 columns]>

In [126]:
y_train.head

<bound method NDFrame.head of         price
2    12250000
221   4767000
513   2485000
146   5600000
241   4550000
..        ...
70    6790000
277   4305000
9     9800000
359   3710000
192   5040000

[272 rows x 1 columns]>

Agora que ja separamos os parâmetros a serem utilizados, area e preço.
Poderemos construir a regressão.

In [127]:
X_train.shape

(272, 1)

In [128]:
y_train.shape

(272, 1)

In [129]:
random.seed(32)
from numpy import linalg as linalg

Aplicando o Feature Scaling nos dados

Definindo o método da Regressão Linear

In [148]:
class linear_regression():
    def __init__(self, learning_rate, iterations):
        self.learning_rate = learning_rate
        self.iterations = iterations
        self.weights = None
        self.min_X = None
        self.max_X = None
        self.bias = None
        self.min_y = None
        self.max_y = None
        
        
        
    def fit(self, X, y):
        X=np.column_stack((np.ones(len(X)), X))
        
        XtX=np.dot(X.T, X)
        XtX_inv=linalg.inv(XtX)
        XtY=np.dot(X.T, y)
        B=np.dot(XtX_inv, XtY)
        self.weights = B[1:]
        self.bias = B[0]

        
    def predict(self, X):
        X=np.column_stack((np.ones(len(X)), X))
        y_predicted = np.dot(X, self.weights) + self.bias
        return y_predicted   
    
    def cost_function(self, X, y):
        #Seguindo a formula do MSE (Mean Squared Error)
        y_predicted = np.dot(X, self.weights) + self.bias
        return np.mean((y_predicted - y) ** 2)
    
    def feature_scaling_X_train(self, X):
        self.min_X = np.min(X, axis=0)
        self.max_X = np.max(X, axis=0)
        return (X - self.min_X) / (self.max_X - self.min_X)
    
    def feature_scaling_y_train(self, y):
        self.min_y = np.min(y, axis=0)
        self.max_y = np.max(y, axis=0)
        return (y - self.min_y) / (self.max_y - self.min_y)
    
    def inverse_feature_scaling_y(self, y):
        return y * (self.max_y - self.min_y) + self.min_y
    
    def inverse_feature_scaling_X(self, X):
        return X * (self.max_X - self.min_X) + self.min_X
    
    def feature_scaling_X_test(self, X):

        return (X - self.min_X) / (self.max_X - self.min_X)
    

    def feature_scaling_y_test(self, y):
        return (y - self.min_y) / (self.max_y - self.min_y)
    
#Calculando a acurácia do modelo
    
    def score(self, X, y):      
        y_predicted = self.predict(X)
        u = ((y - y_predicted) ** 2).sum()
        v = ((y - y.mean()) ** 2).sum()
        return 1 - u/v
    
    def plot_regression_line(self, X, y):
        plt.scatter(X, y, color = "m", marker = "o", s = 30)
        y_predicted = self.predict(X)
        plt.plot(X, y_predicted, color = "g")
        plt.xlabel('area')
        plt.ylabel('price')
        plt.show()

                

In [131]:
#Instanciando o modelo

model = linear_regression(learning_rate=1, iterations=1000)

In [144]:
#Normalizando os dados de treino
X_train = model.feature_scaling_X_train(X_train)
y_train = model.feature_scaling_y_train(y_train)
print(X_train)

         area
2    0.586813
221  0.107692
513  0.179487
146  0.626374
241  0.132601
..        ...
70   0.150183
277  0.616117
9    0.278388
359  0.120879
192  0.340659

[272 rows x 1 columns]


In [145]:
#Treinando o modelo
model.fit(X_train, y_train)

In [149]:
#Normalizando os dados de teste
X_test = model.feature_scaling_X_test(X_test)
y_test = model.feature_scaling_y_test(y_test)
print(X_test)
    

         area
239 -0.142868
113 -0.142868
325 -0.142868
66  -0.142868
479 -0.142868
..        ...
494 -0.142868
322 -0.142868
253 -0.142868
299 -0.142868
463 -0.142868

[273 rows x 1 columns]


In [150]:
#Testando o modelo
y_predicted = model.predict(X_test)

ValueError: shapes (273,2) and (1,1) not aligned: 2 (dim 1) != 1 (dim 0)