# Introduction 
#### In this project i used linear regression algorithms to predict the real estate price based on area of the estate

## Libraries and tools using in this project 

#### 1 - Python
#### 2 - Pandas
#### 3 - Numpy
#### 4 - Scikit-Learn
#### 5 - Statsmodel (Statistical Approach)
#### 6 - Matplotlib
#### 7 - Seaborne

### Importing Libraries

In [49]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Importing Dataset

In [50]:
df = pd.read_csv('../input/real-estate/real_estate.csv')
df

In [51]:
df.info()

In [52]:
df.describe().round(2)

### Splitting Data

In [53]:
X = df['area']

In [54]:
y = df['price']

### Exploring Data

In [55]:
plt.scatter(X,y)
plt.xlabel('Area', fontsize=15)
plt.ylabel('Price', fontsize=15)
plt.show()

In [56]:
sns.regplot(data=df, x=X, y=y)

#### **It show that the regression model is the best for the data**

### Model Building & Training

#### **I will use two library to build this model and compare it together**

### 1] Using Statsmodels (Statistical Approach)

In [57]:
import statsmodels.api as sm

In [58]:
X_stat = sm.add_constant(X)

In [59]:
reg_stat = sm.OLS(y,X_stat).fit()

In [60]:
reg_stat.params

In [61]:
# The linear equation
# y = mx + b
# price = 223.178743 * area + 101912.601801

#### Now let's try by using Scikit_Learn

### 2] Using Scikit-Learn (Machine Learning Approach)

In [62]:
from sklearn.linear_model import LinearRegression

In [63]:
X_sk = X.values.reshape(-1,1)

In [64]:
reg_sk = LinearRegression().fit(X_sk, y)

In [65]:
reg_sk.coef_

In [66]:
reg_sk.intercept_

In [67]:
# using Linear equation
# y = mx + b
# price = 223.178743 * area + 101912.601801

#### **It show same results, Now let's building the prediction model**

### Model Prediction 

#### **Same like the previous we start with Statsmodel**

### 1] Using Statsmodel (Statistical Approach)

##### Evaluating Results

In [68]:
reg_stat.summary()

In [69]:
# y = mx + b
# price = 223.178743 * area(X_stat) + 101912.601801

In [70]:
# Let's substitute the values for the equation
plt.scatter(X,y)
y_hat = 223.178743 * X_stat + 101912.601801
fig = plt.plot(X_stat, y_hat, c = 'r')
plt.xlabel('Area', fontsize = 15)
plt.ylabel('Price', fontsize = 15)
plt.show()

### 2] Using Scikit-Learn (Machine Learning Approach)

#### First let's Predicting Values using random number

In [71]:
reg_sk.predict([[1000]])

It seems work good let's continue

#### Splitting Data

In [72]:
from sklearn.model_selection import train_test_split

In [73]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#### Model Re-Building

In [78]:
from sklearn.linear_model import LinearRegression

In [80]:
reg = LinearRegression()

In [81]:
reg.fit(X_train.values.reshape(-1,1), y_train)

In [82]:
y_pred = reg.predict(X_test.values.reshape(-1,1))

### Evaluating Results

Scikit-learn Regression Metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

Regression Metrics Demo: https://www.geogebra.org/m/yybenxjm

Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE):
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE):
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE):
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

All of these are **loss functions**, because we want to minimize them.

In [83]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [84]:
# MAE
print(mean_absolute_error(y_test, y_pred))

In [85]:
# MSE
print(mean_squared_error(y_test, y_pred))

In [86]:
# RMSE
print(np.sqrt(mean_squared_error(y_test, y_pred)))

In [87]:
np.mean(y_test)

In [88]:
np.mean(y_pred)

In [90]:
# R^2
reg_sk.score(X_train.values.reshape(-1,1), y_train)

In [91]:
reg_sk.score(X_test.values.reshape(-1,1), y_test)

In [92]:
r2_score(y_pred,y_test)

## Please leave an upvote and comment to helps me continue my data science journy and improves my work. Thanks