# # Data Processing and Visualization Tutorial 📊 Linear Regression

## Prediction of House Price Using Linear Regression

### Data

The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee.

In [None]:
import pandas as pd
import numpy as np
import pandas as pd

%matplotlib inline

### 1. Explore Data

Perform EDA on this dataset. Find Dependent and Independent Variables to predict house price.

In [None]:
# Read CSV
house_csv = "/Sacramento_transactions.csv"
house = pd.read_csv(house_csv)
house.head()

In [None]:
house.dtypes # Check data types

In [None]:
house.describe() # summary statistics

In [None]:
import folium

# Create a map object
m = folium.Map(location=[985.000000, -121.355982], zoom_start=12)

# Add markers for each location
for index, row in house.iterrows():
    folium.Marker([row['latitude'], row['longitude']], popup=row['street']).add_to(m)

# Display the map
m

In [None]:
house['city'].value_counts() # Count of unique values

In [None]:
# check for missing values
house.isnull().sum()

In [None]:
def draw_scatter_plot(x, y):
    plt.scatter(x, y)
    plt.xlabel('Square Feet')
    plt.ylabel('Price')
    plt.title('Scatter Plot of Square Feet vs Price')
    plt.show()

draw_scatter_plot(house['sq__ft'], house['price'])

### 2. Predict Price

We are going to predict `price` from available information.

#### 2.1 What is the Target Variable? Why?

In [None]:
# 
# The Target Variable is 'prices'

#### 2.2 List all possible variables which might be Independent/Predictor variable.

In [None]:
# 
#  The variables which might be independents/predicot variables are 'beds', 'baths', 'sq__ft', 'type', 'city', 'state', 'zip', 'sale_date', 'latitude', 'longitude'

#### 2.3 Find correlation between variables

Find correlation between variables. Which is the best predictor? Why? State your reason.

In [None]:
numeric_columns = house.select_dtypes(include=[np.number]).columns # Select only numeric columns
corr = house[numeric_columns].corr() # Calculate correlation matrix
corr

#### 2.4 Find Coefficient and Intercept using Linear Regression

Using LinearRegression of sklearn.linear_model package find Coefficient and Intercept.

Create an instance of LinearRegression.

Explore following methods:

- fit
- predict
- score

In [None]:
from sklearn.linear_model import LinearRegression

##### 2.4.1 Fit predictor and target variables using linear regression

In [None]:
# 

X = house['sq__ft'].values.reshape(-1,1) # values converts it into a numpy array
Y = house['price']

##### 2.4.2 Find R Score

Find $R^2$ Using ```score``` method of Linear Regression.

### 3. Splitting Data

In [None]:
from sklearn.model_selection import train_test_split

#### 3.1 Create training and testing subsets

Hint: Using `train_test_split` Library.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
```

In [None]:
# 
## Create training and testing subsets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

#### 3.2 Check Shape, Sample of Test Train Data

In [None]:
# 
## Check training/test data
print("X Training data shape:", pd.DataFrame(x_train).shape)
print("X Testing data shape:", pd.DataFrame(x_test).shape)


print("Sample of training data:")
print(pd.DataFrame(x_train[:10])) # Display first 10 records

print("Sample of testing data:")
print(pd.DataFrame(x_test[:10])) # Display first 10 records

In [None]:
print("Y Training data:", pd.DataFrame(y_train).shape) # Check Y training data
print("Y Testing data:", pd.DataFrame(y_test).shape) # Check Y testing data

print("Sample of Y training data:") # Display sample of Y training data
print(pd.DataFrame(y_train[:10])) # Display sample of Y training data

print("Sample of Y testing data:") # Display sample of Y testing data
print(pd.DataFrame(y_test[:10])) # Display sample of Y testing data

#### 3.3 Using Linear Regression Find The Score

1. Fit model using X_train, y_train
2. Find score using X_test, y_test

In [None]:
model = LinearRegression() # Create a Linear Regression Model object

In [None]:
# 
model.fit(x_train, y_train) # Fit the model

In [None]:
model.intercept_, model.coef_ # Display the intercept and coefficient

In [None]:
a = model.intercept_
b = model.coef_
x = 1204

Ypred = a*x + b
Ypred

In [None]:
print('Predicted House Price:', Ypred[0])

In [None]:
mean_squared_error(y_test, model.predict(x_test)) # Calculate Mean Squared Error

In [None]:
# 
y_pred = model.predict(x_test) # Predictions
y_pred # Predicted values

### 4. Predict House Price

Let's assume we have information of following house:

- street:	1140 EDMONTON DR
- city:	SACRAMENTO
- zip:	95833
- state:	CA
- beds:	3
- baths:	2
- sq__ft:	1204
- type:	Residential

**Predict the price of this house using linear regression model.**

In [None]:
# 

# Define the features of the house
new_house = pd.DataFrame({'sq__ft': [1204],})

# Preprocess the features

# Predict the price using the linear regression model
predicted_price = model.predict(new_house[['sq__ft']])
round_price = round(predicted_price[0], 2)
print("Predicted Price of the House:", "$",round_price) # Predicted price of the house

#### Find the error

In [None]:
y_preds = linreg.predict(x_test)

y_preds[:10]

In [None]:
linre.score(x_test, y_test) # Calculate the R-squared value

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, y_preds, squared=True)