# Linear Regression: Predicting House Rental Prices

This example parallels [Chapter 16 in Inferential Thinking](https://inferentialthinking.com/chapters/16/Inference_for_Regression.html)

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

**The following description is copied from this [Kaggle dataset](https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/data)**

### Context
The spectrum of housing options in India is incredibly diverse, spanning from the opulent palaces once inhabited by maharajas of yore, to the contemporary high-rise apartment complexes in bustling metropolitan areas, and even to the humble abodes in remote villages, consisting of modest huts. This wide-ranging tapestry of residential choices reflects the significant expansion witnessed in India's housing sector, which has paralleled the upward trajectory of income levels in the country. According to the findings of the Human Rights Measurement Initiative, India currently achieves 60.9% of what is theoretically attainable, considering its current income levels, in ensuring the fundamental right to housing for its citizens. In the realm of housing arrangements, renting, known interchangeably as hiring or letting, constitutes an agreement wherein compensation is provided for the temporary utilization of a resource, service, or property owned by another party. Within this arrangement, a gross lease is one where the tenant is obligated to pay a fixed rental amount, and the landlord assumes responsibility for covering all ongoing property-related expenses. The concept of renting also aligns with the principles of the sharing economy, as it fosters the utilization of assets and resources among individuals or entities, promoting efficiency and access to housing solutions for a broad spectrum of individuals.

### Content
Within this dataset, you will find a comprehensive collection of data pertaining to nearly 4700+ available residential properties, encompassing houses, apartments, and flats offered for rent. This dataset is rich with various attributes, including the number of bedrooms (BHK), rental rates, property size, number of floors, area type, locality, city, furnishing status, tenant preferences, bathroom count, and contact information for the respective point of contact.

### Dataset Glossary (Column-Wise)
* BHK: Number of Bedrooms, Hall, Kitchen.
* Rent: Rent of the Houses/Apartments/Flats.
* Size: Size of the Houses/Apartments/Flats in Square Feet.
* Floor: Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)
* Area Type: Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.
* Area Locality: Locality of the Houses/Apartments/Flats.
* City: City where the Houses/Apartments/Flats are Located.
* Furnishing Status: Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.
* Tenant Preferred: Type of Tenant Preferred by the Owner or Agent.
* Bathroom: Number of Bathrooms.
* Point of Contact: Whom should you contact for more information regarding the Houses/Apartments/Flats.

In [None]:
# Load the Data
house = Table().read_table('data/House_Rent_Dataset.csv')
house

In [None]:
house.stats()

In [None]:
# Explore likely predictor of rental price: the size of the house
plt.scatter(house['Size'], house['Rent'])
plt.xlabel('Size')
plt.ylabel('Rent');

In [None]:
np.unique(house.column('City'))

In [None]:
# Choose a particular city. Prices probably vary by city.
kolkata = house.where('City', are.equal_to('Kolkata'))
plt.scatter(kolkata['Size'], kolkata['Rent'])
plt.xlabel('Size')
plt.ylabel('Rent');

In [None]:
# Remove the one extreme outlier. 
# Though it might be interesting to know why this small a house rents to high
kolkata = kolkata.where('Rent', are.below(100000))
plt.scatter(kolkata['Size'], kolkata['Rent'])
plt.xlabel('Size')
plt.ylabel('Rent');

Clearly, there is a positive relationship between house size and rental price.

In [None]:
def standard_units(xyz):
    "Convert any array of numbers to standard units."
    return (xyz - np.mean(xyz)) / np.std(xyz)


def correlation(t, label_x, label_y):
    return np.mean(
        standard_units(t.column(label_x)) * standard_units(t.column(label_y))
    )


# Regression
def slope(t, label_x, label_y):
    r = correlation(t, label_x, label_y)
    return r * np.std(t.column(label_y)) / np.std(t.column(label_x))


def intercept(t, label_x, label_y):
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y) * np.mean(
        t.column(label_x)
    )

def fit(table, x, y):
    """Return the height of the regression line at each x value."""
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * table.column(x) + b

In [None]:
# Find the correlation coefficient
correlation(kolkata, "Size", "Rent")

In [None]:
kolkata.scatter('Size', 'Rent')
slp = slope(kolkata, 'Size', 'Rent')
inter = intercept(kolkata,'Size','Rent')
print("Slope: %4.2f Intercept:  %4.2f" % (slp, inter))
plt.scatter(0,inter)
plt.plot(kolkata.column('Size'), fit(kolkata, 'Size', 'Rent'), lw=4, color='gold')
plt.show()

## Bootstrap to find the confidence interval for the slope

In [None]:
slopes = make_array()
for i in np.arange(5000):
    bootstrap_sample = kolkata.sample()
    bootstrap_slope = slope(bootstrap_sample, 'Size', 'Rent')
    slopes = np.append(slopes, bootstrap_slope)
Table().with_column('Bootstrap Slopes', slopes).hist(bins=20)

## Find the 95% interval

In [None]:
left = percentile(2.5, slopes)
right = percentile(97.5, slopes)
left, right

## Prediction Interval

Suppose we want to know the price we would predict for a house size of 1500 sq ft, along with the 95% confidence interval.

In [None]:
def fitted_value(table, x, y, given_x):
    a = slope(table, x, y)
    b = intercept(table, x, y)
    return a * given_x  + b

In [None]:
kolkata.scatter('Size', 'Rent')
slp = slope(kolkata, 'Size', 'Rent')
plt.scatter(0,inter)
plt.plot(kolkata.column('Size'), fit(kolkata, 'Size', 'Rent'), lw=4, color='gold')
plt.plot([1500, 1500], [0, fitted_value(kolkata, 'Size', 'Rent', 1500)], lw=4)
plt.show()

In [None]:
# Bootstrap prediction of variable y at new_x
# Data contained in table; prediction by regression of y based on x
# repetitions = number of bootstrap replications of the original scatter plot

def bootstrap_prediction(table, x, y, new_x, repetitions):
    
    # For each repetition:
    # Bootstrap the scatter; 
    # get the regression prediction at new_x; 
    # augment the predictions list
    predictions = make_array()
    for i in np.arange(repetitions):
        bootstrap_sample = table.sample()
        bootstrap_prediction = fitted_value(bootstrap_sample, x, y, new_x)
        predictions = np.append(predictions, bootstrap_prediction)
        
    # Find the ends of the approximate 95% prediction interval
    left = percentile(2.5, predictions)
    right = percentile(97.5, predictions)
    
    # Prediction based on original sample
    original = fitted_value(table, x, y, new_x)
    
    # Display results
    Table().with_column('Prediction', predictions).hist(bins=20)
    plt.xlabel('predictions at x='+str(new_x))
    plt.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
    print('Height of regression line at x='+str(new_x)+':', original)
    print('Approximate 95%-confidence interval:')
    print(left, right)

In [None]:
bootstrap_prediction(kolkata, 'Size', 'Rent', 1500, 5000)

In [None]:
kolkata.scatter('Size', 'Rent')
slp = slope(kolkata, 'Size', 'Rent')
plt.scatter(0,inter)
plt.plot(kolkata.column('Size'), fit(kolkata, 'Size', 'Rent'), lw=4, color='gold')
plt.plot([1500, 1500], [18172., 22654.], lw=4)
plt.show()