# Banglore House Price Prediction Project

Let's start the project by installing the data set from Kaggle

In order to install the data-set from kaggle first go to [Kaggle](https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data) see the data.

### Install Kaggle if not already installed
!pip install Kaggle

### Add json file to Kaggle folder
C:/Users/.../.Kaggle/Kaggle.json

This can be done using the Kaggle account. Go to [Kaggle Account](https://www.kaggle.com/harrykt/account).

And create a new API, this will download the Kaggle.json file, it is like a public key so that Kaggle knows who is downloading the data-set.

### Download the Data-set
This can be done using the following command.

!kaggle datasets download -d amitabhajoy/bengaluru-house-price-data

---

## Unziping the data-set

After downloading the data set you can unzip the data-set using the following script. 
Rather than manually doing it.

Why?
Just to look cool. Haha

    import zipfile
    with zipfile.ZipFile("bengaluru-house-price-data.zip","r") as zip_ref:
        zip_ref.extractall("bengaluru-house-price-data")

---

## Organizing the folder

Let's move the csv file to the data-set folder, and delete rest of the uncessary folder and zip file

This can be done using:

    import os
    import shutil

Moving the csv file to the dataset folder:
    
    src = os.path.abspath('bengaluru-house-price-data/Bengaluru_House_Data.csv')
    dst = os.path.abspath('dataset/Bengaluru_House_Data.csv')
    _ = shutil.move(src, dst)
    
Deleting unnecessary files. 
Make sure the path is correct, because this deletes things permanently.
I am **not responsible** if you do it incorrectly.
You can just ingore and do it manually.

    src_1 = os.path.abspath('bengaluru-house-price-data/')
    src_2 = os.path.abspath('bengaluru-house-price-data.zip')
    shutil.rmtree(src_1)
    os.remove(src_2)

---

## Data Cleaning

Import and necessary libraries and cleand the data

In [5]:
# Importing necessary libraries 
import pandas as pd
import numpy as np

import matplotlib
from matplotlib import pyplot as plt
%matplolib inline
matplotlib.rcParams["figure.figsize"] = (20, 10)

UsageError: Line magic function `%matplolib` not found.


In [7]:
csv_path = os.path.abspath('dataset/Bengaluru_House_Data.csv')
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [8]:
df.shape

(13320, 9)

In [9]:
df.groupby('area_type')['area_type'].agg('count')

area_type
Built-up  Area          2418
Carpet  Area              87
Plot  Area              2025
Super built-up  Area    8790
Name: area_type, dtype: int64

In [10]:
new_df = df.drop(['area_type', 'society', 'balcony', 'availability'], axis='columns')
new_df.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


In [12]:
new_df.isnull().sum()

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

## Dropping NULL Values

From the above we can see that there can be minimum or 73 NULL rows or maximum of 90 NULL columns
This is very small compared to 13320
This instead of doing manipulation and filling those rows 
We can drop them directly

In [14]:
df3 = new_df.dropna()
df3.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

## Finidng out unique values

Making the data consisent as some of the values are same, but they are just written in different format.

In [15]:
df3['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

In [16]:
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df3.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))


Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0,4
2,Uttarahalli,3 BHK,1440,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0,3
4,Kothanur,2 BHK,1200,2.0,51.0,2


In [23]:
df3 = df3.drop(['bhk'], axis=1)
df3.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Electronic City Phase II,2 BHK,1056,2.0,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0
