## Housing Case Study
Problem Statement:

Consider a real estate company that has a dataset containing the prices of properties in the Delhi region. It wishes to use the data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Essentially, the company wants —

To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc.

To create a linear model that quantitatively relates house prices with variables such as number of rooms, area, number of bathrooms, etc.

To know the accuracy of the model, i.e. how well these variables can predict house prices.

In [2]:
#1 Import the dataset (Housing.csv) and look at the top five rows
import pandas as pd
df=pd.read_csv('C:/Users/NAGARAJU/Downloads/Housing.csv')
df

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished
...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished


In [3]:
#2 Check the shape of the DataFrame (rows, columns)
df.shape

(545, 13)

In [4]:
#3 Look at the data types of the columns
df.dtypes

price                int64
area                 int64
bedrooms             int64
bathrooms            int64
stories              int64
mainroad            object
guestroom           object
basement            object
hotwaterheating     object
airconditioning     object
parking              int64
prefarea            object
furnishingstatus    object
dtype: object

In [5]:
#4 Check for missing  values if any, replace with appropriate values
df.isnull().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

In [6]:
#5 Prepare  X (independent variables) and y (dependent variable)
x=df.drop('price',axis=1)
y=df['price']

In [11]:
#6 Visualise the relationship between the independent variables and the depenedent variable using scatterplots


In [7]:
#7 Encoding categorical data in X
from sklearn.preprocessing import LabelEncoder

labelencoder=LabelEncoder()
categorical_columns=['mainroad','guestroom','basement','hotwaterheating','airconditioning','prefarea','furnishingstatus']
for col in categorical_columns:
    x[col]=labelencoder.fit_transform(x[col])


In [None]:
#8  Avoiding the Dummy Variable Trap



In [20]:
#9 Apply feature scaling on numerical variables


In [23]:
#10 Split Data into Training and Testing Sets (70%-Train 30%-Test) - X_train, y_train, X_test, y_test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)

In [25]:
x_train

Unnamed: 0,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
126,7160,3,1,1,1,0,1,0,0,2,1,2
363,3584,2,1,1,1,0,0,1,0,0,0,1
370,4280,2,1,1,1,0,0,0,1,2,0,1
31,7000,3,1,4,1,0,0,0,1,2,0,1
113,9620,3,1,1,1,0,1,0,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
71,6000,4,2,4,1,0,0,0,1,0,0,2
106,5450,4,2,1,1,0,1,0,1,0,1,1
270,4500,3,2,3,1,0,0,1,0,1,0,0
435,4040,2,1,1,1,0,0,0,0,0,0,2


In [26]:
x_test

Unnamed: 0,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
316,5900,4,2,2,0,0,1,0,0,1,0,2
77,6500,3,2,3,1,0,0,0,1,0,1,0
360,4040,2,1,1,1,0,0,0,0,0,0,1
90,5000,3,1,2,1,0,0,0,1,0,0,1
493,3960,3,1,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
395,3600,6,1,2,1,0,0,0,0,1,0,2
425,3185,2,1,1,1,0,1,0,0,2,0,0
195,4410,4,3,2,1,0,1,0,0,2,0,1
452,9000,3,1,2,1,0,0,0,0,2,0,1


In [27]:
y_train

126    5880000
363    3710000
370    3640000
31     8400000
113    6083000
        ...   
71     6755000
106    6160000
270    4340000
435    3290000
102    6195000
Name: price, Length: 381, dtype: int64

In [28]:
y_test

316    4060000
77     6650000
360    3710000
90     6440000
493    2800000
        ...   
395    3500000
425    3360000
195    4970000
452    3150000
154    5530000
Name: price, Length: 164, dtype: int64

In [33]:
#11 Fitting Multiple Linear Regression to the Training
from sklearn.linear_model import LinearRegression

mlr=LinearRegression()
mlr.fit(x_train,y_train)
y_pred=mlr.predict(x_train)
y_pred

array([ 5332298.80162373,  3560000.36855075,  4404360.98837882,
        6422828.76367875,  6350894.77980776,  6125803.7400548 ,
        3154901.01336098,  4641274.2767926 ,  8811989.65931224,
        5151873.94105849,  7362142.90153429,  8117826.44806162,
        3323065.80214996,  6980537.59138662,  5358025.69101714,
        4519725.42144618,  3419875.40619322,  4377646.55495528,
        6385788.92284523,  5819537.0668614 ,  5258023.92404966,
        3835676.52871266,  6749594.55318048,  3217825.87201888,
        3232110.43832742,  3000311.50208294,  8531123.97262171,
        6055252.83880228,  6263181.51388486,  4355706.89446947,
        4899309.8183723 ,  7461955.39104298,  5328194.73506062,
        4087447.07413147,  2857431.11794584,  8414396.76104224,
        4547026.95451487,  5155699.58974984,  5750985.84469392,
        5445311.17081142,  2661054.75112932,  5668149.26438808,
        3256024.39161894,  4857095.26556867,  7102084.11357081,
        5724018.21510571,  6186517.53781

In [34]:
#12 Predict on the train set and calculate the error = y_pred - y_train
error=y_pred-y_train
error

126   -5.477012e+05
363   -1.499996e+05
370    7.643610e+05
31    -1.977171e+06
113    2.678948e+05
           ...     
71    -1.913138e+05
106    2.095515e+05
270    1.985625e+06
435   -4.325689e+05
102    9.410638e+05
Name: price, Length: 381, dtype: float64

In [None]:
#13  Residual plot - plot y_pred in x axis and errors in y axis

In [35]:
#14  Predic on the test set
y_pred=mlr.predict(x_test)
y_pred

array([5407508.87024418, 7097185.46706855, 3055462.44314053,
       4476945.19636315, 3315983.65663579, 3618373.03255259,
       5758111.46044028, 6466502.43909126, 2830273.16469119,
       2588804.65810567, 9649589.31414054, 2830606.51113843,
       3048137.62898116, 3392779.60203048, 3823232.9673009 ,
       5358170.87034031, 2955016.41578148, 4836054.53230682,
       4603068.47740645, 3551464.60674927, 5625018.82657786,
       5796938.54363456, 2758483.74755246, 4873266.20950521,
       5600804.93370716, 7772078.63540938, 3381536.16270183,
       5370732.06725796, 8352665.9587942 , 3406110.06934798,
       6335677.41367624, 3427228.10570008, 6740746.88053742,
       4205633.93578768, 3624702.80095917, 5797171.46441145,
       5080025.13346592, 4386055.52335342, 3070137.54474224,
       4635050.40917587, 4743419.55702888, 3433682.48420934,
       7076940.4807988 , 4096598.07073101, 3741261.35302813,
       4308416.36745432, 6678982.6364043 , 4092649.04459023,
       3872211.05471678,

In [None]:
#15 Residual plot - plot y_pred in x axis and errors in y axis

In [None]:
#16 Print Mean Squared Error and R Squared Value

In [None]:
#17 Check  Adjusted R Squared Value(by selecting different no of input variables insted of all)