# Problem Statement

Airbnb Inc is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. Airbnb has close to 150 million customers across the world. Price is the most important factor considered by the customer while making booking into a property. Strategic pricing of the properties is important to avoid losing customers to the competitors.  
  
We have a data of 74111 Airbnb properties across the nations. Based on this data build a simple and multiple linear regression model to predict the strategic pricing of a new listed property on Airbnb.


In [7]:
import numpy as np   
import pandas as pd    
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt   
import matplotlib.style

### Importing data

In [8]:
# reading the CSV file into pandas dataframe
df = pd.read_csv("ThreeCars.csv")  

### EDA

In [9]:
# Check top few records to get a feel of the data structure
df.head()

Unnamed: 0.1,Unnamed: 0,Price,Age,Mileage,Porsche,Jaguar,BMW
0,1,69.4,3,21.5,1,0,0
1,2,56.9,3,43.0,1,0,0
2,3,49.9,2,19.9,1,0,0
3,4,47.4,4,36.0,1,0,0
4,5,42.9,4,44.0,1,0,0


In [10]:
df.shape

(90, 7)

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,Price,Age,Mileage,Porsche,Jaguar,BMW
0,1,69.4,3,21.5,1,0,0
1,2,56.9,3,43.0,1,0,0
2,3,49.9,2,19.9,1,0,0
3,4,47.4,4,36.0,1,0,0
4,5,42.9,4,44.0,1,0,0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  90 non-null     int64  
 1   Price       90 non-null     float64
 2   Age         90 non-null     int64  
 3   Mileage     90 non-null     float64
 4   Porsche     90 non-null     int64  
 5   Jaguar      90 non-null     int64  
 6   BMW         90 non-null     int64  
dtypes: float64(2), int64(5)
memory usage: 5.0 KB


In [13]:
df.describe(include="all")

Unnamed: 0.1,Unnamed: 0,Price,Age,Mileage,Porsche,Jaguar,BMW
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,45.5,37.575556,5.655556,41.321889,0.333333,0.333333,0.333333
std,26.124701,17.641265,3.895146,23.516371,0.474045,0.474045,0.474045
min,1.0,12.0,0.0,0.67,0.0,0.0,0.0
25%,23.25,23.9,3.25,20.75,0.0,0.0,0.0
50%,45.5,33.7,5.0,42.85,0.0,0.0,0.0
75%,67.75,49.975,7.0,59.825,1.0,1.0,1.0
max,90.0,83.0,22.0,100.7,1.0,1.0,1.0


### unique values for categorical variables

### Linear Regression Model

In [None]:
# invoke the LinearRegression function and find the bestfit model on training data
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)


In [None]:
regression_model.coef_

In [None]:
print(enumerate(X_train.columns))

In [None]:
# Let us explore the coefficients for each of the independent attributes
for idx, col_name in enumerate(X_train.columns):
    print(idx,col_name)
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

In [None]:
# Let us check the intercept for the model

intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

In [None]:
# R square on training data
regression_model.score(X_train, y_train)

50% of the variation in the log_price is explained by the predictors in the model for train set

In [None]:
# R square on testing data
regression_model.score(X_test, y_test)

In [None]:
import math

In [None]:
#RMSE on Training data
predicted_train=regression_model.fit(X_train, y_train).predict(X_train)
np.sqrt(metrics.mean_squared_error(y_train, predicted_train ))

In [None]:
#RMSE on Testing data
predicted_test=regression_model.fit(X_train, y_train).predict(X_test)
np.sqrt(metrics.mean_squared_error(y_test,predicted_test))

### Linear Regression using statsmodels

In [None]:
# concatenate X and y into a single dataframe
data_train = pd.concat([X_train, y_train], axis=1)
data_train.head()

In [None]:
data_test = pd.concat([X_test, y_test], axis=1)
data_test.head()

In [None]:
data_train.rename(columns = {"room_type_Entire home/apt": "room_type_entire_home", "room_type_Private room": "room_type_private_room", 
                     "room_type_Shared room": "room_type_shared_room"}, 
                      inplace = True) 

data_test.rename(columns = {"room_type_Entire home/apt": "room_type_entire_home", "room_type_Private room": "room_type_private_room", 
                     "room_type_Shared room": "room_type_shared_room"}, 
                      inplace = True) 

In [None]:
data_train.columns

In [None]:
expr= 'log_price ~ accommodates + bathrooms + instant_bookable+review_scores_rating +bedrooms + beds + room_type_private_room + room_type_shared_room  + cancellation_policy_moderate + cancellation_policy_strict + cleaning_fee_True'

In [None]:
import statsmodels.formula.api as smf
lm1 = smf.ols(formula= expr, data = data_train).fit()
lm1.params

In [None]:
print(lm1.summary())

The overall P value is less than alpha, so rejecting H0 and accepting Ha that atleast 1 regression co-efficient is not 0. Here all regression co-efficients are not 0

In [None]:
# Calculate MSE
mse = np.mean((lm1.predict(data_train.drop('log_price',axis=1))-data_train['log_price'])**2)
mse

In [None]:
#Root Mean Squared Error - RMSE
math.sqrt(mse)

In [None]:
np.sqrt(lm1.mse_resid) #another way

In [None]:
# Prediction on Test data
y_pred = lm1.predict(data_test)

In [None]:
plt.scatter(y_test['log_price'], y_pred)
plt.show()

In [None]:
for i,j in np.array(lm1.params.reset_index()):
    print('+ ({}) * {} '.format(round(j,2),i),end=' ')

 # Conclusion

The final Linear Regression equation is  
  
<b>log_price = b0 + b1 * instant_bookable[T.True] + b2 * accommodates + b3 * bathrooms + b4 * review_scores_rating + b5 * bedrooms + b6 * beds + b7 * room_type_private_room + b8 * room_type_shared_room + b9 * cancellation_policy_moderate + b10 * cancellation_policy_strict + b11 * cleaning_fee_True </b>
  
<b>log_price = (3.43) * Intercept + (-0.07) * instant_bookable[T.True] + (0.1) * accommodates + (0.18) * bathrooms + (0.01) * review_scores_rating + (0.16) * bedrooms + (-0.05) * beds + (-0.61) * room_type_private_room + (-1.08) * room_type_shared_room + (-0.06) * cancellation_policy_moderate + (-0.01) * cancellation_policy_strict + (-0.08) * cleaning_fee_True</b>  
  
When accommodates increases by 1 unit, log_price increases by 0.1 units, keeping all other predictors constant.  
similarly, when no. of bathrooms increases by 1 unit, log_price increases by 0.18 units, keeping all other predictors constant.
  
  
There are also some negative co-efficient values, for instance, room_type_shared_room has its corresponding co-efficient as -1.08. This implies, when the room type is shared room, the log_price decreases by 1.08 units, keeping all other predictors constant.



In [None]:
qz = pd.read_csv("ThreeCars.csv")

In [None]:
regression_model.score(X_train, y_train)

In [None]:
qz.head()

In [None]:
qz.drop('Unnamed: 0', axis=1)

In [None]:
y=qz.drop('Price', axis=1)

In [None]:
x=qz