## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [2]:
#lets breakdown this datasets into 6steps inorder to achive the goal.
#1.import libraries
#2.Loading the dataset
#3.Data Cleaning
#4.Data Visualization: Using plots to find relations between the features.
#5.find the correlations.
#6.predictions.

In [3]:
# File location and type
file_location = "AB_NYC_2019.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149.0,1.0,9,2018-10-19,0.21,6.0,365.0
2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225.0,1.0,45,2019-05-21,0.38,2.0,355.0
3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150.0,3.0,0,,,1.0,365.0
3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89.0,1.0,270,2019-07-05,4.64,1.0,194.0
5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80.0,10.0,9,2018-11-19,0.1,1.0,0.0
5099,Large Cozy 1 BR Apartment In Midtown East,7322,Chris,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200.0,3.0,74,2019-06-22,0.59,1.0,129.0
5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60.0,45.0,49,2017-10-05,0.4,1.0,0.0
5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79.0,2.0,430,2019-06-24,3.47,1.0,220.0
5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,79.0,2.0,118,2017-07-21,0.99,1.0,0.0
5238,Cute & Cozy Lower East Side 1 bdrm,7549,Ben,Manhattan,Chinatown,40.71344,-73.99037,Entire home/apt,150.0,1.0,160,2019-06-09,1.33,4.0,188.0


In [4]:
#installing the required libraries in datavricks.

In [3]:
%sh pip install plotly geopandas folium iplot

UsageError: %%sh is a cell magic, but the cell body is empty.


In [6]:
#1.import the required libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly as py
import seaborn as sns
from sklearn import preprocessing
import geopandas as gpd
import iplot as iplot

In [3]:
from plotly import __version__
import plotly.offline as py 
from plotly.offline import init_notebook_mode, plot
init_notebook_mode(connected=True)
from plotly import tools
import plotly.graph_objs as go
import plotly.express as px
import folium
from folium.plugins import MarkerCluster
from folium import plugins
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [4]:
df = pd.read_csv('../data/AB_NYC_2019.csv.zip')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../data/AB_NYC_2019.csv.zip'

In [8]:
#Data Cleaning.

In [9]:
#to check whetehr there is any null values in the dataframe.

In [12]:
import numpy as np
from pyspark import SparkConf, SparkContext

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
    ('session', "timestamp1", "id2"))

NameError: name 'spark' is not defined

In [13]:
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

In [14]:
from pyspark.sql.functions import col, count, isnan, lit, sum

def count_not_null(c, nan_as_null=False):
    """Use conversion between boolean and integer
    - False -> 0
    - True ->  1
    """
    pred = col(c).isNotNull() & (~isnan(c) if nan_as_null else lit(True))
    return sum(pred.cast("integer")).alias(c)

df.agg(*[count_not_null(c) for c in df.columns]).show()

In [15]:
df1=df.sort_values(by=['number_of_reviews'],ascending=False).head(1000)

In [16]:
df2=df.sort_values(by=['price'],ascending=False).head(1000)

In [17]:
print('Rooms with the most number of reviews')
Long=-73.80
Lat=40.80
mapdf1=folium.Map([Lat,Long],zoom_start=10,)

mapdf1_rooms_map=plugins.MarkerCluster().add_to(mapdf1)

for lat,lon,label in zip(df1.latitude,df1.longitude,df1.name):
    folium.Marker(location=[lat,lon],icon=folium.Icon(icon='home'),popup=label).add_to(mapdf1_rooms_map)
mapdf1.add_child(mapdf1_rooms_map)

mapdf1

In [18]:
print('Most Expensive rooms')
Long=-73.80
Lat=40.80
mapdf1=folium.Map([Lat,Long],zoom_start=10,)

mapdf1_rooms_map=plugins.MarkerCluster().add_to(mapdf1)

for lat,lon,label in zip(df2.latitude,df2.longitude,df2.name):
    folium.Marker(location=[lat,lon],icon=folium.Icon(icon='home'),popup=label).add_to(mapdf1_rooms_map)
mapdf1.add_child(mapdf1_rooms_map)

mapdf1

In [19]:
plt.figure(figsize=(10,10))
sns.scatterplot(x='longitude', y='latitude', hue='neighbourhood_group',s=20, data=df)
plt.show()
display()

In [20]:
df3=df.groupby(['neighbourhood_group']).mean()

In [21]:
df3.drop(['latitude', 'longitude','host_id','id'],axis=1)

Unnamed: 0_level_0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
neighbourhood_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bronx,87.496792,4.560953,26.004583,1.837831,2.233731,165.758937
Brooklyn,124.383207,6.056556,24.202845,1.283212,2.284371,100.232292
Manhattan,196.875814,8.579151,20.985596,1.272131,12.79133,111.97941
Queens,99.517649,5.181433,27.700318,1.9412,4.060184,144.451818
Staten Island,114.812332,4.831099,30.941019,1.87258,2.319035,199.678284


In [22]:
#To learn about different hosts and areas, we will try to find visualize the data into graphs and charts that are easy to understand.

#check the host with the highest amount of rooms
top_host=df.host_id.value_counts().head(10)
top_host

In [23]:
The host with the highest amount of listings has over 327 rooms within New York alone!

In [24]:
#counting the amount of listings per neighbourhood group
plt.figure(figsize=(15,7))
viz_1 = sns.countplot(x='neighbourhood_group', hue = 'room_type', data = df, palette = 'pastel')
viz_1.set_title('Number and types of listings for each neighbourhood_group')

In [25]:
display()

In [26]:
#From the abover bar graph we can infer that most listings are located in either Brooklyn or Manhattan and the most available types of rooms are private rooms or apartment rooms.

In [27]:
#due to extremities in this dataset, we will be cutting those values off from this visualization
#creating a sub-dataframe with no extreme values / less than 500
sub_1 = df[df.price < 500]
#using violinplot to showcase density and distribtuion of prices 
plt.figure(figsize = (15,7))
viz_2 = sns.violinplot(data = sub_1, x = 'neighbourhood_group', y = 'price', palette = 'pastel')
viz_2.set_title('Density and distribution of prices for each neighberhood_group')

In [28]:
display()

In [29]:
#From the above violin plot we can infer that Manhattan, on average, has the highest price for its listings in Airbnb. Followed by Brooklyn, Queens, Staten Island, and Bronx respectively. This is no surprise due to the common knowledge that Manhattan is one of the most expensive cities to live in. Furthermore, Manhattan has the highest variations in its price.

In [30]:
#Now, we move on to see if neighbourhood groups affect how well each listing is booked, or its availability.
#using violinplot to showcase density and distribtuion of availability

plt.figure(figsize = (15,7))
viz_3 = sns.violinplot(data = df, x = 'neighbourhood_group', y = 'availability_365', palette = 'pastel')
viz_3.set_title('Density and distribution of availability for each neighberhood_group')

In [31]:
display()

In [32]:
#From the aove graph we can induce that the busiest postings are in Brooklyn and Manhattan, with the violin shape growing at the base. This shows the airbnb posting has an accessibility for appointments for zero days out of 365, in this manner it is constantly reserved. Sovereigns likewise had accomplishment with postings, however not as much as the two neighborhood bunches in advance. Each of the five violins demonstrated a slanted example to either the top or the base, implying that either the rooms sold quite well or wasn't reserved by any stretch of the imagination. This is extremely intriguing in light of the fact that this shows there is an obstacle for each inclining to survive, another posting will think that its difficult to be reserved because of absence of audits or the posting lost all sense of direction in Airbnb's suggestion framework. Another posting once in a while gets prescribed to clients. In any case, when the posting discovers achievement and additions audits and is highlighted via Airbnb's application, the posting will be reserved and sold out like flapjacks.

In [33]:
#Now let's move on to see how the top hosts keep their listings busy.

#let's make a list of the top 5 hosts
top_five_host=df.host_id.value_counts().head(5)
top_five_host

In [34]:
#create a dataframe that only contains listings from the top hosts
airbnb_top_host = df.loc[df.host_id.isin([219517861, 107434423, 30283594, 137358866, 12243051])]
airbnb_top_host.head()


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
9740,7491713,NYC Lavish Studio Apartment Steps from SoHo!,30283594,Kara,Manhattan,Financial District,40.70862,-74.01408,Entire home/apt,169,30,3,2018-12-07,0.09,121,364
10075,7730160,Furnished NYC 1BR apt near Rockefeller Center!!!,30283594,Kara,Manhattan,Theater District,40.75967,-73.98573,Entire home/apt,135,30,0,,,121,174
10335,7913426,LUX 1-Bedroom NYC Apartment Near Times Square!,30283594,Kara,Manhattan,Theater District,40.75654,-73.98891,Entire home/apt,369,30,0,,,121,364
10398,7966358,NYC High End 2BR Midtown West Apt,30283594,Kara,Manhattan,Midtown,40.76633,-73.98145,Entire home/apt,335,30,0,,,121,201
10490,8045421,NYC Chelsea Luxury 1BR Apt,30283594,Kara,Manhattan,Chelsea,40.74465,-73.99253,Entire home/apt,129,30,3,2017-12-31,0.07,121,161


In [35]:
##visualization of top hosts' listings' location and room type
plt.figure(figsize = (15,7))
viz_4 = sns.catplot(x = 'host_id', hue = 'room_type',col='neighbourhood_group',
                    data = airbnb_top_host, palette = 'pastel', kind = 'count')

In [36]:
display()

In [37]:
#From this catplot, we can infer that the top hosts mostly, if not some, has all their listings in Manhattan. These top hosts owns mostly listings with the room type 'entire home/apt', with an exception of the second host with the most listings mainly listing private rooms.

In [38]:
#using violinplot to showcase density and distribtuion of availability

plt.figure(figsize = (15,7))
viz_5 = sns.violinplot(data = airbnb_top_host, x = 'host_id', y = 'availability_365', palette = 'pastel')
viz_5.set_title('Density and distribution of availability for each neighberhood_group')

In [39]:
display()

In [40]:
#interestingly, on average, host_id '13758866', received more frequent bookings than the other. This host mainly lists room type that are private, so the type of room might be a factor leading to higher amount of bookings.

In [41]:
df4=df.groupby(['neighbourhood_group','neighbourhood']).mean()

In [42]:
r1=df4.loc['Bronx'].number_of_reviews.sum().round()
r2=df4.loc['Brooklyn'].number_of_reviews.sum().round()
r3=df4.loc['Manhattan'].number_of_reviews.sum().round()
r4=df4.loc['Queens'].number_of_reviews.sum().round()
r5=df4.loc['Staten Island'].number_of_reviews.sum().round()

In [43]:
abcd=df['neighbourhood_group'].value_counts()
dfabcd=pd.DataFrame(abcd)
dfabcd.reset_index(inplace=True)

In [44]:
reviews = [r1,r2,r3,r4,r5]
review = pd.DataFrame(data=reviews,index=dfabcd['index'],columns=['values'],)
review.reset_index(inplace=True)



trace10 = go.Bar(x=review['index'],y=review['values'],marker=dict(color=['Blue','Red','Green','Black','Purple']),width=0.4)

data=[trace10]
layout = go.Layout(title='Number of reviews by Neighbourhood',height=400,width=800)
fig= go.Figure(data=data,layout=layout)
display()

In [45]:
r1=df4.loc['Bronx'].reviews_per_month.mean()
r2=df4.loc['Brooklyn'].reviews_per_month.mean()
r3=df4.loc['Manhattan'].reviews_per_month.mean()
r4=df4.loc['Queens'].reviews_per_month.mean()
r5=df4.loc['Staten Island'].reviews_per_month.mean()

rev = [r1,r2,r3,r4,r5]

rev_per_month = pd.DataFrame(data=rev,columns=['values'],index=dfabcd['index'])

rev_per_month.reset_index(inplace=True)


trace2 = go.Scatter(x=rev_per_month['index'],y=rev_per_month['values'],marker=dict(color=['Blue','Red','Green','Black','Purple']))
data=[trace2]
layout = go.Layout(title='Average Reviews per month per place by Neighbourhood',height=400,width=800)
fig= go.Figure(data=data,layout=layout,)
display()

In [46]:
df['room_type'].value_counts()

In [47]:
df5 = df.groupby(['neighbourhood_group','room_type']).mean()

In [48]:
room_types_neighbourhoods=df5.drop(['id','host_id','latitude','longitude','number_of_reviews','reviews_per_month'],axis=1)

In [49]:
room_types_neighbourhoods

Unnamed: 0_level_0,Unnamed: 1_level_0,price,minimum_nights,calculated_host_listings_count,availability_365
neighbourhood_group,room_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bronx,Entire home/apt,127.506596,5.957784,1.865435,158.0
Bronx,Private room,66.788344,3.858896,2.338957,171.331288
Bronx,Shared room,59.8,3.366667,3.416667,154.216667
Brooklyn,Entire home/apt,178.327545,6.531332,1.837849,97.205147
Brooklyn,Private room,76.500099,5.539479,2.547177,99.917983
Brooklyn,Shared room,50.527845,7.753027,6.171913,178.007264
Manhattan,Entire home/apt,249.239109,10.539283,18.922721,117.140996
Manhattan,Private room,116.776622,5.44688,3.188048,101.845026
Manhattan,Shared room,88.977083,6.766667,3.885417,138.572917
Queens,Entire home/apt,147.050573,5.369752,1.677958,132.267176


In [50]:
%sh pip install scatter

In [51]:
fig = px.scatter_matrix(room_types_neighbourhoods,height=1000,width=900,color="minimum_nights")
fig.update_traces(diagonal_visible=False)
display()

In [52]:

df6 = df.groupby(['room_type']).mean()
room_types =df6.drop(['id','host_id','latitude','longitude','number_of_reviews','reviews_per_month'],axis=1)

In [53]:
room_types

Unnamed: 0_level_0,price,minimum_nights,calculated_host_listings_count,availability_365
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Entire home/apt,211.794246,8.506907,10.698335,111.920304
Private room,89.780973,5.3779,3.227717,111.203933
Shared room,70.127586,6.475,4.662931,162.000862


In [54]:
df7 = df.groupby(['neighbourhood_group','room_type'])['id'].agg('count')

In [55]:
rmtng = pd.DataFrame(df7)

In [56]:
rmtng.reset_index(inplace=True)

In [57]:
Bronx = rmtng[rmtng['neighbourhood_group']=='Bronx']
Brooklyn = rmtng[rmtng['neighbourhood_group']=='Brooklyn']
Manhattan = rmtng[rmtng['neighbourhood_group']=='Manhattan']
Queens = rmtng[rmtng['neighbourhood_group']=='Queens']
StatenIsland = rmtng[rmtng['neighbourhood_group']=='Staten Island']

df8 = df.groupby(['room_type']).count()
rooms = df8.drop(['host_name','name','host_id','latitude','longitude','number_of_reviews','reviews_per_month','neighbourhood_group','neighbourhood','price','minimum_nights','calculated_host_listings_count','availability_365','last_review'],axis=1)
rooms.reset_index(inplace=True)

In [58]:
%sh pip install iplot

In [59]:
#Preprocessing and preparing Data for prediction
df = pd.read_csv("/dbfs/FileStore/tables/AB_NYC_2019.csv")
df1 = df

df1.drop(['name','id','host_name','last_review'],axis=1,inplace=True)
df1['reviews_per_month']=df1['reviews_per_month'].replace(np.nan, 0)

le = preprocessing.LabelEncoder()
le.fit(df1['neighbourhood_group'])    
df1['neighbourhood_group']=le.transform(df1['neighbourhood_group'])

le = preprocessing.LabelEncoder()
le.fit(df1['neighbourhood'])
df1['neighbourhood']=le.transform(df1['neighbourhood'])

le = preprocessing.LabelEncoder()
le.fit(df1['room_type'])
df1['room_type']=le.transform(df1['room_type'])

df1.sort_values(by='price',ascending=True,inplace=True)

df1.head()

Unnamed: 0,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
25796,86327101,1,13,40.68258,-73.91284,1,0,1,95,4.35,6,222
25634,15787004,1,28,40.69467,-73.92433,1,0,2,16,0.71,5,0
25433,131697576,0,62,40.83296,-73.88668,1,0,2,55,2.56,4,127
25753,1641537,1,91,40.72462,-73.94072,1,0,2,12,0.53,2,0
23161,8993084,1,13,40.69023,-73.95428,1,0,4,1,0.05,4,28


In [60]:
#summary of data visualisation:
#For this dataset, the differences between many variables within the dataframe are relatively low.Nevertheless, we may extract information from these details with polishing and visualization.Next, in each neighborhood group we researched the numbers and room styles for airbnb listings.We found that most of the listings were private rooms and whole homes frequently in the Manhattan and Brooklyn neighbourhood.We also investigated cost structure in each area and uncovered the facts that Manhattan is expensive to live in.
#Instead we continued with the most listings to explore hosts,From the graph, almost all the top airbnb hosts listed whole homes / apt in Manhattan.Nonetheless, distributions for top hosts-owned listings also shows, in comparison to my assumption, that top hosts also had a handful of listings for over 300 days, which suggests they are not booked that much.Ironically, it is comparatively more difficult to book the top host with a number of private residences.

In [61]:
#After understanding the data and visulsing we founf out the correlations. MAchine Learning algrithms are used to predict.

In [62]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression

In [63]:
#Linear Regression Model
lm = LinearRegression()

In [64]:
X = df1[['host_id','neighbourhood_group','neighbourhood','latitude','longitude','room_type','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]
y = df1['price']

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
lm.fit(X_train,y_train)


In [66]:
predicts = lm.predict(X_test)

In [67]:
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

In [68]:
print("Root mean squared error is:")
np.sqrt(metrics.mean_squared_error(y_test,predicts))

In [69]:
print('r2 score is:')
r2 = r2_score(y_test,predicts)
r2*100

In [70]:
print("Mean absolute error is:")
mean_absolute_error(y_test,predicts)

In [71]:
error_diff = pd.DataFrame({'Actual Values': np.array(y_test).flatten(), 'Predicted Values': predicts.flatten()})
error_diff1 = error_diff.head(20)

In [72]:
error_diff1.head(5)

Unnamed: 0,Actual Values,Predicted Values
0,400,135.112125
1,140,210.599057
2,195,176.222978
3,120,100.439183
4,88,88.545753


In [73]:
#Actual values vs Predicted values

title=['Pred vs Actual']
fig = go.Figure(data=[
    go.Bar(name='Predicted', x=error_diff1.index, y=error_diff1['Predicted Values']),
    go.Bar(name='Actual', x=error_diff1.index, y=error_diff1['Actual Values'])
])

fig.update_layout(barmode='group')
display()

In [74]:
#Linear Model Predictions
plt.figure(figsize=(16,8))
sns.regplot(predicts,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Linear Model Predictions")
display()


In [75]:
from sklearn.ensemble import  GradientBoostingRegressor
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

from sklearn.model_selection import train_test_split

In [76]:
X = df1[['host_id','neighbourhood_group','neighbourhood','latitude','longitude','room_type','minimum_nights','number_of_reviews','reviews_per_month','calculated_host_listings_count','availability_365']]
y = df1['price']

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)


In [78]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.01)

In [79]:
GBoost.fit(X_train,y_train)

In [80]:
predict = GBoost.predict(X_test)

In [81]:
print("Root mean squared error is:")
np.sqrt(metrics.mean_squared_error(y_test,predict))

In [82]:
print('r2 score is:')
r2 = r2_score(y_test,predict)
r2*100

In [83]:
error_diff = pd.DataFrame({'Actual Values': np.array(y_test).flatten(), 'Predicted Values': predict.flatten()})
error_diff1 = error_diff.head(20)


In [84]:
print("Mean absolute error is:")
mean_absolute_error(y_test,predict)

In [85]:
error_diff1.head()

Unnamed: 0,Actual Values,Predicted Values
0,400,125.903567
1,140,156.575618
2,195,161.337522
3,120,94.543442
4,88,71.978409


In [86]:
title='Pred vs Actual'
fig = go.Figure(data=[
    go.Bar(name='Predicted', x=error_diff1.index, y=error_diff1['Predicted Values']),
    go.Bar(name='Actual', x=error_diff1.index, y=error_diff1['Actual Values'])
])
fig.update_layout(barmode='group')
display()

In [87]:
plt.figure(figsize=(10,8))
sns.regplot(predict,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Gradient Boosted Regressor model Predictions")
plt.show()
display()

In [88]:
import xgboost
import warnings 
warnings.simplefilter(action='ignore')


In [89]:
xgb = xgboost.XGBRegressor(n_estimators=310,learning_rate=0.1,objective='reg:squarederror')
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)

In [90]:
print("Root mean squared error is:")
np.sqrt(metrics.mean_squared_error(y_test,xgb_pred))

In [91]:
print('r2 score is:')
r2 = r2_score(y_test,xgb_pred)
r2*100

In [92]:
print("Mean absolute error is:")
mean_absolute_error(y_test,xgb_pred)

In [93]:
error_diff = pd.DataFrame({'Actual Values': np.array(y_test).flatten(), 'Predicted Values': xgb_pred.flatten()})
error_diff1 = error_diff.head(20)

In [94]:
error_diff1.head()

Unnamed: 0,Actual Values,Predicted Values
0,400,122.94558
1,140,145.77063
2,195,164.780121
3,120,97.577797
4,88,74.028694


In [95]:
import plotly.express as px
import plotly.graph_objects as go
title='Pred vs Actual'
fig = go.Figure(data=[
    go.Bar(name='Predicted', x=error_diff1.index, y=error_diff1['Predicted Values']),
    go.Bar(name='Actual', x=error_diff1.index, y=error_diff1['Actual Values'])
])
fig.update_layout(barmode='group')
display()

In [96]:
plt.figure(figsize=(10,8))
sns.regplot(xgb_pred,y_test)
plt.xlabel('Predictions')
plt.ylabel('Actual')
plt.title("Xgboost Regressor Predictions")
plt.show()
display()