# Scenario
Create a model that will predict whether a airbnb will get a perfect 5.0 rating in San Diego California. The purpose of this model is for airbnb hosts to have a way to evaluate their rentals and make sure that they are meeting all of the criteria to get that perfect review.

## Alternative Scenario: 
    - Airbnb is an industry disruptor. Many Hosts get a good deal of income from it. 
    - The goal of is to be a "5 Star Host" or "SuperHost". 
    - What can they do to get more 5 Star Ratings?
    - any particular features?

## Questions to Answer

1. How many units have a perfect rating?
2. How long have they had perfect rating?
3. How many reviews should the unit have to be considered? (ie, one 5.0 isn't enough)
4. What review metrics have the most impact?
5. What house factors have the most impact?
6. Relationship between price and rating?

## Loading Data

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mtick
from matplotlib.pylab import rcParams
import matplotlib.ticker as mtick
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, mean_squared_error, mean_squared_log_error, roc_curve, auc
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import plot_confusion_matrix
from xgboost import XGBClassifier
import numpy as np


In [2]:
pd.set_option('display.max_rows', 1000)
plt.style.use('fivethirtyeight')

## Reviews_df: Details on each review post

In [3]:
reviews_df = pd.read_csv('reviews.csv.gz')

In [4]:
reviews_df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,29967,62788,2010-07-09,151260,Debbie,When I booked our stay in San Diego at Dennis ...
1,29967,64568,2010-07-14,141552,Eric,This was my first experience with using airbnb...
2,29967,67502,2010-07-22,141591,David,We found the house to be very accommodating--e...
3,29967,70466,2010-07-29,125982,Anders,As advertised and more. Dennis was very helpfu...
4,29967,74876,2010-08-07,29835,Miyoko,We had a great time in San Diego. Denis' house...


## Listing_DF: Baseline DF with many columns

In [5]:
listing_df = pd.read_csv('listings.csv.gz')

In [6]:
listing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10935 entries, 0 to 10934
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            10935 non-null  int64  
 1   listing_url                                   10935 non-null  object 
 2   scrape_id                                     10935 non-null  int64  
 3   last_scraped                                  10935 non-null  object 
 4   name                                          10935 non-null  object 
 5   description                                   10809 non-null  object 
 6   neighborhood_overview                         7440 non-null   object 
 7   picture_url                                   10935 non-null  object 
 8   host_id                                       10935 non-null  int64  
 9   host_url                                      10935 non-null 

In [7]:
listing_df['price']

0         $60.00
1        $282.00
2        $348.00
3        $368.00
4        $264.00
          ...   
10930    $228.00
10931    $168.00
10932    $500.00
10933     $67.00
10934     $88.00
Name: price, Length: 10935, dtype: object

### Fixing Price

In [8]:
listing_df['price'] = listing_df['price'].map(lambda x: x.replace('$',' '))
listing_df['price'] = listing_df['price'].map(lambda x: x.replace(',',''))
listing_df['price'] = listing_df['price'].astype(float)

In [9]:
listing_df['price']

0         60.0
1        282.0
2        348.0
3        368.0
4        264.0
         ...  
10930    228.0
10931    168.0
10932    500.0
10933     67.0
10934     88.0
Name: price, Length: 10935, dtype: float64

### Amenities

In [10]:
listing_df['amenities'] = listing_df['amenities'].astype('str')

In [11]:
listing_df['amenities']

0        ["First aid kit", "Private patio or balcony", ...
1        ["First aid kit", "Ethernet connection", "Bike...
2        ["Shower gel", "Private patio or balcony", "TV...
3        ["First aid kit", "Private patio or balcony", ...
4        ["First aid kit", "Shower gel", "Outdoor showe...
                               ...                        
10930    ["Shower gel", "Private patio or balcony", "Ai...
10931    ["55\" HDTV with Netflix", "First aid kit", "S...
10932    ["First aid kit", "Air conditioning", "TV", "S...
10933    ["TV", "Keypad", "Stove", "Iron", "Hangers", "...
10934    ["TV", "Stove", "Iron", "Hangers", "Dishes and...
Name: amenities, Length: 10935, dtype: object

In [12]:
data=listing_df
data['amenities_list'] = data['amenities'].str.split(pat=",")
amenities_data = data.explode('amenities_list')
amenities_data.head(5)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,amenities_list
0,53157684,https://www.airbnb.com/rooms/53157684,20220323041300,2022-03-23,Lovely 1 bedroom near Coronado Island,We are a young married couple with two small f...,Imperial beach pier <br />Coronado hotel <br /...,https://a0.muscache.com/pictures/b613f1c8-e4b1...,43636297,https://www.airbnb.com/users/show/43636297,...,4.0,5.0,,t,1,0,1,0,0.26,"[""First aid kit"""
0,53157684,https://www.airbnb.com/rooms/53157684,20220323041300,2022-03-23,Lovely 1 bedroom near Coronado Island,We are a young married couple with two small f...,Imperial beach pier <br />Coronado hotel <br /...,https://a0.muscache.com/pictures/b613f1c8-e4b1...,43636297,https://www.airbnb.com/users/show/43636297,...,4.0,5.0,,t,1,0,1,0,0.26,"""Private patio or balcony"""
0,53157684,https://www.airbnb.com/rooms/53157684,20220323041300,2022-03-23,Lovely 1 bedroom near Coronado Island,We are a young married couple with two small f...,Imperial beach pier <br />Coronado hotel <br /...,https://a0.muscache.com/pictures/b613f1c8-e4b1...,43636297,https://www.airbnb.com/users/show/43636297,...,4.0,5.0,,t,1,0,1,0,0.26,"""Keurig coffee machine"""
0,53157684,https://www.airbnb.com/rooms/53157684,20220323041300,2022-03-23,Lovely 1 bedroom near Coronado Island,We are a young married couple with two small f...,Imperial beach pier <br />Coronado hotel <br /...,https://a0.muscache.com/pictures/b613f1c8-e4b1...,43636297,https://www.airbnb.com/users/show/43636297,...,4.0,5.0,,t,1,0,1,0,0.26,"""Dining table"""
0,53157684,https://www.airbnb.com/rooms/53157684,20220323041300,2022-03-23,Lovely 1 bedroom near Coronado Island,We are a young married couple with two small f...,Imperial beach pier <br />Coronado hotel <br /...,https://a0.muscache.com/pictures/b613f1c8-e4b1...,43636297,https://www.airbnb.com/users/show/43636297,...,4.0,5.0,,t,1,0,1,0,0.26,"""Outdoor dining area"""


In [13]:
amenities_data['amenities_list']

0                   ["First aid kit"
0         "Private patio or balcony"
0            "Keurig coffee machine"
0                     "Dining table"
0              "Outdoor dining area"
                    ...             
10934                 "Refrigerator"
10934               "Cooking basics"
10934                      "Kitchen"
10934                   "Hair dryer"
10934                  "Beachfront"]
Name: amenities_list, Length: 389144, dtype: object

In [14]:
#This works!!!!!!
amenities_data['amenities_list_clean']= amenities_data['amenities_list'].str.replace(r'[^\w\s]+', '')
amenities_data.drop('amenities_list', axis=1, inplace=True)

In [21]:
need_to_encode = amenities_data[['id', 'amenities_list_clean']]

ohe = OneHotEncoder()
ohe.fit(need_to_encode)

ohe_1 = ohe.transform(need_to_encode).toarray()

ohe_df = pd.DataFrame(ohe_1, columns=ohe.get_feature_names(need_to_encode.columns))
ohe_df.head(2)

Unnamed: 0,id_29967,id_38245,id_54001,id_62274,id_62949,id_67441,id_75668,id_77785,id_79300,id_103417,...,amenities_list_clean_TV with Netflix,amenities_list_clean_TV with standard cable,amenities_list_clean_The Honest Company body soap,amenities_list_clean_TresSemme conditioner,amenities_list_clean_Waterfront,amenities_list_clean_Wifi,amenities_list_clean_Window guards,amenities_list_clean_elseve conditioner,amenities_list_clean_generic shampoo,amenities_list_clean_kitchen aid oven
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
ohe_df.groupby('id')

## Review Score DF

In [None]:
#review_score_df = listing_df[['id', 'price', 'review_scores_rating', 'review_scores_accuracy',
#                             'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
#                             'review_scores_location', 'review_scores_value', 'number_of_reviews',
#                             'number_of_reviews_ltm', 'number_of_reviews_l30d']]

In [None]:
listing_df.info()

In [None]:
review_score_df = listing_df[['price', 'review_scores_rating', 'review_scores_accuracy',
                             'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
                             'review_scores_location', 'review_scores_value','accommodates',
                              'bedrooms', 'beds', 'instant_bookable',
                             'property_type', 'room_type', 'amenities', 'availability_365', 'availability_30',
                             'availability_90',
                             'host_id', 'calculated_host_listings_count', 'host_response_time', 'host_response_rate',
                             'host_is_superhost']]

In [None]:
df = review_score_df

In [None]:
df.head()

In [None]:
df['availability_365'].value_counts()

In [None]:
df['availability_90'].value_counts()

In [None]:
df['availability_30'].value_counts()

### Analysis of Availability
- 2939 (26.8% of units) have no availability a month out.
- 1479 (13.5% of units) have no availability 3 months out.
- 

- 5102 (46.7%) have 5 or less days available for the month ahead.

In [None]:
df.info()

In [None]:
df['instant_bookable'].value_counts()

### Host Info

In [None]:
df['host_id'].value_counts()

In [None]:
df['calculated_host_listings_count'].value_counts()

In [None]:
#df['host_listings_1-4'] = df['calculated_host_listings_count'] <= 4
#df['host_listings_3-4'] = df['calculated_host_listings_count'] == range(3 , 4)
df['host_listings_5+'] =df['calculated_host_listings_count'] >= 5

In [None]:
df['host_listings_5+'].value_counts()

In [None]:
#df.groupby('host_id').mean()

In [None]:
df['host_is_superhost'].value_counts()

### New Feature: Capacity Ranges

In [None]:
df['accommodates'].value_counts()

In [None]:
#df['capacity_0-2'] = df['accommodates'] <= 2

In [None]:
#df['capacity_3-4'] = df['accommodates'] == range(3-4)

In [None]:
#df['capacity_5-9'] = df['accommodates'] == range(5-9)
#df['capacity_10+'] = df['accommodates'] >= 10

In [None]:
#df['capacity_-2'] = df['accommodates'] <= 2
#df['capacity_-4'] = df['accommodates'] <= 4
#df['capacity_-6'] = df['accommodates'] <= 6
#df['capacity_-10'] = df['accommodates'] <= 10
#df['capacity_11+'] = df['accommodates'] >= 11


In [None]:
df['capacity_couple'] = df['accommodates'] <= 2
df['capacity_family'] = df['accommodates'] == range(3 , 4)
df['capacity_large'] = df['accommodates'] <= 5

In [None]:
df['capacity_family'].value_counts()

### New Feature: Bedroom Ranges

In [None]:
df['bedrooms'].value_counts()

In [None]:
#df['bedrooms_1'] = df['bedrooms'] == 1
#df['bedrooms_2'] = df['bedrooms'] == 2
#df['bedrooms_3'] = df['bedrooms'] == 3
#df['bedrooms_4+'] = df['bedrooms'] >= 4

In [None]:
df['bedrooms_1-2'] = df['bedrooms'] <= 2
df['bedrooms_3+'] = df['bedrooms'] >= 3

In [None]:
df['bedrooms_1-2'].value_counts()

### New Feature: Booked
- Booked: For the next 30 days, at least 25 days have already been booked.

In [None]:
df['booked'] = df['availability_30'] <= 3

In [None]:
df.head()

In [None]:
df.drop(['availability_30', 'availability_90', 'availability_365'], axis=1, inplace=True)

In [None]:
df.info()

### Number of Units with a 5.0 Avg Rating

In [None]:
df['review_scores_rating'].value_counts()

In [None]:
df['review_scores_rating'].isna().sum()

There are 1527 Null records that need to be dealt with. If I drop them, I will lose 14% of my data.

In [None]:
nulls = df[df['review_scores_rating'].isna()]

In [None]:
nulls.head(3)

nulls appear to have no ratings. Let's drop them for now.

In [None]:
df = df.dropna()

In [None]:
df

<b> 8385 Records are left after dropping null values </b>

### Creating Classifier Column

In [None]:
df['rating_5'] = df['review_scores_rating'] == 5.00
df['accuracy_5'] = df['review_scores_accuracy'] == 5.00
df['cleanliness_5'] = df['review_scores_cleanliness'] == 5.00
df['checkin_5'] = df['review_scores_checkin'] == 5.00
df['location_5'] = df['review_scores_location'] == 5.00
df['value_5'] = df['review_scores_value'] == 5.00
df['communication_5'] = df['review_scores_communication'] == 5.00

In [None]:
df['rating_5'].value_counts()

### 5.0 Rating Analysis:
- Out of 8385 records:
-- 22.5% had a perfect rating
-- 77.5% did not have a perfect rating

### Future Work: Analyze how much more likely a 5.0 rated rental is to be successful than a non-5.0 rated rental

In [None]:
df['location_5'].value_counts()

### Creating Binned Price Categories

In [None]:
df['price'].describe()

In [None]:
df['price_high'] = df['price'] >= 350
df['price_low'] = df['price'] <= 120

In [None]:
#df['price_high'] = df['price'] >= 200
#df['price_low'] = df['price'] < 200

In [None]:
df

### Checking # of reviews

In [None]:
#df['number_of_reviews'].value_counts()

In [None]:
#df['5+_reviews'] = df['number_of_reviews'] >= 5.00

In [None]:
df

In [None]:
df['stripped_rating'] = df['review_scores_rating'].astype(str).str[:1]

In [None]:
df['stripped_rating']

In [None]:
df = df.sort_values('stripped_rating', ascending=False)

In [None]:
df.info()

## Attempting visualization

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))

p = sns.lineplot(data=df, x='stripped_rating', y='price', color='blue');


#p.set_xlabel("Date", fontsize = 30)
#p.set_ylabel("Average Home Prices (Median)", fontsize = 30)

#y1 = p.axvline('2008-01', color='red') #housing market crash begins
#y2 = p.axvline('2012-01', color='red') #housing market crash ends

plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick) 

#p.set_title("US Median Home Prices", fontsize = 55)
plt.figsize=(50,25)
#plt.savefig('images/us_median_plot_1')

plt.tight_layout()

plt.show();

### Analysis: 
This is a very basic plot, but seems to show that there is a "sweet spot" for pricing that is around $300.

In [None]:
#new_df = df[df['5+_reviews'] == True]

In [None]:
#new_df['5+_reviews'].value_counts()

In [None]:
scatter_df = df[df['review_scores_rating'] >= 4.0]

In [None]:
scatter_mean = scatter_df['price'].mean()
scatter_mean

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))

p = sns.lineplot(data=scatter_df, x='review_scores_rating', y='price', color='blue');


#p.set_xlabel("Date", fontsize = 30)
#p.set_ylabel("Average Home Prices (Median)", fontsize = 30)

y1 = p.axhline(scatter_mean, color='red') 
#y2 = p.axvline('2012-01', color='red') #housing market crash ends

plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick) 

#p.set_title("US Median Home Prices", fontsize = 55)
plt.figsize=(50,25)
#plt.savefig('images/us_median_plot_1')

plt.tight_layout()

plt.show();

In [None]:
df.head(1)

In [None]:
df.sort_values('rating_5', ascending=False).head()

### Creating Analysis_df

In [None]:
analysis_df = df.copy()

## Modelling

### Getting Data Ready for Modelling

In [None]:
cont_features = [col for col in df.columns if df[col].dtype in [np.float64, np.int64]]
feature_df = df.loc[:, cont_features]
feature_df.head()

In [None]:
#test_df = df.drop(cont_features, axis=1)

In [None]:
#test_df

In [None]:
#feature_df.drop(['review_scores_rating', 'review_scores_accuracy','review_scores_cleanliness',
#                 'review_scores_checkin', 'review_scores_communication','review_scores_value',
#                 'review_scores_location', 'accommodates', 'bedrooms',
#                 'beds'], axis=1, inplace=True#)

In [None]:
#feature_df.drop(['review_scores_location'], axis=1, inplace=True)

In [None]:
feature_df.head()

In [None]:
df.info()

#### Cleaning and Splitting Amenities

In [None]:
df_bravo = df.copy()

In [None]:
data= df_bravo
data['amenities_list'] = data['amenities'].str.split(pat=",")
amenities_data = data.explode('amenities_list')
amenities_data.head(5)

In [None]:
#This works!!!!!!
amenities_data['amenities_list_clean']= amenities_data['amenities_list'].str.replace(r'[^\w\s]+', '')

In [None]:
amenities_data.drop('amenities_list', axis=1, inplace=True)

In [None]:
df = amenities_data

In [None]:
df.head()

#### One Hot Encoding

In [None]:
need_to_encode = df[['rating_5', 'accuracy_5', 'cleanliness_5', 'checkin_5', 
                    'location_5', 'value_5', 'communication_5', 'price_high', 'price_low', 'room_type',
                     'capacity_couple', 'capacity_family', 'capacity_large',
                    #'capacity_1-4', 'capacity_5+',
                    # 'capacity_-2','capacity_-4', 'capacity_-6', 'capacity_-10', 'capacity_11+',
                    #'bedrooms_1', 'bedrooms_2', 'bedrooms_3', 'bedrooms_4+', 'instant_bookable']]
                     'bedrooms_1-2', 'bedrooms_3+', 'instant_bookable', 'booked',
                     'host_listings_5+', 'host_is_superhost'
                    ]]
ohe = OneHotEncoder()
ohe.fit(need_to_encode)

ohe_1 = ohe.transform(need_to_encode).toarray()

ohe_df = pd.DataFrame(ohe_1, columns=ohe.get_feature_names(need_to_encode.columns))
ohe_df.head(2)

In [None]:
# Combining everything together
#cleaned_df = pd.concat([pd.DataFrame(feature_df), ohe_df], axis=1)
#cleaned_df.head(2)

In [None]:
cleaned_df = ohe_df

In [None]:
ohe_df['host_is_superhost_t']

### Top Rated and Low Rated DFs Created

In [None]:
top_rated = cleaned_df[cleaned_df['rating_5_True'] == 1]

In [None]:
low_rated = cleaned_df[cleaned_df['rating_5_False'] == 1]

In [None]:
cleaned_df.info()

#### Dropping One Value for Categoricals

In [None]:
cleaned_df = cleaned_df.drop(['rating_5_False', 'accuracy_5_False',
                             'cleanliness_5_False', 'checkin_5_False', 'location_5_False',
                             'value_5_False', 'price_high_False', 'price_low_False',
                           # 'bedrooms_1_False', 'bedrooms_2_False', 'bedrooms_3_False', 'bedrooms_4+_False',
                             'bedrooms_1-2_False', 'bedrooms_3+_False',
                            #'capacity_1-4_False', 'capacity_5+_False',
                             # 'capacity_-2_False','capacity_-4_False', 'capacity_-6_False',
                             # 'capacity_-10_False', 'capacity_11+_False',
                              'capacity_couple_False', 'capacity_family_False', 'capacity_large_False',
                              'instant_bookable_f', 'booked_False', 'room_type_Hotel room',
                             'communication_5_False', 'host_is_superhost_f', 'host_listings_5+_False'],
                              axis=1)

In [None]:
#top_rated_df = top_rated.drop(['rating_5_False', 'accuracy_5_False',
#                             'cleanliness_5_False', 'checkin_5_False', 'location_5_False',
#                             'value_5_False', 'price_high_False', 'price_low_False',
 #                          # 'bedrooms_1_False', 'bedrooms_2_False', 'bedrooms_3_False', 'bedrooms_4+_False',
  #                           'bedrooms_1-2_False', 'bedrooms_3+_False',
   #                         'capacity_1-4_False', 'capacity_5+_False', 'instant_bookable_f', 'booked_False'],
    #                          axis=1)

In [None]:
#low_rated_df = low_rated.drop(['rating_5_False', 'accuracy_5_False',
 #                            'cleanliness_5_False', 'checkin_5_False', 'location_5_False',
  #                           'value_5_False', 'price_high_False', 'price_low_False',
   #                        # 'bedrooms_1_False', 'bedrooms_2_False', 'bedrooms_3_False', 'bedrooms_4+_False',
    #                         'bedrooms_1-2_False', 'bedrooms_3+_False',
     #                       'capacity_1-4_False', 'capacity_5+_False', 'instant_bookable_f', 'booked_False'],
      #                        axis=1)

In [None]:
#Dropping a few of the redundant values.
#cleaned_df= cleaned_df.drop(['rating_5_False', 'accuracy_5_False',
#                             'cleanliness_5_False', 'checkin_5_False', 'location_5_False',
 #                            'value_5_False', 'price_high_False', 'price_low_False',
  #                         # 'bedrooms_1_False', 'bedrooms_2_False', 'bedrooms_3_False', 'bedrooms_4+_False',
   #                          'bedrooms_1-2_False', 'bedrooms_3+_False',
    #                        'capacity_1-4_False', 'capacity_5+_False', 'instant_bookable_f'], axis=1)

In [None]:
#cleaned_df= cleaned_df.drop(['price_low_True', 'price_high_True'], axis=1)
#cleaned_df= cleaned_df.drop(['price'], axis=1)

In [None]:
#cleaned_df.head(1)

In [None]:
cleaned_df['rating_5_True'].value_counts()

In [None]:
cleaned_df['instant_bookable_t'].value_counts()

#### Dealing with Class Imbalance

- Rentals without at 5.0 rating are at a rate of 3.4 to 1.
- <b> Solution </b>
    - Always use class weight parameter in Decision Tree Classifier
    - Always stratify Train Test Split.
    - Add SMOTE to Training Sets.

In [None]:
cleaned_df['rating_5_True']

In [None]:
cleaned_df.isna().sum()

In [None]:
#cleaned_df = cleaned_df.dropna()
#top_rated_df = top_rated_df.dropna()
#low_rated_df = low_rated_df.dropna()

In [None]:
#top_rated_df.isna().sum()

In [None]:
#cleaned_df.info()

In [None]:
#cleaned_df.isna().sum()

In [None]:
#cleaned_df.drop(['number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'price'], axis=1, inplace=True)

In [None]:
#dropping accuracy to see what the feature importances are
#cleaned_df.drop(['accuracy_5_True'], axis=1, inplace=True)

In [None]:
#dropping rating_5 and searching for accuracy as the target
#cleaned_df.drop(['rating_5_True'], axis=1, inplace=True)
#top_rated_df.drop(['rating_5_True'], axis=1, inplace=True)
#low_rated_df.drop(['rating_5_True'], axis=1, inplace=True)

In [None]:
# dropping price_high as it is redundant
# price_low means that the listing is below the mean avg price
#cleaned_df.drop(['price_high_True'], axis=1, inplace=True)
#top_rated_df.drop(['price_high_True'], axis=1, inplace=True)
#low_rated_df.drop(['price_high_True'], axis=1, inplace=True)

In [None]:
booked_df = cleaned_df[cleaned_df['booked_True'] == 1]
booked_df.head()

In [None]:
booked_df.mean()

In [None]:
#cleaned_df.head(1)

In [None]:
#top_rated_df.describe()

In [None]:
#new_df = top_rated_df[['checkin_5_True', 'location_5_True', 'price_low_True', 'capacity_5+_True',
#                    'instant_bookable_t', 'booked_True']]

In [None]:
balanced_df = cleaned_df.copy()
#balanced_df = top_rated_df.copy()
#balanced_df = new_df.copy()

X = balanced_df.drop(['rating_5_True'], axis=1)
y = balanced_df['rating_5_True']
#X = balanced_df.drop(['booked_True'], axis=1)
#y = balanced_df['booked_True']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, stratify=y, random_state=23)

smote = SMOTE(random_state=23)
X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train) 

### Metrics Function

In [None]:
def get_metrics(clf, y_pred):
    
    
    #clf_acc = accuracy_score(y_test, y_pred) * 100
    #print('Accuracy is :{0}'.format(clf_acc))
    clf_prec = precision_score(y_test, y_pred) * 100
    print('Precision is :{0}'.format(clf_prec))
    clf_rcl = recall_score(y_test, y_pred) * 100
    print('Recall is :{0}'.format(clf_rcl))
    clf_f1 = f1_score(y_test, y_pred) * 100
    print('F1 Score is :{0}'.format(clf_f1))
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
    clf_roc_auc = auc(false_positive_rate, true_positive_rate)
    print('ROC AUC is :{0}'.format(round(clf_roc_auc, 2)))
    clf_cv_score = np.mean(cross_val_score(clf, X_train_resampled, y_train_resampled, cv=10))
    print('Cross Validation Score is :{0}'.format(round(clf_cv_score, 3)))

### Choosing Evaluation Metrics
- My goal is to predict whether a person will get a perfect 5.0 Airbnb rating.
- Which is worse?
    - Model predicts that someone has a perfect rating, but they actually don't? (more false Positives)
    - Model predicts that someone does not have a perfect rating, but they actually do? (more false negatives)

<b> Decision </b>
- I want false Positives to be as low as possible.
- If my model says that a property will have a 5.0 score, I want it to be a near guarantee.
- If it misses some that will still get a 5.0 score that is fine.
- <b>Therefore, I am most concerned with Precision, balanced out by F1 score.</b>

## Baseline Decision Tree

In [None]:
#w/out SMOTE
dt0 = DecisionTreeClassifier(random_state=23, class_weight="balanced")
dt0.fit(X_train, y_train)
dt0_y_pred = dt0.predict(X_test)
get_metrics(dt0, dt0_y_pred)

In [None]:
dt1 = DecisionTreeClassifier(random_state=23, class_weight="balanced")
dt1.fit(X_train_resampled, y_train_resampled)
dt1_y_pred = dt1.predict(X_test)
get_metrics(dt1, dt1_y_pred)

In [None]:
dt1_matrix = confusion_matrix(y_test, dt1_y_pred)

fig, ax = plt.subplots(figsize=(5,5))

ax = sns.heatmap(dt1_matrix, annot=True, cmap='Blues', fmt='d')

ax.set_title('Baseline Decision Tree Confusion Matrix', fontsize = 30);
ax.set_xlabel('\nPredicted Values',fontsize = 20)
ax.set_ylabel('Actual Values ', fontsize=20);

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Bad Score','5.0 Score'])
ax.yaxis.set_ticklabels(['Bad Score','5.0 Score'])

## Display the visualization of the Confusion Matrix.
plt.show()

### Refining Decision Tree through GridSearchCV

In [None]:
dt_param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 3, 4, 5, 6],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6]
}

In [None]:
# Instantiate GridSearchCV
dt2 = DecisionTreeClassifier(random_state=23)

dt_grid_search = GridSearchCV(dt2, dt_param_grid, cv=3, scoring = 'precision')

# Fit to the data
dt_grid_search.fit(X_train_resampled, y_train_resampled)
dt_grid_search.best_params_

## Decision Tree 2

In [None]:
dt2 = DecisionTreeClassifier(criterion='entropy', max_depth=None, min_samples_split=2,
                             min_samples_leaf=2, class_weight='balanced', random_state=23)
dt2.fit(X_train_resampled, y_train_resampled)
dt2_y_pred = dt2.predict(X_test)
get_metrics(dt2, dt2_y_pred)

In [None]:
dt2_matrix = confusion_matrix(y_test, dt2_y_pred)

fig, ax = plt.subplots(figsize=(5,5))

ax = sns.heatmap(dt2_matrix, annot=True, cmap='Blues', fmt='d')

ax.set_title('Decision Tree 2 Confusion Matrix', fontsize = 30);
ax.set_xlabel('\nPredicted Values',fontsize = 20)
ax.set_ylabel('Actual Values ', fontsize=20);

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Bad Score','5.0 Score'])
ax.yaxis.set_ticklabels(['Bad Score','5.0 Score'])

## Display the visualization of the Confusion Matrix.
plt.show()

## Random Forests

In [None]:
rf1_clf = RandomForestClassifier(random_state=23, class_weight="balanced")
rf1_clf.fit(X_train_resampled, y_train_resampled)
rf1_y_pred = rf1_clf.predict(X_test)
get_metrics(rf1_clf, rf1_y_pred)

In [None]:
rf1_matrix = confusion_matrix(y_test, rf1_y_pred)

fig, ax = plt.subplots(figsize=(5,5))

ax = sns.heatmap(rf1_matrix, annot=True, cmap='Blues', fmt='d')

ax.set_title('Ramdom Forests 1 Confusion Matrix', fontsize = 30);
ax.set_xlabel('\nPredicted Values',fontsize = 20)
ax.set_ylabel('Actual Values ', fontsize=20);

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Bad Score','5.0 Score'])
ax.yaxis.set_ticklabels(['Bad Score','5.0 Score'])

## Display the visualization of the Confusion Matrix.
plt.show()

### GridSearch CV

In [None]:
rf_param_grid = {
    'n_estimators': [10, 30, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 6, 10],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [3, 6]
}

In [None]:
rf2_clf = RandomForestClassifier(random_state=23)


rf1_grid_search= GridSearchCV(rf2_clf, rf_param_grid, scoring = 'precision', cv=3)
rf1_grid_search.fit(X_train_resampled, y_train_resampled)

print("")
print(f"Random Forest  Optimal Parameters: {rf1_grid_search.best_params_}")

## Random Forests 2

In [None]:
rf2_clf = RandomForestClassifier(criterion= 'entropy', max_depth= None, min_samples_leaf= 3,
                                min_samples_split= 10, n_estimators= 100, random_state=23,
                                class_weight='balanced')
rf2_clf.fit(X_train_resampled, y_train_resampled)
rf2_y_pred = rf2_clf.predict(X_test)
get_metrics(rf2_clf, rf2_y_pred)

In [None]:
rf2_matrix = confusion_matrix(y_test, rf2_y_pred)

fig, ax = plt.subplots(figsize=(5,5))

ax = sns.heatmap(rf2_matrix, annot=True, cmap='Blues', fmt='d')

ax.set_title('Ramdom Forests 2 Confusion Matrix', fontsize = 30);
ax.set_xlabel('\nPredicted Values',fontsize = 20)
ax.set_ylabel('Actual Values ', fontsize=20);

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Bad Score','5.0 Score'])
ax.yaxis.set_ticklabels(['Bad Score','5.0 Score'])

## Display the visualization of the Confusion Matrix.
plt.show()

## XGBoost Model

In [None]:
# Instantiate XGBClassifier
clf = XGBClassifier(random_state=23)

# Fit XGBClassifier
xg1 = clf.fit(X_train_resampled, y_train_resampled)

# Predict on training and test sets
training_preds = clf.predict(X_train_resampled)
xg1_y_pred = clf.predict(X_test)
get_metrics(xg1, xg1_y_pred)

In [None]:
xg1_matrix = confusion_matrix(y_test, xg1_y_pred)

fig, ax = plt.subplots(figsize=(5,5))

ax = sns.heatmap(xg1_matrix, annot=True, cmap='Blues', fmt='d')

ax.set_title('XG Boost 1 Confusion Matrix', fontsize = 30);
ax.set_xlabel('\nPredicted Values',fontsize = 20)
ax.set_ylabel('Actual Values ', fontsize=20);

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Bad Score','5.0 Score'])
ax.yaxis.set_ticklabels(['Bad Score','5.0 Score'])

## Display the visualization of the Confusion Matrix.
plt.show()

### GridSearch

In [None]:
boost_param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.5, 0.7],
    'n_estimators': [100],
}

## XGBoost 2

In [None]:
xg2 = XGBClassifier(random_state=23)

grid_clf = GridSearchCV(xg2, boost_param_grid, scoring='precision', cv=3, n_jobs=1)
grid_clf.fit(X_train_resampled, y_train_resampled)

best_parameters = grid_clf.best_params_

print('Grid Search found the following optimal parameters: ')
for param_name in sorted(best_parameters.keys()):
    print('%s: %r' % (param_name, best_parameters[param_name]))

In [None]:
xg2 = XGBClassifier(learning_rate= 0.2, max_depth=6, min_child_weight=2,
                                n_estimators=100, subsample=0.7, random_state=23)
xg2.fit(X_train_resampled, y_train_resampled)
xg2_y_pred = xg2.predict(X_test)
get_metrics(xg2, xg2_y_pred)

In [None]:
xg2_matrix = confusion_matrix(y_test, xg2_y_pred)

fig, ax = plt.subplots(figsize=(5,5))

ax = sns.heatmap(xg2_matrix, annot=True, cmap='Blues', fmt='d')

ax.set_title('XG Boost 2 Confusion Matrix', fontsize = 30);
ax.set_xlabel('\nPredicted Values',fontsize = 20)
ax.set_ylabel('Actual Values ', fontsize=20);

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['Bad Score','5.0 Score'])
ax.yaxis.set_ticklabels(['Bad Score','5.0 Score'])

## Display the visualization of the Confusion Matrix.
plt.show()

## FINAL MODEL = XG Boost 1

In [None]:
get_metrics(xg1, xg1_y_pred)
#get_metrics(rf1_clf, rf1_y_pred)

### Model Evaluation:
- <b> Precision: </b> This Model correctly picks whether a rental will receive a perfect 5 star overall AirBnb rating 79.4% of the time.
    - This is 30% better than random guessing.
    - The Final Model is also a slight improvement over the baseline model. (about 2% better)
- <b> Recall and F1 Score: </b> The Final Model's recall and F1 Scores are also slightly higher than the baseline model. The F1 Score indicates that the Precision is balanced well with Recall.
- <b> ROC AUC Score: </b> Shows the True Positive Rate vs. the False Postive Rate. My Random Forests 2 Model performed better on this metric, but also captured more false positives. I want to avoid false positives if at all possible, so I chose XG Boost 2 over Random Forests 2.
- <b> Cross Validation Score: </b> The high score shows that this model does a good job with new data that it was not trained on.


## Feature Importance

In [None]:
feature_names = list(X)
feature_names

In [None]:
xg1_importance = xg1.feature_importances_
xg1_importance

In [None]:
#feature_importance_df = pd.DataFrame(dt2_importance, feature_names)
#feature_importance_df = pd.DataFrame(rf1_importance, feature_names)
feature_importance_df = pd.DataFrame(xg1_importance, feature_names)
feature_importance_df= feature_importance_df.reset_index()
feature_importance_df.rename(columns={'index': 'Feature', 0: 'Importance'}, inplace=True)
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
feature_importance_df

In [None]:
# plot feature importance
fig, ax = plt.subplots(figsize=(50,20))
p = sns.barplot(data=feature_importance_df, x='Importance', y='Feature', color ='mediumpurple' );
p.set_xlabel("Importance", fontsize = 50)

p.set_ylabel("Feature", fontsize = 50)
plt.xticks(fontsize=40)
plt.yticks(fontsize=40)

p.set_title("Features by Importance", fontsize = 100)
plt.figsize=(30,20) 

plt.show();

## <b>Review Metric DF (or Feature Analysis DF)</b>

In [None]:
review_metrics = df[['rating_5', 'cleanliness_5', 'checkin_5', 'location_5', 'value_5',
                     'communication_5', 'accuracy_5',
                    'instant_bookable', 'booked', 'host_is_superhost',
                    'host_listings_5+', 'capacity_couple', 'capacity_family', 'capacity_large',
                     'bedrooms_1-2', 'bedrooms_3+', 'room_type' ]]

In [None]:
review_metrics.info()

## Function get_stats( )

In [None]:
def get_stats(df):
    
    df_transposed = df.transpose()
    df_transposed = df_transposed.reset_index()
    df_transposed.rename(columns={'index': 'Metric'}, inplace=True)
    stats_df = df_transposed
    total = stats_df.apply(lambda x: x[False] + x[True], axis=1)
    stats_df['total'] = total
    mean = stats_df.apply(lambda x: x[True] / x['total'], axis=1)
    stats_df['mean'] = mean

    return stats_df. sort_values('mean', ascending=False)

In [None]:
def plot_stats(df):
    
    fig, ax = plt.subplots(figsize=(50,20))
    p = sns.lineplot(data=df.head(10), x='Metric', y='mean', color ='seagreen' );
    p.set_ylabel("Percentage of 5 Star Rentals", fontsize = 50)

    p.set_xlabel("Feature", fontsize = 50)
    plt.xticks(fontsize=30)
    plt.yticks(fontsize=40)

    ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=None, symbol='%', is_latex=False))

    p.set_title("Frequency of Features in 5 Star Rentals ", fontsize = 100)
    plt.figsize=(30,20) 

    return plt.show();

### Accuracy Stats

In [None]:
accuracy_metrics = review_metrics.groupby('accuracy_5').sum()
accuracy_metrics

In [None]:
accuracy_stats = get_stats(accuracy_metrics)
accuracy_stats

In [None]:
accuracy_plot = plot_stats(accuracy_stats)
accuracy_plot

### Communication Stats

In [None]:
communication_metrics = review_metrics.groupby('communication_5').sum()
communication_metrics

In [None]:
communication_stats = get_stats(communication_metrics)
communication_stats

In [None]:
communication_plot = plot_stats(communication_stats)
communication_plot

### Superhost Stats

In [None]:
superhost_metrics = review_metrics.groupby('host_is_superhost').sum()
superhost_metrics

In [None]:
superhost_stats = get_stats(superhost_metrics)
superhost_stats

In [None]:
superhost_plot = plot_stats(superhost_stats)
superhost_plot

### Cleanliness Stats

In [None]:
cleanliness_metrics = review_metrics.groupby('cleanliness_5').sum()
cleanliness_metrics

In [None]:
cleanliness_stats = get_stats(cleanliness_metrics)
cleanliness_stats

In [None]:
cleanliness_plot = plot_stats(cleanliness_stats)
cleanliness_plot

### Perfect Rating Stats

In [None]:
review_analysis_df = review_metrics.groupby('rating_5').sum()

In [None]:
rating_5_stats = get_stats(review_analysis_df)
rating_5_stats

In [None]:
rating_5_plot = plot_stats(rating_5_stats)
rating_5_plot

## Attempting Master LinePlot

In [None]:
fig, ax = plt.subplots(figsize=(50,20))
p = sns.lineplot(data=communication_stats, x='Metric', y='mean', color ='green' );
p = sns.lineplot(data=accuracy_stats, x='Metric', y='mean', color ='blue' );
p = sns.lineplot(data=cleanliness_stats, x='Metric', y='mean', color ='black' );
p = sns.lineplot(data=superhost_stats, x='Metric', y='mean', color ='purple' );
p = sns.lineplot(data=rating_5_stats, x='Metric', y='mean', color ='red' );


p.set_ylabel("Percentage of 5 Star Rentals", fontsize = 50)

p.set_xlabel("Feature", fontsize = 50)
plt.xticks(fontsize=30)
plt.yticks(fontsize=40)

ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=None, symbol='%', is_latex=False))

p.set_title("Frequency of Features in 5 Star Rentals ", fontsize = 100)
plt.figsize=(30,20) 

plt.show();

In [None]:
review_analysis_transposed = review_analysis_df.transpose()
review_analysis_transposed = review_analysis_transposed.reset_index()
review_analysis_transposed.rename(columns={'index': 'Metric'}, inplace=True)
df2 = review_analysis_transposed
total = df2.apply(lambda x: x[False] + x[True], axis=1)
df2['total'] = total
mean = df2.apply(lambda x: x[True] / x['total'], axis=1)
df2['mean'] = mean

df2 = df2.sort_values('mean', ascending=False)
df2

In [None]:
# plot feature importance
fig, ax = plt.subplots(figsize=(50,20))
p = sns.barplot(data=df2.head(10), x='Metric', y='mean', color ='seagreen' );
p.set_ylabel("Percentage of 5 Star Rentals", fontsize = 50)

p.set_xlabel("Feature", fontsize = 50)
plt.xticks(fontsize=30)
plt.yticks(fontsize=40)

ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=None, symbol='%', is_latex=False))

p.set_title("Frequency of Features in 5 Star Rentals ", fontsize = 100)
plt.figsize=(30,20) 

plt.show();

In [None]:
review_metrics['checkin_5'].mean()

In [None]:
new_df = pd.DataFrame(list(review_metrics.columns.values))

In [None]:
new_df

In [None]:
review_metrics['checkin_5'].value_counts()

In [None]:
review_metrics.describe()

In [None]:
review_metrics.groupby('rating_5').sum()

In [None]:
review_metrics.groupby('rating_5').count()

In [None]:
review_metrics.groupby('accuracy_5').count()

In [None]:
review_metrics.groupby('cleanliness_5').count()

In [None]:
review_metrics.groupby('accuracy_5').sum()

In [None]:
review_metrics.groupby('accuracy_5').mean()

In [None]:
review_metrics.groupby('accuracy_5').sum()

In [None]:
balanced_df['accuracy_5_True']

## Analysis of Features

In [None]:
balanced_df.info()

In [None]:
analysis_df_2 = balanced_df.copy()

In [None]:
#analysis_df.info()

In [None]:
#analysis_df.iloc[:, 12:31]

In [None]:
#analysis_df.drop(analysis_df.iloc[:, 12:31], axis=1, inplace=True)

In [None]:
#analysis_df['review_scores_rating'].value_counts()

In [None]:
top_accuracy_df = analysis_df[analysis_df['review_scores_accuracy'] >= 4]
top_accuracy_df = top_accuracy_df[top_accuracy_df['review_scores_rating'] >= 4]

In [None]:
import matplotlib.ticker as mtick

fig, ax = plt.subplots(figsize=(10,10))
p = sns.scatterplot(x="review_scores_accuracy", y="review_scores_rating", data=top_accuracy_df);

p.invert_xaxis()
p.invert_yaxis()


#p.set_xlabel("Total Charge: Mean Value", fontsize = 15)
#p.set_ylabel("Churn Rate", fontsize = 15)
#ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=None, symbol='%', is_latex=False))
#p.xaxis.set_major_formatter(display_millions)
#ax.xaxis.set_major_formatter('${x:1.2f}')

#p.set_title("Churn Rate by Total Charge", fontsize = 20)
plt.figsize=(30,20) 

#line_1 = plt.axhline(y=.145, color='firebrick')
#line_2 = plt.axhline(y=0, color='black')
#line_3 = plt.axvline(x=59.45, linestyle='--',color='mediumseagreen')
#line_4 = plt.axvline(x=74.00, color='skyblue')

#ax.legend([line_1, line_3], ['Regular Churn Line', 'Mean Bill Charge'])
#plt.savefig('images/project_3_total_charge_churn')


plt.show();

### Analysis:
    - Even a small drop in accuracy results in a drop in the review score rating.

In [None]:
balanced_df['instant_bookable_t'].value_counts()

## Comparison: Location

In [None]:
top_location_df = analysis_df[analysis_df['review_scores_accuracy'] >= 3.5]
top_location_df = top_location_df[top_location_df['review_scores_rating'] >= 3.5]

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
p = sns.scatterplot(x="review_scores_rating", y="review_scores_location", data=top_location_df);

p.invert_xaxis()
p.invert_yaxis()

plt.figsize=(30,20) 

plt.show();

Analysis: Location has a much less linear relationship.

In [None]:
perfect_score_df = analysis_df[analysis_df['review_scores_rating'] == 5.0]

In [None]:
perfect_score_df.mean()

In [None]:
not_perfect_score_df = analysis_df[analysis_df['review_scores_rating'] < 5.0]

In [None]:
not_perfect_df_mean = not_perfect_score_df.mean().reset_index()

In [None]:
not_perfect_df_mean.rename(columns={"index": "category", 0: "Mean"}, inplace=True)

In [None]:
not_perfect_df_mean.info()

### Attempting to Group records and analyze based on how many records of each type are 5.0 rated

In [None]:
analysis_df.head()

In [None]:
analysis_df['instant_bookable'] = analysis_df['instant_bookable'].astype('bool')

In [None]:
contrast_df = analysis_df_2.groupby('rating_5_True').sum()

In [None]:
contrast_df_transposed = contrast_df.transpose()

In [None]:
contrast_df_transposed = contrast_df_transposed.reset_index()

In [None]:
total = contrast_df_transposed.apply(lambda x: x[1.0] + x[0.0], axis=1)

In [None]:
contrast_df_transposed['total'] = total

In [None]:
contrast_df_transposed

In [None]:
delta = contrast_df_transposed.apply(lambda x: x[1.0] - x[0.0], axis=1)

In [None]:
#delta_percent = contrast_df_transposed.apply(lambda x: x[0.0] / x[1.0], axis=1)

In [None]:
percent_of_total_5_star = contrast_df_transposed.apply(lambda x: x[1.0] / x['total'], axis=1)
percent_of_total_not = contrast_df_transposed.apply(lambda x: x[0.0] / x['total'], axis=1)

In [None]:
contrast_df_transposed['delta'] = delta
#contrast_df_transposed['delta_percent'] = delta_percent
contrast_df_transposed['5_star_percent'] = percent_of_total_5_star
contrast_df_transposed['not_5_star_percent'] = percent_of_total_not

In [None]:
contrast_df_transposed.rename(columns={0.0:'Not_5_Star', 1.0:'5_Star_Rating'}, inplace=True)

In [None]:
contrast_df_transposed.sort_values('delta', ascending=False)

In [None]:
contrast_df.iloc[:, 2:8]

In [None]:
contrast_df.drop(contrast_df.iloc[:, 2:8], axis=1, inplace=True)

In [None]:
contrast_df

In [None]:
contrast_df['instant_bookable']

In [None]:
contrast_df['accuracy_5']

In [None]:
analysis_df['value_5'].value_counts()

In [None]:
review_metrics = contrast_df[['accuracy_5', 'cleanliness_5', 'checkin_5', 'location_5', 'value_5',
                                        'booked']]

## Analysis of Review Metrics

In [None]:
review_metrics

### Analysis:
- Units with a perfect accuracy rating were 72% more likely to have a perfect overall rating.
- Units with a perfect cleanliness rating were 64% more likely to have a perfect overall rating.
- Units with a perfect checkin rating were 63% more likely to have a perfect overall rating.
- Units with a perfect location rating were 58% more likely to have a perfect overall rating.
- Units with a perfect value rating were 56% more likely to have a perfect overall rating.


## Analysis of Accuracy_5

In [None]:
analysis_df['accuracy_5'].value_counts()

In [None]:
accuracy_test_df = analysis_df[analysis_df['accuracy_5'] == False]

In [None]:
accuracy_test_df

In [None]:
accuracy_test_df.groupby('rating_5').mean()

## Analysis of Instant Bookable

In [None]:
analysis_df_2['instant_bookable_t'].value_counts()

In [None]:
analysis_df_2.groupby('rating_5_True').mean()

In [None]:
review_diff = review_metrics.diff()

In [None]:
review_diff

In [None]:
review_transposed = review_metrics.transpose()

In [None]:
review_transposed['delta'] = review_transposed.diff(axis=1)

In [None]:
new_review_df = review_transposed.reset_index()

In [None]:
new_review_df

In [None]:
new_review_df.info()

In [None]:
new_review_df.diff(axis=1)

In [None]:
new_review_df['delta'] = new_review_df.apply(lambda x: x['True'] - x['False'], axis=1)

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
p = sns.scatterplot(x="not_perfect_df.mean()");

p.invert_xaxis()
p.invert_yaxis()

plt.figsize=(30,20) 

plt.show();