## Business Understanding
Following the previous notebook, I will explain the third question in this notebook
1. What are the busiest times to visit Boston and Seattle? By how much do prices spike?
2. How well can we predict the price of the listing? Which features affect the price the most?
3. **Which kind of host tends to have a better review?**

## Data Understanding

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from datetime import datetime,timedelta

In [None]:
%matplotlib inline

df_listings_bos = pd.read_csv('./boston/listings.csv')

df_listings_sea = pd.read_csv('./seattle/listings.csv')

In [None]:
df_listings_bos.info()

In [None]:
def clean_data(df):
    '''
    INPUT:
    df - pandas dataframe you want to clean
    is_bos - if the dataset is df_listings_bos, in order to reuse this function for Seattle's dataset
    
    OUTPUT:
    df - a new dataframe that has the following characteristics:
            1. include only the columns related to host and review
            2. Remove $ sign in "price" and convert the string into float
            3. Remove % sign in "host_response_rate", "host_acceptance_rate" and convert the string into float
            4. Drop the rows without review
            5. Convert t/f into binary True/False
            6. Create a column "host_about_exist" showing if the about section of the host exist
            7. Convert string to datetime
            8. Created host_for_days and get the number of days till now after host created account
            9. Drop the rows without bathroom, bedroom, host_location ,bed and property_type since they only have very few missing value
    '''
    # include columns related to host
    host_columns = ['price', 'host_name', 'host_since', 'host_location','host_about',
                    'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
                    'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications',
                    'host_has_profile_pic', 'host_identity_verified'
                   ]
    # include columns related to review
    review_columns = ['number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
                     'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
                     'review_scores_value']
    
    df = df[host_columns + review_columns].copy()
    
    # Remove $ sign in "price" and convert the string into float
    df['price'] = df['price'].str.replace("[$, ]", "").astype("float")
    
    # Remove % sign in "host_response_rate", "host_acceptance_rate" and convert the string into float
    df['host_response_rate'] = df['host_response_rate'].map(lambda r: str(r).replace('%','')).astype("float")
    df['host_acceptance_rate'] = df['host_acceptance_rate'].map(lambda r: str(r).replace('%','')).astype("float")
    
    # Drop the rows without review since we are analysis reviews
    df.dropna(subset=review_columns, inplace=True)

    
    # Convert t/f into binary True/False
    df["host_is_superhost"]=df["host_is_superhost"].apply(lambda x:True if x=="t" else False)
    df["host_has_profile_pic"]=df["host_has_profile_pic"].apply(lambda x:True if x=="t" else False)
    df["host_identity_verified"]=df["host_identity_verified"].apply(lambda x:True if x=="t" else False)
    
    # For simplicity we only want to know if the host fill in the about section
    df["host_about"]= df["host_about"].fillna(False)
    df["host_about_exist"]=df["host_about"].apply(lambda x:True if x else False)
    
    

    # Convert string to datetime
    df["host_since"] = pd.to_datetime(df["host_since"])
    
    
    # Created host_for_days and get the number of days till now after host created account
    today = pd.to_datetime(datetime.now())
    result = today - df["host_since"]
    df["host_for_days"] = result.apply(lambda x: x.days)
    
    
    df.drop(columns=["host_about", "host_since"], inplace=True)
    
    # Drop the rows without bathroom, bedroom, host_location ,bed and property_type since they only have very few missing value
    df.dropna(subset=["host_location", "host_response_time", "host_response_rate", "host_acceptance_rate", "host_neighbourhood"], inplace=True)
    
    return df

df_listings_bos = clean_data(df_listings_bos)
df_listings_sea = clean_data(df_listings_sea)

In [None]:
df_listings_bos.info()

In [None]:
df_listings_sea.info()

In [None]:
plt.figure(figsize=(10,9))
plt.title('Price distribution in Seattle')
ax = sns.histplot(df_listings_sea['review_scores_rating'])
df_listings_sea['review_scores_rating'].describe()

In [None]:
plt.figure(figsize=(10,9))
plt.title('Price distribution in Seattle')
ax = sns.histplot(df_listings_bos['review_scores_rating'])
df_listings_bos['review_scores_rating'].describe()

We extract the review rating with the information of the host and see if there is any correlation.

In [None]:
corr = df_listings_bos.select_dtypes(include=['int64', 'float64']).corr()
_mask = np.zeros_like(corr)
_mask[np.triu_indices_from(_mask)] = True
plt.figure(figsize=(24,12))
plt.title('Heatmap of corr of features in Boston')
sns.heatmap(corr, mask = _mask, vmax=.3, square=True, annot=True, fmt='.2f', cmap='coolwarm')
plt.show()

In [None]:
corr = df_listings_sea.select_dtypes(include=['int64', 'float64']).corr()
_mask = np.zeros_like(corr)
_mask[np.triu_indices_from(_mask)] = True
plt.figure(figsize=(24,12))
plt.title('Heatmap of corr of features in Seattle')
sns.heatmap(corr, mask = _mask, vmax=.3, square=True, annot=True, fmt='.2f', cmap='coolwarm')
plt.show()

From the graph, there are a few takeaways:
1. The host with a higher response rate tends to have a better review.
2. A higher acceptance rate will also increase the review rating in Boston but not Seattle.
3. The host with more listing counts tends to have a bad review, especially during check-in.
4. The higher the price doesn’t mean having a better review nor having a bad rating in value (worth the price).

## host_response_time
We also want to know if the response time of the host affect the review

In [None]:
plt.figure(figsize=(25,10))
plt.title("host_response_time vs review_scores_rating in Boston")

sns.boxplot(data=df_listings_bos, x='host_response_time', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([50,102])
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title("host_response_time vs review_scores_rating in Seattle")

sns.boxplot(data=df_listings_sea, x='host_response_time', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([50,102])
plt.show()

In both graphs, the review rating is similar as long as the host response time is within a day. However, if the host response took a few days or more, the average rating will be generally lower.

## host_is_superhost

In [None]:
plt.figure(figsize=(25,10))
plt.title("The effect of superhost on rating in Boston")

sns.boxplot(data=df_listings_bos, x='host_is_superhost', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([60,102])
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title("The effect of superhost on rating in Seattle")

sns.boxplot(data=df_listings_sea, x='host_is_superhost', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([60,102])
plt.show()

From both graph, the superhost have significantly higher and stable review (mostly >90%) than those who are not

## host_has_profile_pic

In [None]:
df_listings_bos.host_has_profile_pic.value_counts()

In [None]:
df_listings_sea.host_has_profile_pic.value_counts()

Since there are too little amount of host doesn't have a profile pic. We can drop this column

## host_identity_verified

In [None]:
plt.figure(figsize=(25,10))
plt.title("host_identity_verified vs rating in Boston")

sns.boxplot(data=df_listings_bos, x='host_identity_verified', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([60,102])
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title("host_identity_verified vs rating in Seattle")

sns.boxplot(data=df_listings_sea, x='host_identity_verified', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([60,102])
plt.show()

The review is slightly higher for those host who identity is verified

## host_about_exist

In [None]:
plt.figure(figsize=(25,10))
plt.title("The existance of host about section effect on rating in Boston")

sns.boxplot(data=df_listings_bos, x='host_about_exist', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([60,102])
plt.show()

In [None]:
plt.figure(figsize=(25,10))
plt.title("The existance of host about section effect on rating in Seattle")

sns.boxplot(data=df_listings_sea, x='host_about_exist', y='review_scores_rating')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=15)

plt.ylim([60,102])
plt.show()

There are not much difference in average review on whether the host filled in the about section or not. Except, the variance of reviews for those who filled in the about is slightly smaller.