## Project Motivation 
For this project, I was interested in using the Washington D.C. Airbnb data to better understand:
### Properties & Pricing:
1. How does pricing look across neighborhoods? Can I still find a relatively cheaper priced listing in one of the more expensive neighborhoods?
2. How expensive are the top 5% of listings?
3. Does proximity to attractions play a factor in price?
4. Are there common property types, room types, bedroom and bathroom counts?
### Hosts:
5. Do hosts with multiple listings tend to be in certain neighborhoods?
6. Do hosts with multiple listings stick to certain price points?
7. Can we expect better reviews from hosts with businesses?
### Reviews:
8. Does sentiment in reviews tell us which neighborhoods or price ranges are better? 


In [None]:
# imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import json
import ipywidgets as widgets
import plotly
%matplotlib inline

# my favorite
plt.style.use("fivethirtyeight")

# show full columns
pd.set_option('display.max_columns', None)

# cell width 
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

## write fig for medium

In [None]:
import chart_studio
import chart_studio.plotly as py
import chart_studio.tools as tls

username = 'lawrencedugom'
api_key = 'o6hC0fxFQ8liKaqnDCbr'
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)



In [None]:
import plotly.io as pio


# where is orca 
plotly.io.orca.config.executable = '/Users/ldugom/anaconda3/envs/ds/bin/orca'

def w_image(fig, name, width=1500, height=1000):
    
    # write html and png 
    #pio.write_html(fig, file=f"{name}.html", auto_open=True, width=width, height=height)
    py.plot(fig, name, auto_open=False, width=width, height=height)
    fig.write_image(f"{name}.png", width=width, height=height)


## Datasets available to us: listings, reviews, geographical information

In [None]:
# listings data
ls = pd.read_csv("data/listings.csv")
ls_d = pd.read_csv("data/listings 2.csv")

# reviews data
rs = pd.read_csv("data/reviews.csv")
rs_d = pd.read_csv("data/reviews 2.csv")


# geography data
geo = pd.read_csv("data/neighbourhoods.csv")

with open("data/neighbourhoods.geojson") as jsonfile:
    geojson = json.load(jsonfile)

## Preliminary: Where are the neighbourhoods in DC?

In [None]:
import plotly.express as px

# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.choropleth_mapbox(ls, geojson=geojson, title="Washington D.C. Neighbourhood Map",
                           locations="neighbourhood", featureidkey="properties.neighbourhood",opacity=0.5,
                           mapbox_style="light", zoom=11)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()



## Explore numerical attributes

In [None]:
# melt listings data for graphing purposes 
melted_ls = pd.melt(ls,id_vars=['id', 'name', 'host_id', 'host_name', 'neighbourhood_group','neighbourhood', 'latitude', 'longitude'], value_vars=ls.select_dtypes(include=np.number).iloc[:, 5:].columns.tolist())


In [None]:
# create dropdown for attributes in melted dataframe
dropdown_attribute = widgets.Dropdown(options = sorted(melted_ls.variable.unique()))

# output
output = widgets.Output()


def view_attribute(attribute):
    
    # clear output for new attribute to be plotted 
    output.clear_output()
    
    # filter df to selected attribute
    filtered = melted_ls[melted_ls['variable'] == attribute].copy()
    
    # filter out outliers 
    filtered = filtered[filtered.value.between(filtered.value.quantile(.10), filtered.value.quantile(.80))]
 
    with output:
        # set token
        px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
        fig = px.scatter_mapbox(filtered.rename({'value':'Listing Price'}, axis=1), lat="latitude", lon="longitude", color="Listing Price", template="gridon", 
                                color_continuous_scale=plotly.colors.diverging.RdYlGn,
                                   opacity=0.4, center={"lat": 38.9072, "lon": -77.0369},
                                   mapbox_style="light", zoom=11)

        fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
        fig.show()

# change attribute
def dropdown_attribute_handler(change):
    view_attribute(change.new)
    
dropdown_attribute.observe(dropdown_attribute_handler, names="value")
display(dropdown_attribute)
display(output)

#  How does pricing look across neighborhoods? Can I still find a relatively cheaper priced listing in one of the more expensive neighborhoods?


We see that neighborhood averages range from $83 to $198 for the listings. Georgetown, Southwest, Spring Valley are the most expensive neighborhoods in the city. 
While remaining in the 10 most expensive neighborhoods, you can still save upwards of $50–100/night by choosing a less expensive neighborhood option. When comparing the top 12 neighborhoods in the list to the bottom 12 neighborhoods, you can save 2x your money by choosing an averaged priced item from the lower-priced tier of neighborhoods.

In [None]:
# calculate outliers within neighbourhoods 
stdev = 3.0

zscores = ls[['neighbourhood', 'price']].groupby('neighbourhood').transform(
    lambda group: (group - group.mean()).div(group.std())).abs()

outliers = zscores > stdev

# take out outliers 
ls_nonoutliers = ls[~outliers.any(axis=1)].copy()


# agg all the observations by nieghbourhood to get median and mean 
summary = ls.groupby("neighbourhood").agg({'price':['mean', 'median', 'count']}).reset_index()
summary.columns = ['Neighbourhood', 'Mean Price', 'Median Price', '# of listings']

# strip outliers and redo agg
summary_nonoutliers = ls_nonoutliers.groupby("neighbourhood").agg({'price':['mean', 'count']}).reset_index()
summary_nonoutliers.columns = ['Neighbourhood', 'Mean Price', '# of listings']

# merge 
summary_final = pd.merge(summary, summary_nonoutliers, on='Neighbourhood', suffixes=[" with outliers", " without outliers"])
summary_final["# of outliers"] = summary_final["# of listings with outliers"] - summary_final["# of listings without outliers"] 
summary_final.drop(['# of listings without outliers'], axis=1, inplace=True)

# mean price by nieghborhood 
(
    summary_final
    .sort_values(["Mean Price without outliers"], ascending=False)[['Neighbourhood', 'Mean Price with outliers', 'Mean Price without outliers', 'Median Price', '# of outliers']]
    .style.background_gradient(cmap='RdYlGn', subset=['Mean Price with outliers', 'Median Price', 'Mean Price without outliers']) 
    
          
)

## Which neighborhoods have the widest variety of price ranges?  

In [None]:
# Looking at the averages across neighborhoods, it does look like one can find a relatively cheaper listing in a more expensive neighborhood. 
# In fact, nine of the top ten most expensive neighborhoods also have the biggest IQRs across all niehgborhoods, meaning that we can still find aparmtents less than
# half of their averages in their neighborhoods (largest range of values)

In [None]:
# filter out top 5% to be fair
ls_price_summary = ls.query("price <= price.quantile(.95)")

ls_price_summary = ls_price_summary.groupby("neighbourhood")['price'].describe().reset_index()
ls_price_summary['IQR'] = ls_price_summary['75%']  - ls_price_summary['25%']

(
ls_price_summary
[["neighbourhood", 'mean', 'std', '25%', '50%', '75%', 'IQR']].round(1)
.sort_values(["IQR","mean"], ascending=False).style.background_gradient(cmap="RdYlGn")
    .format(lambda x: "${:.0f}".format(x) if type(x) != str else x)

)

In [None]:
ls_nonoutliers['price'].hist(bins=50)

In [None]:
ls[ls.price <= ls.price.quantile(.95)].price.hist(bins=50);

In [None]:
# top and bottom twelve nieghborhoods
most_expensive_neighbourhoods = summary_final.sort_values(["Mean Price without outliers"], ascending=False)[:10].Neighbourhood.values.tolist()
cheapest_neighbourhoods= summary_final.sort_values(["Mean Price without outliers"])[:10].Neighbourhood.values.tolist()

top = ls[ls.neighbourhood.isin(most_expensive_neighbourhoods + cheapest_neighbourhoods)]
top['category'] = top['neighbourhood'].apply(lambda x: 'Top 10 Cheapest' if x in cheapest_neighbourhoods else 'Top 10 Most Expensive')


# you save an average of 3x your money by choosing a nieghbourhood from the 5 cheapest rather than the 5 most expsensive 
top.groupby("category").mean()['price'].reset_index().rename({'price':'Average price in Category'}, axis=1) 

## Looking at the 10 most expensive and 10 most inexpensive neighbourhoods visually (39 total)

In [None]:
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.scatter_mapbox(top, lat="latitude", lon="longitude", color="category", template="simple_white",center={"lat": 38.895, "lon": -77.024},
                           mapbox_style="basic", zoom=10)



In [None]:
# average IQR
ls_price_summary.IQR.mean()

# median IQR
ls_price_summary.IQR.median()

## How expensive is the top 25%? 
### Answer: We can see that top listings start at  ~ \\$200, with a few going all the way to $10,000 ! Georgetown, Capitol, Hill and Downtown/Chinatown seem to pop out the most.

In [None]:
# clean price 
ls_d["price"] = ls_d["price"].str[1:].str.replace(",","").astype(float)

# detect outliers (169 total)
outliers_df = ls_d[ls.price.between(ls_d.price.quantile(.75), ls_d.price.quantile(1))].copy().dropna(subset=["beds"])

sns.set_style("dark") 
sns.set_palette("Set3")

fig = px.scatter(outliers_df, y="neighbourhood_cleansed", x="price", size="beds", title="Top 25% of listings<br>sized by bedroom count",
                 color="beds", template="plotly_dark", color_continuous_scale="sunset", width=1500, height=1000)

fig.show()

In [None]:
# write out figure 
py.plot(fig, "top25%", auto_open=False, width=1700, height=1250)


## From a glance, it looks like the more expensive listings tend to be closer to downtown. Can wee assume the farther you move away from the center of DC, the more likely you are to find a cheaper listing?

In [None]:
# bottom 95%
ls_95 = ls.query("price <= price.quantile(.95)").copy()
lsd_95 = ls_d.query("price <= price.quantile(.95)").copy()

#create bins to eliminate outliers from dominating continuous scale
ls_95['Price Decile'] = pd.qcut(ls_95['price'], 10)
lsd_95['Price Decile'] = pd.qcut(lsd_95['price'], 10)


# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.scatter_mapbox(ls_95.sample(frac=.60), lat="latitude", lon="longitude", color="Price Decile", template="simple_white", opacity=0.5, 
                         color_discrete_sequence=px.colors.diverging.RdYlGn, category_orders={"Price Decile":sorted(ls_95['Price Decile'].unique().tolist())}, hover_data=["neighbourhood", "price"],
                           center={"lat": 38.895, "lon": -77.024},
                           mapbox_style="basic", zoom=10.5)


fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
py.plot(fig, "price deciles", auto_open=False, width=1500, height=1200)


In [None]:
fig.write_image("binned_prices.png", height=850, width=1200)

## To test our hypothesis about distance to downtown/attractions and being in more of an expensive price range, we'll calculate the distnace to Capital One Arena in Downtown DC as well as the Washington Monument, which is at teh center of a lot of the DC attractions

In [None]:
monument = {'lat':38.8872036, 'lon':-77.045968}
arena = {'lat':38.8980942,'lon':-77.0208438}
center = {'lat':38.9072,'lon':-77.0369}
wh = {'lat': 38.8977, 'lon':-77.0365}
from math import radians, cos, sin, asin, sqrt


# Thanks https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points
    
    
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 3956 
    return c * r

## calculate each properties distance to washington monument and capital one arena

In [None]:
ls_d["distance_to_monument"] = ls_d[['latitude', 'longitude']] \
    .apply(lambda row: haversine(row['latitude'], row['longitude'], monument.get("lat"), monument.get("lon")), axis=1)


ls_d["distance_to_arena"] = ls_d[['latitude', 'longitude']] \
    .apply(lambda row: haversine(row['latitude'], row['longitude'], arena.get("lat"), arena.get("lon")), axis=1)


ls_d["distance_to_center"] = ls_d[['latitude', 'longitude']] \
    .apply(lambda row: haversine(row['latitude'], row['longitude'], center.get("lat"), center.get("lon")), axis=1)


ls_d["distance_to_whitehouse"] = ls_d[['latitude', 'longitude']] \
    .apply(lambda row: haversine(row['latitude'], row['longitude'], wh.get("lat"), wh.get("lon")), axis=1)


In [None]:
# concatenate correlations and price deciles
distance_corrs = pd.concat([ls_d.select_dtypes(include=np.number), pd.get_dummies(ls_95['Price Decile'])], axis=1).corr()[["distance_to_monument", "distance_to_arena", "distance_to_center", "distance_to_whitehouse"]].reset_index().dropna()

In [None]:
# quick curious check 
px.scatter(ls_d[ls_d.review_scores_location.notnull()], x="distance_to_monument", y="review_scores_location", trendline='lowess')

In [None]:
# look at correlations overall 
distance_corrs.round(3).sort_values("distance_to_whitehouse").style.background_gradient(low=0.0, high=.10)


## Answer: slightly negative associaton for wh/monument distances, location review scores, host listing count. Weak postive correlation betweeen bedrooms, availability and lowest-priced decile of listings.

# Question 2: What kind of properties are these listings (property type, room type, bedroom count, bathroom count, etc.)

### Property Types: 44% Apartments, 20% Houses, 15% Townhouses (3 form biggest share)

In [None]:
# groupby prop type and then figure their share overall
proptypes = (
                ls_d.
                    groupby("property_type").count()["price"]
                    .reset_index().rename({'price':'# of listings'}, axis=1)
)


(
    proptypes
        .sort_values("# of listings", ascending=False)
        .assign(share=proptypes["# of listings"]/ls_d.shape[0])
        .style.background_gradient(cmap='Purples', subset=['share'])
    

)

In [None]:
prop_summary = ls_d.groupby(["neighbourhood_cleansed", "property_type"])['id'].count().reset_index().rename({"id":"listing count"},axis=1).sort_values("listing count",ascending=False)
prop_fig = px.bar(prop_summary.sort_values("listing count",ascending=False),
       title="Property Type by Neighborhood",x='neighbourhood_cleansed', y="listing count", color="property_type", 
                  template="plotly_dark", height=900, color_discrete_sequence=plotly.colors.qualitative.Light24)


In [None]:
prop_fig.update_xaxes(title='Neighborhood')

In [None]:
# save
py.plot(prop_fig, "property_types", auto_open=False, width=1700, height=1250)


### Room Types: We can see that the overwhelming majority of listings are entire homes/apartments ( 71%) , followed by private rooms (~25%).

In [None]:
roomtypes = (
                ls_d.
                    groupby("room_type").count()["price"]
                    .reset_index().rename({'price':'# of listings'}, axis=1)
)


(
    roomtypes
        .sort_values("# of listings", ascending=False)
        .assign(share=roomtypes["# of listings"]/ls_d.shape[0])
        .style.background_gradient(cmap='Blues', subset=['share'])
    

)

In [None]:
# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.scatter_mapbox(ls_d.dropna(subset=['beds']), lat="latitude", lon="longitude", color="room_type", template="simple_white", size="beds",
                          hover_data=["neighbourhood", "price"],
                           center={"lat": 38.895, "lon": -77.024},
                           mapbox_style="basic", zoom=11.5)

fig.update_layout(margin={"r":5,"t":0,"l":0,"b":0})
py.plot(fig, "room_types", auto_open=False, width=1500, height=100)


### Summary statistics for type of property types: mode of price, bathrooms, bedrooms and beds

In [None]:
prop_df = ls_d[['price', 'bathrooms','bedrooms','beds', 'property_type']].copy()
prop_df = prop_df.groupby("property_type").agg([('mode', lambda x: x.value_counts().index[0]),'count', 'mean']).reset_index()
prop_df.columns = ['_'.join(col).strip() for col in prop_df.columns.values]
prop_df.rename({'beds_count':'number_of_listings'}, axis=1, inplace=True)
prop_df.drop([col for col in prop_df.columns if "count" in col], axis=1, inplace=True)
prop_df.sort_values("number_of_listings", ascending=False)

# Question 3 (Host Analysis):  Are there hosts that have multiple listings/have businesses using Aribnb? Do they individually tend to be in certain neighborhoods? Do they stick certain price points? Are their reviews impeccable due to their experience?

## Are there hosts that have multiple listings/have businesses using Aribnb?

In [None]:
ls_subset = ls_d[['id', 'host_id','host_name', 'host_about', 'host_response_time','host_since', 
      'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_listings_count', 'host_total_listings_count', 'neighbourhood',
      'latitude', 'longitude', 'room_type', 'property_type', 'bathrooms', 'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights',
      'availability_30', 'availability_60', 'availability_90','availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 
      'review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value',
      'reviews_per_month']].copy()


In [None]:
# hosts that have more than one listing 
host_summary = (
    ls_subset
        .groupby(["host_id", "host_name"])
        .count().reset_index()
        .sort_values("id", ascending=False)
        .iloc[:,:3]
        .rename({"id":"# of listings"}, axis=1)
    )

host_summary_multiple = host_summary[host_summary["# of listings"] > 1]



## Answer 1: 46% of the listings are owned by hosts with multiple listings. Thes guys are probably running businesses 

In [None]:
np.sum(host_summary_multiple["# of listings"]) / np.sum(host_summary["# of listings"]) 

## Do these hosts have good scores and reviews? 


In [None]:
host_reviews = host_summary_multiple.merge(ls_subset, on = ['host_id', 'host_name'])

In [None]:
# weight the scores of the hosts that have more than one listing by "reviews per month"
host_grouped = (
                host_reviews
                    .dropna()
                    .groupby(['host_id', 'host_name'])
                    [['review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value',
                          'reviews_per_month']]
)

(
    # collect weighted averages 
    host_grouped
        .apply(lambda x: pd.Series(np.average(x[['review_scores_rating','review_scores_accuracy',
                                                 'review_scores_cleanliness','review_scores_checkin','review_scores_communication',
                                                 'review_scores_location','review_scores_value']], weights=x["reviews_per_month"], axis=0),
                                   ['review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',
                                    'review_scores_communication','review_scores_location','review_scores_value']))
    
    # merge in helpful information 
    .merge(host_summary_multiple, on=['host_id', 'host_name'])
    .sort_values("# of listings", ascending=False)
    .style.background_gradient(cmap='RdYlGn',subset=['review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',
                                    'review_scores_communication','review_scores_location','review_scores_value'])
    
)

## Answer 2: We do not see see too much fluctuation in the scores here, except for a couple bad apples with low accuracy scores as well. It generally looks like the review scores are generally postive for the top hosts and overall


In [None]:
ls_d[[col for col in ls_d.columns if "review" in col]].mean()

In [None]:
rs_d.head()

## Do they focus in one or two neighbourhoods? Focus on ceratin price points?

In [None]:
# top 100 hosts by # of listings 
top100hosts = host_summary[:100].host_id.values.tolist()

# get mean price of listings acorss neighbourhoods for each host with multiple reviews
prices_neighbourhoods = \
(
    host_reviews[['id', 'host_name', 'host_id', 'price','neighbourhood']]
    .groupby(["host_name","host_id"])
    .agg({'price':'mean', 'neighbourhood':lambda x:x.value_counts().index[0]}, axis=1)
    .reset_index()
    .rename({'price':'mean_price'}, axis=1)
)

multiples = prices_neighbourhoods.merge(host_summary[['host_id','# of listings']], on='host_id').sort_values("# of listings", ascending=False)

In [None]:
host_summary_multiple.head()

## Asnwer 3 Part 1: It looks like out of the top 1078 hosts with multiple listings, 825 of them (~77%) are in only one neighourhood

In [None]:
h_summary = \
(
    ls[ls.host_id.isin(host_summary_multiple.host_id.unique())]
   .groupby(["host_id", "neighbourhood"])["id"].agg(["count", "sum"]).reset_index()
)

In [None]:
len(h_summary.host_id.value_counts().reset_index().query('host_id > 1')) / len(h_summary.host_id.unique())

In [None]:
multi_neigh = h_summary.host_id.value_counts().reset_index().query('host_id > 1')['index'].unique().tolist()

In [None]:
sns.catplot(height=10, kind='bar',aspect=1.5,
    data = h_summary.groupby("neighbourhood")['count'].sum().reset_index().sort_values("count", ascending=False)
            .rename({'count':'Number of listings', 'neighbourhood':'Neighborhood' }, axis=1),
    y='Neighborhood', x='Number of listings')

## Asnwer 3 Part 2: With the average IQR (of the middle 80%) of hosts with multiple listings being $40, it looks like these "businesses" tend to foucs on certain price points. There is no strong linear relationship between number of listings and IQR eitherm with the correlation around 8%

In [None]:
q3 = ls[ls.host_id.isin(multiples.host_id.unique())].groupby(["host_id"])["price"].describe().reset_index()

# filter out outliers
q3_filtered = q3[q3["mean"].between(q3["mean"].quantile(.05), q3["mean"].quantile(.95))]

q3_filtered["IQR"] = q3_filtered["75%"] - q3_filtered["25%"]
                                                                                                                                                      
q3_filtered.sort_values("count", ascending=False).style.background_gradient(cmap="RdYlGn", subset=["25%", "50%", "75%", "IQR"])

In [None]:
# what is the mean 
q3_filtered["IQR"].mean()

In [None]:
# does IQR tend to go up with # of listings?
q3_filtered["mean"].corr(q3_filtered["IQR"])

In [None]:
iqr = px.scatter(q3_filtered.rename({"mean":"Mean Price of listings", "50%":"50th percentile"},axis=1), x="Mean Price of listings", y='IQR', color='50th percentile', trendline='ols', 
                 template="ggplot2", color_continuous_scale=plotly.colors.sequential.Darkmint,
          title="<b>Mean Price vs IQR </b> (hosts with multiple listings)")

In [None]:
iqr.update_yaxes(showgrid=False, tickformat='$')
iqr.update_xaxes(showgrid=False, tickformat='$')


In [None]:
iqr.write_image("iqr.png", width=1200)

# Question 4 (Review Analysis): Even though reveiws are subjective, do ceratin neighbourhoods or price ranges reveal any patterns in review sentiment?

### For this we'll use NLTK's Vader Sentiment Intensity Analyzer. Learn more here: https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

In [None]:
import nltk
nltk.download('vader_lexicon')

# 'normalized, weighted composite score'
def get_sentiment(sentence):
    
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    nltk_sentiment = SentimentIntensityAnalyzer()
    score = nltk_sentiment.polarity_scores(sentence)
    return score



In [None]:
# make a reviews datframe 
reviews = (
            rs_d
                .rename({'id':'review_id'}, axis=1)
                .merge(ls[['host_id', 'host_name', 'id']], left_on='listing_id', right_on='id')
                .drop("id", axis=1)
)

In [None]:
# attach sentiment scores
reviews_final = pd.concat([reviews, reviews['comments'].astype(str).apply(get_sentiment).apply(pd.Series)], axis=1)


## Analyze sentiment scores by host, by neighbourhood, by price range 

In [None]:
vader_scores = pd.read_csv("../../sentiment_scores.csv")

In [None]:
vader_merged = vader_scores.merge(ls, left_on=['listing_id'], right_on=['id'] )

In [None]:
vader_summary = vader_merged[['listing_id', 'neighbourhood', 'date', 'neg', 'neu', 'pos', 'compound', 'price']]

## Using vader's "normalized" score, it looks all neighboourhoods overall have good reviews, with the bottom 10ish neighbourhoods having relatively lower scores, but all in all, still above average (0-1) scores.

In [None]:
vader_groupby = vader_summary.groupby("neighbourhood")["compound"].describe().sort_values(["50%", "count"], ascending=False).reset_index()
vader_groupby.style.background_gradient(cmap="RdYlGn", subset=["25%", "50%", "75%"])

In [None]:
import plotly.express as px

# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.choropleth_mapbox(vader_groupby, geojson=geojson, color="mean", 
                               locations="neighbourhood", featureidkey="properties.neighbourhood",opacity=0.5, color_continuous_scale=px.colors.diverging.Geyser,
                           center={"lat": 38.9072, "lon": -77.0369},
                           mapbox_style="light", zoom=10)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
py.plot(fig, "mean_sentiment", auto_open=False, width=1500, height=1000)


In [None]:
import plotly.express as px

# set token
px.set_mapbox_access_token("pk.eyJ1IjoibGF3cmVuY2VkIiwiYSI6ImNrODFzZnFnNzA0YmczZW9nNWN4aTFvdngifQ.VlB5-L7owXKEXo8JEePk7w")
fig = px.choropleth_mapbox(vader_groupby, geojson=geojson, color="std", 
                               locations="neighbourhood", featureidkey="properties.neighbourhood",opacity=0.5, color_continuous_scale=px.colors.diverging.Geyser,
                           center={"lat": 38.9072, "lon": -77.0369},
                           mapbox_style="light", zoom=10)

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
py.plot(fig, "std_sentiment", auto_open=False, width=1500, height=1000)


## Does price range uncover patterns in reviews ?

In [None]:
vader_merged.groupby("Price Decile")["compound"].describe().reset_index().sort_values("Price Decile").style.background_gradient(cmap="RdYlGn", subset=["mean"])

## Answer: Price range does not uncover too much besides the fact that the lowest two priced deciles have relatively lower scores.
