# Project Airbnb Analysis v1

### 1) Business understanding


##### The main problem is helping a persona (character) choosing a home where he/she is going to pass next vacations based on his/her aspirations.


### 2) Data understanding

##### It was based on the data downloaded in http://insideairbnb.com/get-the-data.html

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk import FreqDist
from nltk.util import ngrams
import nltk
from nltk.stem import WordNetLemmatizer 

In [None]:
# Loading all three datasets

calendar = pd.read_csv('calendar.csv', sep=',')
listings = pd.read_csv('listings.csv', sep=',')
reviews = pd.read_csv('reviews.csv', sep=',')

In [None]:
# Checking a head of each dataframe

calendar.head(2)

In [None]:
listings.head(2)

In [None]:
reviews.head(2)

##### Calendar is a dataframe with the asking prices from the houses (any places) offered in airbnb website and also other variables like minimum and maximum nights permited considering the price and adjusted_price registered.
##### Listings is a dataframe considering all houses registered in airbnb website with descriptions like a picture, reviews per month neighborhood overview, etc.
##### Reviews is a dataframe with all the reviews from people who already have been hosted in those houses.

### 3) Preparing data

##### NaN values, if found, should be treated as the average price per neighborhood. Considering the analysis is going to be done by average price.

In [None]:
# Task to put adjusted price as a workable float.

calendar['adjusted_price'] = calendar['adjusted_price'].str.replace('$', '')
calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(',', '')
calendar['adjusted_price'] = calendar['adjusted_price'].astype('float')

In [None]:
# Merging a new dataframe with the prices and the neighborhood of each house listed.

calendar.columns = ['id', 'date', 'availabe', 'price', 'adjusted_price', 'min_nights', 'max_nights']

calendar = pd.merge(calendar, listings[['id', 
                                        'neighbourhood_cleansed', 
                                        'neighbourhood_group_cleansed']], on='id')
calendar.head(2)

In [None]:
# Reporting NaN values in a dataframe

na_df = pd.DataFrame(index  = ['id', 
                               'date', 
                               'available', 
                               'price', 
                               'adjusted_price', 
                               'min_nights', 
                               'max_nights', 
                               'neighbourhood_cleansed', 
                               'neighbourhood_group_cleansed'],                     
                     data = (calendar['id'].isna().sum()/calendar.shape[0], 
                            calendar['date'].isna().sum()/calendar.shape[0], 
                            calendar['availabe'].isna().sum()/calendar.shape[0], 
                            calendar['price'].isna().sum()/calendar.shape[0], 
                            calendar['adjusted_price'].isna().sum()/calendar.shape[0], 
                            calendar['min_nights'].isna().sum()/calendar.shape[0], 
                            calendar['max_nights'].isna().sum()/calendar.shape[0], 
                            calendar['neighbourhood_cleansed'].isna().sum()/calendar.shape[0], 
                            calendar['neighbourhood_group_cleansed'].isna().sum()/calendar.shape[0]), 
                     columns=['NaN values %'])


na_df

In [None]:
reviews.head(2)

In [None]:
# Merging neighborhood and reviews in a small list to check in NLTK

small_listings = pd.merge(reviews, listings[['id', 'neighbourhood_cleansed', 'description']], left_on='listing_id', right_on='id').drop(['id_x', 'id_y'], axis=1)
small_listings = small_listings[['neighbourhood_cleansed', 'comments', 'description']]
small_listings.dropna(subset=['neighbourhood_cleansed', 'comments'], axis=0)
small_listings.head()

In [None]:
# Selecting stopwords and including some stopwords in the list.

my_stopwords = stopwords.words('english')
my_stopwords.extend(['br', '', 'The', 'for'])

In [None]:
# Counting revies per neighborhood and selecting top 20
# The more reviews, more "real" can be the analysis.

listings_top_20 = small_listings.groupby('neighbourhood_cleansed').agg({'comments': 'count'}).sort_values(by='comments', 
                                                                                        ascending=False).head(20)
listings_top_20

In [None]:
# Selecting bigrams "adjective" + "substantive"
# It's gonna take several hours (in my case 6 hours)
# That's why I chose to save that in a pickle file

%%time

import csv
  
lemmatizer = WordNetLemmatizer()

neigh_list = list(listings_top_20.index)
all_bigrams = {}
for neigh in neigh_list:
    bigrams = {}
    bigram = []
    for sentence in small_listings[small_listings['neighbourhood_cleansed'] == neigh].comments:
        try:
            words = nltk.word_tokenize(sentence)
            for i in range(len(words)):
                if nltk.pos_tag(words)[i][1] == 'JJ' and nltk.pos_tag(words)[i+1][1] in ['NN', 'NNS']:
                    adjective = WordNetLemmatizer().lemmatize(words[i], pos='a')
                    noun = WordNetLemmatizer().lemmatize(words[i+1], pos='n')
                    bigram.append((adjective.lower(), noun.lower()))
                    bigrams.update({str(neigh): bigram})
        except:
            None
    all_bigrams.update({str(neigh): bigram})
    print('Updated new neighborhood: {}' .format(neigh))
    neigh = neigh.replace(' ', '_')
    neigh = neigh.replace('-', '')
    neigh = neigh.replace('/', '')
    pd.DataFrame(bigrams).to_csv('bigram_' + str(neigh) + '.csv', sep=',')

In [None]:
# Saving the bigram dict in pickle

file_save = open("reviews_bigrams_top20.pkl", "wb")

pickle.dump(all_bigrams, file_save)

file_save.close()

In [None]:
# Reading pickle dict

reviews_bigrams = open("reviews_bigrams.pkl", "rb")

all_bigrams = pickle.load(reviews_bigrams)

In [None]:
# Function to get the top bigrams

def get_top_bigrams(word, how_many, neigh):
    
    '''
    
    This function returns a sorted dataframe with bigrams in index and neighborhood in \
    columns showing the frequencies of the bigrams:
    
        word = the word we want to search in the bigrams collected, 
        how_many = the size of the list of bigrams we want to look, by descending frequency, 
        neigh = the neighborhood list in the dataset we want to analyze
        
        '''

    best_bigrams = {}

    for i in range(len(all_bigrams.get(neigh))):
        if word in all_bigrams.get(neigh)[i]:
            bigrama = all_bigrams.get(neigh)[i]
            perc = all_bigrams.get(neigh).count(all_bigrams.get(neigh)[i])/len(all_bigrams.get(neigh))
            best_bigrams.update({bigrama: round(100*perc, 2)})

    best_df = pd.DataFrame(data=best_bigrams.values(), index=best_bigrams.keys(), columns=[str(neigh)])
    
    return best_df.sort_values(by=str(neigh), ascending=False).head(how_many)

In [None]:
# Function to get neighborhood list descending frequencies by word.

def neigh_by_noun(word, neighlist):
    
    '''
    This function returns a dataframe with the 5 most registered bigrams
    by all the neighborhoods in a list.
    
    word = the word in a bigram we want to search for
    neighlist = list of neighborhoods in the dataframe returned    
    
    '''
    
    df = pd.DataFrame()
    
    for neigh in neighlist:
        
        try:
        
            df = pd.concat([df, get_top_bigrams(word, 5, neigh)], axis=1).fillna(0)
        
        except:
        
            None
    
    return df.nlargest(5, df.columns, keep='first')

In [None]:
# This is the neigh list I choose in the problem

neighlist = ['Atlantic', 
'South Lake Union', 
'Eastlake', 
'Stevens', 
'Green Lake', 
'Wallingford', 
'Fremont', 
'Loyal Heights', 
'North Deridge', 
'Mount Baker', 
'North Beacon Hill', 
'Columbia City', 
'Brighton', 
'Adams', 
'Belltown', 
'Broadway', 
'Interbay', 
'Minor', 
'Seward Park', 
'Pioneer Square', 
'Ravenna', 
'Leschi', 
'University District', 
'Greenwood', 
'Fairmount Park', 
'Mid-Beacon Hill', 
'Roosevelt', 
'Yesler Terrace', 
'North College Park']

### 4) Data visualisation

##### 4.1) How do average prices in the whole Seattle behaves?

In [None]:
# Plotting raw ask prices

calendar.groupby('date').mean().adjusted_price.plot(figsize=(16,5), linewidth=2, color='tab:blue')
plt.xlabel('')
plt.xticks(fontsize=14)
plt.ylabel('Price US$/day', fontsize=14)
plt.yticks(fontsize=14)

###### We can clearly see average price oscilates in a short-term.

##### Plotting a seven day moving average, the short-term spikes vanish, so we can tell it has a week seasonality
##### 4.2) What about the behavior of the prices in a more long-term?
##### It spikes in Thanksgiving and Christmas and the highest values are found in summer.

In [None]:
# Plotting seven days moving average in price

hist_prices = calendar.groupby('date').agg({'adjusted_price': 'mean'}).rolling(window = 7).agg({'adjusted_price': ('mean', 'std')})

hist_prices['adjusted_price']['mean'].plot(figsize=(18,6), linewidth=2, color='tab:blue')
plt.xlabel('')
plt.xticks(fontsize=14)
plt.ylabel('Price US$', fontsize=14)
plt.yticks(fontsize=14)

##### 4.3) How the average prices are spread in geographically? What are the most and least expensive neighborhoods?

In [None]:
# Filtering neighboorhoods with at least 30 reviews and plotting average price in the whole dataset

count_neigh = listings['neighbourhood_cleansed'].value_counts()
min_count = 30

filtered = list(count_neigh[count_neigh > min_count].index)

cal_filt = calendar['neighbourhood_cleansed'].isin(filtered)

neigh_prices = calendar[cal_filt].groupby('neighbourhood_cleansed').agg({'adjusted_price':
                                                                'mean'}).sort_values(by='adjusted_price', 
                                                                                     ascending=False)



barplot = neigh_prices.plot(kind='bar',figsize=(19,6), color='goldenrod')
barplot.patches[neigh_prices.index.get_indexer(['Southeast Magnolia'])[0]].set_facecolor('r')
barplot.patches[neigh_prices.index.get_indexer(['West Queen Anne'])[0]].set_facecolor('r')
barplot.patches[neigh_prices.index.get_indexer(['Central Business District'])[0]].set_facecolor('r')
barplot.patches[neigh_prices.index.get_indexer(['North College Park'])[0]].set_facecolor('g')
barplot.patches[neigh_prices.index.get_indexer(['Roosevelt'])[0]].set_facecolor('g')
barplot.patches[neigh_prices.index.get_indexer(['Yesler Terrace'])[0]].set_facecolor('g')
barplot.patches[neigh_prices.index.get_indexer(['Mid-Beacon Hill'])[0]].set_facecolor('g')
barplot.patches[neigh_prices.index.get_indexer(['Fairmount Park'])[0]].set_facecolor('g')


fontsize=16

plt.xticks(fontsize=fontsize);
plt.xlabel('Neighbourhood', fontsize=fontsize)
plt.yticks(fontsize=fontsize);
plt.ylabel('Price U$/day', fontsize=fontsize)
plt.legend('')

##### The most expensive: Southeast Magnolia, West Queen Anne and Central Business District.
##### The least expensive: North Coilege Park, Roosevelt, Yesler Terrace, Mid-Beacon Hill and Fairmount Park.

##### 4.4) How can the reviews help us to choose a neighborhood and book our stay? 

In [None]:
# Visualizing the bigram by neighborhood in a pandas dataframe

occurrences_dt = neigh_by_noun('neighborhood', neighlist)
occurrences_dt

In [None]:
# In bigram_wanted we can change the words to filter whatever we want.

bigram_wanted = ('quiet', 'neighborhood')

occurrences_dt.loc[bigram_wanted].sort_values(ascending=False).plot(kind='bar', figsize=(18,6))

plt.xticks(fontsize=fontsize);
plt.xlabel('Neighbourhood', fontsize=fontsize)
plt.yticks(fontsize=fontsize);
plt.ylabel('% of '+str(bigram_wanted), fontsize=fontsize)
plt.legend('')

### 5) Explaining key insights

##### Considering each pass seen in this notebook:
##### - The ask prices are higher as the weekend comes, so there's a week seasonality in this dataset.
##### - Prices are higher at certain dates as Christmas and Thanksgiving, get falling in the beginning of the year and hit the top at the summer.
##### - We can check the prices of the houses, but considering we don't have the area of each place, we still can't normalize the price for neighborhood as dollar per square feet.
##### - Average prices per neighborhood vary from less than 80 dollars to over 250 dollars per night.
##### - According to reviews, we found that the neighborhood most cited as "quite" is Loyal Heights followed by Ravenna.