# Section 1 : Business Understanding

AirBnB has made the listing of its rental properties in and around Boston, MA neighborhoods public, along with the reviews left by guests, and the dates when properties were available. In this blog, we will review this data to learn about the neighborhood and attempt to understand which types of properties are available, their prices, the concentration of these listings, the number of reviews and lastly, how guests described the properties after their stay.

## Questions  
We analyze the listings and the reviewers datasets in an attempt to answer the following questions:  
### Listings  
What type of properties are available, does the host rent a room or the whole property, the number of properties available in a neighborhood and how rental prices are compared among the neighborhood.  

**Question 1.** Types of properties and room types  
**Question 2.** Proportion of room types in the properties  
**Question 3.** Rental properties in a neighborhood  
**Question 4.** Rental price  
  
### Reviews  
How many reviews are left by the guests, which can subtly tells us the popular neighborhoods among guests. How many guests stayed at multiple properties, who are the guests that left most reviews, and finally the sentiments used in reviews to express the neighborhood.  

**Question 1.** How number of reviews compared among neighborhoods  
**Question 2.** How many reviewers stays at more than one properties  
**Question 3.** Guests leaving the most reviewers  
**Question 4.** Sentiments to describe the neighborhood  

In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import matplotlib.ticker as mtick


import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from wordcloud import WordCloud 
from sklearn.feature_extraction.text import TfidfVectorizer


stop_words = stopwords.words("english")

%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Section 2 : Data Understanding

## Gather

In [5]:
listingsdf = pd.read_csv('../input/boston/listings.csv')
calendardf = pd.read_csv('../input/boston/calendar.csv')
reviewsdf  = pd.read_csv('../input/boston/reviews.csv')

## Explore/ Analyze

In [6]:
# shape of the data sets
print('Dataframe shapes:\nListings - {0} \nCalendar - {1} \nReviews  - {2}'.format(listingsdf.shape, calendardf.shape, reviewsdf.shape ))

print('\nnumber of rows\nListings - {0:9,} \nCalendar - {1:9,} \nReviews  - {2:9,}'.format(listingsdf.shape[0], calendardf.shape[0], reviewsdf.shape[0]))

In [7]:
listingsdf.columns

In [8]:
listingsdf.head()

In [9]:
listingsdf['neighbourhood']

In [10]:
reviewsdf.head()

In [11]:
calendardf.head()

In [12]:
listingsdf.info()

In [13]:
reviewsdf.info()

In [14]:
calendardf.info()

In [15]:
print('REVIEWS: first review date = ' , reviewsdf.date.min(), ' , last review date = ', reviewsdf.date.max())
print('CALENDAR: first date = ' , calendardf.date.min(), ' , last date = ', calendardf.date.max())

# Section 3 : Prepare Data

## Analyzing Null Values

In [16]:
def listingdfValues(df=listingsdf, type='null', cutoff=0.01 ):
    if (type == 'null'):
        data = df.isna().sum().to_frame().reset_index()
    else:
        data = df.notna().sum().to_frame().reset_index()

    print('\n')
    data.rename(columns={0:'count'}, inplace=True)
    data['count_percentage'] = data['count'] / df.shape[0]

    data = data[data['count_percentage'] > cutoff]

    plt.figure(figsize=(38,12))
    plt.xticks(rotation=80)
    if (type == 'null') :      
        plt.title('Columns with percentage of null values in Listing dataset', fontsize=20 )
    else:
        plt.title('Columns with percentage of non-null values ', fontsize=20 )

    ax = sns.barplot(x='index', y='count_percentage', data=data )
    sns.despine()
    ax.set_xlabel('Columns in listing dataset', fontsize=20)
    ax.set_ylabel('percentage of Values', fontsize=20)

#     for container in ax.containers:
#         ax.bar_label(container*100, fmt='%.2f')

    for i in range (data.shape[0]):
        count = data.iloc[i]['count_percentage']

        # Refer here for details of the text() - https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.text.html
        plt.text(i, count + 0.02 , '{:0.1f}'.format(count*100), ha = 'center', va='center', rotation=90)

listingdfValues(listingsdf, 'null', 0.01)
listingdfValues(listingsdf, 'notnull', 0.0)

## Cleaning Price column

In [18]:
# price column has $ sign and comma, create a new column by
listingsdf['price_cleansed'] = listingsdf.price.replace({'[\$,]':''}, regex=True).astype(float)

### Outlier Prices

In [19]:
def outlier_prices(df=listingsdf, cutoff=500):
    data = df[df['price_cleansed']>cutoff].groupby('price_cleansed')['id'].count().to_frame()
    display("prices with cutoff of " + str(cutoff) +  " ", data[data['id']>1].T)

outlier_prices(listingsdf, 500)
outlier_prices(listingsdf, 600)
outlier_prices(listingsdf, 700)

### Create reviews_neighborhood Dataframe

In [20]:
reviews_neighborhood_df = reviewsdf.merge(listingsdf['neighbourhood_cleansed'], how='left', left_on=reviewsdf['listing_id'], right_on=listingsdf['id'] )

# Section 4 : Evaluation

## Four questions on listings dataset followed by 4 questions on reviews dataset

## Question 1 : Type of Properties and Room Types

### **Q1 Visualize**

In [22]:
def propertyTypes(df=listingsdf, byType='property_type'):
    '''
    create a visualization of the property types or room types, and also output the data in the form of a table
    
    PARAMETERS:
    df : Dataframe - Default is listingsdf, method is specific to listingdf, however, allowing it to pass as a parameter pemit us to use a modified listingsdf
    byType: default is 'property_type', allows to use the same method for producing for both property_type as well as the room_type
    
    RETURNS:
    N/A
    '''
    
    if (byType == 'property_type'):
        rotation = 75
    else:
        rotation = 0
    
    data = listingsdf 
    data = data.groupby(byType)['id'].count().reset_index()
    total = data['id'].sum()
    data['percent'] = (data['id']/total * 100).round(2)

    plt.figure(figsize=(12,6))
    
    ax = sns.barplot(x=byType, y='id', data=data)
    
    
    byTypeText = byType.replace("_"," ")
    plt.xticks(rotation=rotation)
    plt.title('count of the ' + byTypeText, fontsize=24)
    ax.set_xlabel(byTypeText, fontsize=16)
    ax.set_ylabel('count of ' + byTypeText, fontsize=16)
#     ax.grid(axis='y', linewidth=.4)

    for container in ax.containers:
        ax.bar_label(container, fontsize=14) #, fmt='%.2f')
        
    ax2=ax.twinx()
    
    ax = sns.barplot(x=byType, y='percent', data=data)
    
    ax.set_ylabel('percentage', fontsize=20)
    ax.grid(which='both', axis='y', linewidth=0.4)
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    

#     ax.figure.legend()
    plt.yticks(fontsize=14)
    ax.tick_params(right=False)
    sns.despine()
    
    display(data.rename(columns={'id':'count'}).T)

propertyTypes(listingsdf, 'property_type')
propertyTypes(listingsdf, 'room_type')

### **Q1 Explanation**

Properties range from apartments, Bed & Breakfast, Condos, Townhouses, and even houses. The most by far are the apartment listings (almost 73%), followed by houses (~16% ) and condos (~6%) for a total of 95% of the 3,582 listings. That leaves 177, or 5% of 3,582, of other property types.

Shared rooms are also an option, but only 80 listings are offered. Other options are either the full apartment/house or a private room.

## Question 2 : Proportion of room types in the properties

### **Q2. Visualize**

In [23]:

data = listingsdf 
type_count = data[['property_type', 'room_type']].value_counts().reset_index().sort_values(['property_type','room_type'])

# ax = plt.subplot()
plt.figure(figsize=(18,5))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax = sns.barplot(x='property_type', y=0, hue='room_type', data=type_count)
sns.despine()
plt.xticks(rotation=75)
plt.title('property types vs room type count of properties', fontsize=20)
plt.legend(loc='upper right',fancybox=True)
ax.set_xlabel('property type', fontsize=16)
ax.set_ylabel('properties count', fontsize=16)
ax.grid(axis='y', linewidth=.35)

for container in ax.containers:
    ax.bar_label(container) #, fmt='%.1f')

# for i in range(type_count.shape[0]):
#     count = type_count.iloc[i][0]
    
#     # Refer here for details of the text() - https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.text.html
# #     plt.text(i, count + 0.05 , '{:0.1f}'.format(count), rotation=90, ha = 'center', va='top')
    
#     plt.text(i, count + 0.05 , '{:0.1f}'.format(count), rotation=90, ha = 'center', va='top')



In [29]:
def nhoodPropertyTypes(df=listingsdf, byType='property_type'):
    '''
    visualize and output the data of property types in neighbourhood.
    Same method is used for creating visuals for both property_type as well as room_type
    
    PARAMETERS:
    df - dataframe - default is listingdf. a variant of listingdf can also be passed, however it needs to have the columns used in thei method.
    byType - string - default is 'property_type', a constant used to either have a property_type or room_type
    '''
    
    if (byType == 'property_type'):
        rotation = 75
    else:
        rotation = 0
    
    data = listingsdf[listingsdf['property_type'].isin(['Apartment', 'Condominium', 'House']) ]
    data = data.groupby(['neighbourhood_cleansed', 'property_type'])['id'].count().reset_index()
    
    total = data['id'].sum()
    data['percent'] = (data['id']/total * 100).round(2)

    plt.figure(figsize=(24, 8))
    ax = sns.barplot(x='neighbourhood_cleansed', y='id', hue=byType, orient="v", data=data)
    sns.despine()
    
    byType = byType.replace("_"," ")
    plt.xticks(rotation=rotation)
    plt.title('count of the ' + byType, fontsize=24)
    ax.set_xlabel(byType, fontsize=16)
    ax.set_ylabel('count of ' + byType, fontsize=16)
    ax.grid(axis='y', linewidth=.4)
    plt.legend(loc='upper right')

    for container in ax.containers:
        ax.bar_label(container, fontsize=10) #, fmt='%.2f')
        
    display(data.rename(columns={'id':'count'}).T)

nhoodPropertyTypes(listingsdf, 'property_type')
# nhoodPropertyTypes(listingsdf, 'room_type')

### **Q2 Explanation**

Most of the apartment and condo listings offer full apartment/condos as a rental (67.5%) vs houses (26%). While 30% of apartments (31% for condos) offer a private room, it’s at 72% for house rentals. Homeowners renting a private room is indicative of homes having an extra room to offer for rent. Apartment and condos are generally at most up to three bedrooms. Full apartment rental gives a clue that many of such listings are either single or have two bedrooms.


The green bar in the second visualgives a hint that the neighborhood is a suburb of Boston, such as West Roxbury or Hyde Park, where Houses are in greater number than apartments. On other hand, the tall blue bars represent the neighborhoods of Back Bay, Beacon Hill, Downtown, Fenway, Mission Hill, North End, South Boston Waterfront, South End, and West End, which consist mostly of apartment buildings.

## Question 3 : Rental properties in a neighborhood

### **Q3. Visualize**

In [27]:
def averagePopertiesNhood(df=listingsdf):
    '''
    accepts the dataframe, default is listingdf. 
    Calculate the mean price for each of the neighborhood
    
    PARAMETERS:
    df - dataframe, to use for calculating the average price of the neighbourhood
    '''

    pd.set_option('display.max_columns', 0)

    # average price of properties in the neighborhood
    data2=df.groupby('neighbourhood_cleansed')['price_cleansed'].mean().reset_index()
    # average number of properties in the neighborhood
    data3=df.groupby('neighbourhood_cleansed')['id'].count().reset_index()

    data3['average'] = (data3['id']/listingsdf.shape[0] * 100).round(2)
    data = data3.merge(data2)
    data['price_cleansed'] = data['price_cleansed'].round(2)
    data.rename(columns={'neighbourhood_cleansed': 'neighborhood','price_cleansed':'average price', 'id':'count'}, inplace=True)

    display(data.T)
    
    plt.figure(figsize=(20,8))
    plt.xticks(fontsize=14, rotation=90)
    plt.yticks(fontsize=14)

    ax = sns.barplot(x='neighborhood', y='count', data=data)
    ax.set_ylabel('Number of properties', fontsize=20)
    ax.set_xlabel('Neighborhoods'       , fontsize=20)

    for container in ax.containers:
        ax.bar_label(container, label_type = 'edge', fontsize=14) #, fmt='%.2f')
#     ax.legend(loc=0)

    ax2=ax.twinx()
    ax = sns.barplot(x='neighborhood', y='average', data=data)
    ax.set_ylabel('percentage of properties', fontsize=20)
    ax.grid(which='both', axis='y', linewidth=0.4)
    ax.yaxis.set_major_formatter(mtick.PercentFormatter())
    

#     ax.figure.legend()
    plt.yticks(fontsize=14)
    ax.tick_params(right=False)
    

    plt.title('Number of rental properties in the neighborhood',fontsize=20)
    sns.despine()
    
averagePopertiesNhood(listingsdf)

### **Q3. Explanation**
A little over half, 13 out of 25 (52%), of the neighborhoods have over 80% of total listed rentals, i.e., 2,952 out 3,585 properties. This include the neighborhoods of Jamaica Plain, South End, Back Bay, Fenway, Dorchester, Allston, Beacon Hill, Brighton, South Boston, Downtown, East Boston, Roxbury and North End.

The remaining half, 48%, has 776 out of 3,585 properties. The 12 out of 25 (48%) of the neighborhoods contains 20% of the listed rentals. The neigborhoods include Mission Hill, Charlestown, South Boston Waterfront, Chinatown, Roslindale, West End, West Roxbury, Hyde Park, Bay Village, Mattapan, Longwood Medical Area, and Leather District.

## Question 4 : Rental prices

### **Q4. Visualize**

## Average Price in the neighborhood

In [30]:
def averagePriceNhood(df=listingsdf, cutoff=500):
    '''
    accepts the dataframe, default is listingdf. 
    Calculate the mean price for each of the neighborhood
    
    PARAMETERS:
    df - dataframe, to use for calculating the average price of the neighbourhood
    '''

    pd.set_option('display.max_columns', 0)
    
    df = df[df['price_cleansed'] <= cutoff]

    # data = listingsdf.groupby(['neighbourhood_cleansed','price_cleansed'])['city'].count().reset_index()
    data = df.sort_values('neighbourhood_cleansed') #.groupby(['neighbourhood_cleansed'])['price_cleansed'].mean().reset_index()

    data2 = data.groupby('neighbourhood_cleansed')
    # fig, ax = plt.subplots()
    plt.figure(figsize=(20,8))
    plt.xticks(fontsize=14, rotation=90)
    plt.yticks(fontsize=14)

#     ax = sns.countplot(x='neighbourhood_cleansed',  data=data)
#     ax.set_ylabel('Number of properties', fontsize=20)
#     ax.set_xlabel('Neighborhoods'       , fontsize=20)
#     for container in ax.containers:
#         ax.bar_label(container, label_type = 'edge', fontsize=14) #, fmt='%.2f')
#     ax.legend(loc=0)

#     ax2=ax.twinx()
    ax = sns.lineplot(x='neighbourhood_cleansed', y='price_cleansed',       data=data)
    ax.set_xlabel('Neighborhoods'       , fontsize=20)
    ax.set_ylabel('average price with confidence interval', fontsize=20)
    
    ax.grid(axis='y', linewidth=0.4)
#     ax.figure.legend()
    plt.yticks(fontsize=14)
    sns.despine()
    

    plt.title('Average rental price in the neighborhood',fontsize=20)

# plt.show()

    display(data.groupby('neighbourhood_cleansed')['price_cleansed'].mean().round(2)
            .to_frame().reset_index().rename(columns={'neighbourhood_cleansed':'neighborhood', 'price_cleansed':'average price'}).T)
    
averagePriceNhood(listingsdf, 600)

In [31]:
listingsdf[listingsdf['neighbourhood_cleansed'] == 'Leather District'] #   ['price_cleansed'].sum() / 56

## price range in each neighborhood

In [32]:
def priceRange(df=listingsdf, cutoff=500):
    df = df[df['price_cleansed'] <= cutoff]
    data1 = df.groupby('neighbourhood_cleansed')['price_cleansed'].min().round(0).to_frame().reset_index()
    data2 = df.groupby('neighbourhood_cleansed')['price_cleansed'].max().round(0).to_frame().reset_index()
    data3 = df.groupby('neighbourhood_cleansed')['price_cleansed'].mean().round(2).to_frame().reset_index()
    data1.rename(columns={'price_cleansed':'min price'}, inplace=True)
    data2.rename(columns={'price_cleansed':'max price'}, inplace=True)
    data3.rename(columns={'price_cleansed':'avg price'}, inplace=True)

    data = data1.merge(data2.merge(data3))
    data.rename(columns={'neighbourhood_cleansed':'neighborhood'}, inplace=True)
    display(data.T)


priceRange(listingsdf, 600)

In [33]:
def priceRange(df=listingsdf, cutoff=500):
    plt.figure(figsize=(20,10))
    data = df[df['price_cleansed'] <= 600].sort_values('neighbourhood_cleansed')
#     data = df.sort_values('neighbourhood_cleansed')

    plt.title('Neighborhood price ranges ', fontsize=20)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    
    sns.despine()
    
    ax = sns.boxplot(y='neighbourhood_cleansed', x='price_cleansed', data=data)
    ax.set_xlabel('Price', fontsize=20)
    ax.set_ylabel('Neighborhoods', fontsize=20)
    ax.grid(axis='x', linewidth=0.45)


priceRange(listingsdf, 600)
# ax = sns.swarmplot(y='neighbourhood', x='price_cleansed', data=d)

### **Q4. Explanation**

The average rental price for these properties sits uniformly between $75 and $260.

There are price outliers present in the listings data as shown in table 5 below. One property with $4,000, and another with $3,000. It could be that those are monthly prices, but we can’t be certain. So, we ignore these and other prices that are above $600.

# Reviewers - Four questions on reviewers dataset

## Question 1 : How number of reviews compared among neighborhoods

### **Q1. Visualize**

## Reviews received by neighborhoods


In [34]:
# reviews_neighborhood_df = reviewsdf.merge(listingsdf['neighbourhood_cleansed'], how='left', left_on=reviewsdf['listing_id'], right_on=listingsdf['id'] )

In [35]:
reviews_neighborhood_df.head()

In [36]:
def nhoodReviews(df=reviews_neighborhood_df):
    
    # data = reviews_neighborhood_df.groupby('neighbourhood_cleansed')['reviewer_id'].count()
    plt.figure(figsize=(16,10))
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    
    df.sort_values('neighbourhood_cleansed', inplace=True)
    plt.title('Neighborhood count of reviews', fontsize=20)
    ax = sns.countplot(y='neighbourhood_cleansed' , data=df)
    sns.despine()
    ax.set_xlabel('Number of Reviews', fontsize=20)
    ax.set_ylabel('Neighborhoods', fontsize=20)
    ax.grid(axis='x', linewidth=.4)
    
    for container in ax.containers:
        ax.bar_label(container,fontsize=12) #, fmt='%.2f')
        
    data1 = df.groupby('neighbourhood_cleansed')['id'].count()

nhoodReviews(reviews_neighborhood_df)

In [37]:
data1 = reviews_neighborhood_df.groupby('neighbourhood_cleansed')['id'].count().to_frame()
data2 = reviews_neighborhood_df.groupby('neighbourhood_cleansed')
data1.T

### **Q1. Explanation**

Jamaica Plain has the most reviews and also has the most properties listed for rental (over 9%).

## Question 2 : How many reviewers stays at more than one properties

### Q2. Visualize

In [38]:
def stayMoreThanOneProperty(df=reviewsdf, min_property_stay=3):
    '''
    find the reviewers that has stayed at more than single property.
    reviewsdf has reviewer-id and the property, where the reviewers had stayed and provided the feedback. Now, it is a case that an individual
    has stayed at the same property multiple times, and has left a review each time.
    Here we want to capture the unique properties that reviewer has stayed at.
    
    PARAMETERS:
    df : the reviewsdf dataframe, this will be a default, when no other df is passed
    min_property_stay: default is 3, and it will filter out only the reviewers that has stayed at least by this number of different properties
    
    RETURN:
    none
    '''

#     min_property_stay = 

    # group by reviewer and listing-id, to get unique reviewer-id, and listing-id pair
    data  = df.groupby(['reviewer_id','listing_id']).count().reset_index()

    # count the unique listing, reviewer stayed at
    data2 = data.groupby(['reviewer_id'])['listing_id'].count().to_frame().reset_index()

    # filter the result set
    data3 = data2[data2['listing_id'] >= min_property_stay]

    plt.figure(figsize=(9,4))
    plt.title('reviewers with atleast ' + str(min_property_stay) + ' different property stays' , fontsize=16)
    ax = sns.countplot(x='listing_id' , data=data3)
    ax.grid(axis='y', linewidth=.35)

    sns.despine()
    ax.set_xlabel('count of unique property stays', fontsize=14)
    ax.set_ylabel('number of reviewers', fontsize=14)
    for container in ax.containers:
        ax.bar_label(container, fontsize=12) #, fmt='%.2f')

stayMoreThanOneProperty(reviewsdf, 1)
print('\n')
stayMoreThanOneProperty(reviewsdf, 3)

### **Q2. Explanation**

In analyzing the reviews dataset, we find that (Fig. 10) there is a good majority of 300 individuals who stayed at three different properties and have left reviews. We also find two guests who stayed at 10 and 15 different properties and have left reviews. Reviewers with a single stay at a property far outweigh the rest, with 60,992 out of 68,275 reviewers. This is followed by reviewers who stayed in at least two different properties, which would be 2,352 out of 68,275.

## Question 3 : How many guests left the most reviews?

### Q3. Visualize

In [39]:
def mostUniquePropertiesReviews(df=reviewsdf, min_properties=7):
    
    min_reviews = 7
    # group by reviewer and listing-id, to get unique reviewer-id, and listing-id pair
    data  = df.groupby(['reviewer_id','listing_id']).count().reset_index()

    data2 = data.groupby(['reviewer_id'])['listing_id'].count().to_frame().reset_index()
    data2 = data2.merge(reviewsdf['reviewer_name'], how='left', left_on=data2['reviewer_id'], right_on=reviewsdf['reviewer_id'])
    data3 = data2[data2['listing_id'] >= min_reviews].sort_values('reviewer_name')

    plt.figure(figsize=(20,8))
    plt.xticks(fontsize=14, rotation=90)
    plt.yticks(fontsize=14)
    
    plt.title('reviewers with atleast ' + str(min_reviews) + ' unique property stays', fontsize=22)
    
    ax = sns.barplot(x='reviewer_name', y='listing_id', data=data3)
    ax.set_xlabel('Reviewers', fontsize=20)
    ax.set_ylabel('Number of unique properties stay', fontsize=20)
    ax.grid(axis='y', linewidth='0.35')
    sns.despine()
    for container in ax.containers:
        ax.bar_label(container, fontsize=14) #, fmt='%.2f')

mostUniquePropertiesReviews(reviewsdf, 7)

In [40]:
reviewsdf[reviewsdf['reviewer_id'] == 18607361].sort_values(['listing_id','date'])

In [41]:
reviews_neighborhood_df = reviewsdf.merge(listingsdf['neighbourhood_cleansed'], how='left', left_on=reviewsdf['listing_id'], right_on=listingsdf['id'] )
reviews_neighborhood_df.head()

In [None]:
# len(data3.reviewer_name.unique())

In [None]:
reviewsdf['reviewer_id'].value_counts()

In [42]:
neighborhoods = reviews_neighborhood_df['neighbourhood_cleansed'].unique()
neighborhoods

In [43]:
reviews_neighborhood_df[reviews_neighborhood_df['neighbourhood_cleansed'] == 'Downtown'].shape

### **Q3. Explanation**

There were 17 guests who stayed in at least 7 different properties (Fig. 11), with 15 properties being the most rented by a guest, “Frank (and Meredith)”.

## Question 4 : How did the guests describe the neighborhood?

## Q4. Visualize

In [44]:
# from tqdm import tqdm
# def review

vectorizer = TfidfVectorizer(ngram_range=(1,2))
lemmatizer = WordNetLemmatizer()
count = 0
for neighborhood in neighborhoods:
    print('\n\n')
    count += 1
    corpus = ''
    nhood = reviews_neighborhood_df[reviews_neighborhood_df['neighbourhood_cleansed'] == neighborhood]
    
#     for rec in tqdm(range(nhood.shape[0])):
    for rec in range(nhood.shape[0]):

    #     list_id  = reviewsdf.listing_id[rec]
        comments = reviewsdf.comments[rec]
        comments   = str(comments).replace('[\$,]',' ' )
        corpus += comments


    words = word_tokenize(corpus)
    tokens = [lemmatizer.lemmatize(word).lower().strip() for word in words if word.lower() not in stop_words]
#     tokens = [lemmatizer.lemmatize(word).lower().strip() for word in words ]
    
    corpus = " ".join(tokens)

    if (len(corpus) > 0):
        vectors = vectorizer.fit_transform([corpus])
        names = vectorizer.get_feature_names()

        data = vectors.todense().tolist()# Create a dataframe with the results

        df = pd.DataFrame(data, columns=names)

        wordcloud = WordCloud(background_color="white", max_words=100).generate_from_frequencies(df.T.sum(axis=1))
        plt.figure(figsize=(14,14))

        plt.imshow(wordcloud)
        plt.axis('off')
        plt.title('Neighborhood - ' + neighborhood, fontsize=24)
        plt.show()
        
    if (count > 2):
        break
        
print('done!')
# print(word_bag)

### **Q4. Explanation**

Analyzing the comments for the most popular properties show the positive words like “great,” “would definitely,” “great host,” “helpful,” “great location,” and “comfortable.”

In [None]:
# stop_words