<p  style="text-align: center;"><font size="12"><b>NEW YORK CITY AIR BnB DATA</b></font></p>

<img src="https://www.esquireme.com/public/images/2019/11/03/airbnb-678x381.jpg" alt="NYC">  

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

A thorough analysis of this data will provide valueable insights that may be used by the company for marketing and advertising initiatives or any number of business decisions by AirBnB. It may also be of interest to potential clients, investors, or people with a general interest. 

Containing about 49,000 rows, the dataset is divided into 16 columns containing both numeric and categorical values.

<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Table of Contents</h3>

* <a href="1">I. LOAD LIBRARIES AND PACKAGES</a>  
* <a href="2">II. DATA OVERVIEW AND INSIGHTS</a>  
* <a href="3">III. MISSING VALUES</a>  
* <a href="4">IV. OUTLIERS</a>  
* <a href="5">V. FEATURE ENGINEERING, PART 1</a>  
* <a href="6">VI. EXPLORATORY DATA ANALYSIS</a>  
    * <a href="6a">VIa. UNIVARIATE ANALYSIS</a>
    * <a href="6b">VIb. BIVARIATE ANALYSIS</a>
    * <a href="6c">VIc. WORD ANALYSIS</a>
    * <a href="6d">VId. MAP VISUALIZATION</a>  
* <a href="7">VII. MODEL DEVELOPMENT</a>

# <a id='1'>I. LOAD LIBRARIES AND PACKAGES</a>

In [None]:
#### This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from subprocess import check_output
from wordcloud import WordCloud, STOPWORDS

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <a id='2'>II. DATA OVERVIEW & INSIGHTS</a>

In [None]:
df = pd.read_csv('../input/new-york-city-airbnb-open-data/AB_NYC_2019.csv')
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
print(df['room_type'].value_counts())
print('')
print(df['neighbourhood_group'].value_counts())
print('')
print(df['neighbourhood'].value_counts())

    

# <a id='3'>III. MISSING VALUES</a>

In [None]:
missing = df.notnull()
for col in missing.columns:
    print(col)
    print(missing[col].value_counts())
    print('')

The "reviews_per_month" column has a lot of missing values. However this is most likely because many listings have recieved no reviews at all, but instead of a '0', it is marked as a 'nan' value. Lets correct this by changing 'NaN' to 0 when the 'number_of_reviews' is 0. 

In [None]:
for i in df['number_of_reviews']: 
    if i == 0:
        df['reviews_per_month'].fillna(0, inplace=True)

# <a id='1'>IV. OUTLIERS</a>

Let's take a look at our target variable, 'price', and some of our key feature variables and plot the outliers. 

In [None]:
fig = px.box(df,x="price")
fig.update_layout(width=1000, height=300)
fig.show()

In [None]:
fig = px.box(df,x="minimum_nights")
fig.update_layout(width=1000, height=300)
fig.show()

In [None]:
fig = px.box(df,x="number_of_reviews")
fig.update_layout(width=1000, height=300)
fig.show()

In [None]:
fig = px.box(df,x="reviews_per_month")
fig.update_layout(width=1000, height=300)
fig.show()

We need to drop some of the rows that contain extreme outliers and bring our dataset into a more managable range of values. This will help us later on when its time to do predictive modeling. 

# <a id='V'>V. FEATURE ENGINEERING, PART 1</a>

Its a minor issue, but let's remove the 'u' and change neighbourhood to neighborhood. Also, in NYC neighborhood groups are called boroughs, so lets make that adjustment. There will be some additional feature engineering needed before we develop any predictive models, so we'll label this section part 1. 

In [None]:
df.rename(columns={'neighbourhood_group':'borough',
                   'neighbourhood':'neighborhood'}, inplace=True)

In [None]:
df.drop(df[df['price'] >= 400].index, inplace = True) 
df.drop(df[df['minimum_nights'] >= 12].index, inplace = True) 
df.drop(df[df['number_of_reviews'] >= 69].index, inplace = True) 
df.drop(df[df['reviews_per_month'] >= 4.64].index, inplace = True) 

In [None]:
print('Min Price: ', df['price'].min(), '| Max Price: ', df['price'].max())

In [None]:
# CREATE A NEW COLUMN BY CATEGORIZING THE PRICES INTO FOUR DISTINCT GROUPS

df['price_range'] = np.nan

for col in [df]:
    col.loc[(col['price'] >= 0) & (col['price'] <= 99), 'price_range'] = '0 to 99'
    col.loc[(col['price'] >= 100) & (col['price'] <= 199), 'price_range'] = '100 to 199'
    col.loc[(col['price'] >= 200) & (col['price'] <= 299), 'price_range'] = '200 to 299'
    col.loc[(col['price'] >= 300) & (col['price'] <= 399), 'price_range'] = '300 to 399'

# REORDER COLUMNS
cols = ['id', 'name', 'host_id', 'host_name', 'borough', 'neighborhood',
       'latitude', 'longitude', 'room_type', 'price', 'price_range',
       'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
        'calculated_host_listings_count', 'availability_365']  
df = df[cols]

In [None]:
df.shape

We can see that deleting many of our outlier rows our dataset has been reduced by about 14,000 rows. However, this will increase accuracy in our models. Besides, how many people are really going to rent a place for $10,000?



<a id="6"><p  style="text-align: center;"><font size="6"><b>VI. EXPLORATORY DATA ANALYSIS</b></font></p></a>

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), cbar = True,  square = True, annot=True, cmap= 'YlGnBu')
plt.title('VARIABLE CORRELATION MATRIX')

## <a id="6a">VIa. UNIVARIATE ANALYSIS</a>

In [None]:

plt.figure(figsize=(16, 4))
sns.set_style("darkgrid")
ax = sns.boxplot(x="price", data=df, showmeans=True, palette='rocket')
plt.title('PRICE')
plt.show()

### BOROUGH COUNT & PERCENTAGE

In [None]:
#CREATE A DATAFRAME WITH A COUNT OF EACH BOROUGH
boroughs = df.groupby(['borough'])[['id']].count()
boroughs.reset_index(inplace=True)
boroughs.rename(columns={'id':'count'}, inplace=True)
boroughs.sort_values(by='count', ascending=False, inplace=True)

#CREATE BARCHART AND PIE CHART FOR BOUROUGH VALUES
plt.style.use('fivethirtyeight')

plt.figure(figsize=(15,6))

plt.subplot(1,2,1)
sns.barplot(x='borough',y='count', data=boroughs, palette='viridis')


plt.subplot(1,2,2)
plt.pie(boroughs['count'], explode=[0.05,0.05,0,0.05,0.25], labels=boroughs['borough'], shadow=True, startangle=90)

plt.show()

### ROOM TYPES

In [None]:
room_type = df.groupby(['borough', 'room_type'])[['id']].count()
room_type.reset_index(inplace=True)
room_type.rename(columns={'id':'count'}, inplace=True)
room_type = room_type.sort_values(by='count', ascending=False)[:10]

room_totals = room_type.groupby(['room_type'])[['count']].sum()

plt.style.use('fivethirtyeight')
plt.figure(figsize=(15,6))

plt.subplot(1,2,1)
sns.barplot(x='count',y='room_type', hue='borough', data=room_type, palette='rocket')
plt.legend(loc='lower right')
plt.title('ROOM TYPE COUNT')

plt.subplot(1,2,2)
plt.pie(room_totals['count'], explode=[0.05,0.05,0], labels=room_totals.index, shadow=True, startangle=90)

plt.title('ROOM TYPE PERCENTAGE')
plt.show()

Manhattan has the most entire home/apt listings while Brooklyn has the most private room listings. Staten Island doesn't show up in the visualization at all. 

## <a id="6a">VIb. BIVARIATE ANALYSIS</a>

### HOSTS WITH THE MOST

In [None]:
host_most = df.groupby(['host_name'])[['host_id']].count()
host_most.reset_index(inplace=True)
host_most.rename(columns={'host_id':'count'}, inplace=True)
host_most = host_most.sort_values(by=['count'], ascending=False)[:10]

plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='count',y='host_name', data=host_most, palette='YlGnBu_r')
for i, v in enumerate(host_most['count']):
    ax.text(v + 3, i + .25, str(v), color='blue', fontsize=12, fontweight='bold')
plt.title('10 hosts with most listings')
plt.show()

### PRICE x BOROUGH

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(15,6))

plt.subplot(1,2,1)
ax = sns.boxplot(x="price", y="borough", data=df, showmeans=True, palette='rocket')
plt.title('PRICE x BOROUGH')

plt.subplot(1,2,2)
sns.distplot(df[df.borough=='Manhattan'].price,color='maroon',hist=False,label='Manhattan')
sns.distplot(df[df.borough=='Brooklyn'].price,color='black',hist=False,label='Brooklyn')
sns.distplot(df[df.borough=='Queens'].price,color='green',hist=False,label='Queens')
sns.distplot(df[df.borough=='Staten Island'].price,color='blue',hist=False,label='Staten Island')
sns.distplot(df[df.borough=='Bronx'].price,color='orange',hist=False,label='Bronx')
plt.title('BOROUGH-WISE PRICE DISTRIBUTION ')
plt.xlim(0,500)

plt.show()

### TOP 10 NEIGHBORHOODS ACROSS ALL BOROUGHS

In [None]:
plt.style.use('fivethirtyeight')
fig,ax=plt.subplots(1,2,figsize=(15,8))

clr = ("mediumorchid", "forestgreen", "gold", "red", "purple",'cadetblue','hotpink','orange','darksalmon','brown')
df.neighborhood.value_counts().sort_values(ascending=False)[:10].sort_values().plot(kind='barh',color=clr,ax=ax[0])
ax[0].set_title("TOP 10 MOST LISTED NEIGHBORHOODS",size=20)
ax[0].set_xlabel('rooms',size=18)


neigh_count=df['neighborhood'].value_counts()
neigh_names=list(df['neighborhood'].value_counts().index)[:10]
counts=list(neigh_count[:10])
counts.append(neigh_count.agg(sum)-neigh_count[:10].agg('sum'))
neigh_names.append('Other')

type_dict=pd.DataFrame({"group":neigh_names,"counts":counts})
clr1=('brown','indianred','orange','darkcyan','cadetblue','purple','red','gold','forestgreen','blue','lightskyblue')
qx = type_dict.plot(kind='pie', y='counts', labels=neigh_names,colors=clr1,autopct='%1.1f%%', pctdistance=0.9, radius=1.2,ax=ax[1])

plt.legend(loc=0, bbox_to_anchor=(1.15,0.4)) 
plt.subplots_adjust(wspace =0.5, hspace =0)
plt.ioff()
plt.ylabel('')
pass

### TOP NEIGHBORHOODS IN EACH BOROUGH

A visualization of the 10 neighborhoods with the most listings in each borough

In [None]:
borough_neigh = df.groupby(['borough','neighborhood'])[['id']].count()
borough_neigh.reset_index(inplace=True)
borough_neigh.rename(columns={'id':'count'}, inplace=True)
borough_neigh = borough_neigh.sort_values(by=['count'], ascending=False)

top_10_man = borough_neigh.loc[borough_neigh['borough'] == 'Manhattan'].sort_values(by='count', ascending=False)[:10]
top_10_bklyn = borough_neigh.loc[borough_neigh['borough'] == 'Brooklyn'].sort_values(by='count', ascending=False)[:10]
top_10_queens = borough_neigh.loc[borough_neigh['borough'] == 'Queens'].sort_values(by='count', ascending=False)[:10]
top_10_bronx = borough_neigh.loc[borough_neigh['borough'] == 'Bronx'].sort_values(by='count', ascending=False)[:10]
top_10_si = borough_neigh.loc[borough_neigh['borough'] == 'Staten Island'].sort_values(by='count', ascending=False)[:10]


In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='count',y='neighborhood', hue='borough', data=top_10_man, palette='rocket')
plt.title('TOP NEIGHBORHOODS IN MANHATTAN')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='count',y='neighborhood', hue='borough', data=top_10_bklyn, palette='mako')
plt.title('TOP NEIGHBORHOODS IN BROOKLYN')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='count',y='neighborhood', hue='borough', data=top_10_queens, palette='copper')
plt.title('TOP NEIGHBORHOODS IN QUEENS')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='count',y='neighborhood', hue='borough', data=top_10_si, palette='prism')
plt.title('TOP NEIGHBORHOODS IN STATEN ISLAND')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='count',y='neighborhood', hue='borough', data=top_10_bronx, palette='magma')
plt.title('TOP NEIGHBORHOODS IN BRONX')
plt.show()

### AVERAGE PRICE IN EACH BOROUGH'S NEIGHBORHOOD

In [None]:
neigh_price = df.groupby(['borough','neighborhood'])[['price']].mean()
neigh_price.reset_index(inplace=True)
neigh_price.rename(columns={'price':'avg_price'}, inplace=True)
neigh_price = neigh_price.sort_values(by=['avg_price'], ascending=False)
neigh_price = neigh_price.round(2)

price_man = neigh_price.loc[neigh_price['borough'] == 'Manhattan'].sort_values(by='avg_price', ascending=False)[:10]
price_bklyn = neigh_price.loc[neigh_price['borough'] == 'Brooklyn'].sort_values(by='avg_price', ascending=False)[:10]
price_queens = neigh_price.loc[neigh_price['borough'] == 'Queens'].sort_values(by='avg_price', ascending=False)[:10]
price_bronx = neigh_price.loc[neigh_price['borough'] == 'Bronx'].sort_values(by='avg_price', ascending=False)[:10]
price_si = neigh_price.loc[neigh_price['borough'] == 'Staten Island'].sort_values(by='avg_price', ascending=False)[:10]


In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='avg_price',y='neighborhood', data=price_man, palette='rocket')
plt.title('TOP NEIGHBORHOODS IN MANHATTAN BY AVG PRICE')
for i, v in enumerate(price_man['avg_price']):
    ax.text(v + 3, i + .25, ('$'+str(v)), color='blue', fontsize=12, fontweight='bold')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='avg_price',y='neighborhood', data=price_bklyn, palette='mako')
plt.title('TOP NEIGHBORHOODS IN BROOKLYN BY AVG PRICE')
for i, v in enumerate(price_bklyn['avg_price']):
    ax.text(v+3, i, ('$'+str(v)), color='blue', va='center', fontsize=12, fontweight='bold')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='avg_price',y='neighborhood', data=price_queens, palette='copper')
plt.title('TOP NEIGHBORHOODS IN QUEENS BY AVG PRICE')
for i, v in enumerate(price_queens['avg_price']):
    ax.text(v+3, i, ('$'+str(v)), color='blue', va='center', fontsize=12, fontweight='bold')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='avg_price',y='neighborhood', data=price_si, palette='prism')
plt.title('TOP NEIGHBORHOODS IN STATEN ISLAND BY AVG PRICE')
for i, v in enumerate(price_si['avg_price']):
    ax.text(v+3, i, ('$'+str(v)), color='blue', va='center', fontsize=12, fontweight='bold')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='avg_price',y='neighborhood', data=price_bronx, palette='magma')
plt.title('TOP NEIGHBORHOODS IN BRONX BY AVG PRICE')
for i, v in enumerate(price_bronx['avg_price']):
    ax.text(v+3, i, ('$'+str(v)), color='blue', va='center', fontsize=12, fontweight='bold')
plt.show()

### AVERAGE PRICE OF EACH ROOM TYPE

In [None]:
room_price = df.groupby(['borough', 'room_type'])[['price']].mean()
room_price.reset_index(inplace=True)
room_price.rename(columns={'price':'avg_price'}, inplace=True)
room_price = room_price.sort_values(by=['avg_price'], ascending=False)
room_price = room_price.round(2)

In [None]:
plt.figure(figsize=(12, 8))
sns.set_style("darkgrid")
ax = sns.barplot(x='avg_price',y='room_type', hue='borough', data=room_price, palette='viridis')
plt.title('AVG PRICE OF ROOM TYPE x BOROUGH')
plt.show()

### PRICE x MINIMUM NIGHTS

In [None]:
plt.figure(figsize=(8, 5))
sns.set_style("darkgrid")
sns.scatterplot(data=df, x="minimum_nights", y="price")
plt.title('PRICE x MINIMUM NIGHTS')
plt.show()

### PRICE x NUMBER OF REVIEWS

In [None]:
plt.figure(figsize=(8, 5))
sns.set_style("darkgrid")
sns.scatterplot(data=df, x="number_of_reviews", y="price")
plt.title('PRICE x NUMBER OF REVIEWS')
plt.show()

### PRICE x REVIEWS PER MONTH 

In [None]:
plt.figure(figsize=(8, 5))
sns.set_style("darkgrid")
sns.scatterplot(data=df, x="reviews_per_month", y="price")
plt.title('PRICE x REVIEWS PER MONTH')
plt.show()

## <a id="6C">VIc. WORD ANALYSIS</a>

In [None]:
name_price0 = df[['name','price_range']].loc[df['price_range'] == '0 to 99']
name_price1 = df[['name','price_range']].loc[df['price_range'] == '100 to 199']
name_price2 = df[['name','price_range']].loc[df['price_range'] == '200 to 299']
name_price3 = df[['name','price_range']].loc[df['price_range'] == '300 to 399']

In [None]:
#let's comeback now to the 'name' column as it will require litte bit more coding and continue to analyze it!

#initializing empty list where we are going to put our name strings
names0 = []
names1 = []
names2 = []
names3 = []

#getting name strings from the column and appending it to the list
for name in name_price0.name:
    names0.append(name)
    
for name in name_price1.name:
    names1.append(name)

for name in name_price2.name:
    names2.append(name)

for name in name_price3.name:
    names3.append(name)
    
#setting a function that will split those name strings into separate words   
def split_name(name):
    spl=str(name).split()
    return spl

#initializing empty list where we are going to have words counted
word_count0=[]
word_count1=[]
word_count2=[]
word_count3=[]

#getting name string from our list and using split function, later appending to list above
for x in names0:
    for word in split_name(x):
        word=word.lower()
        if word not in ['in','a','of','and','the','with','to','&','for','from','-']: # remove inconsequential words
            word_count0.append(word)
            
for x in names1:
    for word in split_name(x):
        word=word.lower()
        if word not in ['in','a','of','and','the','with','to','&','for','from','-']: # remove inconsequential words
            word_count1.append(word)
            
for x in names2:
    for word in split_name(x):
        word=word.lower()
        if word not in ['in','a','of','and','the','with','to','&','for','from','-']: # remove inconsequential words
            word_count2.append(word)
            
for x in names3:
    for word in split_name(x):
        word=word.lower()
        if word not in ['in','a','of','and','the','with','to','&','for','from','-']: # remove inconsequential words
            word_count3.append(word)

In [None]:
from collections import Counter
#let's see top 25 used words by host to name their listing
top_25_w0=Counter(word_count0).most_common()
top_25_w0=top_25_w0[0:25]
df_top25_w0=pd.DataFrame(top_25_w0)
df_top25_w0.rename(columns={0:'words', 1:'count'}, inplace=True)

top_25_w1=Counter(word_count1).most_common()
top_25_w1=top_25_w1[0:25]
df_top25_w1=pd.DataFrame(top_25_w1)
df_top25_w1.rename(columns={0:'words', 1:'count'}, inplace=True)

top_25_w2=Counter(word_count2).most_common()
top_25_w2=top_25_w2[0:25]
df_top25_w2=pd.DataFrame(top_25_w2)
df_top25_w2.rename(columns={0:'words', 1:'count'}, inplace=True)

top_25_w3=Counter(word_count3).most_common()
top_25_w3=top_25_w3[0:25]
df_top25_w3=pd.DataFrame(top_25_w3)
df_top25_w3.rename(columns={0:'words', 1:'count'}, inplace=True)

In [None]:
plt.rcParams['figure.figsize']=(12,12)    #(6.0,4.0)
plt.rcParams['font.size']=12                #10 
plt.rcParams['savefig.dpi']=100             #72 
plt.rcParams['figure.subplot.bottom']=.1 

stopwords = set(STOPWORDS)
# stop_words = ["new york"] + list(STOPWORDS)

wordcloud0 = WordCloud(background_color='white',
                       stopwords=stopwords,
                       max_words=200,
                       max_font_size=40, 
                       random_state=42).generate(str(name_price0['name']))

plt.style.use('fivethirtyeight')
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))

ax1.imshow(wordcloud0)
ax1.set_title('WORD CLOUD for $0-99 PRICE RANGE')
ax1.axis('off')

sns.barplot(x='words', y='count', data=df_top25_w0, ax=ax2)
ax2.set_title('WORD COUNT FOR $0-99 LISTINGS')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=80)

plt.show()
fig.savefig("word0.png", dpi=900)

In [None]:
plt.rcParams['figure.figsize']=(12,12)    #(6.0,4.0)
plt.rcParams['font.size']=12                #10 
plt.rcParams['savefig.dpi']=100             #72 
plt.rcParams['figure.subplot.bottom']=.1 

stopwords = set(STOPWORDS)

wordcloud1 = WordCloud(background_color='white',
                      stopwords=stopwords,
                      max_words=200,
                      max_font_size=40, 
                      random_state=42).generate(str(name_price1['name']))

plt.style.use('fivethirtyeight')
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))

ax1.imshow(wordcloud1)
ax1.set_title('WORD CLOUD for $100-199 PRICE RANGE')
ax1.axis('off')

sns.barplot(x='words', y='count', data=df_top25_w1, ax=ax2)
ax2.set_title('WORD COUNT FOR $100-199 LISTINGS')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=80)

plt.show()
fig.savefig("word1.png", dpi=900)

In [None]:
plt.rcParams['figure.figsize']=(12,12)    #(6.0,4.0)
plt.rcParams['font.size']=12                #10 
plt.rcParams['savefig.dpi']=100             #72 
plt.rcParams['figure.subplot.bottom']=.1 

stopwords = set(STOPWORDS)

wordcloud2 = WordCloud(background_color='white',
                      stopwords=stopwords,
                      max_words=200,
                      max_font_size=40, 
                      random_state=42).generate(str(name_price2['name']))

plt.style.use('fivethirtyeight')
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))

ax1.imshow(wordcloud2)
ax1.set_title('WORD CLOUD for $200-299 PRICE RANGE')
ax1.axis('off')

sns.barplot(x='words', y='count', data=df_top25_w2, ax=ax2)
ax2.set_title('WORD COUNT FOR $200-299 LISTINGS')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=80)

plt.show()
fig.savefig("word2.png", dpi=900)

In [None]:
plt.rcParams['figure.figsize']=(12,12)    #(6.0,4.0)
plt.rcParams['font.size']=12                #10 
plt.rcParams['savefig.dpi']=100             #72 
plt.rcParams['figure.subplot.bottom']=.1 

stopwords = set(STOPWORDS)

wordcloud3 = WordCloud(background_color='white',
                      stopwords=stopwords,
                      max_words=200,
                      max_font_size=40, 
                      random_state=42).generate(str(name_price3['name']))

plt.style.use('fivethirtyeight')
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(16,6))

ax1.imshow(wordcloud3)
ax1.set_title('WORD CLOUD for $300-399 PRICE RANGE')
ax1.axis('off')

sns.barplot(x='words', y='count', data=df_top25_w3, ax=ax2)
ax2.set_title('WORD COUNT FOR $300-399 LISTINGS')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=80)

plt.show()
fig.savefig("word3.png", dpi=900)

## <a id="6d">VId. MAP VISUALIZATIONS</a>

In [None]:
# MANHATTAN MAP SPECS
df_manhattan0 = df.loc[(df['borough'] == 'Manhattan') & (df['price_range'] == '0 to 99')]
df_manhattan1 = df.loc[(df['borough'] == 'Manhattan') & (df['price_range'] == '100 to 199')]
df_manhattan2 = df.loc[(df['borough'] == 'Manhattan') & (df['price_range'] == '200 to 299')]
df_manhattan3 = df.loc[(df['borough'] == 'Manhattan') & (df['price_range'] == '300 to 399')]

avg_man_lat = df['latitude'].loc[df['borough'] == 'Manhattan'].mean()
avg_man_long = df['longitude'].loc[df['borough'] == 'Manhattan'].mean()

df_manhattan0['label'] = df_manhattan0['room_type']+(': $')+df_manhattan0['price'].astype(str)
df_manhattan1['label'] = df_manhattan1['room_type']+(': $')+df_manhattan1['price'].astype(str)
df_manhattan2['label'] = df_manhattan2['room_type']+(': $')+df_manhattan2['price'].astype(str)
df_manhattan3['label'] = df_manhattan3['room_type']+(': $')+df_manhattan3['price'].astype(str)

#BROOKLYN MAP SPECS
df_brooklyn0 = df.loc[(df['borough'] == 'Brooklyn') & (df['price_range'] == '0 to 99')]
df_brooklyn1 = df.loc[(df['borough'] == 'Brooklyn') & (df['price_range'] == '100 to 199')]
df_brooklyn2 = df.loc[(df['borough'] == 'Brooklyn') & (df['price_range'] == '200 to 299')]
df_brooklyn3= df.loc[(df['borough'] == 'Brooklyn') & (df['price_range'] == '300 to 399')]

avg_bk_lat = df['latitude'].loc[df['borough'] == 'Brooklyn'].mean()
avg_bk_long = df['longitude'].loc[df['borough'] == 'Brooklyn'].mean()

df_brooklyn0['label'] = df_brooklyn0['room_type']+(': $')+df_brooklyn0['price'].astype(str)
df_brooklyn1['label'] = df_brooklyn1['room_type']+(': $')+df_brooklyn1['price'].astype(str)
df_brooklyn2['label'] = df_brooklyn2['room_type']+(': $')+df_brooklyn2['price'].astype(str)
df_brooklyn3['label'] = df_brooklyn3['room_type']+(': $')+df_brooklyn3['price'].astype(str)

#QUEENS MAP SPECS
df_queens0 = df.loc[(df['borough'] == 'Queens') & (df['price_range'] == '0 to 99')]
df_queens1 = df.loc[(df['borough'] == 'Queens') & (df['price_range'] == '100 to 199')]
df_queens2 = df.loc[(df['borough'] == 'Queens') & (df['price_range'] == '200 to 299')]
df_queens3 = df.loc[(df['borough'] == 'Queens') & (df['price_range'] == '300 to 399')]

avg_queens_lat = df['latitude'].loc[df['borough'] == 'Queens'].mean()
avg_queens_long = df['longitude'].loc[df['borough'] == 'Queens'].mean()

df_queens0['label'] = df_queens0['room_type']+(': $')+df_queens0['price'].astype(str)
df_queens1['label'] = df_queens1['room_type']+(': $')+df_queens1['price'].astype(str)
df_queens2['label'] = df_queens2['room_type']+(': $')+df_queens2['price'].astype(str)
df_queens3['label'] = df_queens3['room_type']+(': $')+df_queens3['price'].astype(str)

# STATEN ISLAND SPECS
df_si0 = df.loc[(df['borough'] == 'Staten Island') & (df['price_range'] == '0 to 99')]
df_si1 = df.loc[(df['borough'] == 'Staten Island') & (df['price_range'] == '100 to 199')]
df_si2 = df.loc[(df['borough'] == 'Staten Island') & (df['price_range'] == '200 to 299')]
df_si3 = df.loc[(df['borough'] == 'Staten Island') & (df['price_range'] == '300 to 399')]

avg_si_lat = df['latitude'].loc[df['borough'] == 'Staten Island'].mean()
avg_si_long = df['longitude'].loc[df['borough'] == 'Staten Island'].mean()

df_si0['label'] = df_si0['room_type']+(': $')+df_si0['price'].astype(str)
df_si1['label'] = df_si1['room_type']+(': $')+df_si1['price'].astype(str)
df_si2['label'] = df_si2['room_type']+(': $')+df_si2['price'].astype(str)
df_si3['label'] = df_si3['room_type']+(': $')+df_si3['price'].astype(str)

#BRONX SPECS
df_bronx0 = df.loc[(df['borough'] == 'Bronx') & (df['price_range'] == '0 to 99')]
df_bronx1 = df.loc[(df['borough'] == 'Bronx') & (df['price_range'] == '100 to 199')]
df_bronx2 = df.loc[(df['borough'] == 'Bronx') & (df['price_range'] == '200 to 299')]
df_bronx3 = df.loc[(df['borough'] == 'Bronx') & (df['price_range'] == '300 to 399')]

avg_bronx_lat = df['latitude'].loc[df['borough'] == 'Bronx'].mean()
avg_bronx_long = df['longitude'].loc[df['borough'] == 'Bronx'].mean()

df_bronx0['label'] = df_bronx0['room_type']+(': $')+df_bronx0['price'].astype(str)
df_bronx1['label'] = df_bronx1['room_type']+(': $')+df_bronx1['price'].astype(str)
df_bronx2['label'] = df_bronx2['room_type']+(': $')+df_bronx2['price'].astype(str)
df_bronx3['label'] = df_bronx3['room_type']+(': $')+df_bronx3['price'].astype(str)

### MANHATTAN PRICE MAP

In [None]:
manhattan_map = folium.Map(location=[avg_man_lat, avg_man_long], width='100%', height='100%', zoom_start=13)


for lat, lng, label in zip(df_manhattan0.latitude, df_manhattan0.longitude, df_manhattan0.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#9795cf', 
                                 fill_color='#9795cf').add_to(manhattan_map)

for lat, lng, label in zip(df_manhattan1.latitude, df_manhattan1.longitude, df_manhattan1.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#6b67d6', 
                                 fill_color='#6b67d6').add_to(manhattan_map)

for lat, lng, label in zip(df_manhattan2.latitude, df_manhattan2.longitude, df_manhattan2.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#342ded', 
                                 fill_color='#342ded').add_to(manhattan_map)

for lat, lng, label in zip(df_manhattan3.latitude, df_manhattan3.longitude, df_manhattan3.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#0600a6', 
                                 fill_color='#0600a6').add_to(manhattan_map)

manhattan_map

### BROOKLYN PRICE MAP

In [None]:
brooklyn_map = folium.Map(location=[avg_bk_lat, avg_bk_long], width='100%', height='100%', zoom_start=13)


for lat, lng, label in zip(df_brooklyn0.latitude, df_brooklyn0.longitude, df_brooklyn0.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#9795cf', 
                                 fill_color='#9795cf').add_to(brooklyn_map)

for lat, lng, label in zip(df_brooklyn1.latitude, df_brooklyn1.longitude, df_brooklyn1.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#6b67d6', 
                                 fill_color='#6b67d6').add_to(brooklyn_map)

for lat, lng, label in zip(df_brooklyn2.latitude, df_brooklyn2.longitude, df_brooklyn2.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#342ded', 
                                 fill_color='#342ded').add_to(brooklyn_map)

for lat, lng, label in zip(df_brooklyn3.latitude, df_brooklyn3.longitude, df_brooklyn3.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#0600a6', 
                                 fill_color='#0600a6').add_to(brooklyn_map)

brooklyn_map

### QUEENS PRICE MAP

In [None]:
queens_map = folium.Map(location=[avg_queens_lat, avg_queens_long], width='100%', height='100%', zoom_start=13)


for lat, lng, label in zip(df_queens0.latitude, df_queens0.longitude, df_queens0.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#9795cf', 
                                 fill_color='#9795cf').add_to(queens_map)

for lat, lng, label in zip(df_queens1.latitude, df_queens1.longitude, df_queens1.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#6b67d6', 
                                 fill_color='#6b67d6').add_to(queens_map)

for lat, lng, label in zip(df_queens2.latitude, df_queens2.longitude, df_queens2.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#342ded', 
                                 fill_color='#342ded').add_to(queens_map)

for lat, lng, label in zip(df_queens3.latitude, df_queens3.longitude, df_queens3.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#0600a6', 
                                 fill_color='#0600a6').add_to(queens_map)

queens_map

### STATEN ISLAND PRICE MAP

In [None]:
si_map = folium.Map(location=[avg_si_lat, avg_si_long], width='100%', height='100%', zoom_start=13)


for lat, lng, label in zip(df_si0.latitude, df_si0.longitude, df_si0.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#9795cf', 
                                 fill_color='#9795cf').add_to(si_map)

for lat, lng, label in zip(df_si1.latitude, df_si1.longitude, df_si1.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#6b67d6', 
                                 fill_color='#6b67d6').add_to(si_map)

for lat, lng, label in zip(df_si2.latitude, df_si2.longitude, df_si2.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#342ded', 
                                 fill_color='#342ded').add_to(si_map)

for lat, lng, label in zip(df_si3.latitude, df_si3.longitude, df_si3.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#0600a6', 
                                 fill_color='#0600a6').add_to(si_map)

si_map

### BRONX PRICE MAP

In [None]:
bronx_map = folium.Map(location=[avg_bronx_lat, avg_bronx_long], width='100%', height='100%', zoom_start=13)


for lat, lng, label in zip(df_bronx0.latitude, df_bronx0.longitude, df_bronx0.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#9795cf', 
                                 fill_color='#9795cf').add_to(bronx_map)

for lat, lng, label in zip(df_bronx1.latitude, df_bronx1.longitude, df_bronx1.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#6b67d6', 
                                 fill_color='#6b67d6').add_to(bronx_map)

for lat, lng, label in zip(df_bronx2.latitude, df_bronx2.longitude, df_bronx2.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#342ded', 
                                 fill_color='#342ded').add_to(bronx_map)

for lat, lng, label in zip(df_bronx3.latitude, df_bronx3.longitude, df_bronx3.label):
    folium.features.CircleMarker([lat, lng],
                                 popup=label,
                                 radius=5, 
                                 fill=True, 
                                 color='#0600a6', 
                                 fill_color='#0600a6').add_to(bronx_map)

bronx_map

# <a id="7">VII. MODEL DEVELOPMENT</a>

COMING SOON...