# NYC Airbnb Open Data

This journal is an analysis of data regarding the various Airbnb information found in the greater NYC area.

Context
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

Content
This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

Acknowledgements
This public dataset is part of Airbnb,

In [16]:
# Standard Imports
import numpy as np
import pandas as pd
import os

In [14]:
# Set Working Directory
os.chdir(r"C:\Users\joshu\OneDrive\Desktop\archive")

In [30]:
# read data
ab = pd.read_csv('AB_NYC_2019.csv')

ab.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [31]:
ab.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


Immediately what catches my attention is the text data that can be seen in the 'name' variable. It may be interesting to run some language processing on this in hopes of potentially seeing
- what keywords are commonly used in airbnb names
- which words are commonly seen with the best results (business)

Additionally, it may be useful to explore:
- which hosts are busiest and why
- how does location factor into this (geospacial analysis)
- can we predict price?

Data Preparation: First we must process the 'name' text by removing punctuation, converting to lowercase, and tokenizing the text (creating vectors of each individual standardized word)

In [58]:
# Standard Imports
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\joshu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
# Tokenize AirBNB descriptions
ab['name'] = ab['name'].astype(str)
ab['tokenized_name'] = ab['name'].apply(lambda x: word_tokenize(x.lower()))

ab['tokenized_name']

0              [clean, &, quiet, apt, home, by, the, park]
1                                [skylit, midtown, castle]
2           [the, village, of, harlem, ...., new, york, !]
3                    [cozy, entire, floor, of, brownstone]
4        [entire, apt, :, spacious, studio/loft, by, ce...
                               ...                        
48890    [charming, one, bedroom, -, newly, renovated, ...
48891    [affordable, room, in, bushwick/east, williams...
48892        [sunny, studio, at, historical, neighborhood]
48893          [43rd, st., time, square-cozy, single, bed]
48894    [trendy, duplex, in, the, very, heart, of, hel...
Name: tokenized_name, Length: 48895, dtype: object

In [60]:
def text_process(text):
    # Tokenize and lowercase the text
    tokens = word_tokenize(text.lower())
    
    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]
    
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    
    return tokens

# Apply the preprocessing function to the 'Description' column
ab['tokenized_name'] = ab['name'].apply(text_process)

In [61]:
ab['tokenized_name']

0                          [clean, quiet, apt, home, park]
1                                [skylit, midtown, castle]
2                       [village, harlem, ...., new, york]
3                        [cozy, entire, floor, brownstone]
4        [entire, apt, spacious, studio/loft, central, ...
                               ...                        
48890    [charming, one, bedroom, newly, renovated, row...
48891      [affordable, room, bushwick/east, williamsburg]
48892            [sunny, studio, historical, neighborhood]
48893          [43rd, st., time, square-cozy, single, bed]
48894           [trendy, duplex, heart, hell, 's, kitchen]
Name: tokenized_name, Length: 48895, dtype: object

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a technique to convert text data into numerical vectors that represent the importance of words in each document.

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Join the cleaned tokens back into strings
# = ab['tokenized_name'].apply(lambda x: ' '.join(x))

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the cleaned descriptions
tfidf_matrix = tfidf_vectorizer.fit_transform(ab['name'])

# Convert the TF-IDF matrix into a pandas DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Add the TF-IDF values to your dataset
ab = pd.concat([ab, tfidf_df], axis=1)

ab

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,yoga,york,yorker,you,young,your,yours,yourself,zen,zoo
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,...,0.0,0.50125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


NMF

In [67]:
def nmf(X,r, iter = 50, eps= 1e-11):
    A = np.random.rand(X.shape[0],r)
    S = np.random.rand(r,X.shape[1])
    for i in range(iter):
        A = np.multiply(A, np.divide(X@S.T,A@S@S.T+eps))
        S = np.multiply(S, np.divide(A.T@X,A.T@A@S+eps))
    return A,S

In [68]:
print(tfidf_matrix.shape)

(48895, 1000)


In [71]:
A,S = nmf(X=tfidf_matrix.T, r=20)
print(A.shape, S.shape)

(1000, 20) (20, 48895)


In [74]:
# Print the topics
feature_names = tfidf_vectorizer.get_feature_names() # top 1000 terms (dictionary)

for i, topic in enumerate(A.T):
    print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in np.array(feature_names)[topic.argsort()[-10:]]])))

Topic 1: 1br,bedford,modern,south,east,location,bright,prime,in,williamsburg
Topic 2: train,sq,quiet,time,columbia,subway,and,times,square,near
Topic 3: downtown,location,15,min,luxury,minutes,in,from,midtown,manhattan
Topic 4: west,beautiful,modern,large,luxury,charming,chelsea,midtown,in,studio
Topic 5: bed,space,greenpoint,tribeca,soho,huge,in,artist,bushwick,loft
Topic 6: br,comfortable,1br,quiet,duplex,sunny,clean,bright,and,spacious
Topic 7: renovated,entire,sunny,new,garden,luxury,modern,in,beautiful,apartment
Topic 8: greenwich,1br,charming,on,lower,west,upper,side,village,east
Topic 9: brownstone,nyc,large,bed,1br,modern,br,in,sunny,apt
Topic 10: bed,terrace,duplex,views,bathroom,backyard,garden,balcony,view,with
Topic 11: 15,jfk,mins,next,train,nyc,min,subway,close,to
Topic 12: bright,oasis,duplex,townhouse,charming,house,heights,brownstone,in,brooklyn
Topic 13: huge,astoria,charming,master,two,beautiful,large,in,one,bedroom
Topic 14: perfect,at,place,by,best,house,city,on,in



Sentiment Analysis

In [75]:
from textblob import TextBlob

ab['Sentiment'] = ab['name'].apply(lambda x: TextBlob(x).sentiment.polarity)

In [77]:
ab

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,york,yorker,you,young,your,yours,yourself,zen,zoo,Sentiment
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.322222
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,...,0.50125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.100000
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.200000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.418182
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,...,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.071429


Word Frequency Analysis


In [78]:
from collections import Counter

# Tokenize words and count frequency
words = word_tokenize(" ".join(dataset['Description']))
word_freq = Counter(words)
most_common_words = word_freq.most_common(10)
print(most_common_words

SyntaxError: unexpected EOF while parsing (2891328371.py, line 7)