# Airbnb data

Choose a city's data from https://insideairbnb.com/get-the-data/, right-click `listings.csv.gz` and record the URL in this notebook to download it.

In [6]:
# Load data

# These are Airbnb listings from Austin TX
! wget https://data.insideairbnb.com/united-states/tx/austin/2024-09-13/data/listings.csv.gz -O austin_listings.csv.gz # has <br />
! gunzip -c austin_listings.csv.gz > austin_listings.csv # This unzips the file into a regular CSV

# Airbnb listings from Albany NY
# ! wget https://data.insideairbnb.com/united-states/ny/albany/2024-11-05/data/listings.csv.gz -O albany_listings.csv.gz
# ! gunzip -c albany_listings.csv.gz > albany_listings.csv # This unzips the file into a regular CSV

# Airbnb listings from Mexico City, Mexico
# ! wget https://data.insideairbnb.com/mexico/df/mexico-city/2024-09-25/data/listings.csv.gz -O mexico_city_listings.csv.gz
# ! gunzip -c mexico_city_listings.csv.gz > mexico_city_listings.csv # This unzips the file into a regular CSV

--2025-01-11 11:27:24--  https://data.insideairbnb.com/united-states/tx/austin/2024-09-13/data/listings.csv.gz
Resolving data.insideairbnb.com (data.insideairbnb.com)... 18.165.98.12, 18.165.98.45, 18.165.98.33, ...
Connecting to data.insideairbnb.com (data.insideairbnb.com)|18.165.98.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8902975 (8.5M) [application/x-gzip]
Saving to: ‘austin_listings.csv.gz’


2025-01-11 11:27:25 (54.5 MB/s) - ‘austin_listings.csv.gz’ saved [8902975/8902975]



In [1]:
# Load data with pandas

import pandas as pd

listings = pd.read_csv('austin_listings.csv') # reads CSV file into a pandas dataframe
listings.info() # provide basic information about this dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15244 entries, 0 to 15243
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            15244 non-null  int64  
 1   listing_url                                   15244 non-null  object 
 2   scrape_id                                     15244 non-null  int64  
 3   last_scraped                                  15244 non-null  object 
 4   source                                        15244 non-null  object 
 5   name                                          15244 non-null  object 
 6   description                                   14775 non-null  object 
 7   neighborhood_overview                         8662 non-null   object 
 8   picture_url                                   15243 non-null  object 
 9   host_id                                       15244 non-null 

In [9]:
# Expand pandas view (good for seeing more of text)
pd.set_option('display.max_colwidth', None)
selected_data.head()

Unnamed: 0,description,neighborhood_overview
0,"Great central location for walking to Convention Center, Rainey Street, East 6th Street, Downtown, Congress Ave Bats.<br /><br /> Free wifi<br /><br />No Smoking, No pets","My neighborhood is ideally located if you want to walk to bars and restaurants downtown, East 6th Street or Rainey Street. The Convention Center is only 3 1/2 blocks away and a quick 10 minute walk. Whole foods store located 5 blks , easily walkable."
1,,Quiet neighborhood with lots of trees and good neighbors.
2,"Great studio apartment, perfect a single person or a couples. Available as a month-to-month rental. If you're looking for a different month than the one that's open, please ask. Just 1 mile into downtown. Convenient for walking, biking, rideshare or busing into downtown, UT campus and other central Austin spots. Walk to the 10-mile looped Town Lake Trail. Airy space with very nice amenities, fresh coffee beans and a private patio.","Travis Heights is one of the oldest neighborhoods in Austin. Our house was built in 1937. We rebuilt the apartment in 2009 (well, finished and furnished it for rental then). From the studio it's a pretty easy 1-mile walk through the neighborhood to all the shops and restaurants on South Congress."
3,"Clean, private space with everything you need for a quiet, comfy, private stay close to Zilker Park and Barton Springs, the river, parks, trails, and downtown. King bed, vaulted ceilings, high-speed fiber internet. Quality furnishings and amenities will make you feel at home. We offer contactless check-in/checkout, if you like (and we are vaccinated).",The neighborhood is fun and funky (but quiet)! People are friendly and you can't beat the location.
4,Studio rental on lower level of home located in a 1950s neighborhood less than two miles from downtown Austin and close to bus routes.<br /><br />On stays less than 30 nights additional Austin city hotel taxes of 11% will be collected separately following confirmation of reservation.<br /><br />Texas state hotel taxes will be collected by Airbnb.<br /><br />Hotel taxes apply for all stays of 29 nights or less. No hotel taxes are charged for rentals of 30 nights or more.,


# Cleaning with regular expressions
Let's remove extraneous text like `br />`

In [2]:
# Do this with pandas' built-in string functions
# TODO: link to these functions
listings['description_processed'] = listings['description'].str.replace(r'<br\s*/>', ' ', regex=True)
listings[['description', 'description_processed']].head()

Unnamed: 0,description,description_processed
0,Great central location for walking to Convent...,Great central location for walking to Convent...
1,,
2,"Great studio apartment, perfect a single perso...","Great studio apartment, perfect a single perso..."
3,"Clean, private space with everything you need ...","Clean, private space with everything you need ..."
4,Studio rental on lower level of home located i...,Studio rental on lower level of home located i...


In [3]:
# Convert NaN values to empty strings
listings['description_processed'] = listings['description_processed'].fillna('')
listings[['description', 'description_processed']].head()

Unnamed: 0,description,description_processed
0,Great central location for walking to Convent...,Great central location for walking to Convent...
1,,
2,"Great studio apartment, perfect a single perso...","Great studio apartment, perfect a single perso..."
3,"Clean, private space with everything you need ...","Clean, private space with everything you need ..."
4,Studio rental on lower level of home located i...,Studio rental on lower level of home located i...


# Tokenization
Tokenization is the process of breaking text up into words! Here we will use the `nltk` package to tokenize.

In [4]:
import nltk

In [7]:
# Only need to do once
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /ihome/cs1671_2025s/mmyoder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# Apply tokenizer from nltk to column
def tokenize(text):
    tokens_list = nltk.word_tokenize(text)
    return ' '.join(tokens_list)
    
listings['description_processed'] = listings['description_processed'].map(tokenize)
listings[['description', 'description_processed']].head()

# Lowercasing

In [9]:
listings['description_processed'] = listings['description_processed'].str.lower()
listings[['description', 'description_processed']].head()

Unnamed: 0,description,description_processed
0,Great central location for walking to Convent...,great central location for walking to conventi...
1,,
2,"Great studio apartment, perfect a single perso...","great studio apartment , perfect a single pers..."
3,"Clean, private space with everything you need ...","clean , private space with everything you need..."
4,Studio rental on lower level of home located i...,studio rental on lower level of home located i...


# Stemming

In [15]:
# Progress bar since it takes awhile
from tqdm.auto import tqdm
tqdm.pandas()

stemmer = nltk.PorterStemmer()

def stem(text):
    tokens = text.split()
    stemmed_tokens = [stemmer.stem(t) for t in tokens]
    return ' '.join(stemmed_tokens)

listings['description_processed'] = listings['description_processed'].progress_map(stem)
listings[['description', 'description_processed']].head()

  0%|          | 0/15244 [00:00<?, ?it/s]

Unnamed: 0,description,description_processed
0,Great central location for walking to Convent...,great central locat for walk to convent center...
1,,
2,"Great studio apartment, perfect a single perso...","great studio apart , perfect a singl person or..."
3,"Clean, private space with everything you need ...","clean , privat space with everyth you need for..."
4,Studio rental on lower level of home located i...,studio rental on lower level of home locat in ...


# Could put subword tokenization (just have an example)