Airbnb scraper by APIFY
https://apify.com/dtrungtin/airbnb-scraper

## Dataset structure

Here we are dealing with database with few columns (we specified excel-friendly format). I suppose you can get more information and complex dataset by changing that option

This dataset was extracted on **08-June-2023** and the scrape period was **6 months**

`dataset_airbnb-scraper_2023-06-08_14-57-41-189.json`

|Column|Type|Description|
|------|----|-----------|
|url|str|URL of the listing|
|name|str|Name of the listing|
|stars|float|Star grading of the listing|
|numberOfGuests|int|Max number of occupants|
|address|str|City, State, Country|
|roomType|str|Type of the listing (e.g., Full apartment, house)|
|location|dict|`{lat, lng}`|
|reviews|list||
|pricing|dict|Dictionary containing currency, rate, etc|
|photos|list|URL's with photos of the listing|
|primaryHost|dict|Details of the host|
|additionalHosts|list|More Details about the host|
|isHostedBySuperhost|bool|Is the host SuperHost?|
|isAvailable|bool|Is the listing Available?|
|calendar|list|List with pairs of values `{Available, Date}`|
|occupancyPercentage|float|Percentage of occupancy in the specified period|

* Table made with the information extracted from the `preliminar.ipynb` notebook 

In [31]:
import pandas as pd

data = pd.read_json('dataset_airbnb-scraper_2023-06-08_14-57-41-189.json')
data.columns

Index(['url', 'name', 'stars', 'numberOfGuests', 'address', 'roomType',
       'location', 'reviews', 'pricing', 'photos', 'primaryHost',
       'additionalHosts', 'isHostedBySuperhost', 'isAvailable', 'calendar',
       'occupancyPercentage'],
      dtype='object')

In [44]:
df = data[['stars', 'numberOfGuests', 'roomType', 'location', 'pricing',
            'isHostedBySuperhost', 'occupancyPercentage']]
df

Unnamed: 0,stars,numberOfGuests,roomType,location,pricing,isHostedBySuperhost,occupancyPercentage
0,4.88,4,Alojamiento entero: apto. residencial,"{'lat': 6.204, 'lng': -75.564}","{'rate': {'amount': 159, 'amountFormatted': '$...",False,0.97
1,,3,Alojamiento entero: apto. residencial,"{'lat': 6.24723, 'lng': -75.59545}","{'rate': {'amount': 237, 'amountFormatted': '$...",False,3.40
2,4.69,5,Alojamiento entero: apartamento con servicios,"{'lat': 6.201, 'lng': -75.574}","{'rate': {'amount': 188, 'amountFormatted': '$...",True,16.99
3,4.19,6,Alojamiento entero: piso,"{'lat': 6.21169, 'lng': -75.57166}","{'rate': {'amount': 180, 'amountFormatted': '$...",False,3.40
4,,2,Habitación privada en: bed and breakfast,"{'lat': 6.20033, 'lng': -75.56914}","{'rate': {'amount': 176, 'amountFormatted': '$...",True,13.11
...,...,...,...,...,...,...,...
1234,4.96,5,Alojamiento entero: apto. residencial,"{'lat': 6.15719, 'lng': -75.60836}","{'rate': {'amount': 54, 'amountFormatted': '$5...",True,21.36
1235,4.78,4,Alojamiento entero: piso,"{'lat': 6.19552, 'lng': -75.57851}","{'rate': {'amount': 54, 'amountFormatted': '$5...",True,6.80
1236,4.50,6,Alojamiento entero: apartamento con servicios,"{'lat': 6.20688, 'lng': -75.56564}","{'rate': {'amount': 57, 'amountFormatted': '$5...",False,22.33
1237,4.82,2,Alojamiento entero: piso,"{'lat': 6.20928, 'lng': -75.55877}","{'rate': {'amount': 50, 'amountFormatted': '$5...",True,0.97


In [45]:
df.pricing[0]

{'rate': {'amount': 159,
  'amountFormatted': '$159',
  'currency': 'USD',
  'isMicrosAccuracy': False},
 'rateType': 'nightly'}

In [46]:
data.roomType.unique()

array(['Alojamiento entero: apto. residencial',
       'Alojamiento entero: apartamento con servicios',
       'Alojamiento entero: piso',
       'Habitación privada en: bed\xa0and\xa0breakfast',
       'Habitación en hotel boutique', 'Alojamiento entero: loft',
       'Alojamiento entero: vivienda', 'Alojamiento entero: casa rural',
       'Habitación en hotel', 'Habitación en apartahotel', 'Granja',
       'Habitación privada en: piso', 'Alojamiento entero: villa',
       'Alojamiento entero: vivienda vacacional',
       'Habitación privada en: vivienda',
       'Habitación en apartamento con servicios',
       'Alojamiento entero: adosado', 'Alojamiento entero',
       'Habitación privada en: cabaña en la naturaleza',
       'Alojamiento entero: cabaña',
       'Habitación privada en: apto. residencial',
       'Habitación privada en: apartamento con servicios',
       'Casa del árbol', 'Habitación privada'], dtype=object)

In [47]:
df['pricepernight'] = df['pricing'].apply(lambda x: x['rate']['amount'])
df.drop(columns=['pricing'], inplace=True)
df['propertyType'] = df.roomType.apply(lambda x: 'hotel' if 'hotel' in x.lower() else 'habitacion' if 'habit' in x.lower() else \
                                        'piso' if 'piso' in x.lower() else \
                                        'piso' if 'loft' in x.lower() else \
                                        'piso' if 'privada' in x.lower() else \
                                        'piso' if 'apartamento' in x.lower() else \
                                        'piso' if 'apto. residencial' in x.lower() else \
                                        'casa' if 'casa' in x.lower() else \
                                        'casa' if 'granja' in x.lower() else \
                                        'casa' if 'villa' in x.lower() else \
                                        'casa' if 'entero' in x.lower() and 'vivienda' in x.lower() else \
                                        'casa' if 'entero' in x.lower() and 'alojamiento' in x.lower() else \
                                        'casa' if 'adosado' in x.lower() else \
                                        '')
df['superhost'] = df.isHostedBySuperhost.apply(lambda x: 1 if x == True else 0)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['pricepernight'] = df['pricing'].apply(lambda x: x['rate']['amount'])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['pricing'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['propertyType'] = df.roomType.apply(lambda x: 'hotel' if 'hotel' in x.lower() else 'habitacion' if 'habit' in x.lower() else \


Unnamed: 0,stars,numberOfGuests,roomType,location,isHostedBySuperhost,occupancyPercentage,pricepernight,propertyType
0,4.88,4,Alojamiento entero: apto. residencial,"{'lat': 6.204, 'lng': -75.564}",False,0.97,159,piso
1,,3,Alojamiento entero: apto. residencial,"{'lat': 6.24723, 'lng': -75.59545}",False,3.40,237,piso
2,4.69,5,Alojamiento entero: apartamento con servicios,"{'lat': 6.201, 'lng': -75.574}",True,16.99,188,piso
3,4.19,6,Alojamiento entero: piso,"{'lat': 6.21169, 'lng': -75.57166}",False,3.40,180,piso
4,,2,Habitación privada en: bed and breakfast,"{'lat': 6.20033, 'lng': -75.56914}",True,13.11,176,habitacion
...,...,...,...,...,...,...,...,...
1234,4.96,5,Alojamiento entero: apto. residencial,"{'lat': 6.15719, 'lng': -75.60836}",True,21.36,54,piso
1235,4.78,4,Alojamiento entero: piso,"{'lat': 6.19552, 'lng': -75.57851}",True,6.80,54,piso
1236,4.50,6,Alojamiento entero: apartamento con servicios,"{'lat': 6.20688, 'lng': -75.56564}",False,22.33,57,piso
1237,4.82,2,Alojamiento entero: piso,"{'lat': 6.20928, 'lng': -75.55877}",True,0.97,50,piso
