# Data Analysis Project


# _An Analysis of AirBnB Listings in London_

### _Contents_

- [Project Plan](#section-1)
    - [The Data](#section-1-1)
        - [Data Source](#section-1-1-1)
        - [Data Content](#section-1-1-2)
        - [Data Quality](#section-1-1-3)
    - [Project Aims and Objectives](#section-1-2)
        - [Objective 1: Predictors of Price and Review Score](#section-1-2-1)
        - [Objective 2: Mapping the Distributions of Listings](#section-1-2-2)
        - [Objective 3: The Language of Short-Term Rentals](#section-1-2-3)
    - [System Design](#section-1-3)
        - [Architecture](#section-1-3-1)
        - [Processing Modules and Algorithms](#section-1-3-2)
* [Program Code](#section-2)
    - [Importing Packages](#section-2-1)
    - [Data Engineering](#section-2-2)
        - [Raw Data Extraction](#section-2-2-1)
        - [Fixing Boolean Columns](#section-2-2-2)
        - [Fixing Cost Columns](#section-2-2-3)
    - [Data Enhancement](#section-2-3)
        - [Feature Selection](#section-2-3-1)
        - [Distance From Centre](#section-2-3-2)
        - [Price Per Bedroom](#section-2-3-3)
        - [Amenities Count](#section-2-3-4)
        - [Room Type To Entire Property](#section-2-3-5)
    - [Data Analysis](#section-2-4)
        - [Numerical Outliers](#section-2-4-1)
        - [Word Count Outliers](#section-2-4-2)
        - [Borough Aggregation](#section-2-4-3)
    - [Preparing Visualisations](#section-2-5)
        - [Sorted Correlation Heatmaps (Objective 1)](#section-2-5-1)
        - [Mapping with Style (Objective 2)](#section-2-5-2)
        - [Concentric Circles (Objective 2)](#section-2-5-3)
        - [Circle Map (Objective 2)](#section-2-5-4)
        - [Generating Word Clouds (Objective 3)](#section-2-5-5)
- [Project Outcome](#section-3)
    - [Overview of Results](#section-3-1)
    - [Objective 1: Predictors of Price and Review Score](#section-3-2)
        - [Explanation](#section-3-2-1)
        - [Visualisation](#section-3-2-2)
    - [Objective 2: Mapping the Distributions of Listings](#section-3-3)
        - [Explanation](#section-3-3-1)
        - [Visualisation](#section-3-3-2)
    - [Objective 3: The Language of Short-Term Rentals](#section-3-4)
        - [Explanation](#section-3-4-1)
        - [Visualisation](#section-3-4-2)
* [Conclusion and Presentation](#section-4)
    - [Achievements](#section-4-1)
    - [Limitations](#section-4-2)
    - [Future Work](#section-4-)

# <u>Project Plan</u> <a class="anchor" id="section-1"></a>

The popularity of AirBnB has transformed the landscape of short-term rentals, offering travelers accommodation and hosts an opportunity for supplementary income. This data science project delves into the dynamics of AirBnB listings in London.

## The Data <a class="anchor" id="section-1-1"></a>

### Data Source <a class="anchor" id="section-1-1-1"></a>

To identify a suitable dataset for this project, I initially searched through the Kaggle.com data catalogue. There, I discovered that a user had uploaded London AirBnB listing data extracted from the web on November 6, 2023. While this was promising, the static nature of the data presented a potential challenge.

Examining the source information provided by the user I discovered that the London listings data was being web-scraped by a site called ___Inside AirBnB___ on a monthly basis: http://insideairbnb.com/get-the-data.

I've been actively working with the dataset obtained on September 6, 2023. However, my objective for this project is to ensure repeatability. So, one year from now, the code can be re-executed to assess the current state of the market and track its evolution.

### Data Content <a class="anchor" id="section-1-1-2"></a>

There are two datasets from ___Inside AirBnB___ that we are going to use:

___Listings Data___

URL: http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/data/listings.csv.gz

This csv contains the all the London listings as at 2023-09-06. I've prepared a dictionary of all the columns and their data types in a code cell below. However, we will only be including a subset of those in our analysis.

In the following table I highlight a few columns that will be the focus of the analysis:

| Column Name | Type | Description |
| --- | --- | --- |
| price | string | The cost per night of the listing |
| neighbourhood_cleansed | string | The London borough in which the listing sits |
| latitude , longitude | float | The exact coordinates of the listing |
| beds | integer |  The number of beds on the listing |
| review_scores_rating | float | The overall rating score from the listing |
| review_scores_* | float | 6 columns breaking down the overall rating |
| description | string | The text entered by the host to describe and market the property |
| amenities | string | Reading this in as a string but it's a list of amenities on a property |
| calculated_host_listings_count | integer | Number of listings owned by the host. Calculated from this dataset |

<br>

___Neighbourhood geojson___

URL: "http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/visualisations/neighbourhoods.geojson"

This is a geojson file that contains the shapes of each of the 32 London boroughs.

| Column Name | Type | Description |
| --- | --- | --- |
| neighbourhood | string | Name of the Borough |
| neighbourhood_group | string | Always Null |
| geometry | MultiPolygon | The boundary of the borough to be used to draw the outline. |

<br>

_See below the full list of columns and data types._

In [None]:
column_data_types = {
    "id": "int64",
    "listing_url": "string",
    "scrape_id": "int64",
    "last_scraped": "datetime",
    "source": "string",
    "name": "string",
    "description": "string",
    "neighborhood_overview": "string",
    "picture_url": "string",
    "host_id": "int64",
    "host_url": "string",
    "host_name": "string",
    "host_since": "datetime",
    "host_location": "string",
    "host_about": "string",
    "host_response_time": "category",
    "host_response_rate": "string",
    "host_acceptance_rate": "string",
    "host_is_superhost": "string",
    "host_thumbnail_url": "string",
    "host_picture_url": "string",
    "host_neighbourhood": "category",
    "host_listings_count": "float64",
    "host_total_listings_count": "float64",
    "host_verifications": "string",
    "host_has_profile_pic": "string",
    "host_identity_verified": "string",
    "neighbourhood": "category",
    "neighbourhood_cleansed": "category",
    "neighbourhood_group_cleansed": "category",
    "latitude": "float64",
    "longitude": "float64",
    "property_type": "category",
    "room_type": "category",
    "accommodates": "int64",
    "bathrooms": "float64",
    "bathrooms_text": "string",
    "bedrooms": "float64",
    "beds": "float64",
    "amenities": "string",
    "price": "string",
    "minimum_nights": "int64",
    "maximum_nights": "int64",
    "minimum_minimum_nights": "float64",
    "maximum_minimum_nights": "float64",
    "minimum_maximum_nights": "float64",
    "maximum_maximum_nights": "float64",
    "minimum_nights_avg_ntm": "float64",
    "maximum_nights_avg_ntm": "float64",
    "calendar_updated": "string",
    "has_availability": "string",
    "availability_30": "int64",
    "availability_60": "int64",
    "availability_90": "int64",
    "availability_365": "int64",
    "calendar_last_scraped": "datetime",
    "number_of_reviews": "int64",
    "number_of_reviews_ltm": "int64",
    "number_of_reviews_l30d": "int64",
    "first_review": "datetime",
    "last_review": "datetime",
    "review_scores_rating": "float64",
    "review_scores_accuracy": "float64",
    "review_scores_cleanliness": "float64",
    "review_scores_checkin": "float64",
    "review_scores_communication": "float64",
    "review_scores_location": "float64",
    "review_scores_value": "float64",
    "license": "string",
    "instant_bookable": "string",
    "calculated_host_listings_count": "int64",
    "calculated_host_listings_count_entire_homes": "int64",
    "calculated_host_listings_count_private_rooms": "int64",
    "calculated_host_listings_count_shared_rooms": "int64",
    "reviews_per_month": "float64"
    }


### Data Quality <a class="anchor" id="section-1-1-3"></a>

The dataset is of particular interest for this type of project, as it exhibits both positive and negative aspects concerning its quality. In this project, I aim to harness the positive elements while mitigating the impact of the negatives:

___Positives___

- This data is highly comprehensive, well-populated and accurate. In fact, it appears to scrape all available information that can be discovered about a listing when exploring the Airbnb website.

- The source website routinely scrapes the data. If you set up a pipeline to monitor this, you could expand your dataset or keep it updated with each scrape.

- Investigating records within the dataset is easy as a url is provided for each listing. This also increases confidence in the data's reliability, as it's easy to verify correctness on an individual, record-by-record basis.

- The dataset encompasses various types of data to explore, including unstructured text entries, location information, numerical values, and categorical data.

- Many fields, such as host response rate and review scores, are calculated by Airbnb, ensuring their reliability.


___Negatives___

- While the columns I'm interested in are well-populated, some columns in the dataset exhibit high null rates.

- Portions of the dataset rely on user input, which can introduce issues related to data entry errors and outliers.

- Some intriguing property details, like square footage, have been omitted from the dataset.

- The dataset contains redundancies, including numerous columns that won't be used in the analysis. I'll be removing these to reduce noise in the analysis.

## Project Aim and Objectives <a class="anchor" id="section-1-2"></a>

This project seeks to offer a comprehensive analysis of Airbnb listings, providing a holistic view of the short-term rental market in London. My goal is to equip both hosts and travellers with valuable insights, enabling them to make informed decisions in this dynamic marketplace.

The focus will be on two dimensions: Price and Review score. I will start with a general numerical analysis of factors influencing these dimensions and then dive deeper into the geographical and linguistic factors, using vivid and interesting visualisations to bring the data to life.

#### _Objective 1: Numerical Predictors of Price and Review Score_ <a class="anchor" id="section-1-2-1"></a>

This objective sets the foundation for our analyses. Using statistical techniques and feature selection I aim to reveal how Price and Review Score correlates with a range of features generated from the data.

**Desired Output:** Heatmaps displaying the correllation of the dataset features with price and review score.

#### Objective 2: _Mapping the Distributions of Listings_ <a class="anchor" id="section-1-2-2"></a>

Utilising mapping and visualization, I will analyse the distribution of listings, discerning patterns in pricing and review scores across London. These insights will guide travelers and hosts in choosing the most suitable locations and properties.

**Desired Output:** Visualisations showing the difference in density, price and review scores across London boroughs and how these factors change as you move away from the centre.

#### Objective 3: _The Language of Short-Term Rentals_ <a class="anchor" id="section-1-2-3"></a>

Building upon our analysis of pricing and review scores, I explore linguistic strategies employed in marketing Airbnb listings. This objective aims to uncover the language in listings associated with high and low review scores, as well as those used to promote more expensive properties vs cheaper ones.

**Desired Output:** Word clouds categorised by price and review score demonstrating the language that correlates with those factors.

## System Design <a class="anchor" id="section-1-3"></a>

### Architecture <a class="anchor" id="section-1-3-1"></a>

The system design will be a common industry standard. 

1. Starting with a Data Engineering ELT (Extract Load Transform) pipeline. Not to be confused with ETL, the idea is we load in a raw cut of the dataset then once we have it stored in our machine we can do transformations. The transformations at this stage will just be simple data fixes and cleaning of values. 

2. Then I will move on to enhancement, using the existing features of our dataset to uncover more features to be used in our analyses.

3. Next an Analysis piece where any outliers will be handled along with null values or anomalies. Here we will also aggregate and combine our two datasets ready for visualisation.

4. Finally I will prepare the visualisations. 

![image.png](architechture.png)

### Processing Modules and Algorithms <a class="anchor" id="section-1-3-2"></a>

The most significant tools I will be using throughout this project are as follows:

* pandas read and write functionality to capture and store the source data.
* Using DataFrame loc to conditionally slice a DataFrame, adjust values or remove outliers.
* The haversine formula, used to compute distance between two points on a globe.
* Aggregations such as count and mean across boroughs and merging with the borough dataset to map the resulting values.
* Lambda functions used to apply a small function to a whole Dataframe.
* HTML alignment used to have more control over the position and size of visualisations.

<br>

# <u>Program Code</u>  <a class="anchor" id="section-2"></a>

#### _Importing Packages_ <a class="anchor" id="section-2-1"></a>

It is standard practice to import packages at the top of your code to show what dependencies are required to run the file.

A couple of the packages we are going to import do not come as standard with Anaconda so I have provided a code snippet to pip install here.

In [None]:
%pip install geopandas
%pip install folium
%pip install wordcloud

I have provided some more detail about the use of each package.

In [None]:
# General purpose data analysis libraries.
import pandas as pd
import numpy as np
import math
import functools

# For statistical and numerical plotting (Objective 1)
import matplotlib.pyplot as plt
import seaborn as sns

# For plotting maps and handling spacial data (Objective 2)
import folium
from folium import plugins
from folium.plugins import FastMarkerCluster
import geopandas

# Used to generate Word Clouds and handle text data (Objective 3)
import string
from collections import Counter
from wordcloud import WordCloud

# Used to have more control when displaying visualisations 
from IPython.display import HTML

As a preference I set the max_columns option so that I can scroll across all the columns without them being truncated.

In [None]:
pd.options.display.max_columns = 500

<br>

## Data Engineering <a class="anchor" id="section-2-2"></a>

#### _Raw Data Extraction_ <a class="anchor" id="section-2-2-1"></a>

The goal for this section is to connect to our data source, pull in the required data and store a copy in our local files.

With the column_data_types dictionary I produced during my initial analysis I can read in the raw dataset. However there is a small change I would like to make to separate out date columns. This will allow me to pass the dictionary in as an argument and a list of date columns in as another parse_dates argument. 

In [None]:
print(f"Column Type Dict Length: {len(column_data_types)}")

date_columns = []
for col, dtype in column_data_types.items():
    if dtype == "datetime":
        date_columns.append(col)

for column in date_columns:
    column_data_types.pop(column)

print(f"New Column Type Dict Length: {len(column_data_types)}")
print(f"Date Column List Length: {len(date_columns)}")

Now we have these separated we can now use pandas to extract the data from the source in the data directory. This code cell uses the url of the csv so that this is repeatable.

In [None]:
raw_df = pd.read_csv(
    "http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/data/listings.csv.gz", 
    dtype=column_data_types, 
    parse_dates=date_columns
    )

print(f"DataFrame Shape: {raw_df.shape}")

Now we have the data loaded with the expected number of records we can store it in a file. I am going to use parquet as storing in csv format can cause issues with our data types.

In [None]:
raw_df.to_parquet("listings.parquet")

I will do the same with the boroughs.geojson file.

In [None]:
boroughs = geopandas.read_file("http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/visualisations/neighbourhoods.geojson")
boroughs.to_file("boroughs.geojson", driver='GeoJSON')

So the E and L of the data pipeline is complete with the desired result. We can connect to the source and pull the two datasets into files on our machine. In a more scaled up approach this might into an S3 Bucket or database but this is fine for our project. 

Now we have a repeatable way of loading our 2 key datasets into our file with the following code.

In [None]:
listings_df = pd.read_parquet("listings.parquet")
boroughs_geo = geopandas.read_file("boroughs.geojson")

print(f"Listings Shape: {listings_df.shape}")
print(f"Boroughs Shape: {boroughs_geo.shape}")

<br>

#### _Fixing Boolean Columns_ <a class="anchor" id="section-2-2-2"></a>

This section will fix an issue with in this dataset with columns that should be Boolean being set to "t" and "f" rather than True or False.

In [None]:
bool_columns = [
    'instant_bookable',
    'has_availability',
    'host_has_profile_pic',
    'host_identity_verified',
    'host_is_superhost'
]
display(listings_df[bool_columns].head())

See how the True and False values are stored as t and f in the dataset. This is understood as a string by Pandas which doesn't reflect the underlying value in the listing.

The aim is to change these values to 1 or 0 and cast as a float which will allow them to be used for correlation later on.

In [None]:
for col in bool_columns:
    listings_df.loc[listings_df[col] == "t", col] = "1"
    listings_df.loc[listings_df[col] == "f", col] = "0"
    listings_df[col] = listings_df[col].astype(float)
display(listings_df[bool_columns].head())
print(f"Listings Shape: {listings_df.shape}")

See how now t has become 1.0 and f has become 0.0 which has resolved this data issue. 

#### _Fixing Cost Columns_ <a class="anchor" id="section-2-2-3"></a>

Next we will look at an issue with the price column. We are looking at it as a list to make it easy to add future columns if they appear in the dataset. 

In [None]:
cost_columns = [
    "price"
]
display(listings_df[cost_columns+["listing_url"]].head(5))

Printing out this column we see that the issue is that the values contain $ and , characters inside the cost which is preventing them from being cast as float values and therefore preventing any numerical analysis. 

This is also an issue due to the currency symbol being incorrect. If you check these listings for example index 0 has a price per night of £42 and hour not $42. 

In [None]:
for col in cost_columns:
    listings_df[col] = listings_df[col].str.replace("\$|,", "", regex=True)
    listings_df[col] = listings_df[col].astype(float)
display(listings_df[cost_columns+["listing_url"]].head(5))

Replacing and recasting has fixed this issue. Now price is a numerical value that we can analyse as such.

<br>

## Data Enhancement <a class="anchor" id="section-2-3"></a>

#### _Feature Selection_ <a class="anchor" id="section-2-3-1"></a>

Due to the breadth of this dataset there are many columns which will not be of use during this project. To reduce the noise in the dataset I will specify in a list the columns that I will use and leave the remaining ones out. However, if the requirement come later down the line to use these columns they can easily be added here.

In [None]:
select_columns = [
    # "id",
    "listing_url",
    # "scrape_id",
    # "source",
    "name",
    "description",
    # "neighborhood_overview",
    # "picture_url",
    # "host_id",
    # "host_url",
    # "host_name",
    # "host_location",
    # "host_about",
    "host_response_time",
    "host_response_rate",
    "host_acceptance_rate",
    "host_is_superhost",
    # "host_thumbnail_url",
    # "host_picture_url",
    # "host_neighbourhood",
    # "host_listings_count",
    "host_total_listings_count",
    # "host_verifications",
    "host_has_profile_pic",
    "host_identity_verified",
    # "neighbourhood",
    "neighbourhood_cleansed",
    # "neighbourhood_group_cleansed",
    "latitude",
    "longitude",
    "property_type",
    "room_type",
    # "accommodates",
    # "bathrooms",
    "bathrooms_text",
    # "bedrooms",
    "beds",
    "amenities",
    "price",
    # "minimum_nights",
    # "maximum_nights",
    # "minimum_minimum_nights",
    # "maximum_minimum_nights",
    # "minimum_maximum_nights",
    # "maximum_maximum_nights",
    # "minimum_nights_avg_ntm",
    # "maximum_nights_avg_ntm",
    # "calendar_updated",
    # "has_availability",
    # "availability_30",
    # "availability_60",
    # "availability_90",
    # "availability_365",
    "number_of_reviews",
    "number_of_reviews_ltm",
    # "number_of_reviews_l30d",
    "review_scores_rating",
    "review_scores_accuracy",
    "review_scores_cleanliness",
    "review_scores_checkin",
    "review_scores_communication",
    "review_scores_location",
    "review_scores_value",
    # "license",
    "instant_bookable",
    "calculated_host_listings_count",
    # "calculated_host_listings_count_entire_homes",
    # "calculated_host_listings_count_private_rooms",
    # "calculated_host_listings_count_shared_rooms",
    "reviews_per_month"
]
listings_df = listings_df[select_columns]
listings_df.head()

With this feature set as a baseline we can now start to think about enhancement.

<br>

#### _Distance From the Centre_ <a class="anchor" id="section-2-3-2"></a>

The Listings DataFrame contains coordinates for each listing.

In [None]:
listings_df[["listing_url", "latitude", "longitude"]].head(5)

This information is useful on it's own but it would be good to have a single datapoint that measures how close the listing is to the centre of London. This means we need to choose a centre point. This is a question with seemingly no right answer but after doing research I found this article which gave coordinates of the exact centre of London based on geographic measurements:

https://www.standard.co.uk/news/london/london-s-real-centre-point-is-next-to-bench-on-the-victoria-embankment-by-the-thames-9381800.html

The article gives these coordinates: 51°30’37.6”N 0°06’56.3”W

<iframe src="https://www.google.com/maps/embed?pb=!1m17!1m12!1m3!1d2483.151070786313!2d-0.11821918763198132!3d51.51044437169647!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m2!1m1!2zNTHCsDMwJzM3LjYiTiAwwrAwNic1Ni4zIlc!5e0!3m2!1sen!2suk!4v1695904951512!5m2!1sen!2suk" width="400" height="300" style="border:0;" allowfullscreen="" loading="lazy" referrerpolicy="no-referrer-when-downgrade"></iframe>

Google Maps has a converter from these northing-westing coordinates to latitude-longitude. We can use this to get the location for the centre of London that we can use throughout our analysis. 

The lat-long given by Google is 51.510444, -0.115639

In [None]:
CENTRE_OF_LONDON = (51.510444, -0.115639)

Now we have a central location to measure from we need to write a function that can compute the distance between two coordinate points. 

Since we are dealing with a globe our calculations would be out if we did a simple pythagorean formula; even though it is over a small region. So, we need to calculate the distance between two points on a globe which requires the haversine formula:

- https://www.educative.io/answers/how-to-calculate-distance-using-the-haversine-formula

Here is my basic implementaion below:

In [None]:
def haversine(point_a, point_b):
    # Unpacking Lat and Longs
    lat_a, lon_a = point_a
    lat_b, lon_b = point_b
    R = 6371  # radius of earth in km
    lat_delta = math.radians(lat_b - lat_a)
    lon_delta = math.radians(lon_b - lon_a)
    a = (math.sin(lat_delta / 2) ** 2) + math.cos(math.radians(lat_a)) * math.cos(math.radians(lat_b)) * math.sin(lon_delta / 2) ** 2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c  # distance in km

We can apply this function to our full dataset to compare the distance of each listing to the centre. 

In [None]:
listings_df["distance_from_centre_km"] = listings_df[["latitude", "longitude"]].apply(
    lambda row: haversine(CENTRE_OF_LONDON, (row['latitude'], row['longitude'])), 
    axis=1
)
listings_df[["listing_url", "latitude", "longitude", "distance_from_centre_km"]].head()

One more interesting thing to add would be distance from centre on the North/South axis and East/West axis. For our purposes, having these in radians is fine as we will only be using them for correlations as opposed to precise mapping.

In [None]:
listings_df["distance_from_centre_N"] = listings_df["latitude"] - CENTRE_OF_LONDON[0]
listings_df["distance_from_centre_S"] = CENTRE_OF_LONDON[0] - listings_df["latitude"]
listings_df["distance_from_centre_E"] = listings_df["longitude"] - CENTRE_OF_LONDON[1]
listings_df["distance_from_centre_W"] = CENTRE_OF_LONDON[1] - listings_df["longitude"]

compass_differences = ["distance_from_centre_N", "distance_from_centre_S", "distance_from_centre_E", "distance_from_centre_W"]

for col in compass_differences:
    listings_df.loc[listings_df[col] < 0, col] = np.NaN
    listings_df[col].astype(float).apply(math.radians)

We have now added 5 more features to our dataset which will give us more ways to analyse the listings based on their location.

<br>

#### _Price Per Bedroom_ <a class="anchor" id="section-2-3-3"></a>

An AirBnB listing displays the price for renting per night and this is reflected in our dataset. This is misleading as there might be one property with a single bedroom that costs the same as a property with 10 bedrooms. The current price column would suggest that these properties were equivalent but that would realistically be covering up the fact that one could house many more guests. 

Thus, I will create a new price_per_bed column to create a price feature that is less dependant on number of beds.

In [None]:
listings_df["price_per_bed"] = listings_df["price"]/listings_df["beds"]
listings_df[["price_per_bed", "price", "beds"]].head()

Now we have a price per night per bed metric which will give a better representation of the price of a listing.

#### _Amenities Count_ <a class="anchor" id="section-2-3-4"></a>

At the moment the amenities column is hard to gain any meaningful insights from as it's just a list of the names of the amenities on the listing. We can use string splitting to give us another numerical column with the number of listings that can be used for the correlation heatmap. 

In [None]:
listings_df["amenities_count"] = listings_df["amenities"].apply(lambda x: len(list(x.split(", "))))

pd.options.display.max_colwidth = 300
display(listings_df[["listing_url", "amenities", "amenities_count"]].head())
pd.reset_option('display.max_colwidth')

#### _Room Type to Entire Property_ <a class="anchor" id="section-2-3-5"></a>

Doing some further checks on room type can allow us to infer another boolean field: Whether the listing is for the entire property or not.

In [None]:
listings_df["room_type"].value_counts()

Examining these room types it's easy to see that a Property is either an Entire Property or some type of single room, either a room, shared room or hotel room. It will be interesting to turn this into a numerical representation.