# Data Analysis Project


# _An Analysis of AirBnB Listings in London_

### _Contents_

- [Project Plan](#section-1)
    - [The Data](#section-1-1)
        - [Data Source](#section-1-1-1)
        - [Data Content](#section-1-1-2)
        - [Data Quality](#section-1-1-3)
    - [Project Aims and Objectives](#section-1-2)
        - [Objective 1: Predictors of Price and Review Score](#section-1-2-1)
        - [Objective 2: Mapping the Distributions of Listings](#section-1-2-2)
        - [Objective 3: The Language of Short-Term Rentals](#section-1-2-3)
    - [System Design](#section-1-3)
        - [Architecture](#section-1-3-1)
        - [Processing Modules and Algorithms](#section-1-3-2)
* [Program Code](#section-2)
    - [Importing Packages](#section-2-1)
    - [Data Engineering](#section-2-2)
        - [Raw Data Extraction](#section-2-2-1)
        - [Fixing Boolean Columns](#section-2-2-2)
        - [Fixing Cost Columns](#section-2-2-3)
    - [Data Enhancement](#section-2-3)
        - [Feature Selection](#section-2-3-1)
        - [Distance From Centre](#section-2-3-2)
        - [Price Per Bedroom](#section-2-3-3)
        - [Amenities Count](#section-2-3-4)
        - [Room Type To Entire Property](#section-2-3-5)
    - [Data Analysis](#section-2-4)
        - [Numerical Outliers](#section-2-4-1)
        - [Word Count Outliers](#section-2-4-2)
        - [Borough Aggregation](#section-2-4-3)
    - [Preparing Visualisations](#section-2-5)
        - [Sorted Correlation Heatmaps (Objective 1)](#section-2-5-1)
        - [Mapping with Style (Objective 2)](#section-2-5-2)
        - [Concentric Circles (Objective 2)](#section-2-5-3)
        - [Circle Map (Objective 2)](#section-2-5-4)
        - [Generating Word Clouds (Objective 3)](#section-2-5-5)
- [Project Outcome](#section-3)
    - [Overview of Results](#section-3-1)
    - [Objective 1: Predictors of Price and Review Score](#section-3-2)
        - [Explanation](#section-3-2-1)
        - [Visualisation](#section-3-2-2)
    - [Objective 2: Mapping the Distributions of Listings](#section-3-3)
        - [Explanation](#section-3-3-1)
        - [Visualisation](#section-3-3-2)
    - [Objective 3: The Language of Short-Term Rentals](#section-3-4)
        - [Explanation](#section-3-4-1)
        - [Visualisation](#section-3-4-2)
* [Conclusion and Presentation](#section-4)
    - [Achievements](#section-4-1)
    - [Limitations](#section-4-2)
    - [Future Work](#section-4-)

# <u>Project Plan</u> <a class="anchor" id="section-1"></a>

The popularity of AirBnB has transformed the landscape of short-term rentals, offering travelers accommodation and hosts an opportunity for supplementary income. This data science project delves into the dynamics of AirBnB listings in London.

## The Data <a class="anchor" id="section-1-1"></a>

### Data Source <a class="anchor" id="section-1-1-1"></a>

To identify a suitable dataset for this project, I initially searched through the Kaggle.com data catalogue. There, I discovered that a user had uploaded London AirBnB listing data extracted from the web on November 6, 2023. While this was promising, the static nature of the data presented a potential challenge.

Examining the source information provided by the user I discovered that the London listings data was being web-scraped by a site called ___Inside AirBnB___ on a monthly basis: http://insideairbnb.com/get-the-data.

I've been actively working with the dataset obtained on September 6, 2023. However, my objective for this project is to ensure repeatability. So, one year from now, the code can be re-executed to assess the current state of the market and track its evolution.

### Data Content <a class="anchor" id="section-1-1-2"></a>

There are two datasets from ___Inside AirBnB___ that we are going to use:

___Listings Data___

URL: http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/data/listings.csv.gz

This csv contains the all the London listings as at 2023-09-06. I've prepared a dictionary of all the columns and their data types in a code cell below. However, we will only be including a subset of those in our analysis.

In the following table I highlight a few columns that will be the focus of the analysis:

| Column Name | Type | Description |
| --- | --- | --- |
| price | string | The cost per night of the listing |
| neighbourhood_cleansed | string | The London borough in which the listing sits |
| latitude , longitude | float | The exact coordinates of the listing |
| beds | integer |  The number of beds on the listing |
| review_scores_rating | float | The overall rating score from the listing |
| review_scores_* | float | 6 columns breaking down the overall rating |
| description | string | The text entered by the host to describe and market the property |
| amenities | string | Reading this in as a string but it's a list of amenities on a property |
| calculated_host_listings_count | integer | Number of listings owned by the host. Calculated from this dataset |

<br>

___Neighbourhood geojson___

URL: "http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/visualisations/neighbourhoods.geojson"

This is a geojson file that contains the shapes of each of the 32 London boroughs.

| Column Name | Type | Description |
| --- | --- | --- |
| neighbourhood | string | Name of the Borough |
| neighbourhood_group | string | Always Null |
| geometry | MultiPolygon | The boundary of the borough to be used to draw the outline. |

<br>

_See below the full list of columns and data types._

In [None]:
column_data_types = {
    "id": "int64",
    "listing_url": "string",
    "scrape_id": "int64",
    "last_scraped": "datetime",
    "source": "string",
    "name": "string",
    "description": "string",
    "neighborhood_overview": "string",
    "picture_url": "string",
    "host_id": "int64",
    "host_url": "string",
    "host_name": "string",
    "host_since": "datetime",
    "host_location": "string",
    "host_about": "string",
    "host_response_time": "category",
    "host_response_rate": "string",
    "host_acceptance_rate": "string",
    "host_is_superhost": "string",
    "host_thumbnail_url": "string",
    "host_picture_url": "string",
    "host_neighbourhood": "category",
    "host_listings_count": "float64",
    "host_total_listings_count": "float64",
    "host_verifications": "string",
    "host_has_profile_pic": "string",
    "host_identity_verified": "string",
    "neighbourhood": "category",
    "neighbourhood_cleansed": "category",
    "neighbourhood_group_cleansed": "category",
    "latitude": "float64",
    "longitude": "float64",
    "property_type": "category",
    "room_type": "category",
    "accommodates": "int64",
    "bathrooms": "float64",
    "bathrooms_text": "string",
    "bedrooms": "float64",
    "beds": "float64",
    "amenities": "string",
    "price": "string",
    "minimum_nights": "int64",
    "maximum_nights": "int64",
    "minimum_minimum_nights": "float64",
    "maximum_minimum_nights": "float64",
    "minimum_maximum_nights": "float64",
    "maximum_maximum_nights": "float64",
    "minimum_nights_avg_ntm": "float64",
    "maximum_nights_avg_ntm": "float64",
    "calendar_updated": "string",
    "has_availability": "string",
    "availability_30": "int64",
    "availability_60": "int64",
    "availability_90": "int64",
    "availability_365": "int64",
    "calendar_last_scraped": "datetime",
    "number_of_reviews": "int64",
    "number_of_reviews_ltm": "int64",
    "number_of_reviews_l30d": "int64",
    "first_review": "datetime",
    "last_review": "datetime",
    "review_scores_rating": "float64",
    "review_scores_accuracy": "float64",
    "review_scores_cleanliness": "float64",
    "review_scores_checkin": "float64",
    "review_scores_communication": "float64",
    "review_scores_location": "float64",
    "review_scores_value": "float64",
    "license": "string",
    "instant_bookable": "string",
    "calculated_host_listings_count": "int64",
    "calculated_host_listings_count_entire_homes": "int64",
    "calculated_host_listings_count_private_rooms": "int64",
    "calculated_host_listings_count_shared_rooms": "int64",
    "reviews_per_month": "float64"
    }


### Data Quality <a class="anchor" id="section-1-1-3"></a>