In [1]:
import pandas as pd
import numpy as np
import abydos
from abydos import distance

# General Instruction

- Cleanup dataset based on the information that is given:
You need to clean the dataset according to the information that is given to you. This means that there are problems with the dataset that need to be fixed, and you should use the information given to you to determine what those problems are and how to fix them.

- Each case has different data quality problems, there will be hint and additional information that can help you understand the problem:
Each row in the dataset may have different data quality problems. There will be hints and additional information provided to help you understand what the specific problem is with each row.

- You can do any approach on cleaning the data, but you should clean the instructed column only:
You have the freedom to use any approach to clean the data, but you should only clean the instructed column. This means that you should not modify any other columns in the dataset, or add or remove any rows.

- Do not create new column or remove any column. Also do not create new row, or remove any row:-
You are not allowed to create new columns or remove any columns from the dataset. You are also not allowed to add or remove any rows.

- Each column will have a flag column something equivalent to <column\_name>\_flag. This column can be used to flag the row if you want to not include it to the downstream task. 0: safe_flag, 1: delete_flag, 2: null_flag (if you want to still include the row with null treatment). You can also add a new category but please add justification and explanation of the new category, there are three categories you can use:
safe_flag (0): this row is safe to use in downstream tasks
delete_flag (1): this row should be deleted and not used in downstream tasks
null_flag (2): this row can be included in downstream tasks but with null treatment.
You can also add a new category, but you need to provide a justification and an explanation for the new category. It is worth to note that the completeness of the dataset is also matter, so try not to flag to many things, and do your best to clean the values.

- For each data cleaning task, we have provided a function that represents the goal of the cleaning. For example, clean_duplicate_id(df) is the function for removing duplicate ID values. These functions take a DataFrame as input and return the cleaned version of the DataFrame.

    In each chunk of data cleaning task, you will see the following three parts:

    1. The clean_<name> function that performs the specific cleaning task.
    2. The execution of the cleaning function on the DataFrame.
    3. A checking part to help you evaluate the effectiveness of the cleaning.
    
  While you can create new cells and add additional code, the cleaning must be performed through the provided cleaning functions. You can adjust the order of the cleaning steps, but please try to move the whole chunks of code to avoid any errors.

The cleaning task will be considered complete if this notebook can be run sequentially by executing "restart and runall"




# Purpose
The purpose of this dataset is to conduct exploratory analysis of the listings and create a prediction model for listing price using some columns from the dataset. This means that the dataset is intended to be used to explore the characteristics and features of the listings, and to build a model that can predict the price of a listing based on certain variables in the dataset. The goal is to gain insights into the factors that influence the price of a listing and to develop a model that can accurately predict listing prices based on those factors.

# Columns and Dataset Description
- id: a unique identifier for each listing.
- name: the name or title of the listing, as provided by the host.
- host_id: a unique identifier for each host.
- host_name: the name of the host who listed the property.
- neighbourhood_group: the larger geographic area in which the listing is located (e.g. a borough or group of neighborhoods).
- neighbourhood: the specific neighborhood in which the listing is located.
- latitude: the latitude coordinate of the listing.
- longitude: the longitude coordinate of the listing.
- room_type: the type of space that is being listed (e.g. an entire apartment, a private room, a shared room).
- price: the nightly price of the listing, in the currency specified in the dataset.
- minimum_nights: the minimum number of nights that a guest must book the listing for.
- number_of_reviews: the total number of reviews that the listing has received.
- last_review: the date of the most recent review of the listing.
- reviews_per_month: the average number of reviews per month that the listing has received.
- calculated_host_listings_count: the total number of listings that the host has on Airbnb.
- availability_365: the number of days per year that the listing is available for booking.
- number_of_reviews_ltm: the total number of reviews that the listing has received in the last 12 months.
- license: a license number for the listing, if applicable (this column may not be present in all versions of the dataset).

Besides the columns above, there are columns pre-defined for flagging the rows based on particular data cleaning context:
- id_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the id column (duplicate).
- host_id_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the host_id column.
- neighbourhood_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the neighbourhood column.
- latitude_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the latitude column.
- longitude_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the longitude column.
- minimum_nights_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the minimum_nights column.
- number_of_reviews_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the number_of_reviews column.
- last_review_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the last_review column.
- room_type_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the room_type column.

# Load Data

In [2]:
airbnb_pd = pd.read_csv("chicago_vert_1_a.csv")

In [3]:
airbnb_pd.columns

Index(['Unnamed: 0', 'id', 'listing_url', 'scrape_id', 'last_scraped',
       'source', 'name', 'description', 'neighborhood_overview', 'picture_url',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights

In [4]:
airbnb_pd

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,...,reviews_per_month,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
0,7742,25879,https://www.airbnb.com/rooms/25879,20230319041143,2023-03-19,city scrape,2/1 One Block to Fullerton L Red Line Deck & ...,"Terrific 2 bed, 1 bath unit in an older brick ...","Safe, tree-lined neighborhood with easy beach ...",https://a0.muscache.com/pictures/730683/6335b6...,...,0.33,0,0,0,0,0,0,0,0,0
1,7744,37738,https://www.airbnb.com/rooms/37738,20230319041143,2023-03-19,previous scrape,Andersonville - Perfect location!,Comfort and style. Recently named as one of th...,We love this neighborhood and hope you will to...,https://a0.muscache.com/pictures/434306/c7f339...,...,1.62,0,0,0,0,0,0,0,0,0
2,7652,189821,https://www.airbnb.com/rooms/189821,20230319041143,2023-03-19,city scrape,"Best in Chicago, private, amazing garden space",We offer the highest standards of cleanliness....,Enjoy the unique green spaces at the heart of ...,https://a0.muscache.com/pictures/e6324b08-d6c8...,...,4.27,0,0,0,0,0,0,0,0,0
3,7653,207218,https://www.airbnb.com/rooms/207218,20230319041143,2023-03-19,city scrape,Historic Pullman Artist Flat - Artists & Explo...,Charming original 1880's Pullman workers flat ...,The entire town of Pullman was built in the 18...,https://a0.muscache.com/pictures/102976353/ff2...,...,2.13,0,0,0,0,0,0,0,0,0
4,7655,220333,https://www.airbnb.com/rooms/220333,20230319041143,2023-03-19,city scrape,Pullman School House Apartment - monthly rental,An unusual MONTHLY rental in the heart of Pull...,The entire neighborhood is part of the Pullman...,https://a0.muscache.com/pictures/miso/Hosting-...,...,0.01,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349,7029,781186064034113431,https://www.airbnb.com/rooms/781186064034113431,20230319041143,2023-03-19,city scrape,2 Queen Bed Room,"Our hotel, located conveniently close to some ...",,https://a0.muscache.com/pictures/miso/Hosting-...,...,7.08,0,0,0,0,0,0,0,0,0
1350,7040,782373284583992853,https://www.airbnb.com/rooms/782373284583992853,20230319041143,2023-03-19,city scrape,"Logan Square 1br w/ pool, lounge & gym nr L",Show up and start living from day one in Chica...,This furnished apartment is located in the Log...,https://a0.muscache.com/pictures/prohost-api/H...,...,,0,0,0,0,0,0,0,0,0
1351,7043,782476073065423478,https://www.airbnb.com/rooms/782476073065423478,20230319041143,2023-03-19,city scrape,Affordable apt near Wrigley,Charming unit located to the North of Boystown...,,https://a0.muscache.com/pictures/00ab76f4-3332...,...,1.96,0,0,0,0,0,0,0,0,0
1352,7054,783249596739848974,https://www.airbnb.com/rooms/783249596739848974,20230319041143,2023-03-19,city scrape,"South Loop 1br w/ gym & lounge, nr Grant Park","Discover the best of Chicago, with this one-be...",This furnished apartment is located in South L...,https://a0.muscache.com/pictures/prohost-api/H...,...,,0,0,0,0,0,0,0,0,0


# cleanup duplicate id
The ID column must contain unique values. If there are any duplicate values in this column, you will need to take action to ensure that each ID is unique. You can do this by either fixing the duplicates (if you want to keep them) or by flagging them for removal (1) using the id_flag column.

In [5]:
def clean_duplicate_id(df):
    #raise Exception("not yet have implementation")
    # do something here
#     df = df[df.duplicated(['id'])]
    df['id_flag'] = df.duplicated(['id'])
    df['id_flag'] = df['id_flag'].map({False: 0, True: 1})
    return df

In [6]:
airbnb_pd = clean_duplicate_id(airbnb_pd)
airbnb_pd

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,...,reviews_per_month,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
0,7742,25879,https://www.airbnb.com/rooms/25879,20230319041143,2023-03-19,city scrape,2/1 One Block to Fullerton L Red Line Deck & ...,"Terrific 2 bed, 1 bath unit in an older brick ...","Safe, tree-lined neighborhood with easy beach ...",https://a0.muscache.com/pictures/730683/6335b6...,...,0.33,0,0,0,0,0,0,0,0,0
1,7744,37738,https://www.airbnb.com/rooms/37738,20230319041143,2023-03-19,previous scrape,Andersonville - Perfect location!,Comfort and style. Recently named as one of th...,We love this neighborhood and hope you will to...,https://a0.muscache.com/pictures/434306/c7f339...,...,1.62,0,0,0,0,0,0,0,0,0
2,7652,189821,https://www.airbnb.com/rooms/189821,20230319041143,2023-03-19,city scrape,"Best in Chicago, private, amazing garden space",We offer the highest standards of cleanliness....,Enjoy the unique green spaces at the heart of ...,https://a0.muscache.com/pictures/e6324b08-d6c8...,...,4.27,0,0,0,0,0,0,0,0,0
3,7653,207218,https://www.airbnb.com/rooms/207218,20230319041143,2023-03-19,city scrape,Historic Pullman Artist Flat - Artists & Explo...,Charming original 1880's Pullman workers flat ...,The entire town of Pullman was built in the 18...,https://a0.muscache.com/pictures/102976353/ff2...,...,2.13,0,0,0,0,0,0,0,0,0
4,7655,220333,https://www.airbnb.com/rooms/220333,20230319041143,2023-03-19,city scrape,Pullman School House Apartment - monthly rental,An unusual MONTHLY rental in the heart of Pull...,The entire neighborhood is part of the Pullman...,https://a0.muscache.com/pictures/miso/Hosting-...,...,0.01,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349,7029,781186064034113431,https://www.airbnb.com/rooms/781186064034113431,20230319041143,2023-03-19,city scrape,2 Queen Bed Room,"Our hotel, located conveniently close to some ...",,https://a0.muscache.com/pictures/miso/Hosting-...,...,7.08,0,0,0,0,0,0,0,0,0
1350,7040,782373284583992853,https://www.airbnb.com/rooms/782373284583992853,20230319041143,2023-03-19,city scrape,"Logan Square 1br w/ pool, lounge & gym nr L",Show up and start living from day one in Chica...,This furnished apartment is located in the Log...,https://a0.muscache.com/pictures/prohost-api/H...,...,,0,0,0,0,0,0,0,0,0
1351,7043,782476073065423478,https://www.airbnb.com/rooms/782476073065423478,20230319041143,2023-03-19,city scrape,Affordable apt near Wrigley,Charming unit located to the North of Boystown...,,https://a0.muscache.com/pictures/00ab76f4-3332...,...,1.96,0,0,0,0,0,0,0,0,0
1352,7054,783249596739848974,https://www.airbnb.com/rooms/783249596739848974,20230319041143,2023-03-19,city scrape,"South Loop 1br w/ gym & lounge, nr Grant Park","Discover the best of Chicago, with this one-be...",This furnished apartment is located in South L...,https://a0.muscache.com/pictures/prohost-api/H...,...,,0,0,0,0,0,0,0,0,0


# Duplicate IDS checking 
To ensure that all ID values in the dataset are unique, you should check for duplicate IDs. When you run the query to check for duplicates, there should be no rows returned, indicating that there are no duplicate ID values present in the dataset.

In [7]:
dup_ids = airbnb_pd[airbnb_pd.id_flag==0]
dup_ids = dup_ids.groupby("id").count()[["name"]].reset_index()
dup_ids = dup_ids[dup_ids.name>1]
dup_ids

Unnamed: 0,id,name


# cleanup inconsistent host id
Each host_id value in the dataset should be associated with only one host_name. However, there may be inconsistencies in the dataset where a host_id is associated with different host_name values.

To clean this up, you can either change the host_name value to a consistent value based on information in the dataset, or flag the host_id_flag column to indicate that the row should be removed from downstream tasks.

For example, if you find that a host_id is associated with multiple host_name values, you may want to investigate further to determine which host_name is correct. If one of the host_name values is clearly incorrect (e.g., a misspelling or a name that does not match the owner of the property), you could update the host_name value to the correct value.

Alternatively, if you cannot determine the correct host_name value, or if you want to exclude the row from downstream tasks for other reasons, you can flag the host_id_flag column with a value of 1 to indicate that the row should be removed.

In [8]:
def clean_host_id(df):
    #raise Exception("not yet have implementation")
    # do something here
    grouped = df.groupby('host_id')['host_name'].nunique().reset_index(name='count')
    duplicates = grouped[grouped['count'] > 1]['host_id'].tolist()

    for idx, row in df.iterrows():
        if row['host_id'] in duplicates:
            df.loc[idx, 'host_id_flag'] = 1
    return df

In [9]:
airbnb_pd = clean_host_id(airbnb_pd)

# Inconsistent Host ID checking 

This query should return zero rows once you implement the cleaning process

In [10]:
dup_host_id = airbnb_pd[airbnb_pd.host_id_flag==0]
dup_host_id = dup_host_id.groupby(["host_id","host_name"]).count()[["id"]].reset_index()
dup_host_id = dup_host_id.groupby("host_id").count()["id"].reset_index()
dup_host_id[dup_host_id["id"]>1]

Unnamed: 0,host_id,id


# cleanup neighbourhood
The neighbourhood column in the dataset should contain values that match the neighbourhoods defined in the official neighbourhood_list. However, there may be some values in the neighbourhood column that are incorrect due to errors or noise in the data.

To clean up the neighbourhood column, you can try to match each value in the column to a valid neighbourhood in the neighbourhood_list using a string distance function such as abydos. If you can successfully match a value in the neighbourhood column to a neighbourhood in the neighbourhood_list, you can replace the value in the dataset with the correct neighbourhood name.

However, if you are unsure about how to clean up a particular value in the neighbourhood column, or if you cannot match the value to a valid neighbourhood in the neighbourhood_list, you can flag the row for deletion by setting the neighbourhood_flag column to a value of 1. If the value in the neighbourhood column is null and you cannot make a determination based on other information in the dataset, you can set the neighbourhood_flag column to a value of 2 to indicate that the row should be included but the neighbourhood value is null.

You can also use the latitude and longitude columns in the dataset to help match values in the neighbourhood column to valid neighbourhoods in the neighbourhood_list. However, you should be aware that the latitude and longitude values may also contain errors or noise, so you should exercise caution when using these columns to clean up the neighbourhood column.

In [11]:
neighbourhood_list = [ 'Hyde Park', 'West Town', 'Lincoln Park', 'Near West Side', 'Lake View',    'Dunning', 'Rogers Park', 'Logan Square', 'Uptown', 'Edgewater',    'North Center', 'Albany Park', 'West Ridge', 'Pullman', 'Irving Park',    'Beverly', 'Lower West Side', 'Near South Side', 'Near North Side',    'Grand Boulevard', 'Bridgeport', 'Humboldt Park', 'Chatham', 'Kenwood',    'Loop', 'West Lawn', 'Lincoln Square', 'Woodlawn', 'Avondale',    'Forest Glen', 'Portage Park', 'East Garfield Park', 'Washington Park',    'North Lawndale', 'Armour Square', 'South Lawndale', 'South Shore',    'Morgan Park', 'South Deering', 'West Garfield Park', 'Hermosa',    'Mckinley Park', 'Douglas', 'Hegewisch', 'West Elsdon', 'Norwood Park',    'Garfield Ridge', 'Austin', 'Belmont Cragin', 'Jefferson Park', 'Ashburn',    'Greater Grand Crossing', 'North Park', 'Oakland', 'Archer Heights',    'Edison Park', 'Englewood', 'Ohare', 'Brighton Park', 'Chicago Lawn',    'New City', 'South Chicago', 'Mount Greenwood', 'Montclare', 'Roseland',    'West Englewood', 'Calumet Heights', 'Auburn Gresham', 'Fuller Park',    'Avalon Park', 'Burnside', 'Clearing', 'Gage Park', 'West Pullman',    'Washington Heights', 'East Side']
print(neighbourhood_list)

['Hyde Park', 'West Town', 'Lincoln Park', 'Near West Side', 'Lake View', 'Dunning', 'Rogers Park', 'Logan Square', 'Uptown', 'Edgewater', 'North Center', 'Albany Park', 'West Ridge', 'Pullman', 'Irving Park', 'Beverly', 'Lower West Side', 'Near South Side', 'Near North Side', 'Grand Boulevard', 'Bridgeport', 'Humboldt Park', 'Chatham', 'Kenwood', 'Loop', 'West Lawn', 'Lincoln Square', 'Woodlawn', 'Avondale', 'Forest Glen', 'Portage Park', 'East Garfield Park', 'Washington Park', 'North Lawndale', 'Armour Square', 'South Lawndale', 'South Shore', 'Morgan Park', 'South Deering', 'West Garfield Park', 'Hermosa', 'Mckinley Park', 'Douglas', 'Hegewisch', 'West Elsdon', 'Norwood Park', 'Garfield Ridge', 'Austin', 'Belmont Cragin', 'Jefferson Park', 'Ashburn', 'Greater Grand Crossing', 'North Park', 'Oakland', 'Archer Heights', 'Edison Park', 'Englewood', 'Ohare', 'Brighton Park', 'Chicago Lawn', 'New City', 'South Chicago', 'Mount Greenwood', 'Montclare', 'Roseland', 'West Englewood', 'Calu

In [12]:
from Levenshtein import distance

In [13]:
airbnb_pd

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,...,reviews_per_month,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
0,7742,25879,https://www.airbnb.com/rooms/25879,20230319041143,2023-03-19,city scrape,2/1 One Block to Fullerton L Red Line Deck & ...,"Terrific 2 bed, 1 bath unit in an older brick ...","Safe, tree-lined neighborhood with easy beach ...",https://a0.muscache.com/pictures/730683/6335b6...,...,0.33,0,0,0,0,0,0,0,0,0
1,7744,37738,https://www.airbnb.com/rooms/37738,20230319041143,2023-03-19,previous scrape,Andersonville - Perfect location!,Comfort and style. Recently named as one of th...,We love this neighborhood and hope you will to...,https://a0.muscache.com/pictures/434306/c7f339...,...,1.62,0,0,0,0,0,0,0,0,0
2,7652,189821,https://www.airbnb.com/rooms/189821,20230319041143,2023-03-19,city scrape,"Best in Chicago, private, amazing garden space",We offer the highest standards of cleanliness....,Enjoy the unique green spaces at the heart of ...,https://a0.muscache.com/pictures/e6324b08-d6c8...,...,4.27,0,0,0,0,0,0,0,0,0
3,7653,207218,https://www.airbnb.com/rooms/207218,20230319041143,2023-03-19,city scrape,Historic Pullman Artist Flat - Artists & Explo...,Charming original 1880's Pullman workers flat ...,The entire town of Pullman was built in the 18...,https://a0.muscache.com/pictures/102976353/ff2...,...,2.13,0,0,0,0,0,0,0,0,0
4,7655,220333,https://www.airbnb.com/rooms/220333,20230319041143,2023-03-19,city scrape,Pullman School House Apartment - monthly rental,An unusual MONTHLY rental in the heart of Pull...,The entire neighborhood is part of the Pullman...,https://a0.muscache.com/pictures/miso/Hosting-...,...,0.01,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349,7029,781186064034113431,https://www.airbnb.com/rooms/781186064034113431,20230319041143,2023-03-19,city scrape,2 Queen Bed Room,"Our hotel, located conveniently close to some ...",,https://a0.muscache.com/pictures/miso/Hosting-...,...,7.08,0,0,0,0,0,0,0,0,0
1350,7040,782373284583992853,https://www.airbnb.com/rooms/782373284583992853,20230319041143,2023-03-19,city scrape,"Logan Square 1br w/ pool, lounge & gym nr L",Show up and start living from day one in Chica...,This furnished apartment is located in the Log...,https://a0.muscache.com/pictures/prohost-api/H...,...,,0,0,0,0,0,0,0,0,0
1351,7043,782476073065423478,https://www.airbnb.com/rooms/782476073065423478,20230319041143,2023-03-19,city scrape,Affordable apt near Wrigley,Charming unit located to the North of Boystown...,,https://a0.muscache.com/pictures/00ab76f4-3332...,...,1.96,0,1,0,0,0,0,0,0,0
1352,7054,783249596739848974,https://www.airbnb.com/rooms/783249596739848974,20230319041143,2023-03-19,city scrape,"South Loop 1br w/ gym & lounge, nr Grant Park","Discover the best of Chicago, with this one-be...",This furnished apartment is located in South L...,https://a0.muscache.com/pictures/prohost-api/H...,...,,0,0,0,0,0,0,0,0,0


In [14]:
def find_closest_match(string):
    closest_match = ""
    min_distance = float('inf')
    if pd.isna(string):
        return string
    else:
        for term in neighbourhood_list:
            d = distance(string, term)
            if d < min_distance:
                closest_match = term
                min_distance = d
        return closest_match

def clean_neighbourhood(df):
    #raise Exception("not yet have implementation")
    df['neighbourhood'] = df['neighbourhood'].apply(find_closest_match)
    df.loc[~df['neighbourhood'].isin(neighbourhood_list), 'neighbourhood_flag'] = 1
    df.loc[df['neighbourhood'].isna(), 'neighbourhood_flag'] = 2
    return df

In [15]:
airbnb_pd = clean_neighbourhood(airbnb_pd)

In [16]:
airbnb_pd[airbnb_pd['neighbourhood_flag']==2]

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,...,reviews_per_month,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
9,7683,668262,https://www.airbnb.com/rooms/668262,20230319041143,2023-03-19,previous scrape,Room in Lakeview East,Welcome! You will have a private room in this ...,,https://a0.muscache.com/pictures/miso/Hosting-...,...,0.31,0,0,2,0,0,0,0,0,0
44,128,4840753,https://www.airbnb.com/rooms/4840753,20230319041143,2023-03-19,city scrape,1BD Suite @ The Guesthouse Hotel,Our suites are perfect for families or individ...,,https://a0.muscache.com/pictures/61182936/9a72...,...,0.35,0,0,2,0,0,0,0,0,0
54,160,5804791,https://www.airbnb.com/rooms/5804791,20230319041143,2023-03-19,city scrape,Single Bedroom for Med Students & Professionals,A private bedroom for one person only in a thr...,,https://a0.muscache.com/pictures/miso/Hosting-...,...,0.24,0,1,2,0,0,0,0,0,0
58,176,6014128,https://www.airbnb.com/rooms/6014128,20230319041143,2023-03-19,city scrape,Extended Stay Business 2 bedroom 2 Bath Balcony,Here it all comes together to form a lifestyle...,,https://a0.muscache.com/pictures/de52fb83-31b9...,...,0.03,0,1,2,0,0,0,0,0,0
64,217,6588681,https://www.airbnb.com/rooms/6588681,20230319041143,2023-03-19,city scrape,Logan Square Studio,Garden studio in 2-flat with private entrance....,,https://a0.muscache.com/pictures/421160c6-a4ed...,...,4.52,0,0,2,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1344,6994,777034792047052261,https://www.airbnb.com/rooms/777034792047052261,20230319041143,2023-03-19,city scrape,Newly Renovated private room 2,Take it easy at this unique and tranquil getaway.,,https://a0.muscache.com/pictures/811d3eb0-9f72...,...,,0,1,2,0,0,0,0,0,0
1345,6995,777038375268689288,https://www.airbnb.com/rooms/777038375268689288,20230319041143,2023-03-19,city scrape,Newly Renovated Private Room 1,Bring the whole family to this great place wit...,,https://a0.muscache.com/pictures/miso/Hosting-...,...,,0,1,2,0,0,0,0,0,0
1347,7017,779953812225353399,https://www.airbnb.com/rooms/779953812225353399,20230319041143,2023-03-19,city scrape,Convenient fantastic north zone,"Keep Studio, 1 bathroom Lakeview, Ground level...",,https://a0.muscache.com/pictures/miso/Hosting-...,...,0.40,0,1,2,0,0,0,0,0,0
1349,7029,781186064034113431,https://www.airbnb.com/rooms/781186064034113431,20230319041143,2023-03-19,city scrape,2 Queen Bed Room,"Our hotel, located conveniently close to some ...",,https://a0.muscache.com/pictures/miso/Hosting-...,...,7.08,0,0,2,0,0,0,0,0,0


In [17]:
airbnb_pd[airbnb_pd['neighbourhood']==2]

Unnamed: 0.1,Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,...,reviews_per_month,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag


# Neighbourhood checking

This query should return zero rows once you implement the cleaning process

In [57]:
neighbourhood_check = airbnb_pd[airbnb_pd.neighbourhood_flag==0]
neighbourhood_check = neighbourhood_check[neighbourhood_check.neighbourhood.apply(lambda x:x not in neighbourhood_list)]
neighbourhood_check[["id","neighbourhood"]]

Unnamed: 0,id,neighbourhood


# cleanup latitude and longitude
The latitude and longitude values in the dataset must fall within the range of -90 to +90 for latitude and -180 to +180 for longitude to ensure that they meet the criteria for analysis. We have provided a check number function to validate the latitude and longitude columns. Any values outside of these ranges should be cleaned to meet the criteria.

If you are unsure what to do with a value or if it is a null value, you can flag the row for deletion by setting latitude_flag or longitude_flag to 1 or 2, respectively.

In [19]:
def check_number(x,start=-90,end=90):
    try:
        temp_x = float(x)
        if start <= temp_x <= end:
            return True
    except:
        return False

In [20]:
import re
def clean_latitude(df):
    #raise Exception("not yet have implementation")
    # do something here
    lat_mask = df['latitude'].apply(lambda x: not check_number(x, -90, 90))
    df.loc[lat_mask, 'latitude_flag'] = 1
#     df.loc[lat_mask, 'delete_flag'] = 1
    df.loc[df['latitude'].isna(), 'latitude_flag'] = 2
    # nikolaus changed clean_latitude implementation to be
    df.loc[df['latitude_flag']==1, 'latitude'] = df.loc[df['latitude_flag']==1, 'latitude'].apply(lambda x: clean_latitude_pls(x))
    # convert the latitude_flag back
    df.loc[df['latitude_flag']==1, 'latitude_flag'] = 0
    return df

In [21]:
def clean_latitude_pls(latitude_string):
    # nikolaus: add minus sign (since range can be - to +)
    latitude_string = re.sub("[^0-9.\-]", "", latitude_string)
    # convert the cleaned latitude string to a float data type
    latitude = float(latitude_string)
    # if the latitude is greater than 90 (the maximum latitude value), set it to NaN
    if latitude > 90:
        latitude = float('nan')
    return latitude

In [22]:
# apply the clean_latitude function to the 'latitude' column
airbnb_pd = clean_latitude(airbnb_pd)
# airbnb_pd.loc[airbnb_pd['latitude_flag']==1, 'latitude']
#airbnb_pd.loc[airbnb_pd['latitude_flag']==1, 'latitude'] = airbnb_pd.loc[airbnb_pd['latitude_flag']==1, 'latitude'].apply(lambda x: clean_latitude_pls(x))

In [23]:
airbnb_pd.loc[airbnb_pd['latitude_flag']==1, 'latitude']

Series([], Name: latitude, dtype: object)

# Latitude checking

This query should return zero rows once you implement the cleaning process

In [24]:
lat_check_pd = airbnb_pd[airbnb_pd.latitude_flag==0]
lat_check_pd = lat_check_pd[lat_check_pd.latitude.apply(lambda x:check_number(x,-90,90))==False]
lat_check_pd[["id","latitude"]]

Unnamed: 0,id,latitude


In [25]:
def clean_longitude(df):
    #raise Exception("not yet have implementation")
    # do something here
    # validate and clean the longitude column
    lon_mask = df['longitude'].apply(lambda x: not check_number(x, -180, 180))
    df.loc[lon_mask, 'longitude_flag'] = 1
    df.loc[df['longitude'].isna(), 'longitude_flag'] = 2
        
    return df

In [26]:
def clean_longi_pls(longi_string):
    # nikolaus: add - on the regex since it can be positive or negative
    longi_string = re.sub("[^0-9.\-]", "", longi_string)
    longi = float(longi_string)
    return longi

In [27]:
airbnb_pd = clean_longitude(airbnb_pd)
airbnb_pd[airbnb_pd['longitude_flag']==1]['longitude']

25                - B87.68N021
26                - B87.68N021
38                -87.6pn461S1
39                e-87.Y657q72
40                -XI87.y64941
                 ...          
1132    -pp87.7q5f34295W116394
1182                -8o7v.6712
1197              -87y.K7808x2
1253                -H87.666h4
1326              -uL87.654F16
Name: longitude, Length: 71, dtype: object

In [28]:
airbnb_pd.loc[airbnb_pd['longitude_flag']==1, 'longitude'] = airbnb_pd.loc[airbnb_pd['longitude_flag']==1, 'longitude'].apply(lambda x: clean_longi_pls(x))
airbnb_pd.loc[airbnb_pd['longitude_flag']==1, 'longitude']
# nikolaus change longitude_flag back to 0
airbnb_pd.loc[airbnb_pd['longitude_flag']==1, 'longitude_flag'] = 0

# Longitude checking

This query should return zero rows once you implement the cleaning process

In [29]:
lon_check_pd = airbnb_pd[airbnb_pd.longitude_flag==0]
lon_check_pd = lon_check_pd[lon_check_pd.longitude.apply(lambda x:check_number(x,-180,180))==False]
lon_check_pd[["id","longitude"]]

Unnamed: 0,id,longitude


# cleanup room type
The "room_type" column in the dataset should contain one of the values defined in the list of allowed_room_type provided by the authority: ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']. Any value outside of this list needs to be adjusted to one of the allowed values.

If you are unsure about how to adjust the value or cannot find a suitable value, you can flag the row for deletion by setting the value of room_type_flag to 1. If the "room_type" column has a null value and you cannot decide on an appropriate value, you can set the value of room_type_flag to 2.

In [30]:
allowed_room_type = ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']

In [31]:
def clean_room_type(df):
    #raise Exception("not yet have implementation")
    # do something here
    df.loc[~df['room_type'].isin(allowed_room_type), 'room_type_flag'] = 1
    df.loc[df['room_type'].isna(), 'room_type_flag'] = 2 
    return df

In [32]:
airbnb_pd = clean_room_type(airbnb_pd)
res = airbnb_pd[airbnb_pd['room_type_flag']==1].groupby('room_type').count()
airbnb_pd['room_type'] = airbnb_pd['room_type'].replace({'Entire home': 'Entire home/apt'})
airbnb_pd[airbnb_pd['room_type_flag']==0]['room_type']
# airbnb_pd[airbnb_pd['room_type_flag']==1]['room_type'] = 'Entire home/apt'
# nikolaus change room_type_flag to 0
airbnb_pd.loc[airbnb_pd['room_type_flag']==1, 'room_type_flag'] = 0

# room_type checking

This query should return zero rows once you implement the cleaning process

In [33]:
room_type_pd = airbnb_pd[airbnb_pd.room_type_flag==0]
room_type_pd = room_type_pd[room_type_pd.room_type.apply(lambda x: x not in allowed_room_type)]
room_type_pd[["id","room_type"]]

Unnamed: 0,id,room_type


# cleanup minimum_nights and number_of_reviews

The columns "minimum_nights" and "number_of_reviews" should both be integer values. "minimum_nights" should be a value between 1 and the number of days in a year (365), while "number_of_reviews" should be a value between 0 and 999999.

To check if these columns meet the criteria, we have provided a "check_integer" function. Any values that do not meet the criteria should be cleaned to meet the criteria for analysis.

If you are unsure what to do with a value or if it is a null value, you can flag the row for deletion by setting "minimum_nights_flag" or "number_of_reviews_flag" to 1 or 2, respectively.

In [34]:
def check_integer(x,start=0,end=365):
    try:
        temp_x = int(x)
        if start <= temp_x <= end:
            return True
    except:
        return False

In [35]:
def clean_minimum_nights(df):
    #raise Exception("not yet have implementation")
    # do something here
    nights_mask = df['minimum_nights'].apply(lambda x: not check_integer(x))
    df.loc[nights_mask, 'minimum_nights_flag'] = 1
    df.loc[df['minimum_nights'].isna(), 'minimum_nights_flag'] = 2
    return df

In [36]:
airbnb_pd = clean_minimum_nights(airbnb_pd)
airbnb_pd[airbnb_pd['minimum_nights_flag']==1]['minimum_nights']

4       3K2
92      K10
98      j32
117     t32
119     q32
170     b32
454     500
462     f32
476     3d2
486     3v2
537     B32
599     o32
722     s32
723     s32
763     3K5
768     p32
836     x32
946     N32
957     3i2
958     3i2
959     3o2
1067    3d2
1098    y29
1224    J32
1238    S32
1245    3b2
1265    o32
1281    3h2
1313    2b1
Name: minimum_nights, dtype: object

In [37]:
def clean_min_nights_pls(nights_string):
    nights_string = re.sub("[^0-9.]", "", nights_string)
    nights = int(nights_string)
    return nights

In [38]:
airbnb_pd.loc[airbnb_pd['minimum_nights_flag']==1, 'minimum_nights'] = airbnb_pd.loc[airbnb_pd['minimum_nights_flag']==1, 'minimum_nights'].apply(lambda x: clean_min_nights_pls(x))
airbnb_pd.loc[airbnb_pd['minimum_nights_flag']==1, 'minimum_nights']
# nikolaus change minimum_nights_flag 1 to 0
airbnb_pd.loc[airbnb_pd['minimum_nights_flag']==1, 'minimum_nights_flag'] = 0


# Minimum nights checking

This query should return zero rows once you implement the cleaning process

In [39]:
min_check_pd = airbnb_pd[airbnb_pd.minimum_nights_flag==0]
min_check_pd = min_check_pd[min_check_pd.minimum_nights.apply(lambda x:check_integer(x,0,365))==False]
min_check_pd[["id","minimum_nights"]]

Unnamed: 0,id,minimum_nights


In [40]:
def clean_number_of_reviews(df):
    #raise Exception("not yet have implementation")
    # do something here
    reviews_mask = df['number_of_reviews'].apply(lambda x: not check_integer(x,0,999999))
    df.loc[reviews_mask, 'number_of_reviews_flag'] = 1
    df.loc[df['number_of_reviews'].isna(), 'number_of_reviews_flag']=2
    return df

In [41]:
airbnb_pd = clean_number_of_reviews(airbnb_pd)

In [42]:
airbnb_pd[airbnb_pd['number_of_reviews_flag']==1]['number_of_reviews']

0        5o1
15       2h7
23       K21
67       7v7
74       D35
85      n146
107     1S30
135      8f9
168      3E1
169     23c6
195     1g86
253      G32
283     Q202
284     Q202
290      f20
304      u57
315      T73
344     18L1
348     d191
361      9P1
390      7X8
391      7X8
436      a22
437      a22
455     A148
463      4x2
464      4x2
472      H36
474      3M3
487     1d94
488     1Q64
489      K14
490      J19
539      B63
614      2w0
668      q20
716      4M6
756      U51
764      4F6
777      5k1
778      5k1
808      K29
813      4f1
974      2p7
986      3C8
1007     a27
1008     a27
1166     1L4
1218     i14
1225     1Q5
1226     1Q5
1330     1S5
Name: number_of_reviews, dtype: object

In [43]:
airbnb_pd.loc[airbnb_pd['number_of_reviews_flag']==1, 'number_of_reviews'] = airbnb_pd.loc[airbnb_pd['number_of_reviews_flag']==1, 'number_of_reviews'].apply(lambda x: clean_min_nights_pls(x))
airbnb_pd.loc[airbnb_pd['number_of_reviews_flag']==1, 'number_of_reviews']
# nikolaus change number_of_reviews_flag to 0
airbnb_pd.loc[airbnb_pd['number_of_reviews_flag']==1, 'number_of_reviews_flag'] = 0

# Clean number of reviews checking

This query should return zero rows once you implement the cleaning process

In [44]:
min_check_pd = airbnb_pd[airbnb_pd.number_of_reviews_flag==0]
min_check_pd = min_check_pd[min_check_pd.number_of_reviews.apply(lambda x:check_integer(x,1,999999))==False]
min_check_pd[["id","number_of_reviews"]]

Unnamed: 0,id,number_of_reviews


# cleanup last_review

The "last_review" column should be in the format of ISO-date (yyyy-mm-dd). We have provided a "check_date" function to verify the date format.

If a value is outside the date format or is null and you are unsure how to handle it, you can flag the row for deletion by setting the "last_review_flag" to 1 or 2.


In [45]:
from datetime import datetime
def check_date(x,fmt="%Y-%m-%d"):
    try:
        datetime.strptime(x,fmt)
        return True
    except:
        return False

In [46]:
def clean_last_reviews(df):
    #raise Exception("not yet have implementation")
    # do something here
    last_mask = df['last_review'].apply(lambda x: not check_date(x))
    df.loc[last_mask, 'last_review_flag'] = 1
    df.loc[df['last_review'].isna(), 'last_review_flag'] = 2
    return df

In [47]:
def clean_date(date_str):
    # Convert the date string to a datetime object
    dt = datetime.strptime(date_str, '%B %d, %Y')

    # Convert the datetime object to an ISO-format string
    iso_date_str = dt.date().isoformat()
    print(iso_date_str)
    return iso_date_str

In [48]:
airbnb_pd = clean_last_reviews(airbnb_pd)
airbnb_pd.loc[airbnb_pd['last_review_flag']==1,'last_review']=airbnb_pd.loc[airbnb_pd['last_review_flag']==1,'last_review'].apply(clean_date)
airbnb_pd[airbnb_pd['last_review_flag']==1]['last_review']
# nikolaus change last_review_flag back to 0
airbnb_pd.loc[airbnb_pd['last_review_flag']==1, 'last_review_flag'] = 0

2022-08-13
2023-03-12
2019-03-17
2023-03-05
2021-09-19
2022-11-08
2023-02-26
2022-11-20
2019-09-13
2022-08-26
2023-03-12
2023-03-12
2023-03-08
2023-03-08
2019-10-20
2022-06-19
2022-01-03
2020-06-28
2022-11-27
2023-01-19
2022-11-30
2023-03-04
2023-03-04
2022-10-03
2023-03-07
2022-12-20
2022-12-20
2022-03-07
2022-12-29
2023-03-13
2023-01-29
2023-03-18
2021-03-31
2021-03-31
2022-08-20
2022-08-20
2022-10-02
2022-08-01
2023-03-06
2023-03-12
2023-03-17
2023-03-16
2023-03-13
2023-03-15
2023-03-14
2023-03-15
2023-03-05
2022-12-28
2022-11-14
2023-03-06
2023-03-06
2022-11-06
2022-08-27
2022-07-17
2022-12-16
2023-02-26
2023-02-24
2022-10-02
2022-12-27
2023-03-12
2023-03-12
2022-12-07


# Last Review checking

This query should return zero rows once you implement the cleaning process

In [49]:
last_review_check_pd = airbnb_pd[airbnb_pd.last_review_flag==0]
last_review_check_pd = last_review_check_pd[last_review_check_pd.last_review.apply(lambda x:check_date(x))==False]
last_review_check_pd[["id","last_review"]]

Unnamed: 0,id,last_review


# save the dataset to csv

In [50]:
airbnb_pd.to_csv("chicago_vert_dataset_cleaned_1.csv")

# columns that potentially will be used for analysis:
id,name,host_id,host_name,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,price

In [51]:
columns_used = ["id","name","host_id","host_name",
                         "neighbourhood","latitude","longitude",
                         "room_type","minimum_nights","number_of_reviews","last_review","price"]

In [52]:
airbnb_pd[columns_used]

Unnamed: 0,id,name,host_id,host_name,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,price
0,25879,2/1 One Block to Fullerton L Red Line Deck & ...,101521,Red,Chicago Lawn,41.92499,-87.65573,Entire home/apt,32,51,2022-12-20,$86.00
1,37738,Andersonville - Perfect location!,162364,Mat And Randy,Chicago Lawn,41.97303,-87.66567,Private room,3,250,2020-03-15,$110.00
2,189821,"Best in Chicago, private, amazing garden space",899757,Meighan,Chicago Lawn,41.92918,-87.70219,Entire home/apt,3,598,2023-02-25,$202.00
3,207218,Historic Pullman Artist Flat - Artists & Explo...,1019125,Jb,Chicago Lawn,41.68843,-87.60712,Entire home/apt,2,299,2023-02-12,$100.00
4,220333,Pullman School House Apartment - monthly rental,1019125,Jb,Chicago Lawn,41.68912,-87.60725,Entire home/apt,32,2,2022-08-13,$100.00
...,...,...,...,...,...,...,...,...,...,...,...,...
1349,781186064034113431,2 Queen Bed Room,431019163,Paul,,41.88413,-87.63025,Private room,1,17,2023-03-17,$192.00
1350,782373284583992853,"Logan Square 1br w/ pool, lounge & gym nr L",107434423,Blueground,Lower West Side,41.927101,-87.7043588,Entire home/apt,32,0,,$158.00
1351,782476073065423478,Affordable apt near Wrigley,252408635,Andie,,41.95693,-87.6497,Entire home/apt,1,6,2023-03-03,$139.00
1352,783249596739848974,"South Loop 1br w/ gym & lounge, nr Grant Park",107434423,Blueground,Chicago Lawn,41.8725288,-87.6308608,Entire home/apt,32,0,,$119.00


In [53]:
# save the csv
