In [1]:
import pandas as pd
import numpy as np
import abydos
from abydos import distance
import re
import math

# General Instruction

- Cleanup dataset based on the information that is given:
You need to clean the dataset according to the information that is given to you. This means that there are problems with the dataset that need to be fixed, and you should use the information given to you to determine what those problems are and how to fix them.

- Each case has different data quality problems, there will be hint and additional information that can help you understand the problem:
Each row in the dataset may have different data quality problems. There will be hints and additional information provided to help you understand what the specific problem is with each row.

- You can do any approach on cleaning the data, but you should clean the instructed column only:
You have the freedom to use any approach to clean the data, but you should only clean the instructed column. This means that you should not modify any other columns in the dataset, or add or remove any rows.

- Do not create new column or remove any column. Also do not create new row, or remove any row:-
You are not allowed to create new columns or remove any columns from the dataset. You are also not allowed to add or remove any rows.

- Each column will have a flag column something equivalent to <column\_name>\_flag. This column can be used to flag the row if you want to not include it to the downstream task. 0: safe_flag, 1: delete_flag, 2: null_flag (if you want to still include the row with null treatment). You can also add a new category but please add justification and explanation of the new category, there are three categories you can use:
safe_flag (0): this row is safe to use in downstream tasks
delete_flag (1): this row should be deleted and not used in downstream tasks
null_flag (2): this row can be included in downstream tasks but with null treatment.
You can also add a new category, but you need to provide a justification and an explanation for the new category. It is worth to note that the completeness of the dataset is also matter, so try not to flag to many things, and do your best to clean the values.

- For each data cleaning task, we have provided a function that represents the goal of the cleaning. For example, clean_duplicate_id(df) is the function for removing duplicate ID values. These functions take a DataFrame as input and return the cleaned version of the DataFrame.

    In each chunk of data cleaning task, you will see the following three parts:

    1. The clean_<name> function that performs the specific cleaning task.
    2. The execution of the cleaning function on the DataFrame.
    3. A checking part to help you evaluate the effectiveness of the cleaning.
    
  While you can create new cells and add additional code, the cleaning must be performed through the provided cleaning functions. You can adjust the order of the cleaning steps, but please try to move the whole chunks of code to avoid any errors.

The cleaning task will be considered complete if this notebook can be run sequentially by executing "restart and runall"




# Purpose
The purpose of this dataset is to conduct exploratory analysis of the listings and create a prediction model for listing price using some columns from the dataset. This means that the dataset is intended to be used to explore the characteristics and features of the listings, and to build a model that can predict the price of a listing based on certain variables in the dataset. The goal is to gain insights into the factors that influence the price of a listing and to develop a model that can accurately predict listing prices based on those factors.

# Columns and Dataset Description
- id: a unique identifier for each listing.
- name: the name or title of the listing, as provided by the host.
- host_id: a unique identifier for each host.
- host_name: the name of the host who listed the property.
- neighbourhood_group: the larger geographic area in which the listing is located (e.g. a borough or group of neighborhoods).
- neighbourhood: the specific neighborhood in which the listing is located.
- latitude: the latitude coordinate of the listing.
- longitude: the longitude coordinate of the listing.
- room_type: the type of space that is being listed (e.g. an entire apartment, a private room, a shared room).
- price: the nightly price of the listing, in the currency specified in the dataset.
- minimum_nights: the minimum number of nights that a guest must book the listing for.
- number_of_reviews: the total number of reviews that the listing has received.
- last_review: the date of the most recent review of the listing.
- reviews_per_month: the average number of reviews per month that the listing has received.
- calculated_host_listings_count: the total number of listings that the host has on Airbnb.
- availability_365: the number of days per year that the listing is available for booking.
- number_of_reviews_ltm: the total number of reviews that the listing has received in the last 12 months.
- license: a license number for the listing, if applicable (this column may not be present in all versions of the dataset).

Besides the columns above, there are columns pre-defined for flagging the rows based on particular data cleaning context:
- id_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the id column (duplicate).
- host_id_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the host_id column.
- neighbourhood_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the neighbourhood column.
- latitude_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the latitude column.
- longitude_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the longitude column.
- minimum_nights_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the minimum_nights column.
- number_of_reviews_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the number_of_reviews column.
- last_review_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the last_review column.
- room_type_flag: a flag column indicating whether a given row should be included in downstream analysis or not based on data quality issues related to the room_type column.

# Load Data

In [2]:
airbnb_pd = pd.read_csv("chicago_vert_dataset.csv")

In [3]:
airbnb_pd.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm', 'license', 'id_flag',
       'host_id_flag', 'neighbourhood_flag', 'latitude_flag', 'longitude_flag',
       'minimum_nights_flag', 'number_of_reviews_flag', 'last_review_flag',
       'room_type_flag'],
      dtype='object')

In [4]:
airbnb_pd.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm', 'license', 'id_flag',
       'host_id_flag', 'neighbourhood_flag', 'latitude_flag', 'longitude_flag',
       'minimum_nights_flag', 'number_of_reviews_flag', 'last_review_flag',
       'room_type_flag'],
      dtype='object')

In [5]:
airbnb_pd.neighbourhood.unique()

array(['Hyde Park', 'WeEst ToMPwn', 'LogalnQ SquUare', 'Uptown',
       'West Town', 'Edgewater', 'North Center', 'Logan Square',
       'Irving Park', 'Near South Side', 'AWesIt Ridg e',
       'Lhogan SqueRare', 'Bridgeport', 'Bridgeapoxrit', 'Wel st rTown',
       'Forest Glen', 'QixLogan Square', 'West Ridge',
       'East Garfield Park', 'Rogers Park', 'IRogxers sPark',
       'South Lawndale', 'UMTptoSwn', 'Beverly', 'Lower West Side',
       'Chatham', 'HysdCe Pasrk', 'Avondale', 'Humboldt Park',
       'Portage Park', 'EastuH GarfiOeld Park', 'Washington Park',
       'LoweNrU  West Side', 'Near SouQth CShide', 'Loop',
       'Nearx PSfouth Side', 'West Lawn', nan, 'PHdydee Park',
       'PEdgewatweOr', 'Grand Boulevard', 'MUjNptown',
       'OLoweri Wiest Side', 'WestK aTxown', 'UFptkohwn', 'AviondSalae',
       'IrvinQg JAPark', 'XLeogman Square', 'LRognan SqQuare',
       'South Deering', 'WLestS ToKwn', 'Wnhejst Town', 'Wenst ToDjwn',
       'WestK ToTCwn', 'WehshtH Town', 

# cleanup duplicate id
The ID column must contain unique values. If there are any duplicate values in this column, you will need to take action to ensure that each ID is unique. You can do this by either fixing the duplicates (if you want to keep them) or by flagging them for removal (1) using the id_flag column.

In [6]:
def clean_duplicate_id(df):
    dup_ids = df[df.id_flag==0]
    dup_ids = dup_ids.groupby("id").count()[["name"]].reset_index()
    dup_ids = dup_ids[dup_ids.name>1]
    
    for id_ in list(dup_ids["id"]):
        temp_df=df[df["id"]==id_]
        
        try:
            id_=list(temp_df[temp_df.duplicated()].index)[0]  ## two rows are completely the same
            df.loc[id_,"id_flag"]=1
        except:
            index_ls=list(temp_df.index)   ## one row combine with another row equals to one of the original row
            temp_row_1=temp_df.loc[index_ls[0]]
            temp_row_2=temp_df.loc[index_ls[1]]
            row_comb=temp_df.loc[index_ls[0]].combine_first(temp_df.loc[index_ls[1]])
            try:
                comp_result_1=list(set(list(temp_row_1.fillna(0)==row_comb.fillna(0))))[0]
                comp_result_2=list(set(list(temp_row_2.fillna(0)==row_comb.fillna(0))))[0]
                if comp_result_1==True:
                    df.loc[index_ls[1],"id_flag"]=1
                elif comp_result_2==True:
                    df.loc[index_ls[0],"id_flag"]=1
                else:
                    df.loc[index_ls[0],"id_flag"]=2
                    df.loc[index_ls[1],"id_flag"]=2  ##mark types which not belongs to the types mentioned above
            except:
                print(id_)
    return df

In [7]:
airbnb_pd = clean_duplicate_id(airbnb_pd)

# Duplicate IDS checking 
To ensure that all ID values in the dataset are unique, you should check for duplicate IDs. When you run the query to check for duplicates, there should be no rows returned, indicating that there are no duplicate ID values present in the dataset.

In [8]:
dup_ids = airbnb_pd[airbnb_pd.id_flag==0]
dup_ids = dup_ids.groupby("id").count()[["name"]].reset_index()
dup_ids = dup_ids[dup_ids.name>1]
dup_ids

Unnamed: 0,id,name


# cleanup inconsistent host id
Each host_id value in the dataset should be associated with only one host_name. However, there may be inconsistencies in the dataset where a host_id is associated with different host_name values.

To clean this up, you can either change the host_name value to a consistent value based on information in the dataset, or flag the host_id_flag column to indicate that the row should be removed from downstream tasks.

For example, if you find that a host_id is associated with multiple host_name values, you may want to investigate further to determine which host_name is correct. If one of the host_name values is clearly incorrect (e.g., a misspelling or a name that does not match the owner of the property), you could update the host_name value to the correct value.

Alternatively, if you cannot determine the correct host_name value, or if you want to exclude the row from downstream tasks for other reasons, you can flag the host_id_flag column with a value of 1 to indicate that the row should be removed.

In [9]:
def clean_host_id(df):
    dup_host_id = df[df.host_id_flag==0]
    dup_host_id = dup_host_id.groupby(["host_id","host_name"]).count()[["id"]].reset_index()
    dup_host_id = dup_host_id.groupby("host_id").count()["id"].reset_index()
    dup_host_id_df=dup_host_id[dup_host_id["id"]>1]
    dup_host_id_ls=list(dup_host_id_df["host_id"])
    for dup_host_id in dup_host_id_ls:
        temp_df=df[df["host_id"]==dup_host_id]
        temp_id_ls=list(temp_df.index)
        
        temp_match_ls=[]
        for id_ in temp_id_ls:
            string=temp_df.loc[id_,"host_name"]
            pattern = r"^[A-Z][a-z0-9_-]{3,19}$"  # here is only a simple way to clean the host_id
            match=re.match(pattern, string)
            if match:
                temp_match_ls.append(string)
            else:
                df.loc[id_,"host_id_flag"]=1  # while this method is also not perfect, since under same host_id, the host_name could be all wrong
        
        if len(set(temp_match_ls))!=1:
            for id_ in temp_id_ls:
                 df.loc[id_,"host_id_flag"]=2
    return df

In [10]:
airbnb_pd = clean_host_id(airbnb_pd)

# Inconsistent Host ID checking 

This query should return zero rows once you implement the cleaning process

In [11]:
dup_host_id = airbnb_pd[airbnb_pd.host_id_flag==0]
dup_host_id = dup_host_id.groupby(["host_id","host_name"]).count()[["id"]].reset_index()
dup_host_id = dup_host_id.groupby("host_id").count()["id"].reset_index()
dup_host_id[dup_host_id["id"]>1]

Unnamed: 0,host_id,id


In [12]:
## I have identified other issues, 1) the two names are pretty close to each other (which is very hard to detect 2) we can choose the right name based on the frequency

# cleanup neighbourhood
The neighbourhood column in the dataset should contain values that match the neighbourhoods defined in the official neighbourhood_list. However, there may be some values in the neighbourhood column that are incorrect due to errors or noise in the data.

To clean up the neighbourhood column, you can try to match each value in the column to a valid neighbourhood in the neighbourhood_list using a string distance function such as abydos. If you can successfully match a value in the neighbourhood column to a neighbourhood in the neighbourhood_list, you can replace the value in the dataset with the correct neighbourhood name.

However, if you are unsure about how to clean up a particular value in the neighbourhood column, or if you cannot match the value to a valid neighbourhood in the neighbourhood_list, you can flag the row for deletion by setting the neighbourhood_flag column to a value of 1. If the value in the neighbourhood column is null and you cannot make a determination based on other information in the dataset, you can set the neighbourhood_flag column to a value of 2 to indicate that the row should be included but the neighbourhood value is null.

You can also use the latitude and longitude columns in the dataset to help match values in the neighbourhood column to valid neighbourhoods in the neighbourhood_list. However, you should be aware that the latitude and longitude values may also contain errors or noise, so you should exercise caution when using these columns to clean up the neighbourhood column.

In [13]:
neighbourhood_list = [ 'Hyde Park', 'West Town', 'Lincoln Park', 'Near West Side', 'Lake View', 'Dunning', 'Rogers Park', 'Logan Square', 'Uptown', 'Edgewater',    'North Center', 'Albany Park', 'West Ridge', 'Pullman', 'Irving Park',    'Beverly', 'Lower West Side', 'Near South Side', 'Near North Side',    'Grand Boulevard', 'Bridgeport', 'Humboldt Park', 'Chatham', 'Kenwood',    'Loop', 'West Lawn', 'Lincoln Square', 'Woodlawn', 'Avondale',    'Forest Glen', 'Portage Park', 'East Garfield Park', 'Washington Park',    'North Lawndale', 'Armour Square', 'South Lawndale', 'South Shore',    'Morgan Park', 'South Deering', 'West Garfield Park', 'Hermosa',    'Mckinley Park', 'Douglas', 'Hegewisch', 'West Elsdon', 'Norwood Park',    'Garfield Ridge', 'Austin', 'Belmont Cragin', 'Jefferson Park', 'Ashburn',    'Greater Grand Crossing', 'North Park', 'Oakland', 'Archer Heights',    'Edison Park', 'Englewood', 'Ohare', 'Brighton Park', 'Chicago Lawn',    'New City', 'South Chicago', 'Mount Greenwood', 'Montclare', 'Roseland',    'West Englewood', 'Calumet Heights', 'Auburn Gresham', 'Fuller Park',    'Avalon Park', 'Burnside', 'Clearing', 'Gage Park', 'West Pullman',    'Washington Heights', 'East Side']
print(neighbourhood_list)

['Hyde Park', 'West Town', 'Lincoln Park', 'Near West Side', 'Lake View', 'Dunning', 'Rogers Park', 'Logan Square', 'Uptown', 'Edgewater', 'North Center', 'Albany Park', 'West Ridge', 'Pullman', 'Irving Park', 'Beverly', 'Lower West Side', 'Near South Side', 'Near North Side', 'Grand Boulevard', 'Bridgeport', 'Humboldt Park', 'Chatham', 'Kenwood', 'Loop', 'West Lawn', 'Lincoln Square', 'Woodlawn', 'Avondale', 'Forest Glen', 'Portage Park', 'East Garfield Park', 'Washington Park', 'North Lawndale', 'Armour Square', 'South Lawndale', 'South Shore', 'Morgan Park', 'South Deering', 'West Garfield Park', 'Hermosa', 'Mckinley Park', 'Douglas', 'Hegewisch', 'West Elsdon', 'Norwood Park', 'Garfield Ridge', 'Austin', 'Belmont Cragin', 'Jefferson Park', 'Ashburn', 'Greater Grand Crossing', 'North Park', 'Oakland', 'Archer Heights', 'Edison Park', 'Englewood', 'Ohare', 'Brighton Park', 'Chicago Lawn', 'New City', 'South Chicago', 'Mount Greenwood', 'Montclare', 'Roseland', 'West Englewood', 'Calu

In [14]:
def clean_neighbourhood(df):
    neighbourhood_check = df[df.neighbourhood_flag==0]
    neighbourhood_check = neighbourhood_check[neighbourhood_check.neighbourhood.apply(lambda x:x not in neighbourhood_list)]
    neigh_id_df=neighbourhood_check[["id","neighbourhood"]]
    neigh_ind_ls=list(neigh_id_df.index)
    
    for neigh_ind in neigh_ind_ls:
        sus_neigh_ipt=df.loc[neigh_ind,"neighbourhood"]

        try:
            
            temp_ls=[]
            for neighbour in neighbourhood_list:
                temp_ls.append(distance.sim(sus_neigh_ipt, neighbour))
            if max(temp_ls)>0.5:
                df.loc[neigh_ind,"neighbourhood"]=neighbourhood_list[temp_ls.index(max(temp_ls))]
            else:
                df.loc[neigh_ind,"neighbourhood_flag"]=1
        except:
            df.loc[neigh_ind,"neighbourhood_flag"]=2
        
    return df

In [15]:
airbnb_pd = clean_neighbourhood(airbnb_pd)

# Neighbourhood checking

This query should return zero rows once you implement the cleaning process

In [16]:
neighbourhood_check = airbnb_pd[airbnb_pd.neighbourhood_flag==0]
neighbourhood_check = neighbourhood_check[neighbourhood_check.neighbourhood.apply(lambda x:x not in neighbourhood_list)]
neighbourhood_check[["id","neighbourhood"]]

Unnamed: 0,id,neighbourhood


In [17]:
airbnb_pd.groupby("neighbourhood_flag").count()

Unnamed: 0_level_0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,number_of_reviews_ltm,license,id_flag,host_id_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
neighbourhood_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4473,4458,4473,4473,0,4473,4457,4459,4457,4461,...,4473,3928,4473,4473,4473,4473,4473,4473,4473,4473
1,6,5,6,6,0,6,5,6,6,5,...,6,4,6,6,6,6,6,6,6,6
2,15,14,15,15,0,0,12,12,11,8,...,15,14,15,15,15,15,15,15,15,15


In [18]:
airbnb_pd[airbnb_pd.neighbourhood_flag=="0"]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,license,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag


# cleanup latitude and longitude
The latitude and longitude values in the dataset must fall within the range of -90 to +90 for latitude and -180 to +180 for longitude to ensure that they meet the criteria for analysis. We have provided a check number function to validate the latitude and longitude columns. Any values outside of these ranges should be cleaned to meet the criteria.

If you are unsure what to do with a value or if it is a null value, you can flag the row for deletion by setting latitude_flag or longitude_flag to 1 or 2, respectively.

In [19]:
def check_number(x,start=-90,end=90):
    try:
        temp_x = float(x)
        #if (start <= temp_x <= end) == True:
        #    return True
        return (start <= temp_x <= end)
    except:
        return False

In [20]:
def clean_latitude(df):
    
    lat_check_pd = df[df.latitude_flag==0]
    lat_check_pd = lat_check_pd[lat_check_pd.latitude.apply(lambda x:check_number(x,-90,90))==False]
    wrong_lat_df=lat_check_pd[["id","latitude"]]
    lat_ind_ls=list(wrong_lat_df.index)

    for lat_id in lat_ind_ls:
        pre_lati=df.loc[lat_id,"latitude"]
        if pre_lati==np.nan:
            df.loc[lat_id,"latitude_flag"]=2
        else:
            try:
                revised=float(''.join(c for c in pre_lati if (c.isdigit() or c =='.')))
                if check_number(revised)==True:
                    df.loc[lat_id,"latitude"]=revised
                else:
                    df.loc[lat_id,"latitude_flag"]=1
        
            except:
                print(lat_id)
    return df

In [21]:
airbnb_pd = clean_latitude(airbnb_pd)

15
46
61
62
73
81
99
129
149
168
182
185
220
239
250
251
252
278
292
330
335
338
340
343
345
346
363
422
432
448
508
526
528
556
563
571
579
595
647
660
711
722
723
745
751
800
814
883
884
889
926
956
972
1018
1019
1027
1042
1049
1058
1061
1063
1069
1094
1110
1129
1139
1155
1194
1204
1215
1216
1228
1236
1283
1310
1338
1371
1387
1400
1405
1411
1412
1428
1435
1447
1465
1476
1504
1511
1572
1589
1594
1595
1597
1598
1599
1668
1669
1680
1684
1698
1712
1722
1740
1785
1789
1795
1796
1798
1812
1844
1867
1882
1887
1944
1955
1978
1979
2006
2030
2034
2076
2080
2082
2085
2119
2144
2227
2229
2270
2284
2289
2290
2326
2340
2348
2350
2384
2396
2401
2406
2409
2412
2415
2433
2438
2439
2471
2482
2493
2494
2506
2511
2512
2520
2534
2542
2564
2565
2632
2655
2660
2676
2683
2687
2691
2692
2699
2725
2793
2818
2833
2851
2861
2876
2877
2904
2912
2915
2927
2939
2948
2986
3002
3003
3037
3059
3105
3123
3127
3188
3203
3211
3243
3254
3255
3274
3277
3278
3279
3306
3328
3329
3371
3377
3450
3470
3471
3506
3517
3543
3551


# Latitude checking

This query should return zero rows once you implement the cleaning process

In [22]:
airbnb_pd.groupby("latitude_flag").count()

Unnamed: 0_level_0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,number_of_reviews_ltm,license,id_flag,host_id_flag,neighbourhood_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
latitude_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4494,4477,4494,4494,0,4479,4474,4477,4474,4474,...,4494,3946,4494,4494,4494,4494,4494,4494,4494,4494


In [23]:
lat_check_pd = airbnb_pd[airbnb_pd.latitude_flag==0]
lat_check_pd = lat_check_pd[lat_check_pd.latitude.apply(lambda x:check_number(x,-90,90))==False]
lat_check_pd[["id","latitude"]]

Unnamed: 0,id,latitude
15,312192,4182741.0
46,1027741,4185308.0
61,1668481,4189665.0
62,1668481,4189665.0
73,2396340,4189648.0
...,...,...
4407,768181481417991009,4201449.0
4416,770523073506526538,4181438.0
4422,771803500765098968,4195196.0
4425,771880753874742770,4199533.0


In [24]:
lat_check_pd = airbnb_pd[airbnb_pd.latitude_flag==0]
lat_check_pd

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,license,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
0,2384,Hyde Park - Walk to UChicago,2613,Rebecca,,Hyde Park,41.787900,-8.758780e+01,Private room,84.0,...,R17000015609,0,0,0,0,0,0,0,0,0
1,7126,Tiny Studio Apartment 94 Walk Score,17928,Sarah,,West Town,41.902890,-8.768182e+06,Entire home/apt,85.0,...,R21000075737,0,0,0,0,0,0,0,0,0
2,28749,Quirky Bucktown Loft w/ Parking NO PARTIES,27506,Lauri,,Logan Square,41.920820,-8.768012e+01,Entire home/apt,147.0,...,R22000079930,0,0,0,0,0,0,0,0,0
3,37738,Andersonville - Perfect location!,162364,Mat And Randy,,Uptown,41.972900,-8.766538e+01,Private room,110.0,...,R20000059426,0,0,0,0,0,0,0,0,0
4,71930,"Rest, Relax and Explore",334241,Michael And Veronica,,West Town,41.896150,-8.767934e+01,Private room,75.0,...,R17000013986,0,2,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4489,783222540979971927,Spacious Private Room w/Smart TV,118714865,cabaXi,,West Town,41.893030,-8.766548e+01,Private room,64.0,...,R21000061257,0,1,0,0,0,0,0,0,0
4490,783249596739848974,"South Loop 1br w/ gym & lounge, nr Grant Park",107434423,Blueground,,Loop,41.872529,-8.763086e+01,Entire home/apt,143.0,...,,0,0,0,0,0,0,0,0,0
4491,783477604930615583,2BR Charming Home in Rogers Park,459457994,Jean,,Rogers Park,41.999305,-8.766586e+01,Entire home/apt,54.0,...,R19000050250,0,2,0,0,0,0,0,0,0
4492,784266776843782956,Fast WiFi/Room with TV/2I*,186115134,Djamila,,West Town,41.891460,-8.767992e+01,Private room,31.0,...,R19000087222,0,0,0,0,0,0,0,0,0


In [25]:
lat_check_pd[lat_check_pd.latitude.apply(lambda x:(-90<float(x)<90)==False)]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,license,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
15,312192,Winter Specials. You'll love this apartment.,1505878,Bob,,Bridgeport,4182741.0,-8.763934e+01,Entire home/apt,146.0,...,2074660,0,0,0,0,0,0,0,0,0
46,1027741,Chicago Women's Residence Midler Rm,5654677,Urban Art,,South Lawndale,4185308.0,-8.770718e+01,Private room,45.0,...,R22000076746,0,0,0,0,0,0,0,0,0
61,1668481,Cozy 2BR Apt - Humboldt Park,531832,Evan,,West Town,4189665.0,-8.769719e+01,Entire home/apt,86.0,...,R18000026178,0,0,0,0,0,0,0,0,0
62,1668481,Cozy 2BR Apt - Humboldt Park,531832,Evan,,West Town,4189665.0,-8.769719e+01,Entire home/apt,86.0,...,R18000026178,1,0,0,0,0,0,0,0,0
73,2396340,"Unique, Handcrafted Art Gallery Condo in River...",5228189,Matthew,,West Town,4189648.0,-8.765651e+01,Entire home/apt,192.0,...,2385048,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4407,768181481417991009,"Comfortable 2-Bedroom, 1-Bathroom Rogers Park Apt",480338513,Lane,,Rogers Park,4201449.0,-8.766483e+01,Entire home/apt,52.0,...,R22000094683,0,2,0,0,0,0,0,0,0
4416,770523073506526538,Fun & Private Chicago Speakeasy!,55728526,Marcus,,Grand Boulevard,4181438.0,-8.761827e+06,Apartment,110.0,...,R22000092075,0,0,0,0,0,0,0,0,0
4422,771803500765098968,3BR Comfy & Bright Chicago Apartment,488113016,Alexis,,North Center,4195196.0,-8.767717e+01,Entire home/apt,67.0,...,R22000089510,0,0,0,0,0,0,0,0,0
4425,771880753874742770,Spacious 2BR Stylish Apartment in Francisco Ave,488113016,Alexis,,West Ridge,4199533.0,-8.770164e+01,Entire home/apt,66.0,...,R18000023286,0,0,0,0,0,0,0,0,0


In [26]:
lat_check_pd

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,license,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
0,2384,Hyde Park - Walk to UChicago,2613,Rebecca,,Hyde Park,41.787900,-8.758780e+01,Private room,84.0,...,R17000015609,0,0,0,0,0,0,0,0,0
1,7126,Tiny Studio Apartment 94 Walk Score,17928,Sarah,,West Town,41.902890,-8.768182e+06,Entire home/apt,85.0,...,R21000075737,0,0,0,0,0,0,0,0,0
2,28749,Quirky Bucktown Loft w/ Parking NO PARTIES,27506,Lauri,,Logan Square,41.920820,-8.768012e+01,Entire home/apt,147.0,...,R22000079930,0,0,0,0,0,0,0,0,0
3,37738,Andersonville - Perfect location!,162364,Mat And Randy,,Uptown,41.972900,-8.766538e+01,Private room,110.0,...,R20000059426,0,0,0,0,0,0,0,0,0
4,71930,"Rest, Relax and Explore",334241,Michael And Veronica,,West Town,41.896150,-8.767934e+01,Private room,75.0,...,R17000013986,0,2,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4489,783222540979971927,Spacious Private Room w/Smart TV,118714865,cabaXi,,West Town,41.893030,-8.766548e+01,Private room,64.0,...,R21000061257,0,1,0,0,0,0,0,0,0
4490,783249596739848974,"South Loop 1br w/ gym & lounge, nr Grant Park",107434423,Blueground,,Loop,41.872529,-8.763086e+01,Entire home/apt,143.0,...,,0,0,0,0,0,0,0,0,0
4491,783477604930615583,2BR Charming Home in Rogers Park,459457994,Jean,,Rogers Park,41.999305,-8.766586e+01,Entire home/apt,54.0,...,R19000050250,0,2,0,0,0,0,0,0,0
4492,784266776843782956,Fast WiFi/Room with TV/2I*,186115134,Djamila,,West Town,41.891460,-8.767992e+01,Private room,31.0,...,R19000087222,0,0,0,0,0,0,0,0,0


In [27]:
def clean_longitude(df):
    lon_check_pd = airbnb_pd[airbnb_pd.longitude_flag==0]
    lon_check_pd = lon_check_pd[lon_check_pd.longitude.apply(lambda x:check_number(x,-180,180))==False]
    wrong_long_df=lon_check_pd[["id","longitude"]]
    long_ind_ls=list(wrong_long_df.index)

    for long_id in long_ind_ls:
        pre_long=df.loc[long_id,"longitude"]
        if pre_long==np.nan:
            df.loc[long_id,"longitude_flag"]=2
        else:
            try:
                revised=float(''.join(c for c in pre_long if (c.isdigit() or c =='.')))
                if check_number(revised,-180,180)==True:
                    df.loc[long_id,"longitude"]=revised
                else:
                    df.loc[long_id,"longitude_flag"]=1
        
            except:
                print(long_id)
        
    return df

In [28]:
airbnb_pd = clean_longitude(airbnb_pd)

1
8
58
66
90
91
97
118
151
153
176
177
201
235
266
285
286
293
322
335
369
370
374
385
398
414
423
444
469
470
475
524
538
588
599
612
649
686
708
722
744
759
799
838
839
851
872
894
895
896
912
930
934
937
967
995
998
999
1012
1085
1087
1099
1105
1131
1133
1142
1166
1168
1193
1197
1204
1207
1220
1225
1250
1258
1305
1322
1343
1348
1361
1396
1411
1412
1480
1521
1546
1580
1618
1631
1674
1675
1679
1680
1708
1736
1737
1743
1749
1775
1795
1816
1823
1848
1874
1887
1901
1926
1928
1952
1962
1986
2035
2049
2058
2065
2070
2077
2078
2084
2098
2112
2122
2141
2158
2178
2201
2216
2219
2221
2304
2315
2341
2383
2395
2401
2402
2416
2417
2449
2460
2480
2486
2487
2509
2511
2512
2526
2570
2605
2627
2666
2667
2668
2683
2703
2716
2741
2746
2747
2793
2820
2874
2887
2896
2926
2936
2944
2946
2947
2996
3015
3026
3035
3071
3079
3088
3092
3104
3183
3184
3200
3212
3227
3228
3297
3309
3351
3359
3377
3378
3397
3403
3417
3418
3432
3433
3443
3473
3482
3494
3524
3534
3560
3584
3596
3606
3615
3621
3640
3678
3679
3692
37

# Longitude checking

This query should return zero rows once you implement the cleaning process

In [29]:
lon_check_pd = airbnb_pd[airbnb_pd.longitude_flag==0]
lon_check_pd = lon_check_pd[lon_check_pd.longitude.apply(lambda x:check_number(x,-180,180))==False]
lon_check_pd[["id","longitude"]]

Unnamed: 0,id,longitude
1,7126,-8768182.0
8,189821,-8770219.0
58,1648786,
66,1921670,-8770783.0
90,2874486,-8767028.0
...,...,...
4475,780299080228475450,-8771377.0
4478,781159796541991716,-876549064.0
4480,781186064034113431,-8763016.0
4481,781186064034113431,-8763016.0


In [30]:
-180 <= float(-8.768182e+06) <= 180

False

# cleanup room type
The "room_type" column in the dataset should contain one of the values defined in the list of allowed_room_type provided by the authority: ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']. Any value outside of this list needs to be adjusted to one of the allowed values.

If you are unsure about how to adjust the value or cannot find a suitable value, you can flag the row for deletion by setting the value of room_type_flag to 1. If the "room_type" column has a null value and you cannot decide on an appropriate value, you can set the value of room_type_flag to 2.

In [31]:
allowed_room_type = ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']

In [32]:
def clean_room_type(df):
    room_type_pd = df[df.room_type_flag==0]
    room_type_pd = room_type_pd[room_type_pd.room_type.apply(lambda x: x not in allowed_room_type)]
    room_type_pd[["id","room_type"]]
    room_id_ls=list(room_type_pd[["id","room_type"]].index)
    
    for room_ind in room_id_ls:
        sus_room_ipt=df.loc[room_ind,"room_type"]

        try:
            temp_ls=[]
            for room in allowed_room_type:
                temp_ls.append(distance.sim(sus_room_ipt, room))
            if max(temp_ls)>0.5:
                df.loc[room_ind,"room_type"]=allowed_room_type[temp_ls.index(max(temp_ls))]
            else:
                df.loc[room_ind,"room_type_flag"]=1
        except:
            df.loc[room_ind,"room_type_flag"]=2
        
    return df

In [33]:
airbnb_pd = clean_room_type(airbnb_pd)

# room_type checking

This query should return zero rows once you implement the cleaning process

In [34]:
room_type_pd = airbnb_pd[airbnb_pd.room_type_flag==0]
room_type_pd = room_type_pd[room_type_pd.room_type.apply(lambda x: x not in allowed_room_type)]
room_type_pd[["id","room_type"]]

Unnamed: 0,id,room_type


In [35]:
#airbnb_pd[airbnb_pd.room_type_flag==0].room_type.unique()

# cleanup minimum_nights and number_of_reviews

The columns "minimum_nights" and "number_of_reviews" should both be integer values. "minimum_nights" should be a value between 1 and the number of days in a year (365), while "number_of_reviews" should be a value between 0 and 999999.

To check if these columns meet the criteria, we have provided a "check_integer" function. Any values that do not meet the criteria should be cleaned to meet the criteria for analysis.

If you are unsure what to do with a value or if it is a null value, you can flag the row for deletion by setting "minimum_nights_flag" or "number_of_reviews_flag" to 1 or 2, respectively.

In [36]:
def check_integer(x,start=1,end=365):
    try:
        temp_x = int(x)
        if start <= temp_x <= end:
            return True
    except:
        return False

In [37]:
def clean_minimum_nights(df):
    min_check_pd = df[df.minimum_nights_flag==0]
    min_check_pd = min_check_pd[min_check_pd.minimum_nights.apply(lambda x:check_integer(x,1,365))==False]
    wrong_min_ng= min_check_pd[["id","minimum_nights"]]
    ng_ind_ls=list(wrong_min_ng.index)

    for ng_id in ng_ind_ls:
        pre_ng=df.loc[ng_id,"minimum_nights"]
        
        try:
            if pre_ng==np.nan or math.isnan(pre_ng):
                # print("here:" +str(ng_id))
                df.loc[ng_id,"minimum_nights_flag"]=2
        except:
            try:
                revised=int(''.join(c for c in pre_ng if c.isdigit()))
                if check_integer(revised)==True:
                    df.loc[ng_id,"minimum_nights"]=revised
                else:
                    df.loc[ng_id,"minimum_nights_flag"]=1
        
            except:
                print(ng_id)
        
    return df

In [38]:
airbnb_pd = clean_minimum_nights(airbnb_pd)

# Minimum nights checking

This query should return zero rows once you implement the cleaning process

In [39]:
min_check_pd = airbnb_pd[airbnb_pd.minimum_nights_flag==0]
min_check_pd = min_check_pd[min_check_pd.minimum_nights.apply(lambda x:check_integer(x,1,365))==False]
min_check_pd[["id","minimum_nights"]]

Unnamed: 0,id,minimum_nights


In [40]:
def clean_number_of_reviews(df):
    min_check_pd = df[df.number_of_reviews_flag==0]
    min_check_pd = min_check_pd[min_check_pd.number_of_reviews.apply(lambda x:check_integer(x,1,999999))==False]
    wrong_min_nr=  min_check_pd[["id","number_of_reviews"]]
    nr_ind_ls=list(wrong_min_nr.index)

    for nr_id in nr_ind_ls:
        pre_nr=df.loc[nr_id,"number_of_reviews"]
        try:
            if pre_nr==np.nan or math.isnan(pre_nr):
                df.loc[nr_id,"number_of_reviews_flag"]=2
        except:
            try:
                revised=int(''.join(c for c in pre_nr if c.isdigit()))
                if check_integer(revised)==True:
                    df.loc[nr_id,"number_of_reviews"]=revised
                else:
                    df.loc[nr_id,"number_of_reviews_flag"]=1
        
            except:
                print(nr_id)
    return df

In [41]:
airbnb_pd = clean_number_of_reviews(airbnb_pd)

# Clean number of reviews checking

This query should return zero rows once you implement the cleaning process

In [42]:
min_check_pd = airbnb_pd[airbnb_pd.number_of_reviews_flag==0]
min_check_pd = min_check_pd[min_check_pd.number_of_reviews.apply(lambda x:check_integer(x,1,999999))==False]
min_check_pd[["id","number_of_reviews"]]

Unnamed: 0,id,number_of_reviews


# cleanup last_review

The "last_review" column should be in the format of ISO-date (yyyy-mm-dd). We have provided a "check_date" function to verify the date format.

If a value is outside the date format or is null and you are unsure how to handle it, you can flag the row for deletion by setting the "last_review_flag" to 1 or 2.


In [43]:
from datetime import datetime
def check_date(x,fmt="%Y-%m-%d"):
    try:
        datetime.strptime(x,fmt)
        return True
    except:
        return False

In [44]:
def clean_last_reviews(df):
    last_review_check_pd = df[df.last_review_flag==0]
    last_review_check_pd = last_review_check_pd[last_review_check_pd.last_review.apply(lambda x:check_date(x))==False]
    wrong_lr=last_review_check_pd[["id","last_review"]]
    lr_ind_ls=list(wrong_lr.index)
    
    for lr_id in lr_ind_ls:
        pre_lr=df.loc[lr_id,"last_review"]
        try:
            if pre_lr==np.nan or math.isnan(pre_lr):
                df.loc[lr_id,"last_review_flag"]=2
        except:
            pre_lr=pre_lr.replace("/", " ")
            mon_ele=pre_lr.split()[0]
            date_ele=pre_lr.split()[1]
            year_ele=pre_lr.split()[2]
            revised_date="20"+year_ele+"-"+mon_ele+"-"+date_ele
            if check_date(revised_date)==True:
                df.loc[lr_id,"last_review"]=revised_date
            else:
                df.loc[lr_id,"last_review_flag"]=1
    return df

In [45]:
airbnb_pd = clean_last_reviews(airbnb_pd)

# Last Review checking

This query should return zero rows once you implement the cleaning process

In [46]:
last_review_check_pd = airbnb_pd[airbnb_pd.last_review_flag==0]
last_review_check_pd = last_review_check_pd[last_review_check_pd.last_review.apply(lambda x:check_date(x))==False]
last_review_check_pd[["id","last_review"]]

Unnamed: 0,id,last_review


# save the dataset to csv

In [47]:
airbnb_pd.to_csv("chicago_vert_dataset_cleaned.csv")

# columns that potentially will be used for analysis:
id,name,host_id,host_name,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,price

In [48]:
columns_used = ["id","name","host_id","host_name",
                         "neighbourhood","latitude","longitude",
                         "room_type","minimum_nights","number_of_reviews","last_review","price"]

In [49]:
airbnb_pd[columns_used]

Unnamed: 0,id,name,host_id,host_name,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,price
0,2384,Hyde Park - Walk to UChicago,2613,Rebecca,Hyde Park,41.787900,-8.758780e+01,Private room,3,211,2022-11-18,84.0
1,7126,Tiny Studio Apartment 94 Walk Score,17928,Sarah,West Town,41.902890,-8.768182e+06,Entire home/apt,2,475,2022-12-05,85.0
2,28749,Quirky Bucktown Loft w/ Parking NO PARTIES,27506,Lauri,Logan Square,41.920820,-8.768012e+01,Entire home/apt,2,164,2022-12-04,147.0
3,37738,Andersonville - Perfect location!,162364,Mat And Randy,Uptown,41.972900,-8.766538e+01,Private room,3,250,2020-03-15,110.0
4,71930,"Rest, Relax and Explore",334241,Michael And Veronica,West Town,41.896150,-8.767934e+01,Private room,3,97,2022-11-16,75.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4489,783222540979971927,Spacious Private Room w/Smart TV,118714865,cabaXi,West Town,41.893030,-8.766548e+01,Private room,25,0,,64.0
4490,783249596739848974,"South Loop 1br w/ gym & lounge, nr Grant Park",107434423,Blueground,Loop,41.872529,-8.763086e+01,Entire home/apt,32,0,,143.0
4491,783477604930615583,2BR Charming Home in Rogers Park,459457994,Jean,Rogers Park,41.999305,-8.766586e+01,Entire home/apt,1,0,,54.0
4492,784266776843782956,Fast WiFi/Room with TV/2I*,186115134,Djamila,West Town,41.891460,-8.767992e+01,Private room,3,0,,31.0


# Checking and Comparing result to the Groundtruth

In [56]:
check_pd = pd.read_csv("/home/deck/projects/collaboration_simulation/airbnb_test_case/airbnb_sample2.csv")

In [57]:
airbnb_pd[airbnb_pd.id_flag==0]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,...,license,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
0,2384,Hyde Park - Walk to UChicago,2613,Rebecca,,Hyde Park,41.787900,-8.758780e+01,Private room,84.0,...,R17000015609,0,0,0,0,0,0,0,0,0
1,7126,Tiny Studio Apartment 94 Walk Score,17928,Sarah,,West Town,41.902890,-8.768182e+06,Entire home/apt,85.0,...,R21000075737,0,0,0,0,0,0,0,0,0
2,28749,Quirky Bucktown Loft w/ Parking NO PARTIES,27506,Lauri,,Logan Square,41.920820,-8.768012e+01,Entire home/apt,147.0,...,R22000079930,0,0,0,0,0,0,0,0,0
3,37738,Andersonville - Perfect location!,162364,Mat And Randy,,Uptown,41.972900,-8.766538e+01,Private room,110.0,...,R20000059426,0,0,0,0,0,0,0,0,0
4,71930,"Rest, Relax and Explore",334241,Michael And Veronica,,West Town,41.896150,-8.767934e+01,Private room,75.0,...,R17000013986,0,2,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4489,783222540979971927,Spacious Private Room w/Smart TV,118714865,cabaXi,,West Town,41.893030,-8.766548e+01,Private room,64.0,...,R21000061257,0,1,0,0,0,0,0,2,0
4490,783249596739848974,"South Loop 1br w/ gym & lounge, nr Grant Park",107434423,Blueground,,Loop,41.872529,-8.763086e+01,Entire home/apt,143.0,...,,0,0,0,0,0,0,0,2,0
4491,783477604930615583,2BR Charming Home in Rogers Park,459457994,Jean,,Rogers Park,41.999305,-8.766586e+01,Entire home/apt,54.0,...,R19000050250,0,2,0,0,0,0,0,2,0
4492,784266776843782956,Fast WiFi/Room with TV/2I*,186115134,Djamila,,West Town,41.891460,-8.767992e+01,Private room,31.0,...,R19000087222,0,0,0,0,0,0,0,2,0


In [58]:
check_pd

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,2384,Hyde Park - Walk to UChicago,2613,Rebecca,,Hyde Park,41.787900,-87.587800,Private room,84,3,211,2022-11-18,2.18,1,365,19,R17000015609
1,7126,Tiny Studio Apartment 94 Walk Score,17928,Sarah,,West Town,41.902890,-87.681820,Entire home/apt,85,2,475,2022-12-05,2.90,1,325,45,R21000075737
2,1195259,Wicker Park - Garden At The Heart of It All,2899484,Jason,,West Town,41.910950,-87.677830,Entire home/apt,120,6,205,2022-11-18,1.76,2,363,11,R19000037818
3,3453656,Private large room with your own bathroom,17405364,Jennifer,,Rogers Park,41.999930,-87.668440,Private room,122,14,39,2022-11-07,0.43,2,148,6,R21000073244
4,28749,Quirky Bucktown Loft w/ Parking NO PARTIES,27506,Lauri,,Logan Square,41.920820,-87.680120,Entire home/apt,147,2,164,2022-12-04,1.12,1,337,38,R22000079930
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4275,771860062577897747,Comfy & Artistic 2BR Home in Francisco Ave,488113016,Alexis,,West Ridge,41.996700,-87.701700,Entire home/apt,72,1,0,,,8,63,0,R21000062411
4276,771880753874742770,Spacious 2BR Stylish Apartment in Francisco Ave,488113016,Alexis,,West Ridge,41.995330,-87.701640,Entire home/apt,66,1,0,,,8,70,0,R18000023286
4277,771881329605160225,3BR Spacious Gorgeous Apt in Rogers Park,488113016,Alexis,,Rogers Park,42.011237,-87.672098,Entire home/apt,92,1,0,,,8,66,0,R22000080656
4278,780097517605219739,Luxurious Private King Room,316658141,Anastasia,,Loop,41.874874,-87.629512,Private room,143,4,0,,,1,58,0,R13243546901


In [59]:
list(filter(lambda x:x.endswith("flag"),airbnb_pd.columns))

['id_flag',
 'host_id_flag',
 'neighbourhood_flag',
 'latitude_flag',
 'longitude_flag',
 'minimum_nights_flag',
 'number_of_reviews_flag',
 'last_review_flag',
 'room_type_flag']

In [60]:
airbnb_output_list = airbnb_pd[(airbnb_pd.id_flag==0)&(airbnb_pd.host_id_flag==0)&
          (airbnb_pd.neighbourhood_flag==0)&(airbnb_pd.latitude_flag==0)&
         (airbnb_pd.longitude_flag==0)&(airbnb_pd.minimum_nights_flag==0)&
         (airbnb_pd.number_of_reviews_flag==0)&(airbnb_pd.last_review_flag==0)&
         (airbnb_pd.room_type_flag==0)
         ]

In [62]:
merge_check = check_pd.merge(airbnb_output_list,left_on="id",right_on="id")

In [63]:
set_col = set(check_pd.columns)
set_col = set_col  - {"id"}

In [64]:
#merge_check[f"{x}_x"].fillna("").apply(lambda x:str(x))

In [65]:
for x in set_col:
    try:
        #print(x)
        compare = merge_check[f"{x}_x"].fillna("").apply(lambda x: str(x)) != merge_check[f"{x}_y"].fillna("").apply(lambda x:str(x))
        print(x,sum(compare))
    except:
        print("failed",x)
        pass

license 0
host_id 0
availability_365 0
room_type 0
name 2
latitude 149
longitude 134
neighbourhood 0
number_of_reviews_ltm 0
calculated_host_listings_count 0
minimum_nights 0
number_of_reviews 0
neighbourhood_group 0
price 2601
reviews_per_month 0
host_name 0
last_review 0


In [66]:
missing_check = check_pd.merge(airbnb_output_list,left_on="id",right_on="id",how="left")
#missing_check[missing_check.]
missing_check[missing_check.license_y.isnull()]

Unnamed: 0,id,name_x,host_id_x,host_name_x,neighbourhood_group_x,neighbourhood_x,latitude_x,longitude_x,room_type_x,price_x,...,license_y,id_flag,host_id_flag,neighbourhood_flag,latitude_flag,longitude_flag,minimum_nights_flag,number_of_reviews_flag,last_review_flag,room_type_flag
2,1195259,Wicker Park - Garden At The Heart of It All,2899484,Jason,,West Town,41.910950,-87.677830,Entire home/apt,120,...,,,,,,,,,,
3,3453656,Private large room with your own bathroom,17405364,Jennifer,,Rogers Park,41.999930,-87.668440,Private room,122,...,,,,,,,,,,
8,71930,"Rest, Relax and Explore",334241,Michael And Veronica,,West Town,41.896150,-87.679340,Private room,75,...,,,,,,,,,,
9,84042,The Explorer Room,334241,Michael And Veronica,,West Town,41.896150,-87.679340,Private room,75,...,,,,,,,,,,
13,3570616,1 Bedroom Victorian Beauty at Logan Square!,3920450,Vas,,Logan Square,41.929250,-87.718240,Entire home/apt,60,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4277,771860062577897747,Comfy & Artistic 2BR Home in Francisco Ave,488113016,Alexis,,West Ridge,41.996700,-87.701700,Entire home/apt,72,...,,,,,,,,,,
4278,771880753874742770,Spacious 2BR Stylish Apartment in Francisco Ave,488113016,Alexis,,West Ridge,41.995330,-87.701640,Entire home/apt,66,...,,,,,,,,,,
4279,771881329605160225,3BR Spacious Gorgeous Apt in Rogers Park,488113016,Alexis,,Rogers Park,42.011237,-87.672098,Entire home/apt,92,...,,,,,,,,,,
4280,780097517605219739,Luxurious Private King Room,316658141,Anastasia,,Loop,41.874874,-87.629512,Private room,143,...,,,,,,,,,,


In [67]:
merge_check[["price_x","price_y"]]

Unnamed: 0,price_x,price_y
0,84,84.0
1,85,85.0
2,147,147.0
3,112,112.0
4,110,110.0
...,...,...
2596,192,192.0
2597,101,101.0
2598,150,150.0
2599,45,45.0


In [69]:
merge_check.price_x

0        84
1        85
2       147
3       112
4       110
       ... 
2596    192
2597    101
2598    150
2599     45
2600    549
Name: price_x, Length: 2601, dtype: int64

In [None]:
merge_check[["room_type_x","room_type_y"]]

In [None]:
airbnb_output_list.groupby(["room_type","room_type_flag"]).count()