owner listings - exploratory data analysis and cleaning
===
# introduction
we'll start by getting to know the data and discovering its shapes and hidden warts. then apply data cleaning and preprocessing.


In [1]:
from datetime import datetime
import duckdb
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import ydata_profiling as yp

DATAFOLDER='../data/raw/'        # path to raw data files
DATABASEFILE='../data/sits.ddb'  # path to persistent duckdb database file

In [2]:
np.iinfo(np.int32).max

2147483647

# homeowner listings
in this data set, each row represents a homeowner's listing. from the data dictionary, we can learn a lot about the columns included. 
many of the columns are boolean, some are datetimes, others are strings, that can either be interpreted as categorical variables or harvested for one-hot encoded variables. 

## assumption
for each boolean variable, if an answer is missing, we will interpret that as FALSE, i.e. we take the question to mean whether _they are confirmed to be true_ or not. this assumption is not always going to be true, but it will be more often than not, and increases the expressiveness of the data set. 

In [3]:
date_columns=[
    'approved_dt',
    'modified_dt',
    'completed_dt',
]
boolean_columns = [
    'car_required',
    'car_included',
    'disabled_access',
    # 'wifi_available',
    'family_friendly',
    'pets_welcome',
    # 'suitable_for_home_working',
]
categorical_features=[
    'home_type',
    'photo_tags',
    'pets',
    'other_animals',
    'children_preferences',
    'family_preferences',
]

data_types={
    'id':np.int32,
    'nb_of_photos':np.uint8,
    'nb_of_pets':np.uint8,
    'nb_distinct_pets':np.uint8,
    'nb_assignments_filled':np.uint8,
    'nb_assignments_published':np.uint16,  
    'nb_unique_sitters':np.uint16,
    'nb_repeat_sitters':np.uint16,
    'nb_domestic_sitters':np.uint16,
    'avg_nb_apps_per_assg':np.uint16,
    'nb_invites':np.uint8,
    'minutes_pet_can_be_left_alone':np.uint16,
}
today = pd.Timestamp.now().floor('D') 

In [11]:
# temp= pd.read_csv(DATAFOLDER+'listings.csv')
# temp['pct_complete'].value_counts()

In [16]:
listings_df =  (
    pd.read_csv(
        DATAFOLDER+'listings.csv', 
        low_memory=False,
        parse_dates=date_columns,
        date_format='%Y-%m-%d %H:%M:%S.%f',
        # dtype=data_types
    )
    .query('pct_complete >= 90') # focus on profiles near completion and ignore partial profiles 
    .query('completed_dt.notnull()', engine='python') # drop noncompleted
    .assign(
        days_since_modified           = lambda x: (today - x['modified_dt']).dt.days.astype('Int16'),
        year_approved                 = lambda x: pd.Series((x['approved_dt'].dt.year),dtype='Int16'),
        avg_nb_apps_per_assg          = lambda x: pd.to_numeric(x['avg_nb_apps_per_assg'].fillna(0), errors='coerce', downcast='integer'),
        children_preferences          = lambda x: x['children_preferences'].fillna('No preference'),
        attraction_city               = lambda x: x['local_attractions'].str.contains('city').fillna(False),
        attraction_beach              = lambda x: x['local_attractions'].str.contains('beach').fillna(False),
        attraction_mountain           = lambda x: x['local_attractions'].str.contains('mountain').fillna(False),
        attraction_countryside        = lambda x: x['local_attractions'].str.contains('countryside').fillna(False),       
        pet_dog                       = lambda x: x['pets'].str.contains('dog').fillna(False),
        pet_cat                       = lambda x: x['pets'].str.contains('cat').fillna(False),
        pet_bird                      = lambda x: x['pets'].str.contains('bird').fillna(False),
        pet_fish                      = lambda x: x['pets'].str.contains('fish').fillna(False),
        pet_reptile                   = lambda x: x['pets'].str.contains('reptile').fillna(False),
        pet_poultry                   = lambda x: x['pets'].str.contains('poultry').fillna(False),
        pet_farm_animal               = lambda x: x['pets'].str.contains('farm-animal').fillna(False),
        photo_interior                = lambda x: x['photo_tags'].str.contains('interior').fillna(False),
        photo_exterior                = lambda x: x['photo_tags'].str.contains('exterior').fillna(False),
        photo_attraction              = lambda x: x['photo_tags'].str.contains('local-attraction').fillna(False),
        photo_garden                  = lambda x: x['photo_tags'].str.contains('exterior').fillna(False),
        photo_pool                    = lambda x: x['photo_tags'].str.contains('pool').fillna(False),
        photo_view                    = lambda x: x['photo_tags'].str.contains('view').fillna(False),
        wish_to_video_call            = lambda x: x['wish_to_video_call'].astype('bool').fillna(False),
        wish_to_meet_in_person        = lambda x: x['wish_to_meet_in_person'].astype('bool').fillna(False),
        welcomes_single               = lambda x: x['family_preferences'].str.contains('single').fillna(False),
        welcomes_couple               = lambda x: x['family_preferences'].str.contains('couple').fillna(False),
        welcomes_family               = lambda x: x['family_preferences'].str.contains('family').fillna(False),
        welcomes_any_age              = lambda x: x['children_preferences'].str.contains('No preference').fillna(False),
        welcomes_baby                 = lambda x: x['children_preferences'].str.contains('0-3').fillna(False),
        welcomes_toddler              = lambda x: x['children_preferences'].str.contains('4-7').fillna(False),
        welcomes_child                = lambda x: x['children_preferences'].str.contains('8-12').fillna(False),
        welcomes_teen                 = lambda x: x['children_preferences'].str.contains('13-17').fillna(False),
        welcomes_young                = lambda x: x['children_preferences'].str.contains('18+').fillna(False),
        home_type                     = lambda x: pd.Categorical(x['home_type'].fillna('unknown')),
        other_animals                 = lambda x: pd.Categorical(x['other_animals'].fillna('None')),
        id                            = lambda x: x['id'].fillna(0).astype('int32'),
        nb_assignments_filled         = lambda x: x['nb_assignments_filled'].fillna(0).astype('int32'),
        nb_assignments_published      = lambda x: x['nb_assignments_published'].fillna(0).astype('int32'),
        nb_distinct_pets              = lambda x: x['nb_distinct_pets'].fillna(0).astype('int32'),
        nb_domestic_sitters           = lambda x: x['nb_domestic_sitters'].fillna(0).astype('int32'),
        nb_invites                    = lambda x: x['nb_invites'].fillna(0).astype('int32'),
        nb_of_pets                    = lambda x: x['nb_of_pets'].fillna(0).astype('int32'),
        nb_of_photos                  = lambda x: x['nb_of_photos'].fillna(0).astype('int32'),
        nb_repeat_sitters             = lambda x: x['nb_repeat_sitters'].fillna(0).astype('int32'),
        nb_unique_sitters             = lambda x: x['nb_unique_sitters'].fillna(0).astype('int32'),
        minutes_pet_can_be_left_alone = lambda x: x['minutes_pet_can_be_left_alone'].fillna(120).astype('int16'),
    )
    .drop(columns=[
        'modified_dt',          # won't need
        'latitude',             # dont know what to do with
        'longitude',            # dont know what to do with
        'local_attractions',    # already used
        'wifi_available',       # 100% available: offers no information
        'photo_tags',           # already used
        'pct_complete',         # filtered low completion. already used
        'family_preferences',   # already used
        'children_preferences', # already used
        'pets',                 # used up
        'user_id',              # corrupted data!
        'approved_dt',
        'completed_dt',
    ])
)
# for booleans, if answer is missing, interpret it as 'no':
for col in boolean_columns:
    listings_df[col] = listings_df[col].astype('bool').fillna(False)

display(listings_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 61961 entries, 0 to 70859
Data columns (total 50 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   accessible_by_public_transport  61961 non-null  bool    
 1   avg_nb_apps_per_assg            61961 non-null  int8    
 2   car_included                    61961 non-null  bool    
 3   car_required                    61961 non-null  bool    
 4   disabled_access                 61961 non-null  bool    
 5   family_friendly                 61961 non-null  bool    
 6   home_type                       61961 non-null  category
 7   id                              61961 non-null  int32   
 8   minutes_pet_can_be_left_alone   61961 non-null  int16   
 9   nb_assignments_filled           61961 non-null  int32   
 10  nb_assignments_published        61961 non-null  int32   
 11  nb_distinct_pets                61961 non-null  int32   
 12  nb_domestic_sitters    

None

In [6]:
listings_df['welcomes_any_age'].value_counts()

welcomes_any_age
True     59703
False     5546
Name: count, dtype: int64

In [7]:
listings_df.to_parquet('../data/listings_prep.parquet')

# save profile feature files in persistant database file

In [8]:
with duckdb.connect(DATABASEFILE) as con:
    con.sql("DROP TABLE IF EXISTS listings_prep")
    con.sql("CREATE TABLE listings_prep AS SELECT * FROM listings_df")
    con.sql("SELECT COUNT(*) FROM listings_prep")

# exploratory data analysis reporting

In [9]:
profile = yp.ProfileReport(listings_df)
# profile.to_notebook_iframe()
profile.to_file('../notes/eda/eda_report_listings.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# lessons from exploratory analysis report
1. user_id useless, corrupted data? only 17k records have it and they are booleans
2.
3. 