sitter profiles - exploratory data analysis and cleaning
===
# introduction
we'll start by getting to know the data and discovering its shapes and hidden warts. data cleaning and preprocessing.

the data tables are:

- profiles (housesitter profiles)
- listings (homeowner listings)
- pets (details of the pets belonging to a listing)
- users (user contact details)
- applications (requests by a sitter to a sit, and invites of homeowneres to sitters)
- assignments (sitter + sits connections)
- feedback (feedback left by housesitters on the sit)
- reviews (feedback left by homeowners on the sitter)

In [1]:
from datetime import datetime
import duckdb
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import ydata_profiling as yp

DATAFOLDER='../data/raw/'        # path to raw data files
DATABASEFILE='../data/sits.ddb'  # path to persistent duckdb database file

In [2]:
# np.iinfo(np.int16).max

# housesitter profiles
in this data set, each row represents a house sitter's profile. from the data dictionary, we can learn a lot about the columns included. 
many of the columns are boolean, some are datetimes, others are strings, that can either be interpreted as categorical variables or harvested for one-hot encoded variables. 

## assumption
for each boolean variable, if an answer is missing, we will interpret that as FALSE, i.e. we take the question to mean whether _they are confirmed to be true_ or not. this assumption is not always going to be true, but it will be more often than not, and increases the expressiveness of the data set. 

In [3]:
bool_cols = [
    'with_children',
    'sitting_with_another',
    'prev_sitting_experience',
    'other_animals',              
    'has_dog_experience',
    'has_cat_experience',
    'has_reptile_experience',
    'has_horse_experience',
    'has_fish_experience',
    'has_poultry_experience',
    'has_farm_animal_experience',
    'has_bird_experience',
    'has_small_pet_experience',
    'prefers_all_countries',
    'happy_to_meet_in_person',
    'happy_to_video_call',
    'interested_in_remote_working',
    'interested_in_dogs',
    'interested_in_cats',
    'interested_in_reptiles',
    'interested_in_horses',
    'interested_in_fish',
    'interested_in_poultry',
    'interested_in_farm_animals',
    'interested_in_birds',
    'interested_in_small_pets',
]
date_columns = [
    'modified_dt',
    'birth_date',
    'partner_birth_date'
]
today = pd.Timestamp.now().floor('D')  # Floor to remove time component


category_cols=[
    'travelling_as',
    'occupation_type',
    'location_wish_list',
    'child_ages',
]
# numerical columns
data_types = {
    'id':np.uint32, 
    'user_id':np.uint32,
    'latitude':np.float32,
    'longitude':np.float32,
    'pct_complete':np.uint8,
    'nb_reviews':np.uint16,
    'nb_5s_reviews':np.uint16,
    'nb_applications':np.uint16,
    'nb_sits_completed':np.uint16,
    'nb_domestic_sits':np.uint16,
    'nb_local_sits':np.uint16,
    'nb_of_sitter_pets':np.uint16,
    'nb_sits_booked':np.uint16,
    'daily_minutes_willing_to_walk_dogs':np.float32,
    # 'years_of_experience':np.uint8
}
# the sex combinations repeat, so let's map them. don't see a reason to distinguish between sitter and partner. just consider the different pairings. order alphabetically.
mapping = {
    # e.g. we already have an 'FM' category for a pair of sitters, so if a profile is 'MF', we map it to 'FM'.
    'MF':'FM',
    'NF':'FN',
    'TF':'FT',
    'UF':'FU',
    'XF':'FX',
    'NM':'MN',
    'TM':'MT',
    'UM':'MU',
    'XM':'MX',
    'TN':'NT',
    'UN':'NU',
    'XN':'NX',
    'XU':'UX',
}

In [4]:
profiles_df =  (
    pd.read_csv(
        DATAFOLDER+'profiles.csv', 
        low_memory=False,
        parse_dates=date_columns,
        date_format='%Y-%m-%d %H:%M:%S.%f',
        dtype=data_types
    )
    .query('pct_complete > 75') # focus on profiles near completion and ignore partial profiles 
    .assign(
        days_since_modified    = lambda x: (today - x['modified_dt']).dt.days.astype('Int16'),
        birth_decade           = lambda x: pd.Series(((x['birth_date'].dt.year//10)*10),dtype='Int16'),
        partner_birth_decade   = lambda x: pd.Series(((x['partner_birth_date'].dt.year//10)*10),dtype='Int16'),
        travelling_as          = lambda x: pd.Categorical(x['travelling_as'].replace(mapping)),
        occupation_type        = lambda x: pd.Categorical(x['occupation_type']),
        with_a_baby            = lambda x: x['child_ages'].str.contains('0-3').fillna(False),
        with_a_toddler         = lambda x: x['child_ages'].str.contains('4-7').fillna(False),
        with_a_child           = lambda x: x['child_ages'].str.contains('8-12').fillna(False),
        with_a_teen            = lambda x: x['child_ages'].str.contains('13-17|18+').fillna(False),
        dog_skills             = lambda x: x['skills'].str.contains('dog').fillna(False),
        cat_skills             = lambda x: x['skills'].str.contains('cat').fillna(False),
        five_star_ratio        = lambda x: (x['nb_5s_reviews']/x['nb_reviews']).fillna(0).astype(np.float32),
        can_give_medicine      = lambda x: x['skills'].str.contains('medication').fillna(False),
        big_dog_preferences    = lambda x: x['dog_size_preferences'].str.contains('^L|XL|,L').fillna(False),
        small_dog_preferences  = lambda x: x['dog_size_preferences'].str.contains('^S|XS|,S').fillna(False),
        # missing years of experience. assume newbies leave blank?
        years_of_experience    = lambda x: pd.to_numeric(x['years_of_experience'].fillna(0), errors='coerce', downcast='integer'),
        wish_list_city         = lambda x: x['location_wish_list'].str.contains('city').fillna(False),
        wish_list_beach        = lambda x: x['location_wish_list'].str.contains('beach').fillna(False),
        wish_list_mountain     = lambda x: x['location_wish_list'].str.contains('mountain').fillna(False),
        wish_list_countryside  = lambda x: x['location_wish_list'].str.contains('countryside').fillna(False),
        daily_minutes_willing_to_walk_dogs = lambda x: pd.to_numeric(x['daily_minutes_willing_to_walk_dogs'].fillna(0), errors='coerce', downcast='integer'),
    )
    # drop columns that add no value or have been used
    .drop(columns=['modified_dt', 'latitude', 'longitude', 'child_ages', 'skills', 'birth_date', 'partner_birth_date', 'dog_size_preferences', 'location_wish_list', 'preferred_countries'])    
)

# for booleans, if answer is missing, interpret it as 'no':
for col in bool_cols:
    profiles_df[col] = profiles_df[col].astype('bool').fillna(False)

display(profiles_df.sample(5))
display(profiles_df.info())

Unnamed: 0,id,user_id,pct_complete,travelling_as,with_children,sitting_with_another,occupation_type,prev_sitting_experience,other_animals,has_dog_experience,...,dog_skills,cat_skills,five_star_ratio,can_give_medicine,big_dog_preferences,small_dog_preferences,wish_list_city,wish_list_beach,wish_list_mountain,wish_list_countryside
17093,1822397,2898044,84,FM,False,True,taking-time-off,True,True,True,...,False,False,1.0,False,False,False,True,True,True,True
16897,2073011,2801897,94,FM,False,True,working-while-travelling,True,True,True,...,False,False,1.0,False,False,False,True,True,True,True
6891,745156,916227,94,FM,False,True,employed,True,True,True,...,False,False,1.0,False,False,False,True,True,True,True
16142,1905041,3070152,84,M,False,False,working-while-travelling,True,True,True,...,False,False,0.875,False,False,False,True,True,True,True
12298,1486120,2296109,84,M,False,False,employed,True,True,True,...,False,False,0.0,False,False,False,True,True,True,True


<class 'pandas.core.frame.DataFrame'>
Index: 33352 entries, 0 to 34915
Data columns (total 58 columns):
 #   Column                              Non-Null Count  Dtype   
---  ------                              --------------  -----   
 0   id                                  33352 non-null  uint32  
 1   user_id                             33352 non-null  uint32  
 2   pct_complete                        33352 non-null  uint8   
 3   travelling_as                       33352 non-null  category
 4   with_children                       33352 non-null  bool    
 5   sitting_with_another                33352 non-null  bool    
 6   occupation_type                     33262 non-null  category
 7   prev_sitting_experience             33352 non-null  bool    
 8   other_animals                       33352 non-null  bool    
 9   has_dog_experience                  33352 non-null  bool    
 10  has_cat_experience                  33352 non-null  bool    
 11  has_reptile_experience           

None

In [5]:
# profiles_df['birth_decade'].unique()
# profiles_df['five_star_ratio'].isna().sum()
# profiles_df['five_star_ratio'].value_counts()
# profiles_df[profiles_df['days_since_modified']>2000]
profiles_df['five_star_ratio'].max()
profiles_df[profiles_df['five_star_ratio']>0.99][['five_star_ratio', 'nb_5s_reviews', 'nb_reviews']]

Unnamed: 0,five_star_ratio,nb_5s_reviews,nb_reviews
1,1.0,57,57
3,1.0,9,9
15,1.0,7,7
17,1.0,12,12
18,1.0,35,35
...,...,...,...
34909,1.0,1,1
34910,1.0,3,3
34913,1.0,4,4
34914,1.0,19,19


In [11]:
for col in bool_cols:
    display(profiles_df[col].value_counts(normalize=True))

with_children
False    0.936196
True     0.063804
Name: proportion, dtype: float64

sitting_with_another
False    0.555139
True     0.444861
Name: proportion, dtype: float64

prev_sitting_experience
True     0.937155
False    0.062845
Name: proportion, dtype: float64

other_animals
True    1.0
Name: proportion, dtype: float64

has_dog_experience
True     0.96495
False    0.03505
Name: proportion, dtype: float64

has_cat_experience
True     0.957904
False    0.042096
Name: proportion, dtype: float64

has_reptile_experience
False    0.76763
True     0.23237
Name: proportion, dtype: float64

has_horse_experience
False    0.755877
True     0.244123
Name: proportion, dtype: float64

has_fish_experience
True     0.656602
False    0.343398
Name: proportion, dtype: float64

has_poultry_experience
False    0.550672
True     0.449328
Name: proportion, dtype: float64

has_farm_animal_experience
False    0.710152
True     0.289848
Name: proportion, dtype: float64

has_bird_experience
False    0.618763
True     0.381237
Name: proportion, dtype: float64

has_small_pet_experience
True     0.640981
False    0.359019
Name: proportion, dtype: float64

prefers_all_countries
True     0.634984
False    0.365016
Name: proportion, dtype: float64

happy_to_meet_in_person
True     0.996162
False    0.003838
Name: proportion, dtype: float64

happy_to_video_call
True     0.99973
False    0.00027
Name: proportion, dtype: float64

interested_in_remote_working
True     0.955355
False    0.044645
Name: proportion, dtype: float64

interested_in_dogs
False    0.896918
True     0.103082
Name: proportion, dtype: float64

interested_in_cats
False    0.888103
True     0.111897
Name: proportion, dtype: float64

interested_in_reptiles
False    0.950018
True     0.049982
Name: proportion, dtype: float64

interested_in_horses
False    0.962971
True     0.037029
Name: proportion, dtype: float64

interested_in_fish
False    0.909601
True     0.090399
Name: proportion, dtype: float64

interested_in_poultry
False    0.92879
True     0.07121
Name: proportion, dtype: float64

interested_in_farm_animals
False    0.954755
True     0.045245
Name: proportion, dtype: float64

interested_in_birds
False    0.935566
True     0.064434
Name: proportion, dtype: float64

interested_in_small_pets
False    0.906662
True     0.093338
Name: proportion, dtype: float64

# lessons from exploratory analysis report
1. column `other_animals` has no value
2. column `years_of_experience` has an oddly u-shaped distribution long-term true believers, and the new crowd. no in-between cohort.
3. 

# save profile feature files in persistant database file

In [6]:
with duckdb.connect(DATABASEFILE) as con:
    con.sql("DROP TABLE IF EXISTS profiles_prep")
    con.sql("CREATE TABLE profiles_prep AS SELECT * FROM profiles_df")
    con.sql("SELECT COUNT(*) FROM profiles_prep")

# exploratory data analysis reporting

In [8]:
profile = yp.ProfileReport(profiles_df)
# profile.to_notebook_iframe()
profile.to_file('../notes/eda/eda_report_profiles.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]