housesitter assignments- exploratory data analysis
===
# introduction
this notebook is solely for the purposes of getting to know the data and discovering its shapes and secrets. data cleaning and preprocessing takes place further downstream from here.


In [1]:
from datetime import datetime
import duckdb
import matplotlib as plt
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import ydata_profiling as yp

DATAFOLDER='../data/raw/'        # path to raw data files
DATABASEFILE='../data/sits.ddb'  # path to persistent duckdb database file

In [2]:
np.finfo(np.float16).max

65500.0

# housesitter assignments
in this data set, each row represents a housesitter's sit at a homeowner's property. from the data dictionary, we can learn a lot about the columns included.

many of the columns are boolean, some are datetimes, others are strings, that can either be interpreted as categorical variables or harvested for one-hot encoded variables. 

## assumption
for each boolean variable, if an answer is missing, we will interpret that as FALSE, i.e. we take the question to mean whether _they are confirmed to be true_ or not. this assumption is not always going to be true, but it will be more often than not, and increases the expressiveness of the data set. 

In [5]:
bool_cols = [
    'private',
    'is_new',
    'reviewing',
    'deleted',
    ''
]
date_columns = [
    'created_dt',
    'last_modified_dt',
    'start_dt',
    'end_dt',
    'welcome_guide_shared_ts',
]
today = pd.Timestamp.now().floor('D')  # Floor to remove time component


category_cols=[
    'ocountry',
    'oregion',
    'ocity',
    'scountry',
    'sregion',
    'scity',
]
# numerical columns
data_types = {
    'id':np.uint32,
    'listing_id':np.uint32,
    'id':np.uint32,
    'listing_id':np.uint32,
    'owner_user_id':np.uint32,
    'profile_id':np.uint32,
}


In [6]:
assignments_df =  (
    pd.read_csv(
        DATAFOLDER+'assignments.csv', 
        low_memory=False,
        parse_dates=date_columns,
        date_format='%Y-%m-%d %H:%M:%S.%f',
        dtype=data_types
    )
    .assign(
    )
    .drop(columns=[
        'modified_dt',          # won't need
    ]

)
display(assignments_df.sample(5))
display(assignments_df.info())

Unnamed: 0,id,created_dt,last_modified_dt,start_dt,end_dt,private,approximate_dates,sitter_found,listing_id,owner_user_id,...,ocity,profile_id,sitter_user_id,scountry,sregion,scity,is_new,reviewing,welcome_guide_shared_ts,deleted
144093,659199,2022-10-06,2022-10-10,2022-10-28,2022-10-31,False,False,True,213057,760161,...,gatineau,767892.0,946695.0,canada,quebec,gatineau,False,False,NaT,False
26902,524813,2022-03-02,2022-03-02,2022-04-21,2022-04-24,False,False,False,502383,2451308,...,midland,,,,,,False,False,NaT,False
118252,615553,2022-07-28,2022-07-29,2022-09-12,2022-09-24,False,False,False,780517,3568102,...,london,,,,,,True,False,NaT,True
45021,538591,2022-03-28,2022-04-10,2022-04-15,2022-04-17,False,False,True,394646,1628467,...,townville,1697967.0,2658374.0,united-states,north-carolina,asheville,False,False,2022-04-10 22:52:46,False
99308,599415,2022-07-04,2022-07-04,2022-07-17,2022-07-23,False,False,False,914354,4061342,...,oakland,,,,,,False,False,NaT,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198025 entries, 0 to 198024
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   id                       198025 non-null  uint32        
 1   created_dt               198025 non-null  datetime64[ns]
 2   last_modified_dt         198025 non-null  datetime64[ns]
 3   start_dt                 198025 non-null  datetime64[ns]
 4   end_dt                   198025 non-null  datetime64[ns]
 5   private                  198025 non-null  bool          
 6   approximate_dates        198025 non-null  bool          
 7   sitter_found             198025 non-null  bool          
 8   listing_id               198025 non-null  uint32        
 9   owner_user_id            198025 non-null  uint32        
 10  ocountry                 198025 non-null  object        
 11  oregion                  197409 non-null  object        
 12  ocity           

None

Unnamed: 0_level_0,num_reviews,num_assignments,mean_score,tidiness_avg,organised_avg
sitter_user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2808091,9,9,3.333333,3.111111,3.111111
1441093,11,11,3.727273,3.545455,3.727273
3479858,12,12,4.166667,4.250000,4.083333
3114610,8,8,4.250000,4.375000,4.250000
1700238,8,8,4.375000,3.875000,4.250000
...,...,...,...,...,...
1620426,10,10,5.000000,5.000000,5.000000
1626664,10,10,5.000000,4.500000,4.500000
1628692,12,12,5.000000,5.000000,5.000000
1568938,10,10,5.000000,4.900000,5.000000


# save prepped data to database

In [None]:
with duckdb.connect(DATABASEFILE) as con:
    con.sql("DROP TABLE IF EXISTS assignment_prep")
    con.sql("CREATE TABLE assignment_prep AS SELECT * FROM assignments_df")
    con.sql("SELECT COUNT(*) FROM assignment_prep")

# exploratory data analysis reporting

In [8]:
profile = yp.ProfileReport(assignment_prep)
profile.to_file('../notes/eda/eda_report_assignments.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# lessons from exploratory analysis report
