# Verify Tables in the STEDI Human Balance Analytics Project

In this notebook, we are going to verify the number of rows in all stages of the tables in this project.

The rubric requires you to have the tables at various stages to have the following number of rows:

- Landing
  - Customer: 956
  - Accelerometer: 81273
  - Step Trainer: 28680
- Trusted
  - Customer: 482
  - Accelerometer: 40981
  - Step Trainer: 14460
- Curated
  - Customer: 482
  - Machine Learning: 38403
 
Let's now see if the numbers of rows are correct when combined via pandas.

In [48]:
import os
import pandas as pd

cust_dir = "../final_data/customer/landing"
acc_dir = "../final_data/accelerometer/landing"
st_dir = "../final_data/step_trainer/landing"

def load(path):
    df = pd.DataFrame()
    for filename in os.listdir(path):
        if filename.endswith('.json'):
            file_path = os.path.join(path, filename)
            df_ = pd.read_json(file_path, lines=True)
            df = pd.concat([df, df_], ignore_index=True)
    return df
cdf = load(cust_dir)
adf = load(acc_dir)
sdf = load(st_dir)

In [49]:
display(cdf.info())
display(adf.info())
display(sdf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 956 entries, 0 to 955
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customerName               956 non-null    object 
 1   email                      956 non-null    object 
 2   phone                      956 non-null    int64  
 3   birthDay                   956 non-null    object 
 4   serialNumber               956 non-null    object 
 5   registrationDate           956 non-null    int64  
 6   lastUpdateDate             956 non-null    int64  
 7   shareWithResearchAsOfDate  482 non-null    float64
 8   shareWithPublicAsOfDate    491 non-null    float64
 9   shareWithFriendsAsOfDate   508 non-null    float64
dtypes: float64(3), int64(3), object(4)
memory usage: 74.8+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81273 entries, 0 to 81272
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   user       81273 non-null  object        
 1   timestamp  81273 non-null  datetime64[ns]
 2   x          81273 non-null  int64         
 3   y          81273 non-null  int64         
 4   z          81273 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 3.1+ MB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28680 entries, 0 to 28679
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   sensorReadingTime   28680 non-null  int64 
 1   serialNumber        28680 non-null  object
 2   distanceFromObject  28680 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 672.3+ KB


None

In [50]:
# customer trusted
cdf_t = cdf[~cdf['shareWithResearchAsOfDate'].isna()]
cdf_t.info()

<class 'pandas.core.frame.DataFrame'>
Index: 482 entries, 0 to 954
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customerName               482 non-null    object 
 1   email                      482 non-null    object 
 2   phone                      482 non-null    int64  
 3   birthDay                   482 non-null    object 
 4   serialNumber               482 non-null    object 
 5   registrationDate           482 non-null    int64  
 6   lastUpdateDate             482 non-null    int64  
 7   shareWithResearchAsOfDate  482 non-null    float64
 8   shareWithPublicAsOfDate    240 non-null    float64
 9   shareWithFriendsAsOfDate   270 non-null    float64
dtypes: float64(3), int64(3), object(4)
memory usage: 41.4+ KB


In [51]:
# accelerometer trusted
adf_t = adf.merge(cdf_t, how='inner', left_on='user', right_on='email')
adf_t = adf_t[adf.columns]
adf_t.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40981 entries, 0 to 40980
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   user       40981 non-null  object        
 1   timestamp  40981 non-null  datetime64[ns]
 2   x          40981 non-null  int64         
 3   y          40981 non-null  int64         
 4   z          40981 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 1.6+ MB


In [52]:
# customer curated
cdf_c = cdf_t.merge(adf_t, how='inner', left_on='email', right_on='user')
cdf_c = cdf_c[cdf.columns]
cdf_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40981 entries, 0 to 40980
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customerName               40981 non-null  object 
 1   email                      40981 non-null  object 
 2   phone                      40981 non-null  int64  
 3   birthDay                   40981 non-null  object 
 4   serialNumber               40981 non-null  object 
 5   registrationDate           40981 non-null  int64  
 6   lastUpdateDate             40981 non-null  int64  
 7   shareWithResearchAsOfDate  40981 non-null  float64
 8   shareWithPublicAsOfDate    20404 non-null  float64
 9   shareWithFriendsAsOfDate   22979 non-null  float64
dtypes: float64(3), int64(3), object(4)
memory usage: 3.1+ MB


In [53]:
# step trainer trusted
sdf_t = sdf.merge(cdf_c, how='inner', left_on='serialNumber', right_on='serialNumber')
sdf_t = sdf_t[sdf.columns]
sdf_t = sdf_t.drop_duplicates()
sdf_t.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14460 entries, 0 to 1229343
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   sensorReadingTime   14460 non-null  int64 
 1   serialNumber        14460 non-null  object
 2   distanceFromObject  14460 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 451.9+ KB


In [54]:
# machine learning curated
adf_t['timestamp'] = (adf_t['timestamp'].astype('int64')/1000000).astype('int64')
mdf = sdf_t.merge(adf_t, how='inner', left_on='sensorReadingTime', right_on='timestamp')
mdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38403 entries, 0 to 38402
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   sensorReadingTime   38403 non-null  int64 
 1   serialNumber        38403 non-null  object
 2   distanceFromObject  38403 non-null  int64 
 3   user                38403 non-null  object
 4   timestamp           38403 non-null  int64 
 5   x                   38403 non-null  int64 
 6   y                   38403 non-null  int64 
 7   z                   38403 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 2.3+ MB
