# Verify Processed Data

Now that we have the data processed in Glue Job, let's verify it.

For example, the curated customer data produced by Glue Job, when read in AWS Athena it has the correct number of rows, but when processed in another Glue Job, only a part of the data got carried over.

In [5]:
import os
import pandas as pd
import json

In [7]:
directory_path = '../processed_data/customer/curated'
all_files = [f for f in os.listdir(directory_path)]

dataframes_list = [pd.read_json(os.path.join(directory_path, file), lines=True) for file in all_files]

df = pd.concat(dataframes_list, ignore_index=True)
df

Unnamed: 0,serialNumber,birthDay,shareWithPublicAsOfDate,shareWithResearchAsOfDate,registrationDate,customerName,shareWithFriendsAsOfDate,email,lastUpdateDate,phone
0,44a10dbd-444c-427d-8e2a-b5b884d79eab,1271-01-01,1.655564e+12,1.655564e+12,1655564423958,Jacob Anderson,1.655564e+12,Jacob.Anderson@test.com,1655564423958,8015551212
1,c3ee6284-91b6-403d-965a-2b8a611b7612,1543-01-01,,1.655564e+12,1655564403664,John Smith,,John.Smith@test.com,1655564403664,8015551212
2,d2ff3040-3bec-4894-b51f-76a8df1ce1f4,1644-01-01,,1.655564e+12,1655564396106,Lyn Doshi,,Lyn.Doshi@test.com,1655564396106,8015551212
3,dc675019-d2cb-40ce-b3fd-4f95ff4e3d11,1276-01-01,,1.655564e+12,1655564423601,Dan Olson,1.655564e+12,Dan.Olson@test.com,1655564423601,8015551212
4,8eca1513-cac8-42eb-994f-f49c22ad3004,1268-01-01,,1.655564e+12,1655564135368,Neeraj Staples,1.655564e+12,Neeraj.Staples@test.com,1655564135368,8015551212
...,...,...,...,...,...,...,...,...,...,...
477,3689766a-2b96-4272-8b60-e57f0c067b93,1882-01-01,1.655563e+12,1.655563e+12,1655563197928,John Fibonnaci,1.655563e+12,John.Fibonnaci@test.com,1655563264021,8015551212
478,be0294fc-67e9-48dc-aa8a-3015fdcfefcb,1097-01-01,,1.655564e+12,1655564436771,Sarah Khatib,1.655564e+12,Sarah.Khatib@test.com,1655564436771,8015551212
479,05269382-df65-43f7-b6d5-dd1a5b029e5f,1655-01-01,1.655564e+12,1.655564e+12,1655564107530,Jason Anderson,,Jason.Anderson@test.com,1655564107530,8015551212
480,38a5324b-c29a-41b3-aa9d-e2ef921b93af,1926-01-01,1.655564e+12,1.655564e+12,1655564374012,Craig Habschied,1.655564e+12,Craig.Habschied@test.com,1655564374012,8015551212


In [3]:
# Now check the step trainer trusted data. Somehow, the job produces more than 1 million rows.

directory_path = '../processed_data/step_trainer/trusted-maybe'
all_files = [f for f in os.listdir(directory_path)]

dataframes_list = [pd.read_json(os.path.join(directory_path, file), lines=True, encoding_errors='ignore') for file in all_files]

st_df = pd.concat(dataframes_list, ignore_index=True)
st_df

ValueError: Unmatched ''"' when when decoding 'string'

In [6]:
# Files are too big, let's just get the first 100

num_lines_to_read = 10
lines = []

with open('../processed_data/step_trainer/trusted/run-1691491795804-part-r-00000', 'r') as file:
    for _ in range(num_lines_to_read):
        line = file.readline()
        if not line:  # reached end of file
            break
        lines.append(line)

# Now, we'll use pandas to convert the list of JSON strings into a DataFrame
st_df = pd.DataFrame([json.loads(line) for line in lines])

In [7]:
st_df

Unnamed: 0,serialNumber,birthDay,shareWithPublicAsOfDate,shareWithResearchAsOfDate,registrationDate,customerName,sensorReadingTime,shareWithFriendsAsOfDate,.serialNumber,distanceFromObject,lastUpdateDate
0,a5d63530-aa58-4a81-b872-ec3ee6ef8200,1677-01-01,,1655564000000.0,1655564106007,Jaya Aristotle,1655564444103,1655564000000.0,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564106007
1,da87fb99-8e30-4396-8834-ae9cb1ab7cc5,1727-01-01,,1655564000000.0,1655564389575,Spencer Howard,1655564444103,1655564000000.0,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564389575
2,6825097d-7e08-4fce-a71b-c28ee1aec084,1151-01-01,1655564000000.0,1655564000000.0,1655564432890,Suresh Jackson,1655564444103,,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564432890
3,d0f2de3d-79a4-4e35-aebe-1f4abd67d466,1072-01-01,,1655564000000.0,1655564438566,Bobby Olson,1655564444103,,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564438566
4,f6ab57be-a768-4eb3-a6c2-8b5833c1646a,1674-01-01,,1655564000000.0,1655564393876,Chris Gonzalez,1655564444103,,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564393876
5,651d1e45-f94a-4a53-8161-abff5442a6f1,1123-01-01,,1655564000000.0,1655564434907,Lyn Anandh,1655564444103,1655564000000.0,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564434907
6,d9e3ef45-d77c-4266-8c66-be438ed79ec9,1217-01-01,1655564000000.0,1655564000000.0,1655563974242,Dan Anandh,1655564444103,1655564000000.0,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655563974242
7,42c7a46e-b91a-4b21-aead-1cf230a0ed4d,1527-01-01,,1655564000000.0,1655563953289,Bobby Habschied,1655564444103,1655564000000.0,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655563953289
8,cf977dd3-df90-492a-b05d-15fd3898a45b,1033-01-01,1655564000000.0,1655564000000.0,1655564152526,Liz Aristotle,1655564444103,,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564152526
9,2f511a03-5105-4fa1-a8db-88fd1a7904b4,1428-01-01,1655564000000.0,1655564000000.0,1655564412261,Angie Olson,1655564444103,,50f7b4f3-7af5-4b07-a421-7b902c8d2b7c,218,1655564412261
