## Apply Schema on the lists from files

Let us understand how to apply schema while processing the data from the files. 
* In many cases, data files might not contain the metadata such as column names, data types, etc.
* We might get the data metadata in the form of separate files. Also, it is common that metadata is available via Database Tables or REST based schema registries.
* We need to make sure that the metadata (schema) is applied on the data as part of data processing.

In this case data files are available under **/data/retail_db**, the json file with metadata is available under **schemas/retail_db/retail.json**.

In [None]:
!ls -ltr /data/retail_db

In [None]:
!ls -ltr schemas/retail_db/retail.json

In [None]:
!cat schemas/retail_db/retail.json

In [None]:
!ls -ltr /data/retail_db/orders

In [None]:
# Read orders data into list of strings

orders_path = '/data/retail_db/orders/part-00000'
orders = open(orders_path). \
    read(). \
    splitlines()

In [None]:
orders[:10]

In [None]:
# Load schemas into dict using json

import json
retail_schemas = json.load(open('schemas/retail_db/retail.json'))

In [None]:
retail_schemas

In [None]:
# Get the schema for relevant data set

retail_schemas['orders']

In [None]:
# Fetch the column names

columns = list(map(lambda col: col['column_name'], retail_schemas['orders']))

In [None]:
columns

In [None]:
import csv

In [None]:
csv.DictReader?

In [None]:
# Create DictReader object using list of strings and column names
# We will get list of dicts. The keys in the dicts are from columns
csv_reader = csv.DictReader(open(orders_path), fieldnames=columns)

In [None]:
csv_reader

In [None]:
list(csv_reader)[:10]

In [None]:
folder_name = '/data/retail_db/orders'

In [None]:
import os
file_names = os.listdir(folder_name)

In [None]:
file_names

In [None]:
l1 = [1, 2, 3]

In [None]:
l2 = [4, 5]

In [None]:
l1 + l2

In [None]:
import os
import json
import csv

def get_dicts(base_folder, data_set_name, schema_file):
    file_names = os.listdir(f'{base_folder}/{data_set_name}')
    retail_schemas = json.load(open(schema_file))
    columns = list(map(lambda col: col['column_name'], retail_schemas[data_set_name]))
    data = []
    for file_name in file_names:
        file_path = f'{base_folder}/{data_set_name}/{file_name}'
        csv_reader = csv.DictReader(open(file_path), fieldnames=columns)
        data += list(csv_reader)
    return data

In [None]:
data = get_dicts('/data/retail_db', 'orders', 'schemas/retail_db/retail.json')

In [None]:
len(data)

In [None]:
data[:10]

In [None]:
data = get_dicts('/data/retail_db', 'order_items', 'schemas/retail_db/retail.json')

In [None]:
len(data)

In [None]:
data[:10]