## Apply Schema on the dataframe from files

Let us understand how to apply schema while creating the data frame. 
* In many cases, data files might not contain the metadata such as column names, data types, etc.
* We might get the data metadata in the form of separate files. Also, it is common that metadata is available via Database Tables or REST based schema registries.
* We need to make sure that the metadata (schema) is applied on the data as part of data processing.

In this case data files are available under **/data/retail_db**, the json file with metadata is available under **schemas/retail_db/retail.json**.

In [None]:
!ls -ltr /data/retail_db

In [None]:
!ls -ltr schemas/retail_db/retail.json

In [None]:
!cat schemas/retail_db/retail.json

In [None]:
!ls -ltr /data/retail_db/orders

In [None]:
# Read orders data into list of strings

orders_path = '/data/retail_db/orders/part-00000'
orders = open(orders_path). \
    read(). \
    splitlines()

In [None]:
orders[:10]

In [None]:
# Load schemas into dict using json

import json
retail_schemas = json.load(open('schemas/retail_db/retail.json'))

In [None]:
retail_schemas

In [None]:
# Get the schema for relevant data set

retail_schemas['orders']

In [None]:
# Fetch the column names

columns = list(map(lambda col: col['column_name'], retail_schemas['orders']))

In [None]:
columns

In [None]:
import pandas as pd

In [None]:
pd.read_csv('/data/retail_db/orders/part-00000', names=columns)

In [None]:
pd.DataFrame(map(lambda rec: rec.split(','), orders), columns=columns)

In [None]:
import os
import json
import csv
import pandas as pd

def get_df(base_folder, data_set_name, schema_file):
    file_names = os.listdir(f'{base_folder}/{data_set_name}')
    retail_schemas = json.load(open(schema_file))
    columns = list(map(lambda col: col['column_name'], retail_schemas[data_set_name]))
    data = []
    for file_name in file_names:
        file_path = f'{base_folder}/{data_set_name}/{file_name}'
        raw_data = open(file_path)
        data += list(raw_data)
    return pd.DataFrame(map(lambda rec: rec.split(','), data), columns=columns)

In [None]:
orders = get_df('/data/retail_db', 'orders', 'schemas/retail_db/retail.json')

In [None]:
order_items = get_df('/data/retail_db', 'order_items', 'schemas/retail_db/retail.json')

In [None]:
customers = get_df('/data/retail_db', 'customers', 'schemas/retail_db/retail.json')