In [None]:
spark

## Convert column name to convention

It is possible that there are variety of naming columns or fields of raw data, however we should have a common convention for our data in datalake so that data users can use it with intuition and no confusion.

As our tables in raw have coluns in `camel-casing` convention, but we need `snake-casing` in our datalake.

![Data Models](https://www.mysqltutorial.org/wp-content/uploads/2009/12/MySQL-Sample-Database-Schema.png)

For an example, `productlines` need to change columns as following:

- `productLine` changed to `product_line`
- `textDescription` chagned to `text_description`
- `htmlDescription` changed to `html_description`
- `image` nothing to change

We need to transform column names and store all tranformed models in: `s3a://datalake/exercises/bronze/classicmodels/<table>.parquet`

In [None]:
def camel2snake(name):
    """Convert from camel-case to snake-case"""
    result = []
    for x, y in zip(name[:-1], name[1:]):
        result.append(x.lower())
        if x.islower() and (y.isupper() or y.isdigit()):
            result.append("_")
    result.append(y.lower())
    return "".join(result)

## Productlines

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/productlines.parquet")

assert sorted(df.columns) == ['html_description', 'image', 'product_line', 'text_description']

## Products

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/products.parquet")

assert sorted(df.columns) == ['buy_price', 'msrp', 'product_code',
                              'product_description', 'product_line', 'product_name',
                              'product_scale', 'product_vendor', 'quantity_in_stock']

## Employees

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/employees.parquet")

assert sorted(df.columns) == ['email', 'employee_number', 'extension',
                              'first_name', 'job_title', 'last_name',
                              'office_code', 'reports_to']

## Offices

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/offices.parquet")

assert sorted(df.columns) == ['address_line_1', 'address_line_2', 'city',
                              'country', 'office_code', 'phone',
                              'postal_code', 'state', 'territory']

## Customers

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/customers.parquet")

assert sorted(df.columns) == ['address_line_1', 'address_line_2', 'city',
                              'contact_first_name', 'contact_last_name', 'country',
                              'credit_limit', 'customer_name', 'customer_number',
                              'phone', 'postal_code', 'sales_rep_employee_number',
                              'state']

## Payments

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/payments.parquet")

assert sorted(df.columns) == ['amount', 'check_number', 'customer_number', 'payment_date']

## Orders

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/orders.parquet")

assert sorted(df.columns) == ['comments', 'customer_number', 'order_date',
                              'order_number', 'required_date', 'shipped_date',
                              'status']

## Orderdetails

In [None]:
# CODE HERE

In [None]:
df = spark.read.format("parquet").load("s3a://datalake/exercises/bronze/classicmodels/orderdetails.parquet")

assert sorted(df.columns) == ['order_line_number', 'order_number', 'price_each', 'product_code', 'quantity_ordered']