# Ingesting Data Lab

Read in CSV files containing products data.

##### Tasks
1. Read with infer schema
2. Read with user-defined schema
3. Read with schema as DDL formatted string
4. Write using Delta format

In [0]:
%run ../Includes/Classroom-Setup

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03"


Validating the locally installed datasets...(3 seconds)

Creating & using the schema "da_lpalum_7163_asp"...(1 seconds)

Predefined tables in "da_lpalum_7163_asp":
  -none-

Predefined paths variables:
  DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03
  DA.paths.user_db:     dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks/database.db
  DA.paths.working_dir: dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks
  DA.paths.checkpoints: dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks/_checkpoints
  DA.paths.sales:       dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03/ecommerce/sales/sales.delta
  DA.paths.users:       dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-d

### 1. Read with infer schema
- View the first CSV file using DBUtils method **`fs.head`** with the filepath provided in the variable **`single_product_cs_file_path`**
- Create **`products_df`** by reading from CSV files located in the filepath provided in the variable **`products_csv_path`**
  - Configure options to use first line as header and infer schema

In [0]:
# ANSWER
single_product_csv_file_path = f"{DA.paths.datasets}/products/products.csv/part-00000-tid-1663954264736839188-daf30e86-5967-4173-b9ae-d1481d3506db-2367-1-c000.csv"
print(dbutils.fs.head(single_product_csv_file_path))

products_csv_path = f"{DA.paths.datasets}/products/products.csv"
products_df = (spark
               .read
               .option("header", True)
               .option("inferSchema", True)
               .csv(products_csv_path)
              )

products_df.printSchema()

item_id,name,price
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0

root
 |-- item_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- price: double (nullable = true)



**1.1: CHECK YOUR WORK**

In [0]:
assert(products_df.count() == 12)
print("All test pass")

All test pass


### 2. Read with user-defined schema
Define schema by creating a **`StructType`** with column names and data types

In [0]:
# ANSWER
from pyspark.sql.types import DoubleType, StringType, StructType, StructField

user_defined_schema = StructType([
    StructField("item_id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("price", DoubleType(), True)
])

products_df2 = (spark
                .read
                .option("header", True)
                .schema(user_defined_schema)
                .csv(products_csv_path)
               )

**2.1: CHECK YOUR WORK**

In [0]:
assert(user_defined_schema.fieldNames() == ["item_id", "name", "price"])
print("All test pass")

All test pass


In [0]:
from pyspark.sql import Row

expected1 = Row(item_id="M_STAN_Q", name="Standard Queen Mattress", price=1045.0)
result1 = products_df2.first()

assert(expected1 == result1)
print("All test pass")

All test pass


### 3. Read with DDL formatted string

In [0]:
# ANSWER
ddl_schema = "`item_id` STRING,`name` STRING,`price` DOUBLE"

products_df3 = (spark
                .read
                .option("header", True)
                .schema(ddl_schema)
                .csv(products_csv_path)
               )

**3.1: CHECK YOUR WORK**

In [0]:
assert(products_df3.count() == 12)
print("All test pass")

All test pass


### 4. Write to Delta
Write **`products_df`** to the filepath provided in the variable **`products_output_path`**

In [0]:
# ANSWER
products_output_path = f"{DA.paths.working_dir}/delta/products"
(products_df
 .write
 .format("delta")
 .mode("overwrite")
 .save(products_output_path)
)

**4.1: CHECK YOUR WORK**

In [0]:
verify_files = dbutils.fs.ls(products_output_path)
verify_delta_format = False
verify_num_data_files = 0
for f in verify_files:
    if f.name == "_delta_log/":
        verify_delta_format = True
    elif f.name.endswith(".parquet"):
        verify_num_data_files += 1

assert verify_delta_format, "Data not written in Delta format"
assert verify_num_data_files > 0, "No data written"
del verify_files, verify_delta_format, verify_num_data_files
print("All test pass")

All test pass


### Clean up classroom

In [0]:
DA.cleanup()

Resetting the learning environment...
...dropping the schema "da_lpalum_7163_asp"...(0 seconds)
...removing the working directory "dbfs:/mnt/dbacademy-users/lpalum@ur.rochester.edu/apache-spark-programming-with-databricks"...(1 seconds)


Validating the locally installed datasets...(3 seconds)
