### Ingest circuits.csv

#### Step 1 - Read the CSV file the spark df reader

In [0]:
%run /Workspace/Users/nicolas97alonso@gmail.com/databricks-course/utils/vairables

In [0]:
circuits_path = f"{raw_root}/circuits.csv"

#### Why `inferSchema` should be avoided in production

Relying on `inferSchema` in production pipelines is risky because Spark must scan the entire dataset to guess column types. This adds unnecessary cost and makes the pipeline less predictable. Schema inference can also produce different results when new data arrives, which leads to inconsistent schemas across runs.

- **Performance impact** — Spark performs an extra full scan of all input files to infer types, which becomes expensive at scale.
- **Unstable behavior** — Type inference depends on the data itself. If new files contain different patterns, Spark may infer a different type (for example, `Integer` becoming `Long` or even `String`).
- **Silent schema drift** — Pipelines may continue running with a changed schema, causing downstream failures or corrupted tables.
- **Weak data contracts** — Production systems should treat the schema as a stable contract. Inference breaks that assumption.

In [0]:
circuits_df = spark.read \
    .option('header', True) \
    .option('inferSchema', True) \
    .csv(circuits_path)
display(circuits_df)


#### *❗Now we continue without inferring the schema*

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

In [0]:
circuits_schema = StructType([
    StructField("circuitId", IntegerType(), False),
    StructField("circuitRef", StringType(), True),
    StructField("name", StringType(), True),
    StructField("location", StringType(), True),
    StructField("country", StringType(), True),
    StructField("lat", DoubleType(), True),
    StructField("lng", DoubleType(), True),
    StructField("alt", IntegerType(), True),
    StructField("url", StringType(), True)
])

In [0]:
circuits_df = spark.read \
    .option('header', True) \
    .schema(circuits_schema) \
    .csv(circuits_path)
    
display(circuits_df)

In [0]:
circuits_df.printSchema()