# 2. Getting Started - Quiz

## 2.0. Imports

Import `polars`.

In [1]:
import polars as pl

## 2.1 Question 1: Creating DataFrame from Dictionary

Given the data dictionary about schoolchildren, create a `pl.DataFrame` and display it. What are the datatypes of each column?

In [5]:
data_dictionary = {
    "first_name": ["danny", "stanny", "ranny"],
    "last_name": ["lang", "slang", "rang"],
    "age": [4.5, 4., 5.],
    "favorite_subject": ["math", "english", "art"],
}
#### YOUR CODE HERE
display(pl.DataFrame(data_dictionary))

first_name,last_name,age,favorite_subject
str,str,f64,str
"""danny""","""lang""",4.5,"""math"""
"""stanny""","""slang""",4.0,"""english"""
"""ranny""","""rang""",5.0,"""art"""


1. (`str`, `str`, `f64`, `str`) - correct - Columns 1, 2, and 4 are strings, but column 3, having at least one value with a decimal point, gets cast as a float. See "Creating a Polars DataFrame."
2. (`str`, `str`, `f64`, `cat`) - incorrect - For the last column to be a categorical variable, you'd need to perform some additional typecasting on the column. See "Creating a Polars DataFrame."
3. (`str`, `str`, `i64`, `cat`) - incorrect - For the last column to be a categorical variable, you'd need to perform some additional typecasting on the column. Also, `i64` is not a datatype. See "Creating a Polars DataFrame."
4. (`str`, `i64`, `cat`) - incorrect - This option is missing a column. See "Creating a Polars DataFrame."

## 2.2 Question 2: Loading CSV with Schema Override

In the module, we loaded data from the CSV file, overriding the schema of the columns `tpep_pickup_datetime` and `tpep_dropoff_datetime` and loading them as a `pl.Datetime` datatype. Now, override the schema to load them as a `pl.Date` datatype. What happens?

In [14]:
df = pl.read_csv(
    "../data/yellow_tripdata_2024-03.csv",
    schema_overrides={
        #### YOUR CODE HERE
        "tpep_pickup_datetime": pl.Date,
        "tpep_dropoff_datetime": pl.Date,
    }
)
df.head()

ComputeError: could not parse `2024-03-27T05:51:45.000000000` as dtype `date` at column 'tpep_pickup_datetime' (column number 2)

The current offset in the file is 323447780 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `2024-03-27T05:51:45.000000000` to the `null_values` list.

Original error: ```could not convert pattern```

1. All the data is loaded, and the columns `tpep_pickup_datetime` and `tpep_dropoff_datetime` are loaded as a `str` datatype - incorrect - This would happen only if you didn't pass a `schema_overrides` argument to the function call. Perhaps you didn't enter the code correctly—try again. See "Reading Data from CSV with In-Memory Mode."
2. All the data is loaded, and the columns `tpep_pickup_datetime` and `tpep_dropoff_datetime` are loaded as a `datetime` datatype - incorrect - This would happen only if you passed `pl.Datetime` as the schema override. Here, you're trying to override with `pl.Date`. See "Reading Data from CSV with In-Memory Mode."
3. The data doesn't load - correct - Polars crashes if you try to force a schema that it cannot conform the data to. See "Reading Data from CSV with In-Memory Mode."
4. All the data is loaded, and the columns `tpep_pickup_datetime` and `tpep_dropoff_datetime` are loaded as a `date` datatype - incorrect - It would be nice if this happened, but unfortunately Polars does not behave this way when you try to force a schema that it can't conform the data to. Try to run the code again. See "Reading Data from CSV with In-Memory Mode."

## 2.3 Question 3: LazyFrame vs DataFrame Selection Speed

In the module, you saw that selecting columns from a `LazyFrame` is about two or three times faster than selecting columns from a `DataFrame` when data is loaded from a CSV file. However, we only did this for CSV, not for Parquet. Which file type do you think would see a greater speedup when selecting on a `DataFrame` than selecting on a `LazyFrame`: CSV or Parquet and why?

1. CSV, because it's a simpler file type - incorrect - Though CSV is a simpler file type, it's not optimized for column operations. See "Selecting Data: In-Memory Versus Lazy Mode Comparison."
2. CSV, because it's an older file format and thus the Polars code for interacting with it is better developed - incorrect - Though CSV is indeed an older file format, Parquet offers a greater speedup when selecting on a DataFrame. See "Selecting Data: In-Memory Versus Lazy Mode Comparison."
3. Parquet, because both Polars and Parquet are built on the Apache Arrow memory model and thus the Polars development team has spent more time developing the functionality associated with Parquet, making its IO operations faster - incorrect - This may or may not be true, but either way it's not the reason why Parquet offers a greater speedup on in-memory selection than on lazy selection. See "Selecting Data: In-Memory Versus Lazy Mode Comparison."
4. Parquet, because Parquet files keep data from the same column in the same location in memory and thus, when the select gets pushed down to the read operation of LazyFrame, the input engine can skip the unnecessary columns' data faster than it can for a CSV file - correct - Exactly! This is what Apache Arrow is all about. See "Selecting Data: In-Memory Versus Lazy Mode Comparison."

## 2.4 Question 4: Highest Null Count in Dataset

Inspect the dataset with `df.describe()`. What is the highest `null_count` of any column?

In [16]:
#### YOUR CODE HERE

1. 3582628 - incorrect - This is the number of rows in the dataset. You may not be reading the correct row in the describe table. See "Reading Data from CSV with In-Memory Mode."
2. 426190 - correct - Exactly! There are a few columns with exactly this number of nulls. We'll discuss this in more depth in a later module. See "Reading Data from CSV with In-Memory Mode."
3. 0 - incorrect - We're looking for the highest `null_count` that any column has, not the lowest. See "Reading Data from CSV with In-Memory Mode."
4. 176836 - incorrect - If you chose this answer, you are probably not looking at the right place in the table. See "Reading Data from CSV with In-Memory Mode."