# 8. Data Manipulation VI - Interoperation and IO - Quiz

## 8.0. Import `polars` and Load Data

In [1]:
import polars as pl

In [2]:
zone_column_rename_mapping = {
    "LocationID": "location_id",
    "Borough": "borough",
    "Zone": "zone",
}
zones_df = (
    pl.read_parquet("../data/taxi_zone_lookup.parquet")
    .rename(zone_column_rename_mapping)
)

In [3]:
yellow_rides_column_rename_mapping = {
    "VendorID": "vendor_id",
    "RatecodeID": "ratecode_id",
    "PULocationID": "pu_location_id",
    "DOLocationID": "do_location_id",
    "Airport_fee": "airport_fee",
}

rides_df = (
    pl.read_parquet("../data/yellow_tripdata_2024-03.parquet")
    .rename(yellow_rides_column_rename_mapping)
    .join(
        zones_df.select(pl.all().name.prefix("pu_")),
        on="pu_location_id",
    )
    .join(
        zones_df.select(pl.all().name.prefix("do_")),
        on="do_location_id",
    )
)

## 8.1 Question 1: DataFrame Shape from List of Lists (Column Orientation)

Given the data in the form of a list of lists, create a DataFrame using `pl.from_records()` with a column orientation. What is the shape of that DataFrame?

In [8]:
data = [
    ["bonnie", "ronnie", "donny", "johnny", "scrawny"],
    list(range(5)),
    list(range(4, 9)),
]
result = (
    #### YOUR CODE HERE
    pl.from_records(data, orient="col")
    .shape
)
print(result)

(5, 3)


1. (3, 5) - incorrect - make sure you're using the `"col"` orientation! See "Interoperating DataFrames with Native Python Objects."
2. (5, 3) - correct - Exactly! With column orientation, each list becomes a column of the DataFrame. See "Interoperating DataFrames with Native Python Objects."

## 8.2 Question 2: DataFrame Shape from List of Lists (Row Orientation)

Given the data in the form of a list of lists, create a DataFrame using `pl.from_records()` with a row orientation. What is the shape of that DataFrame?

In [10]:
data = [
    ["bonnie", "ronnie", "donny", "johnny", "scrawny"],
    list(range(5)),
    list(range(4, 9)),
]
result = (
    #### YOUR CODE HERE
    pl.from_records(data, orient="row")
    .shape
)
print(result)

(3, 5)


1. (3, 5) - correct - Exactly! With row orientation, each list becomes a row of the DataFrame. See "Interoperating DataFrames with Native Python Objects."
2. (5, 3) - incorrect - make sure you're using the `"row"` orientation! See "Interoperating DataFrames with Native Python Objects."

## 8.3 Question 3: Datatype Changes After CSV Write and Read

Save out the DataFrame to a CSV file with `.write_csv()` to `"./temp_file.csv"`. Then, read it back in with `.read_csv()`. Have the datatypes changed?

In [19]:
save_path = "temp_file.csv"
df = pl.DataFrame([
    pl.Series(name="a", values=500 * ["derek", "carole"]),
    pl.Series(name="b", values=list(range(1000))),
    pl.Series(name="c", values=[i % 4 for i in list(range(1000))], dtype=pl.UInt8),
])
print(df.head(0))  # See schema before saving.
#### YOUR CODE HERE
df.write_csv(save_path)
print(pl.read_csv(save_path).head(0))

shape: (0, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u8  │
╞═════╪═════╪═════╡
└─────┴─────┴─────┘
shape: (0, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
└─────┴─────┴─────┘


1. Yes - correct - Exactly! CSV files don't store datatypes, so polars has to infer them upon loading, usually choosing `i64` for integers. See "DataFrame IO."
2. No - incorrect - make sure you're not using any extra arguments to `.read_csv()` and letting polars infer the datatypes. See "DataFrame IO."

## 8.4 Question 4: Datatype Changes After NDJSON Write and Read

Save out the DataFrame to an NDJSON file with `.write_ndjson()` to `"./temp_file.njson"`. Then, read it back in with `.read_ndjson()`. Have the datatypes changed?

In [20]:
save_path = "temp_file.njson"
df = pl.DataFrame([
    pl.Series(name="a", values=500 * ["derek", "carole"]),
    pl.Series(name="b", values=list(range(1000))),
    pl.Series(name="c", values=[i % 4 for i in list(range(1000))], dtype=pl.UInt8),
])
print(df.head(0))  # See schema before saving.
#### YOUR CODE HERE
df.write_ndjson(save_path)
print(pl.read_ndjson(save_path).head(0))

shape: (0, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u8  │
╞═════╪═════╪═════╡
└─────┴─────┴─────┘
shape: (0, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
└─────┴─────┴─────┘


1. Yes - correct - Exactly! NDJSON files don't store datatypes, so polars has to infer them upon loading, usually choosing `i64` for integers. See "DataFrame IO."
2. No - incorrect - make sure you're not using any extra arguments to `.read_ndjson()` and letting polars infer the datatypes. See "DataFrame IO."

## 8.5 Question 5: Datatype Changes After Parquet Write and Read

Save out the DataFrame to a Parquet file with `.write_parquet()` to `"./temp_file.parquet"`. Then, read it back in with `.read_parquet()`. Have the datatypes changed?

In [21]:
save_path = "temp_file.parquet"
df = pl.DataFrame([
    pl.Series(name="a", values=500 * ["derek", "carole"]),
    pl.Series(name="b", values=list(range(1000))),
    pl.Series(name="c", values=[i % 4 for i in list(range(1000))], dtype=pl.UInt8),
])
print(df.head(0))  # See schema before saving.
#### YOUR CODE HERE
df.write_parquet(save_path)
print(pl.read_parquet(save_path).head(0))

shape: (0, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u8  │
╞═════╪═════╪═════╡
└─────┴─────┴─────┘
shape: (0, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u8  │
╞═════╪═════╪═════╡
└─────┴─────┴─────┘


1. Yes - incorrect - make sure you're not transforming the DataFrame before saving with `.write_parquet()`. See "DataFrame IO."
2. No - correct - Exactly! Parquet stores schema with the data, so it loads back exactly as it was saved. See "DataFrame IO."