## Iterating through a DataFrame
By the end of this lecture you will be able to:
- iterate through a column row-by-row
- iterate through multiple columns row-by-row
- understand the performance effect of the different options

While we introduce iteration methods here be aware that we should avoid iterating through a `DataFrame` if it is possible to use expressions as expressions are much faster. 

In [8]:
import polars as pl

In [9]:
csv_file = "../Files/Sample_Superstore.csv"

In [10]:
df = pl.read_csv(csv_file)
df.head(3)

Row_ID,Order_ID,Order_Date,Ship Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""11/8/2016""","""11/11/2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""6/12/2016""","""6/16/2016""","""Second Class""","""DV-13045""","""Darrin Van Huff""","""Corporate""","""United States""","""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


### Iterating over a single column
We can iterate over a single column just as we would do with a Pandas column or a Numpy array

In [11]:
Profits = [Profit for Profit in df["Profit"]]
Profits[:3]

[41.9136, 219.582, 6.8714]

### Iterating over multiple columns
We can iterate over multiple columns using the `rows` attribute of a `DataFrame`.

In this example we create a list where each element is the `Customer_Name` and `Profit`

In [12]:
Customer_Profit = [(row[3],row[5]) for row in df.rows()]
Customer_Profit[:3]

[('11/11/2016', 'CG-12520'),
 ('11/11/2016', 'CG-12520'),
 ('6/16/2016', 'DV-13045')]

Alternatively, we can do this with the `iterrows` attribute

In [13]:
Customer_Profit = [(row[3],row[5]) for row in df.iter_rows()]
Customer_Profit[:3]

[('11/11/2016', 'CG-12520'),
 ('11/11/2016', 'CG-12520'),
 ('6/16/2016', 'DV-13045')]

#### Difference between `rows` and `iter_rows`?
The output of `rows` and `iter_rows` is the same. The difference is that:
- when we call `rows` the entire `DataFrame` is materialised as a list of Python tuples where each tuple is a row. We can then iterate over this list of tuples
- when we call `iter_rows` Polars materialises each row as a Python tuple when we iterate over it rather than materialising the whole `DataFrame` at the outset

Use `rows` if you are iterating through the full `DataFrame` and have enough memory to materialise the whole `DataFrame` as a list of tuples.

Use `iter_rows` if you don't want to materialise the whole `DataFrame` as a list of tuples to reduce memory use

### Iterating with named columns
In the examples with `rows` and `iter_rows` above we use indexing to select the column. We can instead use the column name as an attribute by passing the `named` argument to return a `dict` for each row

In [14]:
Customer_Profit = [(row["Customer_Name"],row["Profit"]) for row in df.rows(named=True)]
Customer_Profit[:3]

[('Claire Gute', 41.9136),
 ('Claire Gute', 219.582),
 ('Darrin Van Huff', 6.8714)]

In [15]:
Customer_Profit = [(row["Customer_Name"],row["Profit"]) for row in df.iter_rows(named=True)]
Customer_Profit[:3]

[('Claire Gute', 41.9136),
 ('Claire Gute', 219.582),
 ('Darrin Van Huff', 6.8714)]

This approach with named values is easier to read but slower as the named objects must be created for each row.