# Python for Data Engineering:

1. **Read raw text data from a file**.
2. **Convert raw text into a Python collection** (list of records).
3. **Inspect and access records** (indexing, slicing).
4. **Model the same record using different collections** (list, set, tuple, dict).
5. **Process records using `map`, `filter`, list comprehensions, and `sorted`**.


Open the orders file

Open the file handle in **read mode (`'r'`)**. A file handle is *not* the file content; it is a Python object that lets us read the content.

### Note for execution environments

If `sample_data.csv` is not available in your environment, run the next cell to create a **sample `orders` list** so the remaining examples still execute.

In [None]:
# Fallback sample data (run only if you don't have the retail_db file locally)
try:
    orders_file = open('sample_data.csv', 'r')
    orders_str = orders_file.read()
    orders = orders_str.splitlines()
except FileNotFoundError:
    orders = [
        '1,2013-07-25 00:00:00.0,11599,CLOSED',
        '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
        '3,2013-07-25 00:00:00.0,12111,COMPLETE',
        '4,2013-07-25 00:00:00.0,8827,CLOSED',
        '5,2013-07-25 00:00:00.0,11318,COMPLETE',
        '6,2013-07-25 00:00:00.0,7130,COMPLETE',
        '7,2013-07-25 00:00:00.0,4530,COMPLETE',
        '8,2013-07-25 00:00:00.0,2911,PROCESSING',
        '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
        '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT',
        '11,2013-07-25 00:00:00.0,918,PAYMENT_REVIEW',
        '12,2013-07-25 00:00:00.0,1837,CLOSED',
        '13,2013-07-25 00:00:00.0,9149,PENDING_PAYMENT',
        '14,2013-07-25 00:00:00.0,9842,PROCESSING',
        '15,2013-07-25 00:00:00.0,2568,COMPLETE',
        '16,2013-07-25 00:00:00.0,7276,PENDING_PAYMENT',
        '17,2013-07-25 00:00:00.0,2667,COMPLETE',
        '18,2013-07-25 00:00:00.0,1205,CLOSED',
        '19,2013-07-25 00:00:00.0,9488,PENDING_PAYMENT',
        '20,2013-07-25 00:00:00.0,9198,PROCESSING'
    ]

In [None]:
orders_file = open('orders.csv', 'r')

## Read entire file content into a single string

`read()` loads the whole file into memory as one **string**. This is convenient for quick exploration, but for very large files it can be memory-heavy.

In [None]:
orders_str = orders_file.read()

## Confirm the data type

We verify that the result of `read()` is a Python `str` (string).

In [None]:
type(orders_str)

## Convert the file content into a list of records

`splitlines()` splits the big string into a **list of lines**. Each line becomes one element (one order record).

In [None]:
orders = orders_str.splitlines()

## Count the number of records

`len(orders)` returns the number of lines (records) present in the file.

In [None]:
len(orders)

## Preview the first few records

Slicing (`[:10]`) returns a new list containing the first 10 records. This is a fast way to confirm the file format.

In [None]:
orders[:10]

## Preview the last few records

Negative slicing (`[-10:]`) returns the last 10 records.

In [None]:
orders[-10:]

## Overview of Python Collections

- **`list`**: ordered collection; duplicates allowed.
- **`set`**: unordered collection; unique elements only.
- **`tuple`**: ordered collection; typically used for fixed-size, read-only records.
- **`dict`**: key-value mapping; keys are unique.

In Data Engineering, **lists** and **dicts** are extremely common for staging and transforming records.

## Create a Python list of order records

Here we define `orders` explicitly as a list of strings (each string is one record). In practice, this list often comes from file reads (`splitlines()`) or API pulls.

In [None]:
orders = [
    '1,2013-07-25 00:00:00.0,11599,CLOSED',
    '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
    '3,2013-07-25 00:00:00.0,12111,COMPLETE',
    '4,2013-07-25 00:00:00.0,8827,CLOSED',
    '5,2013-07-25 00:00:00.0,11318,COMPLETE',
    '6,2013-07-25 00:00:00.0,7130,COMPLETE',
    '7,2013-07-25 00:00:00.0,4530,COMPLETE',
    '8,2013-07-25 00:00:00.0,2911,PROCESSING',
    '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
    '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT'
]

## Access a record by index

Lists are **0-indexed**. `orders[1]` returns the second record.

In [None]:
orders[1]

## A `set` removing duplicates

A `set` keeps only unique values. When we build a set from a list with repeated numbers, duplicates are removed automatically.

In [None]:
s = {1, 1, 3, 2, 2, 5}
s

##  Represent a record using a `tuple`

A tuple is useful when the **structure is fixed** (same positions always mean the same thing). Here we store `(order_id, order_date, order_customer_id, order_status)`.

In [None]:
ordert = (1, '2013-07-25 00:00:00.0', 11599, 'CLOSED')
ordert

## Represent a record using a `dict`

A dict is useful when you want **named fields** (keys) instead of positional access. This reduces mistakes during transformations.

In [None]:
orderd = {
    'order_id': 1,
    'order_date': '2013-07-25 00:00:00.0',
    'order_customer_id': 11599,
    'order_status': 'CLOSED'
}
orderd

## Get dictionary keys and values

- `keys()` returns a *view* of all keys.
- `values()` returns a *view* of all values.

In [None]:
orderd.keys()

In [None]:
orderd.values()

## Access a dictionary value using `[]`

Using `orderd['order_customer_id']` returns the value for that key. If the key is missing, Python raises a `KeyError`.

In [None]:
orderd['order_customer_id']

## Access a dictionary value using `.get()`

Using `get()` is safer for exploratory pipelines:
- If the key exists, it returns the value.
- If the key does not exist, it returns `None` (or a default you provide) instead of raising `KeyError`.

In [None]:
orderd.get('order_customer_id')

## What happens if the key does not exist?

This cell demonstrates the difference between `[]` and `.get()` when the key is missing.

In [None]:
# This will raise KeyError (uncomment to see the exception)
# orderd['customer_id']

# This returns None (no exception)
orderd.get('customer_id')

## Processing Python lists

We will:
- filter records by `order_status`
- extract a specific field from each record
- find unique statuses
- sort records by `order_customer_id`

### Inspect a single order record and its status

Each record is a comma-separated string. `split(',')` converts it into a list of fields.

Index mapping (0-based):
- `[0]` → order_id
- `[1]` → order_date
- `[2]` → order_customer_id
- `[3]` → order_status

In [None]:
order = orders[2]
order

In [None]:
order.split(',')[3]

In [None]:
order.split(',')[3] == 'COMPLETE'

### Filter orders with `filter()`

`filter(predicate, iterable)` keeps only elements where the predicate returns `True`.

Here the predicate checks whether `order_status` equals `'COMPLETE'`.

In [None]:
list(filter(lambda order: order.split(',')[3] == 'COMPLETE', orders))

### Filter orders for multiple statuses

The predicate checks whether `order_status` is in a set/tuple of allowed values.

In [None]:
list(filter(lambda order: order.split(',')[3] in ('COMPLETE', 'CLOSED'), orders))

### Extract `order_status` for every order using `map()`

This builds a list of statuses (one per record) by selecting field `[3]` from each record.

In [None]:
list(map(lambda order: order.split(',')[3], orders))

### Count the extracted statuses

`len(list(map(...)))` confirms that we produced exactly one status per record.

In [None]:
len(list(map(lambda order: order.split(',')[3], orders)))

### Get unique statuses using `set()`

Wrapping the mapped statuses in `set()` removes duplicates and gives the distinct set of statuses in the dataset.

In [None]:
set(map(lambda order: order.split(',')[3], orders))

### Default sorting of strings

`sorted(orders)` sorts strings lexicographically. This is **not numeric sorting** on `order_id` or `order_customer_id`; it is string sorting.

In [None]:
sorted(orders)

### Build a numeric key from the record

To sort by `order_customer_id` we must:
1. Extract field `[2]` (customer id) using `split(',')`
2. Convert it to an integer using `int(...)`

This key function is then passed to `sorted(..., key=...)`.

In [None]:
order = orders[0]
int(order.split(',')[2])

### Sort records by `order_customer_id` (string key vs integer key)

- If we use `order.split(',')[2]` (string), sorting may be incorrect for numeric ordering.
- If we use `int(order.split(',')[2])`, sorting is numeric and correct.

In [None]:
sorted(orders, key=lambda order: order.split(',')[2])

In [None]:
sorted(orders, key=lambda order: int(order.split(',')[2]))

In [None]:
orders_file.close()