# Lab 01: Data Collection and Preprocessing

This notebook demonstrates fundamental data collection and preprocessing techniques using Python and pandas.

---
## Step 1: Load the Dataset

In this step, we load the retail transactions CSV file using pandas and inspect the first few rows to understand the data structure.

In [None]:
# Import pandas library
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('../data/Retail_Transactions_Dataset.csv')

# Display the first 3 rows
df.head(3)

---
## Step 2: Data Structure Choice

### Why pandas DataFrame?

We use a **pandas DataFrame** to store our retail transactions data for the following reasons:

| Criteria | DataFrame Advantage |
|----------|---------------------|
| **Tabular format** | CSV data is naturally row-column structured; DataFrames mirror this exactly |
| **Mixed data types** | Our dataset contains strings (`Product`, `City`), numbers (`Total_Cost`, `Total_Items`), and dates (`Date`) â€” DataFrames handle heterogeneous columns efficiently |
| **Built-in methods** | Pandas provides optimized functions for filtering, grouping, aggregating, and cleaning data |
| **Memory efficiency** | DataFrames use NumPy arrays under the hood, enabling vectorized operations |
| **Ecosystem integration** | Seamless compatibility with visualization libraries (matplotlib, seaborn) and ML frameworks (scikit-learn) |

### Alternatives Considered

- **Python lists/dicts**: No built-in support for column operations or missing data handling
- **NumPy arrays**: Require homogeneous data types; not ideal for mixed-type tabular data
- **SQL database**: Overkill for a single-file analysis; adds setup complexity

**Conclusion**: pandas DataFrame is the optimal choice for exploratory data analysis and preprocessing of structured CSV data.