# Tuples in Python for Data Analysis — Intermediate to Advanced

This notebook focuses on practical and professional uses of **tuples** within data analysis workflows.

Contents:
- Tuples recap and immutability implications
- Packing / unpacking in real data scenarios
- Tuples as keys for grouping and aggregation
- MultiIndex in pandas built from tuples
- NamedTuple and structured records
- Performance and memory considerations
- Practical patterns: deduplication, sliding windows, caching
- Advanced tips and when *not* to use tuples


In [3]:
# Standard imports and configuration constants
import sys
import numpy as np
import pandas as pd
from collections import namedtuple, Counter
from functools import lru_cache

RANDOM_SEED = 42
SAMPLE_N = 12
np.random.seed(RANDOM_SEED)

print('Python version:', sys.version.split()[0])


Python version: 3.12.10


## 1. Quick recap and immutability implications

- Tuples are immutable ordered sequences: good for fixed-schema records.
- They can contain mutable objects (e.g., lists) — the tuple reference is immutable, not necessarily the contents.
- Tuples are hashable if all their elements are hashable, so they can be used as dictionary keys or set elements.


In [4]:
# Basic tuple examples
single = (1,)
record = ("user_001", "US", 34.5)
nested = (1, [2, 3], {"k": "v"})  # mutable element inside tuple

print('single:', single)
print('record:', record)
print('nested before modification:', nested)
nested[1].append(99)
print('nested after modification:', nested)

# Attempt to change tuple element (should raise TypeError)
try:
    record[0] = 'user_002'
except TypeError as err:
    print('Immutable error:', err)


single: (1,)
record: ('user_001', 'US', 34.5)
nested before modification: (1, [2, 3], {'k': 'v'})
nested after modification: (1, [2, 3, 99], {'k': 'v'})
Immutable error: 'tuple' object does not support item assignment


## 2. Packing / Unpacking in data pipelines

- Unpacking is especially useful when iterating over rows or when functions return multiple values.
- Use extended unpacking (`first, *rest, last`) to capture variable-length segments.


In [3]:
# Simulate a row-wise CSV read that yields tuples
rows = [
    ('id1', 'Alice', 34, 'NY'),
    ('id2', 'Bob', 28, 'CA'),
    ('id3', 'Carol', 41, 'TX'),
]

for uid, name, age, state in rows:
    print(f'User {uid}: {name} ({age}) from {state}')

# Function returning multiple values as tuple
def min_max(seq):
    return (min(seq), max(seq))

vals = [3, 7, 1, 9]
low, high = min_max(vals)
print('min, max ->', low, high)

# Extended unpacking example
a_tuple = (0, 1, 2, 3, 4, 5)
first, *middle, last = a_tuple
print('first:', first, 'middle:', middle, 'last:', last)


User id1: Alice (34) from NY
User id2: Bob (28) from CA
User id3: Carol (41) from TX
min, max -> 1 9
first: 0 middle: [1, 2, 3, 4] last: 5


## 3. Tuples as keys for grouping and aggregation

- When records have a natural composite key (e.g., (country, year)), tuples provide a simple hashable key.
- This is useful for dictionary-based aggregations and for building pandas MultiIndex structures.


In [4]:
# Example: aggregate sales by (region, year) using tuples as keys
sales_data = [
    ('US', 2020, 100),
    ('US', 2021, 150),
    ('CA', 2020, 80),
    ('US', 2020, 50),
    ('CA', 2021, 120),
]

agg = {}
for region, year, value in sales_data:
    key = (region, year)
    agg[key] = agg.get(key, 0) + value

print('Aggregated dict with tuple keys:')
for k, v in agg.items():
    print(k, '->', v)

# Convert to DataFrame with MultiIndex
index = pd.MultiIndex.from_tuples(list(agg.keys()), names=['region', 'year'])
df_agg = pd.DataFrame({'sales': list(agg.values())}, index=index)
print('\nPandas DataFrame with MultiIndex:')
print(df_agg)


Aggregated dict with tuple keys:
('US', 2020) -> 150
('US', 2021) -> 150
('CA', 2020) -> 80
('CA', 2021) -> 120

Pandas DataFrame with MultiIndex:
             sales
region year       
US     2020    150
       2021    150
CA     2020     80
       2021    120


## 4. MultiIndex operations (from tuples)

- MultiIndex enables powerful slicing and reshaping operations in pandas.
- Tuples are a natural representation for MultiIndex keys.


In [5]:
# Example DataFrame with MultiIndex built from tuples
tuples = [('US', 2020), ('US', 2021), ('CA', 2020), ('CA', 2021)]
data = [150, 200, 80, 120]
mi = pd.MultiIndex.from_tuples(tuples, names=['region', 'year'])
df = pd.DataFrame({'sales': data}, index=mi)

print('Full DataFrame:')
print(df)

# Select by first level (region)
print('\nSelect region US:')
print(df.loc['US'])

# Cross-section (xs) by second level
print('\nCross-section for year 2020:')
print(df.xs(2020, level='year'))

# Unstack to reshape
print('\nUnstacked (year as columns):')
print(df.unstack(level='year'))


Full DataFrame:
             sales
region year       
US     2020    150
       2021    200
CA     2020     80
       2021    120

Select region US:
      sales
year       
2020    150
2021    200

Cross-section for year 2020:
        sales
region       
US        150
CA         80

Unstacked (year as columns):
       sales     
year    2020 2021
region           
CA        80  120
US       150  200


## 5. NamedTuple and Structured Records

- `collections.namedtuple` and `typing.NamedTuple` provide tuple-like records with named fields — readable and still lightweight.
- They convert easily to dictionaries or DataFrames for analysis.


In [6]:
# Using namedtuple for structured rows
User = namedtuple('User', ['user_id', 'country', 'age'])
users = [User('u1', 'US', 34), User('u2', 'CA', 28), User('u3', 'US', 41)]
print('First user:', users[0])
print('Access by attribute:', users[0].age)

# Convert namedtuple list to DataFrame
df_users = pd.DataFrame(users)
print('\nDataFrame from namedtuple list:')
print(df_users)

# typing.NamedTuple example (Python 3.6+)
from typing import NamedTuple

class Record(NamedTuple):
    id: str
    value: float

recs = [Record('r1', 1.2), Record('r2', 3.4)]
print('\nRecord NamedTuple:', recs[0])


First user: User(user_id='u1', country='US', age=34)
Access by attribute: 34

DataFrame from namedtuple list:
  user_id country  age
0      u1      US   34
1      u2      CA   28
2      u3      US   41

Record NamedTuple: Record(id='r1', value=1.2)


## 6. Deduplication and Set Operations using tuples

- Lists of lists are unhashable; convert inner lists to tuples to perform set-based deduplication efficiently.


In [7]:
# Deduplicate list-of-lists by converting to tuple
rows = [[1, 2], [1, 2], [2, 3], [3, 4]]
unique = list(map(list, set(tuple(r) for r in rows)))
print('Unique rows:', unique)

# Use Counter with tuple keys for frequency counts
freq = Counter(tuple(r) for r in rows)
print('Frequencies:', freq)


Unique rows: [[2, 3], [1, 2], [3, 4]]
Frequencies: Counter({(1, 2): 2, (2, 3): 1, (3, 4): 1})


## 7. Sliding windows (pairwise / n-wise) returning tuples

- Useful for feature engineering in time series or sequential data (n-grams, rolling features).


In [8]:
from itertools import islice

def sliding_window(seq, n=2):
    iters = [iter(seq[i:]) for i in range(n)]
    return zip(*iters)

data = [10, 20, 30, 40]
print('Pairwise windows (as tuples):')
for win in sliding_window(data, n=2):
    print(win)

# More flexible window using deque (memory efficient)
from collections import deque
def sliding(seq, n=3):
    dq = deque(maxlen=n)
    for item in seq:
        dq.append(item)
        if len(dq) == n:
            yield tuple(dq)

print('\nTri-grams (tuples):', list(sliding(data, n=3)))


Pairwise windows (as tuples):
(10, 20)
(20, 30)
(30, 40)

Tri-grams (tuples): [(10, 20, 30), (20, 30, 40)]


## 8. Tuples, caching, and memoization

- Functions with list arguments are not hashable; convert to tuple to use as cache keys.
- Demonstration using `lru_cache` with tuple conversion.

In [9]:
@lru_cache(maxsize=128)
def expensive_operation(args_tuple):
    # args_tuple must be a tuple of hashable items
    print('computing for', args_tuple)
    return sum(args_tuple) * 0.5

print('First call:')
print(expensive_operation((1, 2, 3)))
print('Second call (cached):')
print(expensive_operation((1, 2, 3)))

# If you have a list, convert to tuple before caching
lst = [4, 5, 6]
print('From list via tuple:', expensive_operation(tuple(lst)))


First call:
computing for (1, 2, 3)
3.0
Second call (cached):
3.0
computing for (4, 5, 6)
From list via tuple: 7.5


## 9. Memory and performance considerations

- Tuples have a smaller memory footprint than lists for the same items (immutable overhead differs).
- For large numeric collections, prefer NumPy arrays or pandas structures for vectorized operations.


In [16]:
import sys
LIST_SAMPLE = list(range(10000))
TUPLE_SAMPLE = tuple(LIST_SAMPLE)
print('size list:', sys.getsizeof(LIST_SAMPLE))
print('size tuple:', sys.getsizeof(TUPLE_SAMPLE))

# Timing element access (should be similar)
import timeit
list_access = timeit.timeit('LIST_SAMPLE[5000]', globals=globals(), number=100000)
tuple_access = timeit.timeit('TUPLE_SAMPLE[5000]', globals=globals(), number=100000)
print('list access time:', list_access)
print('tuple access time:', tuple_access)


size list: 80056
size tuple: 80040
list access time: 0.007001799996942282
tuple access time: 0.009608699940145016


## 10. When not to use tuples (guidance)

- If you need to mutate contents frequently, use lists or numpy arrays.
- For labeled data with many columns and heavy operations, prefer pandas DataFrame or named structures.
- Tuples are best for small, fixed-size heterogeneous records, composite keys, and lightweight immutable records.


In [11]:
# Final practical example: building a MultiIndex DataFrame from raw tuple records
raw = [
    ('region1', 'productA', 2020, 100),
    ('region1', 'productA', 2021, 150),
    ('region1', 'productB', 2020, 80),
    ('region2', 'productA', 2020, 60),
]
# Create DataFrame
df_raw = pd.DataFrame(raw, columns=['region', 'product', 'year', 'sales'])

# Use tuples to group and pivot
df_raw['key'] = list(zip(df_raw['region'], df_raw['product']))
grouped = df_raw.groupby('key')['sales'].sum()
print('Grouped by composite key (tuple):')
print(grouped)

# Convert grouping index (tuples) into MultiIndex DataFrame
mi = pd.MultiIndex.from_tuples(grouped.index, names=['region', 'product'])
df_grouped = pd.DataFrame({'sales': grouped.values}, index=mi)
print('\nMultiIndex DataFrame:')
print(df_grouped)


Grouped by composite key (tuple):
key
(region1, productA)    250
(region1, productB)     80
(region2, productA)     60
Name: sales, dtype: int64

MultiIndex DataFrame:
                  sales
region  product        
region1 productA    250
        productB     80
region2 productA     60


## References and professional tips
- Use tuples for composite immutable keys (grouping, indexing).
- Convert to numpy/pandas structures for heavy numeric workloads.
- Prefer NamedTuple for readable small records that benefit from attribute access.
- When serializing to JSON, convert tuples to lists or strings since JSON does not support tuple type explicitly.
