
# Advanced Python Sets for Data Analysis
This notebook explores **Python sets** from intermediate to professional level, with a strong focus on **data analysis** workflows.

Topics covered:
- Core operations and key methods
- Set algebra for data comparison
- Advanced techniques and idioms
- Professional tips for real-world data projects



## Introduction to Sets
A set is an **unordered collection of unique elements**.  
Sets are particularly useful in data analysis for tasks like deduplication, membership tests, and set algebra operations.


In [1]:
# Creating sets
numbers = {1, 2, 3, 4, 4, 5}
print("Unique set:", numbers)

# Adding and removing elements
numbers.add(6)
numbers.discard(2)
print("After add/discard:", numbers)

# Membership test
print("Is 3 in set?", 3 in numbers)
print("Is 10 not in set?", 10 not in numbers)

Unique set: {1, 2, 3, 4, 5}
After add/discard: {1, 3, 4, 5, 6}
Is 3 in set? True
Is 10 not in set? True



## Key Set Methods
Sets provide a range of methods for modifying and analyzing data.


In [3]:
data = {1, 2, 3}
other = {3, 4, 5}

# union() - combine elements from both sets
print("Union:", data.union(other))

# intersection() - elements common to both sets
print("Intersection:", data.intersection(other))

# difference() - elements in data but not in other
print("Difference:", data.difference(other))

# symmetric_difference() - elements in either set but not both
print("Symmetric difference:", data.symmetric_difference(other))

# update() - in-place union
data.update({6, 7})
print("After update:", data)

Union: {1, 2, 3, 4, 5}
Intersection: {3}
Difference: {1, 2}
Symmetric difference: {1, 2, 4, 5}
After update: {1, 2, 3, 6, 7}



## Sets in Data Analysis
Sets are powerful for comparing categories, cleaning duplicates, and performing quick lookups.


In [4]:
# Deduplication example
raw_values = [10, 20, 20, 30, 30, 30]
unique_values = set(raw_values)
print("Unique values:", unique_values)

# Category comparison
pandas_cols = {'Name', 'Age', 'Score', 'City'}
dataset_cols = {'Name', 'Age', 'Score', 'Country'}

missing_cols = pandas_cols.difference(dataset_cols)
extra_cols = dataset_cols.difference(pandas_cols)

print("Missing columns:", missing_cols)
print("Extra columns:", extra_cols)

Unique values: {10, 20, 30}
Missing columns: {'City'}
Extra columns: {'Country'}



## Advanced Techniques
Using sets for high-performance operations and data transformations.


In [None]:
# Set comprehension
even_squares = {x**2 for x in range(10) if x % 2 == 0}
print("Even squares:", even_squares)

# Frozen sets - immutable sets useful as dictionary keys
frozen_a = frozenset([1, 2, 3])
frozen_b = frozenset([3, 4, 5])
print("Frozen set union:", frozen_a | frozen_b)

# Using sets for fast membership in large datasets
large_list = list(range(10**6))
large_set = set(large_list)
print("Fast membership check:", 999999 in large_set)

Even squares: {0, 64, 4, 36, 16}
Frozen set union: frozenset({1, 2, 3, 4, 5})
Fast membership check: True



## Professional Tips
- Use sets when you need **fast membership tests** (O(1) average time complexity).
- For immutable operations where the set must be a key in a dictionary, use **frozenset**.
- Leverage set operations to simplify logic when comparing large categorical datasets.
