# Python Data Structures for Data Analysis - Part 2

## Week 1, Day 2 (Thursday) - April 10th, 2025

### Overview
Continuing from Part 1, this notebook focuses on dictionaries, sets, and practical applications of Python data structures for data analysis tasks.

## 3. Dictionaries in Python

Dictionaries are key-value pair collections that allow fast lookups by key. They are unordered in Python versions < 3.7 and ordered by insertion in Python 3.7+.

In [10]:
# Creating dictionaries
empty_dict = {}
ages = {"Alice": 25, "Bob": 30, "Charlie": 35}
mixed_dict = {"name": "John", "age": 28, "is_student": False, "courses": ["Math", "CS"]}


##### Data Structure Constructors
1. List - `list()`
2. Set - `set()`
3. Tuple - `tuple()`
4. Dictionary - `dict()`

An Accessor is a plug used to get values from a variable.
The two commones accessors are `[]` and `.`

In [15]:

# Alternative creation method
user_info = dict(name="Jane", email="jane@example.com", age=32)
user_info_edit = {'name':'Jane', 'email':'jane@example.com', 'age':32}
print("Empty dictionary:", empty_dict)
print("Ages dictionary:", ages)
print("Mixed dictionary:", mixed_dict)
print("User info dictionary:", user_info)
print("User info Edit dictionary:", user_info_edit)


Empty dictionary: {}
Ages dictionary: {'Alice': 25, 'Bob': 30, 'Charlie': 35}
Mixed dictionary: {'name': 'John', 'age': 28, 'is_student': False, 'courses': ['Math', 'CS']}
User info dictionary: {'name': 'Jane', 'email': 'jane@example.com', 'age': 32}
User info Edit dictionary: {'name': 'Jane', 'email': 'jane@example.com', 'age': 32}


### Dictionary Operations

In [18]:
# Access values by key
print(f"Alice's age: {ages['Alice']}")
print(f"John's courses: {mixed_dict['courses']}")


Alice's age: 25
John's courses: ['Math', 'CS']


In [20]:
ages


{'Alice': 25, 'Bob': 30, 'Charlie': 35}

In [21]:
ages["David"] = 40

In [25]:
ages.get('bom-boy',0)

0

In [26]:

# Adding and updating values
ages["David"] = 40  # Add new key-value pair
ages["Alice"] = 26  # Update existing value
print("Updated ages:", ages)

# Safely get values with default
print(f"Eve's age: {ages.get('Eve', 'Not found')}")

# Check if key exists
if "Bob" in ages:
    print(f"Bob's age is {ages['Bob']}")

# Removing items
removed_age = ages.pop("Charlie")  # Remove and return
print(f"Removed Charlie's age: {removed_age}")
print("Ages after removal:", ages)

# Clear all items
user_info.clear()
print("Cleared user info:", user_info)

Updated ages: {'Alice': 26, 'Bob': 30, 'Charlie': 35, 'David': 40}
Eve's age: Not found
Bob's age is 30
Removed Charlie's age: 35
Ages after removal: {'Alice': 26, 'Bob': 30, 'David': 40}
Cleared user info: {}


### Dictionary Methods and Iteration

In [27]:
# Sample dictionary
student = {
    "name": "Alex Johnson",
    "id": "AJ1234",
    "courses": ["Data Science", "Machine Learning", "Python Programming"],
    "grades": {"Data Science": 95, "Machine Learning": 89, "Python Programming": 92},
    "graduation_year": 2026
}

# Get all keys
keys = student.keys()
print("Keys:", list(keys))

# Get all values
values = student.values()
print("Values:", list(values))

# Get all key-value pairs as tuples
items = student.items()
print("Items:", list(items))

# Iterate through keys
print("\nStudent information:")
for key in student:
    print(f"{key}: {student[key]}")

# Iterate through key-value pairs
print("\nGrades:")
for course, grade in student["grades"].items():
    print(f"{course}: {grade}")

# Update with another dictionary
student.update({"email": "alex.j@example.edu", "graduation_year": 2025})
print("\nUpdated student info:", student)

Keys: ['name', 'id', 'courses', 'grades', 'graduation_year']
Values: ['Alex Johnson', 'AJ1234', ['Data Science', 'Machine Learning', 'Python Programming'], {'Data Science': 95, 'Machine Learning': 89, 'Python Programming': 92}, 2026]
Items: [('name', 'Alex Johnson'), ('id', 'AJ1234'), ('courses', ['Data Science', 'Machine Learning', 'Python Programming']), ('grades', {'Data Science': 95, 'Machine Learning': 89, 'Python Programming': 92}), ('graduation_year', 2026)]

Student information:
name: Alex Johnson
id: AJ1234
courses: ['Data Science', 'Machine Learning', 'Python Programming']
grades: {'Data Science': 95, 'Machine Learning': 89, 'Python Programming': 92}
graduation_year: 2026

Grades:
Data Science: 95
Machine Learning: 89
Python Programming: 92

Updated student info: {'name': 'Alex Johnson', 'id': 'AJ1234', 'courses': ['Data Science', 'Machine Learning', 'Python Programming'], 'grades': {'Data Science': 95, 'Machine Learning': 89, 'Python Programming': 92}, 'graduation_year': 202

### Dictionary Comprehensions

Similar to list comprehensions, dictionary comprehensions provide a concise way to create dictionaries.

In [28]:
# Basic syntax: {key_expr: value_expr for item in iterable if condition}

# Squares dictionary {number: square}
squares = {x: x**2 for x in range(1, 11)}
print("Squares dictionary:", squares)


Squares dictionary: {1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100}


In [34]:

# Filter by value
even_squares = {x: x**2 for x in range(1, 11) if x % 2 == 0}
print("Even squares:", even_squares)

# Create from two lists
names = ["Alice", "Bob", "Charlie", "David"]
scores = [95, 87, 92, 78]
student_scores = {name: score for name, score in zip(names, scores)}
print("Student scores:", student_scores)

# Transform keys and values
student_grades = {name.upper(): "A" if score >= 90 else "B" if score >= 80 else "C" 
                 for name, score in student_scores.items()}
print("Student grades:", student_grades)

# Word frequency counter
text = "python is powerful python is flexible python is fun"
word_count = {word: text.split().count(word) for word in set(text.split())}
print("Word frequency:", word_count)

Even squares: {2: 4, 4: 16, 6: 36, 8: 64, 10: 100}
Student scores: {'Alice': 95, 'Bob': 87, 'Charlie': 92, 'David': 78}
Student grades: {'ALICE': 'A', 'BOB': 'B', 'CHARLIE': 'A', 'DAVID': 'C'}
Word frequency: {'fun': 1, 'flexible': 1, 'is': 3, 'python': 3, 'powerful': 1}


In [33]:
ant = [1,2]
bee = [3,4]
for a in ant:
    for b in bee:
        print(f" Unzipped: {a*b}")	

for a,b in zip(ant,bee):
    print(f"Zipped: {a*b}")	

 Unzipped: 3
 Unzipped: 4
 Unzipped: 6
 Unzipped: 8
Zipped: 3
Zipped: 8


### Nested Dictionaries for Structured Data

Dictionaries can be nested to represent complex data structures.

In [36]:
# Customer order database
customers = {
    "C001": {
        "name": "Alice Brown",
        "email": "alice@example.com",
        "orders": {
            "ORD123": {
                "date": "2025-01-15",
                "items": ["Product A", "Product B"],
                "total": 78.50
            },
            "ORD127": {
                "date": "2025-02-20",
                "items": ["Product C"],
                "total": 39.99
            }
        }
    },
    "C002": {
        "name": "Bob Smith",
        "email": "bob@example.com",
        "orders": {
            "ORD125": {
                "date": "2025-01-20",
                "items": ["Product B", "Product D"],
                "total": 105.75
            }
        }
    }
}


In [38]:
customers['C001']['orders']['ORD123']['items'][0]

'Product A'

In [39]:

# Access nested data
print(f"Alice's email: {customers['C001']['email']}")
print(f"Bob's order total: {customers['C002']['orders']['ORD125']['total']}")

# Get all order totals for Alice
alice_totals = [order['total'] for order in customers['C001']['orders'].values()]
print(f"Alice's order totals: {alice_totals}")
print(f"Alice's total spending: ${sum(alice_totals):.2f}")

# Calculate total revenue across all customers
total_revenue = 0
for customer_id, customer_data in customers.items():
    for order_id, order_data in customer_data['orders'].items():
        total_revenue += order_data['total']

print(f"Total revenue: ${total_revenue:.2f}")

Alice's email: alice@example.com
Bob's order total: 105.75
Alice's order totals: [78.5, 39.99]
Alice's total spending: $118.49
Total revenue: $224.24


## 4. Sets in Python

Sets are unordered collections of unique elements. They are useful for membership testing, removing duplicates, and mathematical operations like union and intersection.

In [40]:
# Creating sets
empty_set = set()  # Note: {} creates an empty dict, not a set
fruits = {"apple", "banana", "cherry"}
numbers = set([1, 2, 2, 3, 3, 3, 4])  # Creating from a list with duplicates

print("Empty set:", empty_set)
print("Fruits set:", fruits)
print("Numbers set (duplicates removed):", numbers)

Empty set: set()
Fruits set: {'cherry', 'apple', 'banana'}
Numbers set (duplicates removed): {1, 2, 3, 4}


### Set Operations

In [None]:
# Adding and removing elements
fruits.add("orange")
print("After adding 'orange':", fruits)

fruits.remove("banana")  # Raises KeyError if not found
print("After removing 'banana':", fruits)

fruits.discard("pear")  # No error if not found
print("After discarding 'pear' (which wasn't in the set):", fruits)

popped = fruits.pop()  # Remove and return an arbitrary element
print(f"Popped element: {popped}, Set after pop: {fruits}")

# Membership testing
print("Is 'cherry' in fruits?", "cherry" in fruits)
print("Is 'banana' in fruits?", "banana" in fruits)

### Mathematical Set Operations

In [41]:
# Sample sets
set_a = {1, 2, 3, 4, 5}
set_b = {4, 5, 6, 7, 8}

# Union: elements in either set
union_result = set_a | set_b  # Alternative: set_a.union(set_b)
print("Union:", union_result)


Union: {1, 2, 3, 4, 5, 6, 7, 8}


In [42]:

# Intersection: elements in both sets
intersection_result = set_a & set_b  # Alternative: set_a.intersection(set_b)
print("Intersection:", intersection_result)


Intersection: {4, 5}


In [43]:

# Difference: elements in first set but not in second
difference_result = set_a - set_b  # Alternative: set_a.difference(set_b)
print("Difference (A - B):", difference_result)
print("Difference (B - A):", set_b - set_a)


Difference (A - B): {1, 2, 3}
Difference (B - A): {8, 6, 7}


In [44]:

# Symmetric difference: elements in either set but not in both
symmetric_difference = set_a ^ set_b  # Alternative: set_a.symmetric_difference(set_b)
print("Symmetric difference:", symmetric_difference)


Symmetric difference: {1, 2, 3, 6, 7, 8}


In [45]:

# Check if one set is a subset of another
set_c = {1, 2, 3}
print("Is C a subset of A?", set_c.issubset(set_a))  # Alternative: set_c <= set_a
print("Is A a superset of C?", set_a.issuperset(set_c))  # Alternative: set_a >= set_c

Is C a subset of A? True
Is A a superset of C? True


### Set Comprehensions

Similar to list and dictionary comprehensions, set comprehensions offer a concise way to create sets.

In [None]:
# Basic syntax: {expression for item in iterable if condition}

# Set of squares
square_set = {x**2 for x in range(1, 11)}
print("Set of squares:", square_set)

# Set of even numbers
even_set = {x for x in range(1, 21) if x % 2 == 0}
print("Set of even numbers:", even_set)

# Extract unique characters from a string
text = "python programming is fun"
unique_chars = {char for char in text if char.isalpha()}
print("Unique characters:", unique_chars)
print(f"Number of unique characters: {len(unique_chars)}")

### Applications of Sets in Data Analysis

In [None]:
# Removing duplicates from data
transaction_ids = ["T1001", "T1002", "T1001", "T1003", "T1002", "T1004"]
unique_transactions = set(transaction_ids)
print(f"Original transactions: {transaction_ids}")
print(f"Unique transactions: {unique_transactions}")
print(f"Number of duplicate transactions: {len(transaction_ids) - len(unique_transactions)}")

# Finding common elements
customers_2024 = {"Alice", "Bob", "Charlie", "David", "Eve"}
customers_2025 = {"Bob", "Charlie", "Frank", "Grace", "Eve"}

# Returning customers (in both years)
returning_customers = customers_2024 & customers_2025
print(f"Returning customers: {returning_customers}")

# New customers in 2025
new_customers = customers_2025 - customers_2024
print(f"New customers in 2025: {new_customers}")

# Lost customers (in 2024 but not in 2025)
lost_customers = customers_2024 - customers_2025
print(f"Lost customers: {lost_customers}")

# Customer status analysis
customer_status = {
    "total_unique": len(customers_2024 | customers_2025),
    "returning": len(returning_customers),
    "new": len(new_customers),
    "lost": len(lost_customers)
}
print("Customer status analysis:", customer_status)

## 5. Practical Applications: Combining Data Structures

Now let's look at a more complex example that combines multiple data structures to solve a realistic data analysis problem.

In [None]:
# Sample sales data (each entry is a transaction)
sales_data = [
    {"date": "2025-01-15", "customer_id": "C001", "product_id": "P101", "quantity": 2, "price": 19.99},
    {"date": "2025-01-15", "customer_id": "C002", "product_id": "P105", "quantity": 1, "price": 29.99},
    {"date": "2025-01-16", "customer_id": "C001", "product_id": "P103", "quantity": 3, "price": 14.99},
    {"date": "2025-01-16", "customer_id": "C003", "product_id": "P101", "quantity": 1, "price": 19.99},
    {"date": "2025-01-17", "customer_id": "C002", "product_id": "P102", "quantity": 2, "price": 24.99},
    {"date": "2025-01-18", "customer_id": "C001", "product_id": "P101", "quantity": 1, "price": 19.99},
    {"date": "2025-01-18", "customer_id": "C003", "product_id": "P103", "quantity": 4, "price": 14.99},
    {"date": "2025-01-19", "customer_id": "C004", "product_id": "P105", "quantity": 2, "price": 29.99},
    {"date": "2025-01-19", "customer_id": "C002", "product_id": "P101", "quantity": 3, "price": 19.99}
]

product_info = {
    "P101": {"name": "Widget A", "category": "Tools", "weight": 0.5},
    "P102": {"name": "Widget B", "category": "Tools", "weight": 0.8},
    "P103": {"name": "Gadget X", "category": "Electronics", "weight": 0.3},
    "P105": {"name": "Gadget Y", "category": "Electronics", "weight": 1.2}
}

# Analysis Tasks:

# 1. Calculate total sales by product
product_sales = {}
for transaction in sales_data:
    product_id = transaction["product_id"]
    total = transaction["quantity"] * transaction["price"]
    
    if product_id not in product_sales:
        product_sales[product_id] = 0
    product_sales[product_id] += total

# Add product names to the results
product_sales_with_names = {
    product_info[product_id]["name"]: total 
    for product_id, total in product_sales.items()
}

print("Total sales by product:")
for product, total in product_sales_with_names.items():
    print(f"{product}: ${total:.2f}")

# 2. Find unique customers by product category
category_customers = {}
for transaction in sales_data:
    product_id = transaction["product_id"]
    category = product_info[product_id]["category"]
    customer_id = transaction["customer_id"]
    
    if category not in category_customers:
        category_customers[category] = set()
    category_customers[category].add(customer_id)

print("\nUnique customers by product category:")
for category, customers in category_customers.items():
    print(f"{category}: {len(customers)} customers {customers}")

# 3. Calculate daily sales
daily_sales = {}
for transaction in sales_data:
    date = transaction["date"]
    total = transaction["quantity"] * transaction["price"]
    
    if date not in daily_sales:
        daily_sales[date] = 0
    daily_sales[date] += total

print("\nDaily sales:")
for date in sorted(daily_sales.keys()):
    print(f"{date}: ${daily_sales[date]:.2f}")

# 4. Find the customer with the highest total spending
customer_spending = {}
for transaction in sales_data:
    customer_id = transaction["customer_id"]
    total = transaction["quantity"] * transaction["price"]
    
    if customer_id not in customer_spending:
        customer_spending[customer_id] = 0
    customer_spending[customer_id] += total

top_customer = max(customer_spending.items(), key=lambda x: x[1])
print(f"\nTop customer: {top_customer[0]} spent ${top_customer[1]:.2f}")

# 5. Average order value by category
category_totals = {}
category_counts = {}

for transaction in sales_data:
    product_id = transaction["product_id"]
    category = product_info[product_id]["category"]
    total = transaction["quantity"] * transaction["price"]
    
    if category not in category_totals:
        category_totals[category] = 0
        category_counts[category] = 0
    
    category_totals[category] += total
    category_counts[category] += 1

category_avg = {category: category_totals[category] / category_counts[category]
                for category in category_totals}

print("\nAverage order value by category:")
for category, avg in category_avg.items():
    print(f"{category}: ${avg:.2f}")

## Summary

In these two notebooks, we've covered the essential Python data structures for data analysis:

1. **Lists** - Ordered, mutable collections for sequences of items
2. **Tuples** - Ordered, immutable collections for fixed data
3. **Dictionaries** - Key-value mappings for fast lookups and structured data
4. **Sets** - Unordered collections of unique elements for membership testing and mathematical operations

Understanding these structures and how to combine them allows you to efficiently organize, manipulate, and analyze data in Python. These fundamentals will form the basis for our work with the pandas library, which provides high-level data structures optimized for data analysis tasks.

### Key Takeaways

- Choose the right data structure for your specific needs:
  - Lists for ordered sequences that need to be modified
  - Tuples for fixed data or dictionary keys
  - Dictionaries for key-based lookups and structured data
  - Sets for unique values and set operations
  
- Use comprehensions for concise creation of data structures
- Combine different data structures to solve complex problems
- Leverage built-in methods to manipulate and transform data efficiently

In the next session, we'll learn how to use these data structures with real data files and begin our exploration of the pandas library.