In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Lecture 4

# Simulation, DataFrame Fundamentals

### EECS 398-003: Practical Data Science, Fall 2024

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/fa24">github.com/practicaldsc/fa24</a></small>
    
</div>

### Announcements 📣

- Homework 1 is due **tonight**, though note that you have 6 slip days to use during the semester, and you can use up to 2 slip days on any homework (see [here](https://practicaldsc.org/syllabus/#late-policy-slip-days-and-drops) for policy details).<br>
<small>Post on [Ed](https://edstem.org/us/courses/61012/discussion/) or
come to [Office Hours](https://practicaldsc.org/calendar) for help! We're using a queue for office hours now – access it from [practicaldsc.org/calendar](https://practicaldsc.org/calendar).</small>

- Homework 2 will be released tomorrow.<br><small>We'll make an Ed announcement anytime an assignment is released.</small>

- In discussion tomorrow, we'll cover past exam problems **on paper** related to this week's material.

- Check out the [Resources](https://practicaldsc.org/resources) tab on the course website, with links to lots of supplementary resources.<br><small>New link: [EECS 201: Computer Science Pragmatics](https://www.eecs.umich.edu/courses/eecs201/fa2024/schedule). Look here for help with Terminal commands, `git`, etc.</small>

### Agenda

- Randomness and simulation.
- Introduction to `pandas` DataFrames.
    - Selecting columns from a DataFrame.
    - Selecting rows from a DataFrame.

Remember to follow along in lecture by accessing the "blank" lecture notebook in our [public GitHub repository](https://github.com/practicaldsc/fa24).

<div class="alert alert-warning">
<h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
<small>Remember that you can always ask questions anonymously at the link above!</small>
    
    
When is your birthday?
        
    
    


## Randomness and simulation

We'll start by exploring a useful application of `numpy` in the field of probability and statistics: simulation!

---

### `np.random`

The submodule `np.random` contains various functions that produce **random** results.<br><small>These use [pseudo-random number generators](https://en.wikipedia.org/wiki/Pseudorandom_number_generator) to generate random-seeming sequences of results.</small>

In [None]:
# Run this cell multiple times!
# Returns a random integer between 1 and 6, inclusive.
...

In [None]:
# Returns a random real number between 0 and 1.
...

In [None]:
# Returns a randomly selected element from the provided list, 5 times.
...

In [None]:
# Returns the number of occurrences of each outcome
# in 12 trials of an experiment in which
# outcome 1 happens 60% of the time and
# outcome 2 happens 40% of the time.
...

### Simulations

- Often, we'll want to estimate the probability of an event, but it may not be possible – or we may not know how – to calculate the probability exactly.<br><small>e.g., the probability that I see between 40 and 50 heads when I flip a fair coin 100 times.</small>

- Or, we may have a theoretical answer, and want to validate it using another approach.

- In such cases, we can use the power of simulation. We can:
    1. Figure out how to **simulate** one run of the experiment.<br><small>e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads.</small>
    1. Repeat the experiment many, many times.
    1. Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
    
    This is the basis of [Monte Carlo Methods](https://en.wikipedia.org/wiki/Monte_Carlo_method).

- Theory tells us that **the more repetitions we perform of our experiment, the closer our fraction will be to the true probability of the event**!
<br><small>Specifically, the [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) tells us this.</small>

### Example: Coin flipping

- **Question**: What is the probability that I see between 40 and 50 heads, inclusive, when I flip a fair coin 100 times?

- **Step 1**: Figure out how to **simulate** one run of the experiment.<br><small>e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads.</small>

In [None]:
...

In [None]:
...

In [None]:
def num_heads():
    ...
num_heads()

- **Step 2**: Repeat the experiment many, many times.<br><small>In other words, run the cell above lots of times and store the results somewhere.</small>

In [None]:
outcomes = ...
for _ in range(10_000):
    ...

- **Step 3**: Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.

In [None]:
px.histogram(outcomes)

In [None]:
...

- This is remarkably close to the true, theoretical answer!

In [None]:
from scipy.stats import binom
binom.cdf(50, 100, 0.5) - binom.cdf(39, 100, 0.5)

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What questions do you have about our coin flipping simulation?
    
Can you think of a way to perform the same simulation without **any** `for`-loops?

### Example: The Birthday Paradox

- There are ~80 students in the room right now. What are the chances at least 2 students share the same birthday?

- In general, **how many people must be in a room such that there's a 50% chance that at least 2 students share the same birthday?**

- Let's define a function, `estimated_probability`, which takes in a class size, `n`, and returns the probability that in a class of `n` students, at least 2 students share the same birthday.

In [None]:
def simulate_classroom(n):
    # This helper function should take in a class size, n,
    # and return True if a simulated classroom of size n
    # has at least 2 students with the same birthday
    # and False otherwise.
    # This is not the most efficient solution, but works for now.
    ...
def estimated_probability(n):
    ...

In [None]:
...

- With 80 students, it's almost certain that 2 share the same birthday!<br>What's the minimum class size we'd need for a 50% chance that at least 2 share the same birthday?

In [None]:
probs = [estimated_probability(n) for n in range(1, 51)]

In [None]:
(
    px
    .bar(x=range(1, 51), 
         y=probs,
         title='Probability that at least 2 students share the<br>same birthday in a class of n students')
    .update_xaxes(title='$n$')
    .update_yaxes(title='Probability')
)

- Lower than you might think!

## Introduction to `pandas` DataFrames

Let's finally start working with real datasets! 🎉

Note that we're going to cover a lot of code quickly. The point of lecture is to expose you to what's possible; you can look at the notebook later for the details.

---

### `pandas`

<center><img src='imgs/pandas.png' width=300></center>

- `pandas` is **the** Python library for tabular data manipulation.

- Before `pandas` was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.

- Wes McKinney, the original developer of `pandas`, wanted a library which would allow everything to be done in Python.<br><small>Python is faster to develop in than Java or C++, and is more general-purpose than R.</small>

### Importing `pandas` and related libraries

`pandas` is almost always imported in conjunction with `numpy`.

In [None]:
import pandas as pd
import numpy as np

### `pandas` data structures

There are **three** key data structures at the core of `pandas`.

<center>
    <img src='imgs/df-anatomy.png' width=1000>
    <small>An example DataFrame.</small>
</center>

- **DataFrame**: 2 dimensional tables. These have rows and columns.

- **Series**: 1 dimensional array-like object, representing a row or column.<br><small>Like arrays, Series contain data of the same type. The plural of Series is also Series.</small>

- **Index**: Sequence of row or column labels. When we say "the index", we're referring to the sequence of **row labels**.<br><small>The index – `'lebronja'`, `'obammich'`, `'carpents'`, and `'timapplec'` in the example above – **is not** a column!<br>Column names – `'name'`, `'program'`, and `'year'` in the example above – are stored as strings, and the sequence of column names is also an index.<br></small>

### Example: Dog Breeds 🐶

- The dataset we'll work comes from the American Kennel Club. Here's a [cool plot](https://informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/) made using our dataset.

<center>
    <img src="https://infobeautiful4.s3.amazonaws.com/2014/11/IIB_Best-In-Show_1276x2.png" width=650>
        
</center>

- We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."

- We can read in a CSV using `pd.read_csv(path)`. The path should be relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file (as a string).

- Today's dataset is stored `'data/dogs43.csv'` – open it up and see what it looks like!

In [None]:
# The "cat" shell command shows you the contents of a file.
...

In [None]:
dogs = ...
dogs

### Exploring our first DataFrame

- To extract the first or last few rows of a DataFrame, use the `head` or `tail` methods.<br><small>Like most DataFrame methods, `head` and `tail` **don't** modify the original DataFrame!</small>

In [None]:
...

In [None]:
...

In [None]:
dogs

- The `shape` attribute returns the DataFrame's number of rows and columns.<br><small>Sure, we can _see 👀_ that it says 43 rows x 7 columns above, but the `shape` attribute allows us to write code involving the number of rows/columns.</small>

In [None]:
# Note that the index – 0, 1, 2, ... – does **not** count as a column!
...

### Sorting

- To sort by a column, use the `sort_values` method.<br><small>`ascending=False` is a **keyword argument**, meaning you need to specify the name of the argument to use it. You've seen some examples of this in the `plotly` part of Homework 1.</small> 

In [None]:
# Note that the index is no longer 0, 1, 2, ...!
...

- We can also sort by multiple columns!<br><small>This sorts by `'height'`, then breaks ties by `'longevity'`. Note the difference in the last three rows between this DataFrame and the one above.</small>

In [None]:
...

### Setting the index

- Think of each row's index as its **unique identifier** or **name**. The default index when we create a DataFrame using `pd.read_csv` is 0, 1, 2, 3, ...
<br><small>Think of the index of a DataFrame like a "key" in a dictionary (Python) or map (C++).</small>

In [None]:
dogs

In [None]:
dogs.index

- Often, we like to set the index of a DataFrame to a unique identifier if we have one available. We can do so with the `set_index` method.<br><small>We'll see the real benefit of this shortly.</small>

In [None]:
...

In [None]:
# The above cell didn't involve an assignment statement, so dogs was unchanged.
dogs

In [None]:
# By reassigning dogs, our changes will persist.
# Note that we can't run this cell twice! Try it and see what happens.
dogs = dogs.set_index('breed')
dogs

In [None]:
# There used to be 7 columns, but now there are only 6!
# The index is **not** a column!
dogs.shape

In [None]:
dogs.index

<div class="alert alert-success">
<h3>Activity</h3>
    
Assign `tallest_breed` to the name, as a **string**, of the tallest breed in the dataset. Answer using `pandas` code, i.e. **don't** look at the dataset and hard-code the answer.

In [None]:
tallest_breed = ...
tallest_breed

<div class="alert alert-warning">
    <h3>Question 🤔 (Answer at <a style="text-decoration: none; color: #0066cc" href="https://docs.google.com/forms/d/e/1FAIpQLSd4oliiZYeNh76jWy-arfEtoAkCrVSsobZxPwxifWggo3EO0Q/viewform">practicaldsc.org/q</a>)</h3>
    
What are your thoughts on the activities in lecture?
    
- A. I really hate them, get rid of them and spend more time lecturing.
- B. I don't mind them.
- C. I'm neutral – I wouldn't be sad to see them go, but don't mind if they're there.
- D. I like them.
- E. I really love them, do more!

### 💡 Pro-Tip: Displaying more rows/columns

Sometimes, you just want `pandas` to display a lot of rows and columns. You can use this helper function to do that.

In [None]:
from IPython.display import display
def display_df(df, rows=pd.options.display.max_rows, cols=pd.options.display.max_columns):
    """Displays n rows and cols from df."""
    with pd.option_context("display.max_rows", rows,
                           "display.max_columns", cols):
        display(df)

In [None]:
display_df(dogs.sort_values('weight', ascending=False), rows=43)

## Selecting columns from a DataFrame

In order to answer questions involving our data, we'll need to be able to access the values stored in individual columns.

---

### Selecting columns with `[]`

- The most common way to select a subset of the columns in a DataFrame is by using the `[]` operator.
<br><small>This is just like when we accessed values in a dictionary based on their key.</small>

- Specifying a column name returns the column as a Series.

- Specifying a list of column names returns a DataFrame.

In [None]:
dogs

In [None]:
# Returns a Series.
...

In [None]:
# Returns a DataFrame.
...

In [None]:
# 🤔
...

- As an aside: when you get an error message in Python, **the most informative part is usually at the bottom!**<br><small>So, if you're posting about your error on Ed, or debugging with us in office hours, show us the bottom first.</small>

In [None]:
# Breeds are stored in the index, which is not a column!
...

In [None]:
...

### Useful Series methods

- A Series is like an array, but with an index.

- There are a variety of useful methods that work on Series. You can see the entire list [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). Many methods that work on a Series will also work on DataFrames, as we'll soon see.

In [None]:
dogs

In [None]:
# What are the unique kinds of dogs?
...

In [None]:
# How many unique kinds of dogs are there?
...

In [None]:
# What's the distribution of kinds?
# value_counts is super useful – and I love asking exam questions about it!
...

In [None]:
# What's the mean of the 'longevity' column?
...

In [None]:
# Tell me more about the 'weight' column.
...

In [None]:
# Sort the 'lifetime_cost' column. Note that here we're using sort_values on a Series, not a DataFrame!
...

In [None]:
# Gives us the index of the largest value, not the largest value itself.
# Note that this makes our Activity from a few slides ago way easier!
...

<div class="alert alert-success">
<h3>Activity</h3>
    
Complete the implementation of the function `average_heaviest`, which takes in a positive integer `n` and returns the **mean `'lifetime_cost'` of the top `n` heaviest breeds**. Example behavior is given below.
    
```python
>>> average_heaviest(5)
16142.8
    
>>> average_heaviest(1)
13581.0

```
    
**We won't have time to try this activity in lecture, but the answer is posted in `lec04-filled.ipynb` and in the "filled html" link on the course website.**

In [None]:
def average_heaviest(n):
    ...

In [None]:
...

In [None]:
...

### Series support vectorized operations

- Series operations are vectorized, just like with arrays.

In [None]:
dogs

- Example: If I adopt a puppy next year, when should I expect them to live until?

In [None]:
...

- Example: What is the average cost per year to maintain each breed?

In [None]:
...

- Example: Which breed is the cheapest to own per year, on average? The most expensive?

In [None]:
...

In [None]:
...

<div class="alert alert-success">
<h3>Activity</h3>

Assign `bmis` to a **Series** containing the Body Mass Index (BMI) of each breed, using the following formula:    
    
$$ \text{BMI} = \frac{\text{weight in kg}}{[\text{height in m}]^2}$$
    
<br>
    
Note that in `dogs`, weights are measured in **pounds** and heights are measured in **inches**. Use the following conversion factors:
    
<center>1 kg = 2.2 pounds<br>1 inch = 2.54 cm = 0.0254 m</center>
    
Your solution can span multiple lines, and you can define intermediate variables if you need to (we did).

In [None]:
...

### Aside: Visualization

- We'll spend more time talking about when to create which types of visualizations in a few lectures.

- But for now, you can start exploring how the DataFrame `plot` method works!

In [None]:
dogs.plot(kind='scatter', x='weight', y='longevity')

In [None]:
# Hover over a point and see what happens!
(
    dogs
    .reset_index()
    .plot(kind='scatter', x='weight', y='longevity', color='size', hover_name='breed',
          title='Longevity vs. Weight for 43 Dog Breeds')
)

In [None]:
(
    dogs['kind']
    .value_counts()
    .sort_values()
    .plot(kind='barh', title='Distribution of Dog Kinds')
)

## Selecting slices of a DataFrame

Now that we know how to access specific columns in a dataset, how do we access specific rows? Or even individual values?

---

### Use `loc` to slice rows and columns using labels

- `loc` stands for "location".

- The `loc` indexer works similarly to slicing 2D arrays, but it uses row **labels** and column **labels**, not positions.<br><small>Remember, the "index" refers to the row labels.</small>

In [None]:
dogs

In [None]:
# The first argument is the row label, i.e. the index value.
#        ↓
dogs.loc['Pug', 'longevity']
#                  ↑
# The second argument is the column label.

- As an aside, `loc` is **not** a method – it's an indexer.

In [None]:
type(dogs.loc)

In [None]:
type(dogs.sort_values)

### `loc` is flexible 🧘

You can provide a sequence (list, array, Series) as either argument to `loc`.

In [None]:
dogs

In [None]:
...

In [None]:
...

In [None]:
# Note that the 'weight' column is included!
# loc, per the pandas documentation, is inclusive of both slicer endpoints.
...

In [None]:
...

In [None]:
# Shortcut for the line above.
...

### Use `iloc` to slice rows and columns using positions

- `iloc` stands for "integer location."

- `iloc` is like `loc`, but it selects rows and columns based off of integer positions only, just like with 2D arrays.

In [None]:
dogs

In [None]:
# Try removing the iloc and see what happens!
...

- `iloc` is often most useful when we sort first. For instance, to find the weight of the longest-living breed in the dataset:

In [None]:
...

In [None]:
# Finding the breed itself involves sorting, but not iloc, since breeds are stored in the index.
...

<div class="alert alert-success">
<h3>Activity</h3>
    
Among just the following four breeds, what is the height of the second tallest breed?
    
- Cocker Spaniel.
- Labrador Retriever.
- Irish Setter.
- Newfoundland.
    
Assign your answer to `second_tallest_height` below. Answer using `pandas` code, i.e. **don't** look at the dataset and hard-code the answer.
    
**We won't have time to try this activity in lecture, but the answer is posted in `lec04-filled.ipynb` and in the "filled html" link on the course website.**

In [None]:
...

## Querying

Okay, but what if we don't know anything about the position or index of a row we're looking for? How do we find rows that satisfy certain **conditions**?

---

### Reflection

- So far, all of the questions we've been able to answer involved all of the rows in the dataset.
<br><small>What's the weight of the longest-living breed? What's the average lifetime cost of all breeds? Which breed is third heaviest?</small>

- We don't yet have a mechanism to answer questions about a specific subset of the dataset.
<br><small>How many terriers are there? What's the average longevity of medium-sized breeds?</small>

### Querying

- Querying is the act selecting rows in a DataFrame that satisfy certain condition(s).<br><small>We sometimes call this "filtering."</small>

- As we saw in Lecture 3, comparisons with arrays result in Boolean arrays. The same is true for Series – **make a comparison with a Series, and the result is a Boolean Series!**

- We can use comparisons along with the `loc` operator to select specific rows from a DataFrame.

In [None]:
dogs

In [None]:
...

In [None]:
...

In [None]:
# This gives us the number of terriers in the dataset.
...

In [None]:
...

In [None]:
# .str.contains is very useful!
...

In [None]:
# Because querying is so common, there's a shortcut:
...

In [None]:
# Empty DataFrame – not an error!
...

<div class="alert alert-success">
<h3>Activity</h3>
    
Assign `second_tallest` to the `'size'` of the second-tallest `'sporting'` breed, as a string. Answer using `pandas` code, i.e. **don't** look at the dataset and hard-code the answer.

In [None]:
...

### Lingering questions

- How do we use multiple conditions?<br><small>We actually covered the answer to this in Lecture 3 – try it out yourself!</small>

- There's a DataFrame `query` method – how does it work, and how is it different from what we've seen here?

- How do we find the average longevity of every `'kind'` without having to copy-paste a lot of code?

- We'll find out on Tuesday!