# Midterm Review
1. Table Methods
2. Functions
3. Iteration and Conditionals
4. Chance
5. Sampling
6. Hypothesis Testing
7. Data Visualizations

In [1]:
# These lines import the Numpy and Datascience modules after installation.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1.0 Foundational Table Methods
Below is an image of table methods:
<br />
<img src="./images/table_methods.png" width="700" height="700" />

The table methods/functions above show us the inputs and outputs of each function, which is important in getting what exactly we want. For example, `tbl.column("column_name")` or `tbl.column(column_index)` has an input of either a *string* of a column name or an *int* of a column index. `tbl.column()` returns an *array*. 

In [2]:
tvshows = Table().with_columns(
    "Name", make_array("Grey's Anatomy", "Suits", "House of Cards", "Scrubs", "Scandal", "How I Met Your Mother"),
    "Rating", make_array(7.7, 8.7, 9, 8.4, 7.9, 8.4),
    "# of Seasons", make_array(12, 5, 4, 8, 5, 9),
    "Genre", make_array("medical drama", "legal drama", "political drama", "sitcom", "political drama", "sitcom"),
    "Premiere Year", make_array(2005, 2011, 2013, 2001, 2012, 2005)
)

tvshows.show()

Name,Rating,# of Seasons,Genre,Premiere Year
Grey's Anatomy,7.7,12,medical drama,2005
Suits,8.7,5,legal drama,2011
House of Cards,9.0,4,political drama,2013
Scrubs,8.4,8,sitcom,2001
Scandal,7.9,5,political drama,2012
How I Met Your Mother,8.4,9,sitcom,2005


Let's take an example. Alice decides to filter the table `tvshows` table to include shows that have at least an `8.0` rating and at least `6` seasons. The table `tvshows` is shown above.

**Q1.0.1** Write an expression that returns a table such that only the rows satisfying both of the above conditions will be included.

In [3]:
# Multiple Expressions
filtered_tvshows = tvshows.where("Rating", are.above_or_equal_to(8)) # -> Table
filtered_tvshows = filtered_tvshows.where("# of Seasons", are.above_or_equal_to(6)) # -> Table
filtered_tvshows

# Single Expression
tvshows.where("Rating", are.above_or_equal_to(8)).where("# of Seasons", are.above_or_equal_to(6))

Name,Rating,# of Seasons,Genre,Premiere Year
Scrubs,8.4,8,sitcom,2001
How I Met Your Mother,8.4,9,sitcom,2005


**Q1.0.2** Write an expression that will determine how many seasons did the show with the highest rating receive?

In [4]:
tvshows.sort("Rating", descending = True).column("# of Seasons").item(0)

4

### 1.1 `group()` vs. `pivot()`

Two of the most confused upon table functions are `tbl.group()` and `tbl.pivot()`.

> `tbl.group(column_name, function)`
- Creates a row for each unique category in the column specified
- Uses the specified function to *collect* values in the other columns
- Default function is `count()`
- If function is inputted -> column names are changed to "column_name function_name" (e.g. "Price" -> "Price max")

> `tbl.pivot(col_names, row_names, values, function)`
- Allows us to compare values that overlap amongst unique categories in two seperate columns

Use Cases for `group()` and `pivot()`:
1. `tbl.group()`
    - When we need to *count or summarize values* for each unique category in one column.
    - We're *aggregating* (summing, averaging, max/min, etc.) a numerical column based on a categorical column.
    - We're *reducing the number of rows* to one row per unique category.

For example, you are given a table `grades` with columns `"Course"`, `"Student"`, and `"Score"`. Compute and return the average score for each course.

```python
grades.group("Course", np.mean).column("Score mean")
```
**Why?** We want to aggregate scores within each course, so we group by `"Course"` and take the mean of the `"Score"` column.

If the problem asks for “average,” “count,” or “max/min” for a category -> use `group()`.

2. `tbl.pivot()`
    - When we need to compare values across two categorical columns.
    - We need to reshape the table so that one column’s unique values become separate columns.
    - We need to make a cross-tabulation.

For example, you have a table `sales` with columns `"Month"`, `"Product"`, and `"Revenue"`. Create a table where each row represents a month, and each product has its own column showing total revenue.

```python
sales.pivot("Product", "Month", "Revenue", sum)
```
**Why?** We want to structure the table so that each row corresponds to a month and each product becomes a separate column showing its revenue.

If the problem asks to “compare” across two categorical columns → use `pivot()`.

| Scenario                                                 | Use `group()` | Use `pivot()` |
|----------------------------------------------------------|--------------|--------------|
| Summarizing a numerical column based on a single categorical column | ✅ | ❌ |
| Counting the number of occurrences per category          | ✅ | ❌ |
| Comparing values across two categorical columns         | ❌ | ✅ |
| Reshaping data so that unique values in one column become separate columns | ❌ | ✅ |


### 1.2 `join()`

Another function that may be confusing is `tbl.join()`.

> tblA.join(colA, tblB, colB)
- Joins `tblA` and `tblB`
- `colA` and `colB` should share values
- Returns a new table with the appended shared column from `tblB`

Let's take an example with the `chocolates` table and the `nutrition` table below. We want to join the two tables based on the `"Color"` column in the `chocolates` table and the `"Type"` column in the `nutrition` table.

In [5]:
chocolates = Table().with_columns(
    "Color", make_array("Dark", "Milk", "White", "Dark", "Milk", "Milk"),
    "Shape", make_array("Round", "Rectangular", "Rectangular", "Round", "Rectangular", "Round"),
    "Amount", make_array(4, 6, 12, 7, 9, 2),
    "Price ($)", make_array(1.30, 1.20, 2.00, 1.75, 1.40, 1.00)
)

chocolates.show()

Color,Shape,Amount,Price ($)
Dark,Round,4,1.3
Milk,Rectangular,6,1.2
White,Rectangular,12,2.0
Dark,Round,7,1.75
Milk,Rectangular,9,1.4
Milk,Round,2,1.0


In [6]:
nutrition = Table().with_columns(
    "Type", make_array("Dark", "Milk", "White", "Ruby"),
    "Calories", make_array(120, 130, 115, 120)
)

nutrition.show()

Type,Calories
Dark,120
Milk,130
White,115
Ruby,120


In [7]:
chocolates.join("Color", nutrition, "Type")

Color,Shape,Amount,Price ($),Calories
Dark,Round,4,1.3,120
Dark,Round,7,1.75,120
Milk,Rectangular,6,1.2,130
Milk,Rectangular,9,1.4,130
Milk,Round,2,1.0,130
White,Rectangular,12,2.0,115


As shown above, `chocolates.join("Color", nutrition, "Type")` returns a `Table`, with the same columns in the original `chocolate` table, with the added `"Calories"` column from the nutrition table. (e.g. `"Dark"` -> 120, so every `"Dark"` row in the new table has the value `120` in the new `"Calories"` column).

What happens when we call the reverse, `nutrition.join("Type", chocolates, "Color")`?

In [8]:
nutrition.join("Type", chocolates, "Color")

Type,Calories,Shape,Amount,Price ($)
Dark,120,Round,4,1.3
Dark,120,Round,7,1.75
Milk,130,Rectangular,6,1.2
Milk,130,Rectangular,9,1.4
Milk,130,Round,2,1.0
White,115,Rectangular,12,2.0


**Q1.2.1** *(FA23 Final Q4)* Barbara and Jeanine were qurantined for a week with 13 other friends due to unforseen circumstances. They are curious to understand what songs each friend listened to during the qurantine period. 

To evaluate this, they randomly sample song "plays" by the 15 people during quarantine and put that into the `spotify` table. Here are the first few rows:

In [23]:
# Disclaimer: In real exam scenarios, a table will not be shown being made or imported. This is just for learning purposes within this environment.
spotify = Table().with_columns(
    "Username", make_array("barbz23", "jea9", "ronnieboi"),
    "Artist", make_array("Olivia Rodrigo", "The Weeknd", "Doja Cat"),
    "Song", make_array("Vampire", "Popular", "Paint the Town Red"),
    "Genre", make_array("Pop", "R&B", "Hip-Hop"),
    "Duration", make_array(3.14, 2.78, 3.05)
)

spotify.show()
"""... (328 rows omitted)"""

Username,Artist,Song,Genre,Duration
barbz23,Olivia Rodrigo,Vampire,Pop,3.14
jea9,The Weeknd,Popular,R&B,2.78
ronnieboi,Doja Cat,Paint the Town Red,Hip-Hop,3.05


'... (328 rows omitted)'

The table has the following columns:
- *Username*: (`string`) the spotify username of the person who played the song
- *Artist*: (`string`) the song's artist
- *Song*: (`string`) the song's name
- *Genre*: (`string`) the song's genre
- *Duration*: (`float`) the number of minutes the song was played on that occasion

*Note: There is a rwo for each time a song was played, so many rows will be repeated. For example, if Jeanine listened to the song Vampire 3 times, then there will be 3 rows in the table for those "plays".*

Write a Python expression that returns a table with more than 3 columns that displays the average play duration for each unqiue combination of artist and song.

In [10]:
spotify.pivot("Artist", "Song", "Duration", np.average)

Song,Doja Cat,Olivia Rodrigo,The Weeknd
Paint the Town Red,3.05,0.0,0.0
Popular,0.0,0.0,2.78
Vampire,0.0,3.14,0.0


So why *`pivot()`*?

As per [Section 1.1](#11-group-vs-pivot), we use `.pivot()` when we **compare** values across *two* categorical columns. Here, we are comparing the columns `"Artist"` and `"Song"`. It's asking for each combination of artist and songs, we want the average play duration. Therefore, we want the columns to be made up of the original `"Artist"` column, the rows to be made up of the original `"Song"` column, and the values to be the average of the `"Duration"` column. Therefore, we call `pivot()` on the `spotify` table like so: `spotify.pivot("Artist", "Song", "Duration", np.average)`.

Write a Python expression that returns 

In [15]:
spotify

Username,Artist,Song,Genre,Duration
barbz23,Olivia Rodrigo,Vampire,Pop,3.14
jea9,The Weeknd,Popular,R&B,2.78
ronnieboi,Doja Cat,Paint the Town Red,Hip-Hop,3.05


Write a Python expression that returns the name of the artist that has the largest number of unique songs in the table.

In [None]:
spotify.group(make_array("Artist", "Song")).group("Artist").sort("count", descending = True).column(0).item(0)

'Doja Cat'

This is a complicated question. We use `.group()` twice. Why is that? When we go through it step by step:

1. `spotify.group(make_array("Artist", "Song"))`

This gives the table:
| Artist | Song | count |
| ------ | ---- | ----- |
| Doja Cat | Paint the Town Red | 1 |
| Olivia Rodrigo | Vampire | 1 |
| The Weeknd | Popular | 1 |

**Why?** When `.group()` takes in multiple columns (in the form of an array), it gets the groups by unique values by both columns. If the original table had another Doja Cat song (let's say Say So), the table would look like:
| Artist | Song | count |
| ------ | ---- | ----- |
| Doja Cat | Paint the Town Red | 1 |
| Doja Cat | Say So | 1 |
| Olivia Rodrigo | Vampire | 1 |
| The Weeknd | Popular | 1 |

It essentially is a more specific "grouping": grouping by `"Artist"` and `"Song"`. We don't include the function, because it's not going to be used in the future, we are not looking for the `count` of each unique song per artist. 

2. `spotify.group(make_array("Artist", "Song")).group("Artist")`

Now that we have have grouped by every unique song per unqiue artist, we want to get the artist that has the largest number of unique songs, which we can do by using another `.group()`. Why not `.pivot()`? Because we are grouping by *ONE categorical column*.

This gives the table:
| Artist | count |
| ------ | ---- |
| Doja Cat | 1 |
| Olivia Rodrigo | 1 |
| The Weeknd | 1 |

Doja Cat has 1 unique song, Olivia Rodrigo has 1 unique song, The Weeknd has 1 unique song. *Obviously, for the point of the question, the table has many more rows, this is just an example using the first few rows given.* We had to group by `"Artst"` and `"Song"` to get to this point, since we first needed the unique songs an artist had. If we didn't need the "`Artist`", we very much could've done `spotify.group("Song")` to get the count of unique songs, but the question asks for the artist name that had the number of unique songs.

3. `spotify.group(make_array("Artist", "Song")).group("Artist").sort("count", descending = True)`

We sort by the column `"count"`, and use `descending = True` as we want the column to be sorted from greatest -> least.

If the table looked something like:

| Artist | count |
| ------ | ---- |
| Doja Cat | 5 |
| Olivia Rodrigo | 2 |
| The Weeknd | 4 |

The resulting table after running the `.sort()` would be:

| Artist | count |
| ------ | ---- |
| Doja Cat | 5 |
| The Weeknd | 4 |
| Olivia Rodrigo | 2 |

4. `spotify.group(make_array("Artist", "Song")).group("Artist").sort("count", descending = True).column(0).item(0)`

We then get the column we are interested in after the sort: `"Artist"`. Since the rows are already sorted, we just need the first value of the `"Artist"` column to get the artist that has the largest amount of unique songs, hence `column(0).item(0)` or `column("Artist").item(0)`.

**Q1.2.2** *(FA23 Final Q4)* While looking at a table of song plays is helpful, Barbara notices that the table doesn't contain the name of people who played the songs. She creates a seperate table called `accounts` that contains their friends' Spotify accounts. The first few rows are shown here:

In [27]:
accounts = Table().with_columns(
    "Identifier", make_array("jmarsdenofficial", "margarita23", "ken_the_og"),
    "DisplayName", make_array("James Marsden", "Inez De Leon", "Ken Hyun")
)

accounts.show()
"""... (12 rows omitted)"""

Identifier,DisplayName
jmarsdenofficial,James Marsden
margarita23,Inez De Leon
ken_the_og,Ken Hyun


'... (12 rows omitted)'

The table has the following columns:
- *Identifier*: (string) the account's ID in Spotify's database
- *DisplayName*: (string) the account's display name (first and last name)

Barbara notices that one of the friends, `'Todd Gregory'`, tends to skip Pop songs after listening to them for just a few seconds. She writes the following partially completed code, which assigns `result` to an array containing the average play duration for every Pop song that Todd played.

```python
combined = ____(a)____
todd_pop_songs = ____(b)____
result = todd_pop_songs.____(c)____
```

Write a Python expression that fills in (a), (b), and (c).

In [None]:
combined = spotify.join("Username", accounts, "Identifier")
todd_pop_songs = spotify.where("DisplayName", "Todd Gregory").where("Genre", "Pop")
result = todd_pop_songs.select("Song", "Duration").group("Song", np.average).column(1)

"""Results in an error in our environment since we don't have the original tables with the omitted rows."""

"Results in an error in our environment since we don't have the original tables with the omitted rows."

(a) = `spotify.join("Username", accounts, "Identifier")`

(b) = `spotify.where("DisplayName", "Todd Gregory").where("Genre", "Pop")`

(c) = `todd_pop_songs.select("Song", "Duration").group("Song", np.average).column(1)`

Our final goal is that we want the average play duration for every *unique* Pop song that Todd played. Just from looking at our goal, we need to somehow get the original `spotify` table and this `accounts` table together so we can see the play duration for Todd (our minds should automaticlly go to `.join()`). Additionally, we know that we need to use either `.group()` or `.pivot()`, as we are filtering by uniqueness. 

We can rule out the possibility of `.pivot()` since we are not going to be comparing two categorical columns. So, we know we need to use `.join()` and `.group()`. The variable names give us a clear direction of where to go.

1. `combined = ____(a)____`

`combined` is most likely a table where we use `.join()` on the `spotify` table and the `accounts` table. We know that the shared columns between the two that have similar/identical values is the column `"Identifier"` in `accounts` and the column `"Username"` in `spotify`. The question is now do we use `spotify.join()` or `accounts.join()`. It doesn't really matter for this specific case, but for clarity purposes and for the reason that `spotify` has much more data, we want to use `spotify.join()`. As per [Section 1.2](#12-join), we use `.join()` -> `spotify.join("Username", accounts, "Identifier")`.

The resulting table would have this structure:
| Username | Artist | Song | Genre | Duration | DisplayName |
| -------- | ------ | ---- | ----- | -------- | ----------- |
| ...      | ...    | ...  | ...   | ...      | ...         |


2. `todd_pop_songs = ____(b)____`

Now that we have our `combined` table, we want to filter out the table for specifically `"Todd Gregory"` and his listening genre of `"Pop"`, which are contained in the column `"DisplayName"` and in `"Genre"`. For filtering our rows, we use `.where()`. We want to filter out (in no paritcular order) the DisplayName `"Todd Gregory"` and the Genre `"Pop"`. So we use `combined.where("DisplayName", "Todd Gregory").where("Genre", "Pop")`. 

The resulting table would have this structure:
| Username | Artist | Song | Genre | Duration | DisplayName |
| -------- | ------ | ---- | ----- | -------- | ----------- |
| ...      | ...    | ...  | ...   | ...      | Todd Gregory         |

*Disclaimer: `combined.where("DisplayName", "Todd Gregory").where("Genre", "Pop")` can also be written interchangeably with `combined.where("Genre", "Pop").where("DisplayName", "Todd Gregory"). It's essentially the same thing.*

3. `result = todd_pop_songs.____(c)____`

We know that `todd_pop_songs` is a table, therefore we must use a table function to somehow get an array to assign to result. We haven't filtered uniquely by each Pop song that Todd played. Therefore, we use `.group("Song", np.average)`, which groups by unqiue song and aggregates each numerical column with `np.average`. Then, we can get the specific aggregated column that we need, `"Duration average"`, using `.column("Duration average")` (which returns an array).


## 2.0 Functions

### 2.1 Function Syntax
```python
def <function name>(<zero or more parameters>):
    <body>
    return <some value> # optional
```

- Parameter names are variable names for the future parameters/inputs.
- Variables defined inside the function can only be accessed from within the function
- Only functions have `return` statements, but they are not necessary. For example, let's say we want a function `make_hist` that takes in a table and a column which makes a histogram of the table based on the column.

```python
def make_hist(table, column):
    table.hist(column)
    # We did not return anything / We didn't have the need to return anything
```

**Q2.1.1** *(FA21 Final Q1)* Define a function `count_elem` that takes two arguments: an array `a` and a value `x`. It should return the number of times that `x` appears as an element of `a`. For example, `count_elem(make_array('cat', 'cat', 'dog'), 'cat')` should return 2.


In [32]:
def count_elem(a, x):
    return np.sum(a == x) # We use a == x to compare the values in the array a to x (shorthand to fit the solution into one line)

count_elem(make_array('cat', 'cat', 'dog'), 'cat')

2

## 3.0 Iteration & Conditionals

### 3.1 Comparison Operators
Self-explanatory table:
| Comparison | Operator | returns `True` | returns `False` |
| ---------- | -------- | -------------- | -------------- |
| Less than | < | 2 < 3 | 2 < 2 |
| Greater than | > | 3 > 2 | 3 > 3 |
| Less than or equal to | <= | 2 <= 2 | 3 <= 2 |
| Greater than or eqaul to | >= | 3 >= 3 | 2 >= 3 |
| Equal | == | 3 == 3 | 3 == 2 |
| Not equal | != | 3 != 2 | 2 != 2 |

### 3.2 `for` Loops
Syntax:
```python
for <any_name> in <some_array>:
    <body>
```

For example, if we have an array `basket = make_array("orange", "kiwi", "apple")` and we want to print every item in `basket` with an "!":

In [33]:
basket = make_array("orange", "kiwi", "apple")

for fruit in basket:
    print(fruit + "!")

orange!
kiwi!
apple!


We want also use `np.arange()` to create a sort of makeshift array for the `for` loop to use. 

`np.arange()` has "three syntaxes", to put it simply:
1. `np.arange(start, stop, step)`
where
- *start*: the number to start at (inclusive)
- *stop*: the number to stop at (non-inclusive)
- *step*: how many numbers to "step" over

For example, `np.arange(1, 14, 2)` would give `[1, 3, 5, 7, 9, 11, 13]`.

2. `np.arange(start, stop)`
where
- *start*: the number to start at (inclusive)
- *stop*: the number to stop at (non-inclusive)
- *step*: default `1`

For example, `np.arange(2, 10)` would give `[2, 3, 4, 5, 6, 7, 8, 9]`.

1. `np.arange(stop)`
where
- *start*: default `0`
- *stop*: the number to stop at (non-inclusive)
- *step*: default `1`

For example, `np.arange(10)` would give `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.

**Q3.2.1** Let's say we want to create 10 stats to add to the `cool_stats` array:

In [44]:
"""
Makes a random choice from the array `np.arange(1, 11)`.
"""
def make_statistic() -> int:
    return np.random.choice(make_array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

cool_stats = make_array()

for i in np.arange(10): # np.arange(10) -> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] -> for loop runs 10 times
    stat = make_statistic()
    cool_stats = np.append(cool_stats, stat)

cool_stats


array([ 7.,  9.,  3.,  3.,  9.,  2.,  5.,  9.,  5.,  1.])

### 3.2 Conditionals 
Syntax:
```python
if <condition>:
    <body>
elif <condition>: # 0 or more elifs
    <body>
...
else: # 0 or 1 else at the end
    <body>
```

Truth Values
| `True` | `False` |
| ------ | ------- |
| `True` | `False` |
| Any `int` other than 0 | 0 |
| Any `str` | Any false comparator (e.g. 4 == 2) |
| Any `float` | 
| Any true comparator (e.g. 4 == 4) |

## 4.0 Chance

### 4.1 Probability Basics: Mulitplication, Addition, Complement
Multiplication Rule: When two events must happen together
- P(two events happen) = P(event one) * P(event two)
- When you see **AND** -> think multiplication

Addition Rule: The event can happen in multiple ways
- P(an event happens) = P(first possibility) + P(second possibility)
- When you see **OR** -> think addition

Complement Rule: Probabiltiy of all outcomes - 1
- P(every other event) = 1 - P(one event)
- Most commonly seen when we want **at least one** of something
- It's easier to calculate P(one event) than P(every other event)

### 4.2 With or Without Replacement
With Replacement: Each time we pick something, we put it back

Without Replacement: Each time we pick something, we do not put it back in / we take it out

**Q.4.1.1** Each pet photo at the end of a lab is chosen from a collection of 20 pets with 10 cats, 9 dogs, and 1 bird. For each event below, choose the Python expression that evaluates to the probability of the event.
**a)** When one pet is chosen at random, the probability that it is either a cat or a bird.
- (9 / 20 ) ** 2
- (10 / 20) * (1 / 20)
- **(10 / 20) + (1 / 20)**
- 1 - (9 / 20) ** 2
- 1 - (10 / 20) * (1 / 20)
- 1 - ((10 / 20) + (1 / 20))

(10 / 20) + (1 / 20) is the correct answer because P(cat) = (10 / 20) and P(bird) = (1 / 20). We notice that we are getting P(cat) *OR* P(bird) -> (10 / 20) + (1 / 20).

**b)** When two pets are chosen at random with replacement, the probability that the first chases the second. Assume dogs only chase cats, cats only chase birds, and birds don't chase.
- (19 / 20) * (10 / 20)
- **(10 / 20) * (1 / 20) + (9 / 20) * (10 / 20)**
- 1 - ((9 / 20) * (1 / 20) + (10 / 20) * (9 / 20))
- 1 - ((10 / 20) ** 2 + (9 / 20) ** 2 + (1 / 20) ** 2)
- 1 - ((10 / 20) ** 2 + (9 / 20) ** 2 + (1 / 20))

(10 / 20) * (1 / 20) + (9 / 20) * (10 / 20) is the correct answer because 
```math
\begin{align}
P(\text{first chases the second}) &= P((\text{cat AND bird}) OR (\text{dog AND cat})) \\
&= P(\text{cat AND bird}) + P(\text{dog AND cat}) \\ 
&= P(\text{cat}) * P(\text{bird}) + P(\text{dog}) * P(\text{cat}) \\
&= \frac{10}{20} + \frac{1}{20} + \frac{9}{20} + \frac{10}{20}
\end{align}
```

## 5.0 Sampling

### 5.1 Population vs. Samples
A population describes all invidivuals that belong to a certain group, while a sample is a random selection of the population.

### 5.2 Probability vs. Empiricial Distribution
Probability: the theoretical distribution of a random event
<br />
<img src="./images/probability.png" width="full" height="500" />
<br />

Empirical: the observed distribution of a random event
<br />
<img src="./images/empirical.png" width="full" height="500" />
<br />

So, if a chance experiment is repeated independently and under identical conditions, then in the long run, the proportion of times that an event occurs gets closer and closer to the theoretical probability of the event. For example, as seen in the images below, as sample size increases, the empirical histogram resembles more and more the probability histogram.
<br />
<img src="./images/probability_to_empirical.png" width="full" height="500" />
<br />

### 5.3 Functions to Sample
There are three functions that allow us to "sample" from a "population":
1. `tbl.sample(n, with_replacement)`
- Returns a table with `n` rows sampled from the original table `tbl`
- If `n` and `with_replacement` are not specified, then `n = the number of rows in the table` and `with_replacement = True`.

2. `np.random.choice(array, n, replace)`
- Returns an array with `n` values valies sampled from the original array `arr`
- If `n` and `replace` are not specified, then `n = 1` (a single value returned) and `replace = True`.

3. `sample_proportions(n, model_proportions)`
- Samples `n` objects from the categorical distribution specified by an array `model_proportions`.
- Returns an array of `proportions`.

**Q5.3.1** *(FA22 Final Q3)* There are 6,000 pieces of trash in the park (3,000 bottles, 2,000 boxes, 1,000 pieces of food). Adrian collects 20 pieces at random and observes that `A` of them are bottles. Bala collects 100 pieces at random and observes that `B` of them are bottles.
**i)** Which of the following are more probably than not? Choose all that apply?
- A is larger than B
- **B is larger than A**
- (A / 20) is larger than (B / 100)
- (B / 100) is larger than (A / 20)
- **abs(A / 20 - 0.5) is larger than abs(B / 100 - 0.5)**
- abs(B / 100 - 0.5) is larger than abs(A / 20 - 0.5)
- None of the above

1. **A is larger than B** ❌ *Not necessarily true*  
   - `A` and `B` are both random variables, but `B` is sampled from a larger set (100 pieces vs. 20 pieces).
   - Expected values:  
     - `E[A] = 20 * 0.5 = 10`
     - `E[B] = 100 * 0.5 = 50`
   - Since `B` is expected to be larger than `A`, it is **unlikely** that `A > B`.

2. **B is larger than A** ✅ *More probable than not*  
   - As shown above, `E[B] > E[A]`, meaning `B > A` is the more probable outcome.

3. **(A / 20) is larger than (B / 100)** ❌ *Not necessarily true*  
   - `A / 20` and `B / 100` are both estimates of the proportion of bottles.
   - Since both are sampled from the same distribution (`50%` bottles), neither fraction is systematically expected to be larger than the other.

4. **(B / 100) is larger than (A / 20)** ❌ *Not necessarily true*  
   - For the same reasoning as above, neither fraction is more likely to be larger.

5. **abs(A / 20 - 0.5) is larger than abs(B / 100 - 0.5)** ✅ *More probable than not*  
   - `A / 20` and `B / 100` both estimate the true probability (`0.5`).
   - However, since `B` is based on a larger sample size (`100` vs. `20`), `B / 100` will have **less variance**.
   - **Larger sample sizes lead to more stable estimates**, so `A / 20` fluctuates more around `0.5`, making `abs(A/20 - 0.5)` more likely to be larger.

6. **abs(B / 100 - 0.5) is larger than abs(A / 20 - 0.5)** ❌ *Unlikely*  
   - Since `B / 100` has less variance due to the larger sample size, it is less likely to deviate from `0.5` compared to `A / 20`.

7. **None of the above** ❌ *Incorrect*  
   - Since `(B > A)` and `(abs(A / 20 - 0.5) > abs(B / 100 - 0.5))` are both more probable than not, this choice is incorrect.

**ii)** The pygmy hippo is a small, reclusive (and cute) hippopotamid type that is native to the forests and swamps of West Africa. Two teams of zoologists set out to estimate the proportion that are male by sampling at random from the population. The first team samples 100 hippos and finds the proportion of males in their sample to be `A`. The second team samples 40 hippos and finds the proportion of males in their sample to be `B`. The full population has all 2,500 wild pygmy hippos; the proportion `P` of males in the population is 50% (but unknown to the zoologists).

Which of the following are more likely than not? Select **all** that apply.
- A is smaller than B.
- A is larger than B.
- **P is closer to A than B.**
- P is closer to B than A.
- None of these.

1. **A is smaller than B ❌ (Not necessarily true)**  
   - `A` and `B` are both sample proportions taken randomly from the population.
   - There is no inherent reason why `A` would be systematically smaller than `B`.

2. **A is larger than B ❌ (Not necessarily true)**  
   - Just like the previous option, `A` and `B` are both estimates of `P`, so there is no guarantee that one is consistently larger than the other.

3. **P is closer to A than B ✅ (More likely than not)**  
   - The sample size of `A` is **100**, while the sample size of `B` is **40**.
   - Larger sample sizes tend to produce **estimates closer to the true proportion `P`** due to reduced variance.
   - Since `A` is based on a larger sample, it is expected to be a **more accurate** estimate of `P` than `B`, meaning `P` is more likely to be closer to `A`.

4. **P is closer to B than A ❌ (Unlikely)**  
   - Since `B` is based on a smaller sample size (**40** vs. **100**), it has **higher variance**, meaning it is more likely to deviate from `P` compared to `A`.

5. **None of these ❌ (Incorrect)**  
   - Since **P is more likely to be closer to A than B**, at least one option is correct, so this cannot be the correct answer.




## 6.0 Hypothesis Testing

### 6.1 Structure of Hypothesis Test
- Motivation: We observe some event. We think this event isn't likely to happen with how we think the world works.

- Step 1: Determine Hypothesis
- Step 2: Test Statistic & Observed Test Statistic
- Step 3: Distribution of Test Statistic under Null Hypothesis
- Step 4: Conclusion of Test

For example, let's take the scenario where Rory claims that her coin has a 60% chance of landing on heads and a 40% chance of landing on tails. Marissa flips it 10 times and gets 9 heads; hm... maybe the coin isn't 60/40? We set up the hypothesis test by stating the null and alternative hypotheses.

1. Step 1: Hypothesis
Null Hypothesis: We simulate under this model. We are told exact probabilities, and thus test our hypotheses using those given probabilities. Generally alludes to the idea that *suspicious observations are only due to random chance*. 
- In this example: "The coin being flipped has the following distribution: 60% heads, and 40% tails. Any difference from this is due to random chance."

Alternative Hypothesis: Provides a potential "explanation" for the observed results or other beliefs, contrary to the null hypothesis. We don't need to provide exact probabilities.
- In this example: "The coin being flipped doesn't have the stated distribution (i.e. the coin doesn't have a 60% chance of landing on heads)."

How do we test these hypotheses? We simulate trails.
- A trial recreates the event that you observed
- We then run many, many trials (using the null hypothesis' given probabilities) to understand waht is expected if the null hypothesis was true.
    - In this example: Example Trial = Marissa flipped a supposedly 60/40 coin 10 times, and she landed with 9 heads in total. Thus, each trail thereafter will be flipping a 60/40 coin 10 times.

2. Step 2: Test Statistic
- Intution: We need to way to describe/summarize the result of a trial.
    - Either low or high values should favor the alternative because "extreme" values are unlikely to happen (should not be both high and low)
    - Many test statistics require use to compare a value that is expected under the null hypothesis to a simulated value.

- Common Test Statistics:
    - *Count* -> ex) # of heads
    - *Difference* between expected & observed -> ex) # of heads - expected # of heads
    - *Absolute difference* between expected & observed -> ex) abs(# of heads - expected # of heads)
    - *Total Variation Distance (TVD)*: comparing the observed distribution of heads and tails to the expected distribution of heads and tails -> ex) (0.6, 0.4)

- Choosing a test statistic
    - Recall our hypotheses:
        - Null Hypothesis: The coin being flipped has the following distribution: 60% heads, 40% tails. Any difference from this is due to random chance.
        - Alternative Hypothesis: The coin being flipped doesn't have the stated distribution (i.e. the coind doesn't have a 60% chance of landing heads).
    - From our alternative, we know that we are trying to determine if the probability of landing on heads is **different** from a 60%. So either high or low values should favor the alternative.

| Test Statistic | Range of Values | Appropriate? |
| -------------- | --------------- | ------------ |
| count (# of heads) | 0...4...10 | No |
| difference (# of heads - expected # of heads) | -6...0...4 | No |
| absolute difference (abs(# of heads - expected # of heads)) | 0...6 | Yes |
| TVD of observed distribution and [0.6, 0.4] distribution | 0...| Yes |

**Why?** The count and difference have both the high and low values that support the alternative. The absolute difference and TVD have only high values that support the alternative.

Now that we've decided on our test statistic, let's create a function `calculate_statistic` to simulate flipping a 60/40 coin 10 times and calculating our test statsitic on this simulated value:
```python
def calculate_statistic():
    """Computes 1 trail and returns test statistic"""
    coin_probs = make_array(0.6, 0.4)
    one_simulation = sample_proportions(10, coin_probs)
    return np.abs(one_sim.item(0) * 10 - 6)
```

3. Distribution of Test Statistic under Null Hypothesis
- We want first answer the question, “if the null hypothesis is true, what test statistics will we generally see?”
- How do we do this? *Simulation*

```python
def simulate(num_simulations):
	test_stats = make_array()
	for i in np.arange(num_simulations):
		one_test_stat = calculate_statistic()
		test_stats = np.append(test_stats, one_test_stat)
	return test_stats

simulate(1000) # >> Array([1, 2, 0, 3, 2, 1, 0, 0, 2, 2, 0…]) (1000 elements)
```

4. Conclusion of Test
- Is our simulated data more consistent with the null or alternative hypothesis?
- p-value: the chance that a test statistic is equal to the observed test statistic or more extreme (in the direction of the alternative hypothesis) assuming that the null is true
    - It can be calculated by: (# of times test stat is equal to/more extreme) / (total # of events)
- To calculate the p-value: 
```python
test_stats = simulate(1000) # Array([1, 2, 0, 3, 2, 1, 0,...])
observed_ts = abs(9-6) # 3

p_val = np.count_nonzero(test_stats >= observed_ts) / len(test_stats)
p_val # >> >> 0.102
```

So far, we have:
- We were told a probability distribution of the event: the coin is 60/40
- Observed a trial: Marissa flipped a coin 10 times and got 9 heads
- Created a null hypothesis & alternative hypothesis
- Created a test statistic: abs(# heads - expected # of heads)
- Simulated many trials under the given probabilities of the null distribution
- Calculated a p-value by comparing our simulated test statistics vs. our observed test statistic

<br />
<img src="./images/p-value.png" width="full" height="500" />
<br />

So how do we know what is rare/low?
- P-value cutoff: 
    - This is the cutoff for the value of a “low”/rare p-value to determine if our statistic is significant
    - Typically 0.05 or 0.01, but could vary. (Most of the time given in the question.)

1. If p-value < p-value cutoff:
- Seeing what you observed is unlikely under the null
- SO the data is more consistent with alternative
- THEREFORE we reject the null in favor of the alternative

2. If p-value > p-value cutoff:
- Seeing what you observed is not unlikely under the null
- SO the data is more consistent with null
- THEREFORE we fail to reject the null hypothesis

In our example, because our p-value is 0.102 and we determined a cutoff of 5% p-value cutoff, we fail to reject the null hypothesis because 0.102 is greater than 0.05. 

### 6.2 Test Statistics Guide

| Test Statistic                | Definition | When to Use |
|--------------------------------|------------|-------------|
| **Absolute Difference** (`|A - B|`) | The absolute difference between two sample statistics. | When comparing two proportions or means and assessing if one group has a significantly different value from the other. Best when data is symmetrically distributed. |
| **Total Variation Distance (TVD)** | `0.5 * Σ abs(P(x) - Q(x))` for all categories `x`. Measures the largest possible difference between two distributions. | When comparing entire probability distributions, especially for categorical data with multiple categories. Useful in hypothesis testing for goodness-of-fit. |
| **Chi-Square Statistic** (`Σ (O - E)^2 / E`) | Compares observed (`O`) and expected (`E`) frequencies in categorical data. | When checking for independence between two categorical variables or whether an observed distribution fits an expected one. |
| **t-Statistic** (`(X̄ - μ) / (s / √n)`) | Measures how far the sample mean (`X̄`) is from the population mean (`μ`), standardized by the sample standard deviation (`s`). | When testing differences in means for normally distributed data, such as in **t-tests** for small sample sizes (n < 30). |
| **z-Statistic** (`(X̄ - μ) / (σ / √n)`) | Similar to the t-statistic but used when population standard deviation (`σ`) is known. | When testing means with **large samples** (n ≥ 30) or when the population standard deviation is known. |
| **Kolmogorov-Smirnov (KS) Statistic** | The maximum difference between two cumulative distribution functions (CDFs). | When comparing two entire distributions, especially when data is not normally distributed. Useful for empirical vs. theoretical distributions. |
| **Permutation Test Statistic** | Computes a test statistic (e.g., difference in means) repeatedly under shuffled labels to estimate a p-value. | When running non-parametric hypothesis tests, particularly in randomized experiments where normality assumptions do not hold. |
| **Rank-Sum Statistic (Mann-Whitney U)** | Based on the ranks of data rather than raw values, it assesses whether one sample tends to have larger values than another. | When comparing two independent samples that are **not normally distributed** (non-parametric alternative to t-tests). |

**Choosing the Right Test Statistic**
- **Comparing Two Groups?**
  - Difference in proportions → **Absolute Difference** or **TVD**
  - Difference in means → **t-Statistic** (small sample) or **z-Statistic** (large sample)
  - Difference in distributions → **KS Statistic** or **Permutation Test**
- **Categorical Data?**
  - Frequency comparison → **Chi-Square**
  - Probability distribution comparison → **TVD**
- **Data is not normally distributed?**
  - Non-parametric alternatives → **Mann-Whitney U**, **Permutation Test**

### 6.3 A/B Testing: Setup and Example
A/B testing is a statistical method used to compare two versions of a variable (such as a webpage, product, or marketing campaign) to determine which performs better. It is widely used in business, marketing, UX design, and data science to optimize decisions based on empirical evidence.

**How to Set Up an A/B Test**
1. **Define the Goal**: Determine what metric you want to improve (e.g., click-through rate, conversion rate, engagement time).
2. **Create Variants**:
   - **Control Group (A)**: The existing version (also called the "baseline").
   - **Treatment Group (B)**: The new version with changes (e.g., different UI, new call-to-action).
3. **Randomly Assign Users**: Ensure users are randomly split into two groups to minimize bias.
4. **Run the Experiment**: Show each group their respective versions and collect data.
5. **Measure Key Metrics**: Compare the performance of A vs. B based on the chosen metric.
6. **Statistical Analysis**: Use hypothesis testing (e.g., t-test, chi-square test) to determine if the difference is statistically significant.
7. **Make a Decision**: If version B significantly outperforms A, implement the change; otherwise, keep the existing version.

**Example: A/B Test for Website Conversion Rate**
**Scenario**
A company wants to increase the number of users who sign up for their newsletter. They decide to test two different versions of their sign-up button:

- **Version A (Control)**: The button says "Sign Up."
- **Version B (Treatment)**: The button says "Get Your Free Gift!"

**Experiment Setup**
- **Metric**: Conversion rate (percentage of visitors who sign up).
- **Sample Size**: 10,000 users randomly split into two groups.
  - 5,000 see Version A.
  - 5,000 see Version B.
- **Data Collected**:
  - Version A: 200 users sign up (4% conversion rate).
  - Version B: 300 users sign up (6% conversion rate).

**Analysis**
- **Absolute Difference**: `|6% - 4%| = 2%`
- **Hypothesis Testing**:
  - **Null Hypothesis (H₀)**: There is no difference between A and B.
  - **Alternative Hypothesis (H₁)**: B has a higher conversion rate than A.
  - **Statistical Test**: A **z-test** for proportions is performed.
  - **p-value**: If `p < 0.05`, we reject H₀ and conclude B is significantly better.

**Conclusion**
- If the p-value is **small (e.g., 0.01)** → We reject H₀ and choose Version B.
- If the p-value is **large (e.g., 0.2)** → We do not have enough evidence to prefer B over A.

## 7.0 Data Visualization

### 7.1 Scatter Plots
- Relationship between **two numerical variables**.
- Each point represents an observation with values for both variables.

When to Use:
- To determine if two variables are **correlated**.
- To identify **trends** or **clusters** in data.
- Example: **Does study time affect exam scores?**
  - x-axis: Hours studied
  - y-axis: Exam score

### 7.2 Line Plots
- A continuous trend over time or an ordered sequence.

When to Use:
- When analyzing **time series data** or trends over an ordered variable.
- Example: **Stock market trends over time**
  - x-axis: Time (days, months, years)
  - y-axis: Stock price

### 7.3 Bar Charts
- Comparisons between **categories**.
- Bars represent counts, percentages, or other aggregate measures.

When to Use:
- When dealing with **categorical variables**.
- To compare **frequencies or proportions** of different groups.
- Example: **Favorite ice cream flavors among students**
  - x-axis: Ice cream flavors (Chocolate, Vanilla, Strawberry)
  - y-axis: Number of students who prefer each flavor


### 7.4 Histograms
- **Distribution of a single numerical variable**.
- Groups values into **bins** and shows how many data points fall into each bin.
  
**Key Components**
- **x-axis**: The numerical variable being measured.
  - Example: Monthly rent (in dollars).
- **y-axis**: Density, measured in **% per unit of x-axis**.
  - Example: % of apartments per $100 rent interval.
- **Area of a bin**: Represents the **percentage of data points in that bin**.
  - width * height = % of data points in that bin.

**Histogram Caveats**
1. **Area represents percentage, NOT number of individuals**
   - Unlike bar charts where height represents counts, in histograms, area matters.
2. **Bins are inclusive of the lower bound, exclusive of the upper bound**
   - Example: Bin **[10, 15)** includes **10, 11, 12, 13, 14** but **NOT 15**.
   - **Exception**: The last bin includes both bounds.
3. **We don't know the distribution inside a bin**
   - Example: If a bin is **[10, 15)**, all values could be clustered at 10, spread evenly, or mostly near 14.999.
4. **Bin widths do not have to be the same**
   - If bin widths vary, the heights adjust to maintain correct density.

**When to Use Histograms**
- **When analyzing distributions** of numerical data.
- **To detect skewness**, **outliers**, or **modal behavior**.
- Example: **What is the distribution of rental prices in a city?**
  - x-axis: Rent ($)
  - y-axis: Density (% per $100)

Examples:

<br />
<img src="./images/hist_questions.png" width="full" height="500" />
<br />

<br />
<img src="./images/hist_q_1.png" width="full" height="500" />
<br />

<br />
<img src="./images/hist_q_2.png" width="full" height="500" />
<br />

<br />
<img src="./images/hist_q_3.png" width="full" height="500" />
<br />

---

### **Summary**
| Visualization Type | Used For | Example |
|-------------------|---------|---------|
| **Scatter Plot** | Relationship between **two numerical variables** | Study time vs. Exam score |
| **Line Plot** | Trends in **time-series data** | Stock prices over time |
| **Bar Chart** | Comparisons of **categorical variables** | Favorite ice cream flavors |
| **Histogram** | **Distribution of one numerical variable** | Rent prices in a city |
