<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function `survival_demographics()`.

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas `category` series). The categories should be:
    - Child (up to 12)
    - Teen (13–19)
    - Adult (20–59)
    - Senior (60+)  
  
	Hint: The `pd.cut()` function might come in handy here.

2. Group the passengers by class, sex, and age group.  

3. For each group, calculate:  
    - The total number of passengers, `n_passengers`
    - The number of survivors, `n_survivors`
    - The survival rate, `survival_rate`

4. Return a table that includes the results for *all* combinations of class, sex, and age group.  

5. Order the results so they are easy to interpret.  

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your `app.py` file above the call to your visualization function, using `st.write("Your Question Here")`.
   
7. Create a Plotly visualization in a function named `visualize_demographic()` that directly addresses your question by returning a Plotly figure (e.g., `fig = px. ...`). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.


In [None]:
import pandas as pd

def survival_demographics(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze Titanic survival patterns by class, sex, and age group.
    """

    # Step 1: Add AgeGroup column
    age_bins = [0, 12, 19, 59, float("inf")]
    age_labels = ["Child", "Teen", "Adult", "Senior"]

    df["AgeGroup"] = pd.cut(
        df["Age"], bins=age_bins, labels=age_labels, right=True
    )

    # Step 2 & 3: Group by class, sex, and age group
    grouped = df.groupby(["Pclass", "Sex", "AgeGroup"]).agg(
        n_passengers=("Survived", "count"),
        n_survivors=("Survived", "sum")
    )
    grouped["survival_rate"] = grouped["n_survivors"] / grouped["n_passengers"]

    # Step 4: Establish combinations
    all_combinations = pd.MultiIndex.from_product(
        [df["Pclass"].unique(),
         df["Sex"].unique(),
         df["AgeGroup"].cat.categories],
        names=["Pclass", "Sex", "AgeGroup"]
    )
    grouped = grouped.reindex(all_combinations, fill_value=0).reset_index()

    # Step 5: Sort results
    grouped = grouped.sort_values(by=["Pclass", "Sex", "AgeGroup"]).reset_index(drop=True)

    return grouped

# Read and display results
df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')
results = survival_demographics(df)
print(results.head(12))

    Pclass     Sex AgeGroup  n_passengers  n_survivors  survival_rate
0        1  female    Adult            68           66       0.970588
1        1  female    Child             1            0       0.000000
2        1  female   Senior             3            3       1.000000
3        1  female     Teen            13           13       1.000000
4        1    male    Adult            80           34       0.425000
5        1    male    Child             3            3       1.000000
6        1    male   Senior            14            2       0.142857
7        1    male     Teen             4            1       0.250000
8        2  female    Adult            58           52       0.896552
9        2  female    Child             8            8       1.000000
10       2  female   Senior             0            0            NaN
11       2  female     Teen             8            8       1.000000


  grouped = df.groupby(["Pclass", "Sex", "AgeGroup"]).agg(


In [None]:
import plotly.express as px
import streamlit as st

# Step 6: Question
st.write("Did children in third class have lower survival rates compared to children in higher classes?")

# Step 7: Fig 1
def visualize_demographic(results: pd.DataFrame):
    """
    Creates a Plotly visualization to compare survival rates of children
    across different passenger classes
    """
    
    children = results[results["AgeGroup"] == "Child"]

    # Plot survival rates by class and sex
    fig1 = px.bar(
        children,
        x="Pclass",
        y="survival_rate",
        color="Sex",
        barmode="group",
        labels={"survival_rate": "Survival Rate", "Pclass": "Passenger Class"},
        title="Survival Rates of Children Across Classes by Sex"
    )
    return fig1

df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')
results = survival_demographics(df)

# Display Fig 1
fig1 = visualize_demographic(results)
st.plotly_chart(fig1)

2025-09-30 19:15:46.284 
  command:

    streamlit run /root/miniconda3/lib/python3.11/site-packages/ipykernel_launcher.py [ARGUMENTS]
  grouped = df.groupby(["Pclass", "Sex", "AgeGroup"]).agg(


DeltaGenerator()

## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

In [None]:

import pandas as pd

def family_groups(df: pd.DataFrame) -> pd.DataFrame:
    """
    Explore the relationship between family size, passenger class, and ticket fare.     
    """
    # Step 1: Create family_size column
    df["family_size"] = df["SibSp"] + df["Parch"] + 1

    # Step 2 & 3: Group and aggregate
    grouped = df.groupby(["Pclass", "family_size"]).agg(
        n_passengers=("Fare", "count"),
        avg_fare=("Fare", "mean"),
        min_fare=("Fare", "min"),
        max_fare=("Fare", "max")
    ).reset_index()

    # Step 4: Sort
    grouped = grouped.sort_values(by=["Pclass", "family_size"]).reset_index(drop=True)

    return grouped


def last_names(df: pd.DataFrame) -> pd.Series:
    """
    Extract last names from the Titanic dataset and count their occurrences.
    """
    # Extract last name and count occurrences
    df["LastName"] = df["Name"].str.split(",").str[0].str.strip()
    last_name_counts = df["LastName"].value_counts()

    return last_name_counts


df = pd.read_csv("https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv")

# Family Groups
family_results = family_groups(df)
print(family_results.head(10))

# Last Names
surname_counts = last_names(df)
print(surname_counts.head(10))


   Pclass  family_size  n_passengers    avg_fare  min_fare  max_fare
0       1            1           109   63.672514    0.0000  512.3292
1       1            2            70   91.848039   29.7000  512.3292
2       1            3            24   95.681075   26.2833  211.5000
3       1            4             7  133.521429  120.0000  151.5500
4       1            5             2  262.375000  262.3750  262.3750
5       1            6             4  263.000000  263.0000  263.0000
6       2            1           104   14.066106    0.0000   73.5000
7       2            2            34   24.682962   11.5000   33.0000
8       2            3            31   31.693819   13.0000   73.5000
9       2            4            13   36.575969   11.5000   65.0000
LastName
Andersson    9
Sage         7
Skoog        6
Panula       6
Carter       6
Goodwin      6
Johnson      6
Rice         5
Fortune      4
Williams     4
Name: count, dtype: int64


## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.