<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function `survival_demographics()`.

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas `category` series). The categories should be:
    - Child (up to 12)
    - Teen (13–19)
    - Adult (20–59)
    - Senior (60+)  
  
	Hint: The `pd.cut()` function might come in handy here.

2. Group the passengers by class, sex, and age group.  

3. For each group, calculate:  
    - The total number of passengers, `n_passengers`
    - The number of survivors, `n_survivors`
    - The survival rate, `survival_rate`

4. Return a table that includes the results for *all* combinations of class, sex, and age group.  

5. Order the results so they are easy to interpret.  

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your `app.py` file above the call to your visualization function, using `st.write("Your Question Here")`.
   
7. Create a Plotly visualization in a function named `visualize_demographic()` that directly addresses your question by returning a Plotly figure (e.g., `fig = px. ...`). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.


In [23]:
# importing pandas
import pandas as pd

import plotly.express as px

# loading the titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')

In [24]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [25]:
def survival_demographics():
    bins = [0,12,19,59,float('inf')]
    labels = ['Child', 'Teen', 'Adult', 'Senior']
    
    df["AgeCategory"] = pd.cut(
        df["Age"],
        bins=bins,
        labels=labels,
        right=True
    )

    grouped = (
        df.groupby(["Pclass", "Sex", "AgeCategory"])
        .agg(
            n_passengers = ("Survived", "size"),
            n_survivors = ("Survived", "sum"),
            survival_rate = ("Survived", "mean")
        )
        .reset_index()
    )

    age_order = pd.CategoricalDtype(categories=labels, ordered=True)
    grouped["AgeCategory"] = grouped["AgeCategory"].astype(age_order)

    grouped = grouped.sort_values(by=["Pclass", "Sex", "AgeCategory"]).reset_index(drop=True)

    return grouped

In [26]:
survival_demographics()

  df.groupby(["Pclass", "Sex", "AgeCategory"])


Unnamed: 0,Pclass,Sex,AgeCategory,n_passengers,n_survivors,survival_rate
0,1,female,Child,1,0,0.0
1,1,female,Teen,13,13,1.0
2,1,female,Adult,68,66,0.970588
3,1,female,Senior,3,3,1.0
4,1,male,Child,3,3,1.0
5,1,male,Teen,4,1,0.25
6,1,male,Adult,80,34,0.425
7,1,male,Senior,14,2,0.142857
8,2,female,Child,8,8,1.0
9,2,female,Teen,8,8,1.0


In [27]:
def visualize_demographic():
    # Compute survival rate by passenger class
    grouped = (
        df.groupby("Pclass")["Survived"]
        .agg(["mean", "count"])
        .reset_index()
        .rename(columns={"mean": "survival_rate", "count": "n_passengers"})
    )

    # Create bar chart
    fig = px.bar(
        grouped,
        x="Pclass",
        y="survival_rate",
        text="survival_rate",
        color="Pclass",
        color_continuous_scale="Blues",
        title="Survival Rate by Passenger Class",
        labels={"Pclass": "Passenger Class", "survival_rate": "Survival Rate"}
    )

    # Format survival rate as percentage on hover/text
    fig.update_traces(
        texttemplate="%{y:.1%}",
        textposition="outside"
    )
    fig.update_yaxes(tickformat=".0%", title="Survival Rate")

    return fig

In [28]:
visualize_demographic()

## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

In [37]:
def family_groups():
    # Create a new column for family size
    df["family_size"] = df["SibSp"] + df["Parch"] + 1  # +1 to include the passenger themselves

    # Compute survival rate by family size
    grouped = (
        df.groupby(["Pclass", "family_size"])
        .agg(
            n_passengers = ("PassengerId", "size"),
            avg_fare = ("Fare", "mean"),
            min_fare = ("Fare", "min"),
            max_fare = ("Fare", "max")
        )
        .reset_index()
    )

    grouped = grouped.sort_values(["Pclass", "family_size"]).reset_index(drop=True)

    return grouped

In [38]:
family_groups()

Unnamed: 0,Pclass,family_size,n_passengers,avg_fare,min_fare,max_fare
0,1,1,109,63.672514,0.0,512.3292
1,1,2,70,91.848039,29.7,512.3292
2,1,3,24,95.681075,26.2833,211.5
3,1,4,7,133.521429,120.0,151.55
4,1,5,2,262.375,262.375,262.375
5,1,6,4,263.0,263.0,263.0
6,2,1,104,14.066106,0.0,73.5
7,2,2,34,24.682962,11.5,33.0
8,2,3,31,31.693819,13.0,73.5
9,2,4,13,36.575969,11.5,65.0


In [39]:
def last_names():
    df["LastName"] = df["Name"].str.split(",").str[0]

    last_name_count = df["LastName"].value_counts()

    return last_name_count

In [40]:
last_names()

LastName
Andersson    9
Sage         7
Skoog        6
Panula       6
Carter       6
            ..
Nysveen      1
Young        1
Slayter      1
Danoff       1
Haas         1
Name: count, Length: 667, dtype: int64

In [50]:
def visualize_families():
    # Get the grouped data
    grouped = family_groups()  # calls your existing function

    # Create line chart
    fig = px.line(
        grouped,
        x="family_size",
        y="avg_fare",
        color="Pclass",
        markers=True,  # show points on the line
        hover_data=["n_passengers", "min_fare", "max_fare"],
        title="Average Fare vs Family Size by Passenger Class",
        labels={
            "family_size": "Family Size",
            "avg_fare": "Average Fare",
            "Pclass": "Passenger Class"
        }
    )

    fig.update_layout(
        xaxis=dict(dtick=1),  # show each family size as a tick
        yaxis=dict(title="Average Fare ($)"),
        legend_title="Passenger Class"
    )

    return fig

In [51]:
visualize_families()

## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.