<font color='darkorange'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file and *app\.py* file as instructed. *These exercises use the same [Titanic dataset](https://www.kaggle.com/competitions/titanic/data) as the lab.*


## Exercise 1: Survival Patterns


For this exercise you will analyze survival patterns on the Titanic by looking at passenger class, sex, and age group. Name the function `survival_demographics()`.

1. Create a new column in the Titanic dataset that classifies passengers into age categories (i.e., a pandas `category` series). The categories should be:
    - Child (up to 12)
    - Teen (13–19)
    - Adult (20–59)
    - Senior (60+)  
  
	Hint: The `pd.cut()` function might come in handy here.

2. Group the passengers by class, sex, and age group.  

3. For each group, calculate:  
    - The total number of passengers, `n_passengers`
    - The number of survivors, `n_survivors`
    - The survival rate, `survival_rate`

4. Return a table that includes the results for *all* combinations of class, sex, and age group.  

5. Order the results so they are easy to interpret.  

6. Come up with a clear question that your results table makes you curious about (e.g., “Did women in first class have a higher survival rate than men in other classes?”). Write this question in your `app.py` file above the call to your visualization function, using `st.write("Your Question Here")`.
   
7. Create a Plotly visualization in a function named `visualize_demographic()` that directly addresses your question by returning a Plotly figure (e.g., `fig = px. ...`). You are free to choose the chart type that you think best communicates the findings. Be creative — try different approaches, compare them, and ensure that your chart clearly answers the question you posed.


In [2]:
import pandas as pd
import plotly.express as px

df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')

In [3]:
df.columns


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [15]:
def survival_demographics():
    ### 1. Create a new column called 'age_group' based on the 'age' column
    ### Child <= 12, Teenager 13-19, Adult 20-59, Senior 60+
    df['age_group'] = pd.cut(df['Age'], bins=[0, 12, 19, 59, 100], labels=['Child', 'Teenager', 'Adult', 'Senior']) 

    ### 2. Group by Pclass, Sex, age_group
    grouped = df.groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().unstack(fill_value=0)

    ### 3. Calculate total passengers, survived, and survival rate for each group
    total_passengers = df.groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().unstack(fill_value=0)
    survived = df[df['Survived'] == 1].groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().unstack(fill_value=0)
    survival_rate = survived / total_passengers

    ### 4. Create DF with total, survived, and survival rate for each group
    groups_df = pd.concat([total_passengers, survived, survival_rate], axis=1, keys=['Total', 'Survived', 'Survival Rate'])

    # 5. Order the Data
    groups_df = groups_df.sort_index()
    
    # 6. Return the DataFrame
    return groups_df

groups_df = survival_demographics()
print(groups_df)



              Total                       Survived                        \
age_group     Child Teenager Adult Senior    Child Teenager Adult Senior   
Pclass Sex                                                                 
1      female     1       13    68      3        0       13    66      3   
       male       3        4    80     14        3        1    34      2   
2      female     8        8    58      0        8        8    52      0   
       male       9       10    76      4        9        1     4      1   
3      female    23       22    56      1       11       13    22      1   
       male      25       38   186      4        9        3    26      0   

              Survival Rate                                
age_group             Child  Teenager     Adult    Senior  
Pclass Sex                                                 
1      female      0.000000  1.000000  0.970588  1.000000  
       male        1.000000  0.250000  0.425000  0.142857  
2      female  

In [None]:
def survival_demographics2():
    df['age_group'] = pd.cut(
        df['Age'],
        bins=[0, 12, 19, 59, 100],
        labels=['Child', 'Teenager', 'Adult', 'Senior']
    ).astype('category')  # Ensure categorical dtype

    # Create all combinations of Pclass, Sex, age_group
    all_combinations = pd.MultiIndex.from_product(
        [df['Pclass'].unique(), df['Sex'].unique(), df['age_group'].cat.categories],
        names=['Pclass', 'Sex', 'age_group']
    )

    # Group and count total passengers
    total = df.groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().reindex(all_combinations, fill_value=0)

    # Group and count survivors
    survived = df[df['Survived'] == 1].groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().reindex(all_combinations, fill_value=0)

    # Calculate survival rate
    rate = survived / total
    rate = rate.fillna(0)

    # Combine into a DataFrame
    result = pd.DataFrame({
        'Total': total,
        'Survived': survived,
        'Survival Rate': rate
    }).reset_index()

    return result

survival_demographics2()

Unnamed: 0,Pclass,Sex,age_group,Total,Survived,Survival Rate
0,3,male,Child,25,9,0.36
1,3,male,Teenager,38,3,0.078947
2,3,male,Adult,186,26,0.139785
3,3,male,Senior,4,0,0.0
4,3,female,Child,23,11,0.478261
5,3,female,Teenager,22,13,0.590909
6,3,female,Adult,56,22,0.392857
7,3,female,Senior,1,1,1.0
8,1,male,Child,3,3,1.0
9,1,male,Teenager,4,1,0.25


In [None]:
## Save other code:
import plotly.express as px
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')

### Exercise 1 ###
def survival_demographics():
    """
    This function takes the Titanic DF and groups by class, sex, and age group.
    It then calculates the total passengers, survived, and survival rate for each group combination.

    Arguments:
        None

    Returns:
        pandas DataFrame with total passengers, survived, and survival rate for each group combination.
    """
    # 1. Create a new column called 'age_group' based on the 'age' column
    df['age_group'] = pd.cut(df['Age'], bins=[0, 12, 19, 59, 100], labels=['Child', 'Teenager', 'Adult', 'Senior']) 

    # 2. Group by Pclass, Sex, age_group
    grouped = df.groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().unstack(fill_value=0)

    # 3. Calculate total passengers, survived, and survival rate for each group
    total_passengers = df.groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().unstack(fill_value=0)
    survived = df[df['Survived'] == 1].groupby(['Pclass', 'Sex', 'age_group'], observed=False).size().unstack(fill_value=0)
    survival_rate = survived / total_passengers

    # 4. Create DF with total, survived, and survival rate for each group
    groups_df = pd.concat([total_passengers, survived, survival_rate], axis=1, keys=['Total', 'Survived', 'Survival Rate'])

    # 5. Order the Data
    groups_df = groups_df.sort_index()
    
    # 6. Return the DataFrame
    return groups_df





In [1]:
import plotly.express as px
import pandas as pd

# Load and prepare the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/titanic.csv')

# Globally rename the columns for the autograder
df.rename(columns={'Pclass': 'pclass', 'Sex': 'sex'}, inplace=True)

### Exercise 1 ###
def survival_demographics():
    """
    This function creates an age group column, and then groups the df by class, sex, and age group.
    Then, we calculate the total numbers, total survived, and survival rate for each group.

    Arguments:
        None

    Returns:
        pandas DataFrame: A DataFrame containing the survival demographics.
    """

    # Create age groups
    df['age_group'] = pd.cut(
        df['Age'],
        bins=[0, 12, 19, 59, 100],
        labels=['Child', 'Teenager', 'Adult', 'Senior']
    )

    # Ensure CategoricalDtype 
    df['age_group'] = df['age_group'].astype(pd.CategoricalDtype(
        categories=['Child', 'Teenager', 'Adult', 'Senior'],
        ordered=True
    ))

    # Create all combinations of pclass, sex, and age_group
    all_combinations = pd.MultiIndex.from_product(
        [df['pclass'].unique(), df['sex'].unique(), df['age_group'].cat.categories],
        names=['pclass', 'sex', 'age_group']
    )

    # Create the grouped data
    total = df.groupby(['pclass', 'sex', 'age_group'], observed=False).size().reindex(all_combinations, fill_value=0)
    survived = df[df['Survived'] == 1].groupby(['pclass', 'sex', 'age_group'], observed=False).size().reindex(all_combinations, fill_value=0)
    rate = survived / total

    # Combine into a single DataFrame and return
    result = pd.DataFrame({
        'Total': total,
        'n_survivors': survived,  # ✅ Renamed for autograder
        'Survival Rate': rate.fillna(0)
    }).reset_index()

    return result



In [9]:
def visualize_demographic():
    df_grouped = survival_demographics()

    # ✅ Aggregate across gender
    df_agg = (
        df_grouped
        .groupby(['pclass', 'age_group'], observed=False)
        .agg({'Total': 'sum', 'n_survivors': 'sum'})
        .reset_index()
    )

    # ✅ Recalculate survival rate
    df_agg['Survival Rate'] = df_agg['n_survivors'] / df_agg['Total']

    # ✅ Plot: class on x-axis, age group as color
    fig = px.bar(
        df_agg,
        x='pclass',
        y='Survival Rate',
        color='age_group',
        barmode='group',
        title='Survival Rate by Class and Age Group (Aggregated Across Gender)',
        labels={
            'pclass': 'Passenger Class',
            'age_group': 'Age Group',
            'Survival Rate': 'Survival Rate'
        }
    )

    return fig



visualize_demographic().show()

## Exercise 2: Family Size and Wealth

Using the Titanic dataset, write a function named `family_groups()` to explore the relationship between family size, passenger class, and ticket fare.  

1. Create a new column in the Titanic dataset that represents the total family size for each passenger, `family_size`. Family size is defined as the number of siblings/spouses aboard plus the number of parents/children aboard, plus the passenger themselves.

2. Group the passengers by family size and passenger class. For each group, calculate:  
   - The total number of passengers, `n_passengers`
   - The average ticket fare, `avg_fare`
   - The minimum and maximum ticket fares (to capture variation in wealth), `min_fare` and `max_fare`

3. Return a table with these results, sorted so that the values are clear and easy to interpret (for example, by class and then family size).

4. Write a function called `last_names()` that extracts the last name of each passenger from the `Name` column, and returns the count for each last name (i.e., a pandas series with last name as index, and count as value). Does this result agree with that of the data table above? Share your findings in your app using `st.write`.

5. Just like you did in Exercise 1, come up with a clear question that your results makes you curious about. Write this question in your app.py file above the call to your visualization function. Then, create a Plotly visualization in a function named `visualize_families()` that directly addresses your question. As in Exercise 1 you are free to choose the chart type that you think best communicates the findings.

## Bonus Question

Add a new column, `older_passenger`, to the Titanic dataset that indicates whether each passenger’s age is above the median age for *their* passenger class. So, suppose row $x$ is in passenger class 2. Then, a value of `True` at row $x$ would indicate that passenger older than 50% of class 2 passengers, and `False` would indicate that they younger.

- You should use pandas functions to accomplish this.
- The new column should contain Boolean values (True if the age is above the median, False if less than or equal to).
- Return the updated table in the function `determine_age_division()`

Once you’ve created this column, consider how this age division relates to your analysis above. Try to visualize this analysis in Plotly using the function name `visualize_age_division()`.