<font color='darkred'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file. If you like, you're welcome to adjust the *app\.py* file, but it is not required.

## Notes on Recursion

A [recursive function](https://www.w3schools.com/python/gloss_python_function_recursion.asp) is one which calls itself.

1. When the function is called, your CPU runs through each line of code until the function needs to be called again.
2. At that point, all variables are saved in memory, and the function runs through each line of code again until the function is called (again, but with a different passed argument), and so on.
3. Eventually, this process will stop at the "bottom of the **stack**", where the function doesn't get a chance to call itself again (likely because of some condition un/met by the latest passed argument).
4. Then, your CPU will work its way back up the stack to the final result. For example, take a look at [this visual example](https://realpython.com/python-recursion/#calculate-factorial) of calculating 4!.

When you write these functions, keep two things in mind:

- You will need a built-in stopping point (i.e., the "bottom"), where your function returns some result before it calls itself.
- **Don't think too hard about this.** Recursion can be perplexing to conceptualize when writing the code. So, when you call the function inside the function, think about it as a magical "hidden" function that has already done what you want it to do.
- [Python Tutor](https://pythontutor.com/) ([editor](https://pythontutor.com/visualize.html#mode=edit)) can be a helpful resource for this exercise!

## Exercise 1

The Fibonacci Series starts with 0 and 1. Each of the following numbers are the sum of the previous two numbers in the series:

`0 1 1 2 3 5 8 13 21 34 ...`

So, `fib(9) = 34`.

Write a recursive function (`fib`) that, given `n`, will return the `n`th number of the Fibonacci Series.

*Test your function using Google or any other tool that can calculate the Fibonacci Series.*

In [1]:
def fib(n):
    """ 
    Calculate the nth Fibonacci number using recursion.

    Parameters
    --------
    n : int
        The position in the Fibonacci sequence (must be non-negative).

    Returns
    ------
    int
        The nth Fibonacci number.
    """
    # Base case: stop recursion when n is 0 or 1
    if n == 0:
        return 0
    elif n == 1:
        return 1

    # Recursive case
    return fib(n - 1) + fib(n - 2)

# Test cases
print(fib(0))  # Expected 0
print(fib(1))  # Expected 1
print(fib(5))  # Expected 5
print(fib(9))  # Expected 34

0
1
5
34



## Exercise 2

Write a (single) recursive function, `to_binary()`, that [converts](https://en.wikipedia.org/wiki/Binary_number#Conversion_to_and_from_other_numeral_systems) an integer into its [binary](https://en.wikipedia.org/wiki/Binary_number) representation. So, for example:

```python
to_binary(2)   -->  10
to_binary(12)  -->  1100
```

*Note: you can test your function with the built in `bin()` function.*

In [6]:
def to_binary(n):
    """
    Convert a non-negative integer to its binary representation as a string.
    
    Parameters
    ----------
    n : int
        A non-negative integer (Must be >= 0).

    Returns
    ----------
    str
        The binary representation of the integer as a string.

    Raises
    ------
    ValueError
        If the input is not a non-negative integer.
    """
    #Input validation
    if not isinstance(n, int):
        raise ValueError("Input must be an integer.")
    if n < 0:
        raise ValueError("Input must be a non-negative integer.")

    # Base case: stop recursion when n is 0 or 1
    if n == 0:
        return '0'
    elif n == 1:
        return '1'

    # Recursive case: divide n by 2 and concatenate the remainder
    return to_binary(n // 2) + str(n % 2)

# Test cases
print(to_binary(0))   # Expected '0'
print(to_binary(1))   # Expected '1'
print(to_binary(2))   # Expected '10'
print(to_binary(5))   # Expected '101'
print(to_binary(12))  # Expected '1100'   

# Testcases for invalid input
try:
    print(to_binary(-1))  # Expected ValueError
except ValueError as e:
    print(f"Error: {e}") # Expected "Error: Input must be a non-negative integer."

try:
    print(to_binary(3.5))  # Expected ValueError
except ValueError as e:
    print(f"Error: {e}") # Expected "Error: Input must be an integer."

try:
    print(to_binary("10"))  # Expected ValueError
except ValueError as e:
    print(f"Error: {e}") # Expected "Error: Input must be an integer."

  

0
1
10
101
1100
Error: Input must be a non-negative integer.
Error: Input must be an integer.
Error: Input must be an integer.


## Exercise 3 

Use the raw Bellevue Almshouse Dataset (`df_bellevue`) extracted at the top of the lab (i.e., with `pd.read_csv ...`).

**Write a function for each of the following tasks. Name these functions `task_i()`** (i.e., without any input arguments).

1. Return a list of all column names, *sorted* such that the first column has the *least* missing values, and the last column has the *most* missing values (use the raw column names).
   - *Note: there is an issue with the `gender` column you'll need to remedy first ...*
2. Return a **data frame** with two columns:
   - the year (for each year in the data), `year`
   - the total number of entries (immigrant admissions) for each year, `total_admissions`
3. Return a **series** with:
   - Index: gender (for each gender in the data)
   - Values: the average age for the indexed gender.
4. Return a list of the 5 most common professions *in order of prevalence* (so, the most common is first).

For each of these, if there are messy data issues, use the `print` statement to explain.


In [25]:
import pandas as pd

# Load the dataset
url = 'https://github.com/melaniewalsh/Intro-Cultural-Analytics/raw/master/book/data/bellevue_almshouse_modified.csv'
df_bellevue = pd.read_csv(url)

# clean up gender column
df_bellevue['gender'] = df_bellevue['gender'].str.strip().str.lower()
df_bellevue['gender'] = df_bellevue['gender'].replace({
    'm': 'male',
    'w': 'female'
})
df_bellevue['gender']  = df_bellevue['gender'].fillna('unknown')

def task_1():
    """ Task 1: Return a list of column names from the df_bellevue dataset,
    sorted by the number of missing values (least missing first)

    Returns
    -------
    list
        Column names ordered from least missing values to most missing values.
    """

    # Count missing values for each column
    missing_counts = df_bellevue.isnull().sum()

    # Sort columns by missing value counts
    sorted_columns = missing_counts.sort_values().index.tolist()
    return sorted_columns
print(task_1())  # Example usage



['date_in', 'last_name', 'gender', 'first_name', 'age', 'profession', 'disease', 'children']


In [24]:
def task_2():
    """ 
    Task 2: Return a DataFrame with the year and the total number of admissions per year.
    
    Returns
    -------
    pd.DataFrame
        DataFrame with two columns: 'year' and 'total_admissions', sorted by year in ascending order.
    """
    # Extract year from 'date_in', handling invalid dates
    df_bellevue['year'] = pd.to_datetime(df_bellevue['date_in'], errors='coerce').dt.year

    # Check for missing years
    missing_years = df_bellevue['year'].isnull().sum()
    if missing_years > 0:
        print(f"Warning: {missing_years} records have invalid or missing 'date_in' values.")

    # Count total admissions per year
    admissions_per_year = df_bellevue.groupby('year').size()

    # Convert to DataFrame
    admissions_df = admissions_per_year.reset_index(name='total_admissions').sort_values(by='year')
    return admissions_df

# Example usage
print(task_2().head())

   year  total_admissions
0  1846              3073
1  1847              6511


In [22]:

def task_3():
    """ 
    Task 3: Return a Pandas Series with gender as the index and the
    average age for each gender as the values.
    
    Returns
    -------
    pd.Series
        A Series where:
        - index = gender categories
        - values = average age for each gender
    """
    # Make a copy to avoid modifying the original dataset
    df_copy = df_bellevue.copy()

    #chack if 'age' column exists
    if 'age' not in df_copy.columns:
        print("Error: 'age' column is missing.")
        return None
    
    # Ensure 'age' is numeric
    df_copy['age'] = pd.to_numeric(df_copy['age'], errors='coerce')

    # Clean and replace or missing values with 'unknown'
    df_copy['gender'] = df_copy['gender'].str.strip().str.lower()
    df_copy['gender'] = df_copy['gender'].replace({
        'm': 'male',
        'w': 'female',
        '?': 'unknown'
    })
    df_copy['gender'] = df_copy['gender'].fillna('unknown')

    # Drop rows with missing 'age
    df_copy = df_copy.dropna(subset=['age'])
    
    # Filter out invalid gender values
    valid_genders = ['male', 'female', 'unknown']
    df_copy = df_copy[df_copy['gender'].isin(valid_genders)]

    # Check if the DataFrame is empty after cleaning
    if df_copy.empty:
        print("No valid data to calculate average age.")
        return None

    # Group by gender and calculate average age
    average_age = df_copy.groupby('gender')['age'].mean()

    # Ensure all genders appear in the index, even if missing
    average_age = average_age.reindex(valid_genders)

    return average_age

# Test the function
print(task_3())


gender
male       31.813433
female     28.725162
unknown          NaN
Name: age, dtype: float64


In [21]:
import string

def task_4():
    """ 
    Task 4:Return a list of the 5 most common professions
    in order of prevalence (most common first).
    
    Returns
    -------
    list
        A list of the top 5 most frequent professions.
    """

    # Make a copy to avoid modifying the original dataset
    df_copy = df_bellevue.copy()

    # Check if 'profession' column exists
    if 'profession' not in df_copy.columns:
        print("Error: 'profession' column is missing.")
        return []
    
    # Clean profession column: lowercase, strip spaces, remove punctuation
    df_copy['profession'] = df_copy['profession'].str.lower().str.strip()
    df_copy['profession'] = df_copy['profession'].str.replace(f"[{string.punctuation}]", "", regex=True)

    # Drop missing values
    missing_professions = df_copy['profession'].isnull().sum()
    if missing_professions > 0:
        print(f"Warning: {missing_professions} records have missing 'profession' values.")
    df_copy = df_copy.dropna(subset=['profession'])

    # Count frequencies and get the top 5
    top_professions = df_copy['profession'].value_counts().head(5).index.tolist()

    return top_professions

# Test the function
print(task_4())  # Example usage

 

['laborer', 'married', 'spinster', 'widow', 'shoemaker']
