<font color='darkred'> Unless otherwise noted, **this notebook will not be reviewed or autograded.**</font> You are welcome to use it for scratchwork, but **only the files listed in the exercises will be checked.**

---

# Exercises

For these exercises, add your functions to the *apputil\.py* file. If you like, you're welcome to adjust the *app\.py* file, but it is not required.

## Notes on Recursion

A [recursive function](https://www.w3schools.com/python/gloss_python_function_recursion.asp) is one which calls itself.

1. When the function is called, your CPU runs through each line of code until the function needs to be called again.
2. At that point, all variables are saved in memory, and the function runs through each line of code again until the function is called (again, but with a different passed argument), and so on.
3. Eventually, this process will stop at the "bottom of the **stack**", where the function doesn't get a chance to call itself again (likely because of some condition un/met by the latest passed argument).
4. Then, your CPU will work its way back up the stack to the final result. For example, take a look at [this visual example](https://realpython.com/python-recursion/#calculate-factorial) of calculating 4!.

When you write these functions, keep two things in mind:

- You will need a built-in stopping point (i.e., the "bottom"), where your function returns some result before it calls itself.
- **Don't think too hard about this.** Recursion can be perplexing to conceptualize when writing the code. So, when you call the function inside the function, think about it as a magical "hidden" function that has already done what you want it to do.
- [Python Tutor](https://pythontutor.com/) ([editor](https://pythontutor.com/visualize.html#mode=edit)) can be a helpful resource for this exercise!

## Exercise 1

The Fibonacci Series starts with 0 and 1. Each of the following numbers are the sum of the previous two numbers in the series:

`0 1 1 2 3 5 8 13 21 34 ...`

So, `fib(9) = 34`.

Write a recursive function (`fib`) that, given `n`, will return the `n`th number of the Fibonacci Series.

*Test your function using Google or any other tool that can calculate the Fibonacci Series.*

In [18]:
def fib(n):
    if n <= 1: return n
    else: return fib(n-1) + fib(n-2)

print(fib(80))

KeyboardInterrupt: 


## Exercise 2

Write a (single) recursive function, `to_binary()`, that [converts](https://en.wikipedia.org/wiki/Binary_number#Conversion_to_and_from_other_numeral_systems) an integer into its [binary](https://en.wikipedia.org/wiki/Binary_number) representation. So, for example:

```python
to_binary(2)   -->  10
to_binary(12)  -->  1100
```

*Note: you can test your function with the built in `bin()` function.*

In [None]:
def to_b(n):
    # binary of n <= 1 is n
    if n <= 1: return n
    else: return (
        to_b(n%2)  # returns the remainder of n divided by 2, e.g. 5%2 = 1, 4%2 = 0; 
                   #any modulus 2 is either 0 or 1, giving us the first (rightmost) binary digit
        + 10*to_b(n//2) # integer division, e.g. 5//2 = 2; 
                        # for any other number, get the floor of n/2, and multiply the result by 10 to shift left in binary
    )
print(to_b(255))



11111111


## Exercise 3 

Use the raw Bellevue Almshouse Dataset (`df_bellevue`) extracted at the top of the lab (i.e., with `pd.read_csv ...`).

**Write a function for each of the following tasks. Name these functions `task_i()`** (i.e., without any input arguments).

1. Return a list of all column names, *sorted* such that the first column has the *least* missing values, and the last column has the *most* missing values (use the raw column names).
   - *Note: there is an issue with the `gender` column you'll need to remedy first ...*
2. Return a **data frame** with two columns:
   - the year (for each year in the data), `year`
   - the total number of entries (immigrant admissions) for each year, `total_admissions`
3. Return a **series** with:
   - Index: gender (for each gender in the data)
   - Values: the average age for the indexed gender.
4. Return a list of the 5 most common professions *in order of prevalence* (so, the most common is first).

For each of these, if there are messy data issues, use the `print` statement to explain.


In [20]:
import pandas as pd
url = 'https://github.com/melaniewalsh/Intro-Cultural-Analytics/raw/master/book/data/bellevue_almshouse_modified.csv'

df_bellevue = pd.read_csv(url)

df_bellevue.head()

Unnamed: 0,date_in,first_name,last_name,age,disease,profession,gender,children
0,1847-04-17,Mary,Gallagher,28.0,recent emigrant,married,w,Child Alana 10 days
1,1847-04-08,John,Sanin (?),19.0,recent emigrant,laborer,m,Catherine 2 mo
2,1847-04-17,Anthony,Clark,60.0,recent emigrant,laborer,m,Charles Riley afed 10 days
3,1847-04-08,Lawrence,Feeney,32.0,recent emigrant,laborer,m,Child
4,1847-04-13,Henry,Joyce,21.0,recent emigrant,,m,Child 1 mo


In [26]:
print(df_bellevue['gender'].value_counts(dropna=False))

gender
m    4958
w    4621
?       2
g       2
h       1
Name: count, dtype: int64


In [None]:
import numpy as np

#1. Return a list of all column names, *sorted* such that 
# the first column has the *least* missing values, 
# and the last column has the *most* missing values (use the raw column names).

def fix_gender():
    '''returns the dataframe of bellevue with corrected gender markers (replaces anything ^m or ^w with NaN)'''
    # first, fix the gender column to set the invalid values to NaN
    df_fixedgender = df_bellevue.copy()
    df_fixedgender['gender'] = df_fixedgender['gender'].where(df_fixedgender['gender'].isin(['m', 'w']), np.nan)
    return df_fixedgender


def task_1():
    '''returns column list sorted by number of missing values ascending'''
    # get a dataframe with the fixed genders from the fix_gender function
    df_fixedgender = fix_gender()

    # return the list of column names sorted by number of missing values
    return df_fixedgender.isnull().sum().sort_values(ascending=True).index.tolist()

# testing task_1()
print(task_1())

['date_in', 'last_name', 'year', 'first_name', 'gender', 'age', 'profession', 'disease', 'children']


In [38]:
df_bellevue

Unnamed: 0,date_in,first_name,last_name,age,disease,profession,gender,children
0,1847-04-17,Mary,Gallagher,28.0,recent emigrant,married,w,Child Alana 10 days
1,1847-04-08,John,Sanin (?),19.0,recent emigrant,laborer,m,Catherine 2 mo
2,1847-04-17,Anthony,Clark,60.0,recent emigrant,laborer,m,Charles Riley afed 10 days
3,1847-04-08,Lawrence,Feeney,32.0,recent emigrant,laborer,m,Child
4,1847-04-13,Henry,Joyce,21.0,recent emigrant,,m,Child 1 mo
...,...,...,...,...,...,...,...,...
9579,1847-06-17,Mary,Smith,47.0,,,w,
9580,1847-06-22,Francis,Riley,29.0,lame,superintendent,m,
9581,1847-07-02,Martin,Dunn,4.0,,,m,
9582,1847-07-08,Elizabeth,Post,32.0,,,w,


In [None]:



df_bellevue.groupby('year').size().reset_index(name='total_admissions')


Unnamed: 0,year,total_admissions
0,1846,3073
1,1847,6511


In [48]:
print(task_1())

['date_in', 'last_name', 'year', 'first_name', 'gender', 'age', 'profession', 'disease', 'children']


In [41]:
#2. Return a **data frame** with two columns:
#   - the year (for each year in the data), `year`
#   - the total number of entries (immigrant admissions) for each year, `total_admissions`

def task_2():
    '''returns a dataframe with year and total_admissions columns'''
    #add a new column 'year' extracted from the 'admission_date' column
    df_bellevue['year'] = pd.to_datetime(df_bellevue['date_in'], errors='coerce').dt.year

    # group by year and count the number of entries for each year
    df_yearly = df_bellevue.groupby('year').size().reset_index(name='total_admissions')
    return df_yearly
 

# testing task_2()
print(task_2())




   year  total_admissions
0  1846              3073
1  1847              6511


In [50]:
# 3. Return a **series** with:
#   - Index: gender (for each gender in the data)
#   - Values: the average age for the indexed gender.

def task_3():
    '''returns a series with average age per gender'''

    # first, get the gender-adjusted dataframe from fix_gender() function
    df_fixedgender = fix_gender()
    # first group by, then calculate the mean age for each
    return df_fixedgender.groupby('gender')['age'].mean()


# testing task_3()
print(task_3())
    

gender
m    31.813433
w    28.725162
Name: age, dtype: float64


In [54]:
# 4. Return a list of the 5 most common professions *in order of prevalence* (so, the most common is first).
def task_4():
    '''returns a list of the 5 most common professions in order of prevalence'''
    # group by profession, count the occurrences, sort descending, get the top 5, and return as a list
    return df_bellevue.groupby('profession').size().sort_values(ascending=False).head(5).index.tolist()

# testing task_4()
print(task_4())

['laborer', 'married', 'spinster', 'widow', 'shoemaker']


In [52]:
df_bellevue.groupby('profession').size().sort_values(ascending=True).head(5).index.tolist()

['manow(?)', 'moulder', 'miniature painter', 'marketmab', 'mariner']