**Background**:

The demographic makeup of regions can offer crucial insights into various socio-economic factors. For policymakers, understanding age distributions can be particularly useful, as it can provide direction for initiatives ranging from educational policy to elderly care. In this section, we will work with a dataset detailing the age distribution across U.S. counties, broken down into specific age bins.

---

### Part 1: Analyzing Age Distribution Across U.S. Counties (6 points)

**Data Source**: 'county_age_dist.csv'

This dataset provides a breakdown of the age distribution in each county in the U.S., including territories. It's structured with the 'fips' column indicating the Federal Information Processing Standards code for each county. The age data is categorized into specific age bins (e.g., '0-17', '18-24', and so on).

In Part 1, we analyze the age distribution across U.S. counties using the 'county_age_dist.csv' dataset. The tasks span from standardizing data to understanding dominant age groups both at the county and national levels.

Please print the result for each subproblem.

---

__Exercise 1.1 (1 points).__ 

Ensure that the 'fips' column is standardized. Modify this column such that it is always a 5-digit string. If any 'fips' value has only 4 digits, prepend a '0' to it. For instance, Yolo county should be represented as '06113'. Directly call `county_age` to print the dataframe.

Hint: zfill

__Exercise 1.2 (2 points).__ 

Determine the total population of each county:

1. Generate a new column called 'population' that represents the total count of all age bins for each county.
2. For every age bin, create a corresponding column that indicates its proportion relative to the county's entire population. For example, the column '0-17_rat' should depict the ratio of individuals aged '0-17' to the total population of the respective county.
3. Identify the county with the highest proportion of individuals in the '25-34' age bin. What is its FIPS code?
4. Similarly, which county has the smallest proportion of residents aged 85 and above? Note down its FIPS code.

__Exercise 1.3 (1 points).__ 

Compute the overall age distribution for the entire U.S:

1. Calculate the aggregated count for each age bin across all counties.
2. Determine the proportion of each age bin relative to the entire U.S. population.

__Exercise 1.4 (2 points).__ 

Analyze the most common age group for each county:

1. Construct a new column named 'mode_age_bin'. This should capture the age bin with the highest proportion in each county. For example, if the most prevalent age group in a county with FIPS code '01001' is '0-17', the corresponding value in 'mode_age_bin' should be '0-17'.
2. After determining the mode age bin for each county, calculate the proportion of these modes for all U.S. counties. For instance, compute the percentage of counties where '0-17' is the most dominant age group.


<span style="color:#F00">Grading</span>
* 1.1
    * -1 empty / -0.5 error
* 1.2
    * create population column: 1 / create proportion column: 1
    * FIPS code with the largest prop of 25-34: 1 / FIPS code with the smallest prop of 85+: 1
* 1.3
    * -1 empty / -0.5 error
* 1.4
    * create a new column : - 0.5 incorrect/ -1 empty
    * print out the proportion : -0.5 incorrect/ -1 empty

In [14]:
import pandas as pd
import numpy as np

# Load data and standardize 'fips' column
county_age = pd.read_csv('county_age_dist.csv', dtype={'fips':str})
county_age['fips'] = county_age['fips'].str.zfill(5) # Standardizing to 5 digits
county_age = county_age.set_index('fips')
age_cols = list(county_age.columns)

# Problem 1.1
print("# Problem 1.1")
print(county_age)
print("\n")

# Exercise 1.2
# Population for each county
county_age['population'] = county_age[age_cols].sum(axis=1)

# Proportion columns for each age bin
for col in age_cols:
    county_age[col + '_rat'] = county_age[col] / county_age['population']

# County with highest proportion of '25-34'
max_25_34_fips = county_age['25-34_rat'].idxmax()

# County with smallest proportion of 85+
min_85_fips = county_age['85+_rat'].idxmin()

print("# Problem 1.2")
print("1 & 2. Population and Proportion Columns Added to Dataframe")
print(county_age)
print(f"3. County with highest proportion of '25-34': {max_25_34_fips}")
print(f"4. County with smallest proportion of '85+': {min_85_fips}")
print("\n")

# Exercise 1.3
# Aggregated count for each age bin across all counties
total_population = county_age['population'].sum()
us_age_distribution = county_age[age_cols].sum()
us_age_ratios = us_age_distribution / total_population

print("# Problem 1.3")
print("U.S. Age Distribution Counts:\n", us_age_distribution)
print("U.S. Age Distribution Ratios:\n", us_age_ratios)
print("\n")

# Exercise 1.4
# Most common age group for each county
county_age['mode_age_bin'] = county_age[age_cols].idxmax(axis=1)

# Proportion of mode age bins for all counties
mode_counts = county_age['mode_age_bin'].value_counts(normalize=True)

print("# Problem 1.4")
print("Mode Age Bin Column Added to Dataframe")
print(county_age)
print("Mode Age Bins Proportions:\n", mode_counts)



# Problem 1.1
        0-17  18-24  25-34  35-44  45-54  55-64  65-74  75-84   85+
fips                                                               
01001  25941  11422  12315  13828  14000  12697   9594   5430  1945
01003  86587  37568  44133  46730  49675  52405  43252  23262  8854
01005  11057   6162   6603   5907   6490   6377   5255   2795  1074
01007   9671   5241   5788   5472   6707   5563   4270   2555   638
01009  25671  11360  12635  13570  14737  14123  12106   6560  2022
...      ...    ...    ...    ...    ...    ...    ...    ...   ...
72145  27016  14455  14882  14168  15026  14450  11928   6139  3159
72147   4724   2727   2092   2356   2496   2972   2364   1498   479
72149  11353   6431   5521   5319   5788   6228   4631   2218   859
72151  16068   9025   8465   9199   9548   9805   7926   3784  2103
72153  17375   8974   9422   9457  10028  10672   8571   4620  2346

[3220 rows x 9 columns]


# Problem 1.2
1 & 2. Population and Proportion Columns Added to Dataframe
 

### Part 2: Incorporating Additional County Information (4 points)

**Data Source**: 'county_fips_master.csv'

In this section, we will enrich our analysis by integrating additional county-level information from the 'county_fips_master.csv' file. This dataset contains mapping of counties to their respective state and other metadata.

__Exercise 2.1 (2 points).__ 

Merge the 'county_fips_master.csv' data with the previously constructed age distribution dataframe. Ensure that each row corresponds to a unique county based on the information from the 'county_fips_master' file. Are there counties with missing demographic data? If yes, determine the number of such counties.

__Exercise 2.2 (2 points)__

Remove rows with incomplete data. For every state in the cleaned dataset, compute the age distribution proportions similar to the earlier national-level analysis. Identify the state with the highest proportion of individuals in the '18-24' age group. Report the state and the corresponding proportion for this age bin.

<span style="color:#F00">Grading</span>
* 2.1
    * Join the dataset: - 1 incorrect/ -2 empty
    * other questions: -1 incorrect / -2 missing
* 2.2
    * Drop rows: -0.5 incorrect / -1 missing
    * Give the proportions: -1 incorrect/ -2 missing
    * Final result: -1 incorrect/ -2 missing

In [17]:
# Load fips names
county_fips = pd.read_csv('county_fips_master.csv', dtype={'fips':str}, encoding='latin-1')

# reformat fips
county_fips['fips'] = county_fips['fips'].astype('str').str.zfill(5)
county_fips = county_fips.set_index('fips')

# Exercise 2.1: Merging with age distribution dataframe
county_data = county_fips.join(county_age)

# Display a sample of the merged data
print("# Exercise 2.1")
print(county_data.iloc[91:100,:])
print("\n")

# Check for missing data
missing_counties = county_data[county_data.isna().any(axis=1)]
num_missing = missing_counties.shape[0]

print(f"Number of counties with missing demographic data: {num_missing}")
if num_missing > 0:
    print("Missing data for counties:")
    print(missing_counties)
print("\n")

# Exercise 2.2: Computing state-level age distribution proportions
county_data.dropna(inplace=True)

state_data = county_data[['state_abbr'] + age_cols].groupby('state_abbr').sum()
state_data['total'] = state_data[age_cols].sum(axis=1)
state_data = state_data.join(state_data[age_cols].div(state_data['total'],axis=0), rsuffix='_rat')

# Find the state with the highest proportion of 18-24 age group
max_18_24_state = state_data['18-24_rat'].idxmax()
max_18_24_value = state_data['18-24_rat'].max()

print("# Exercise 2.2")
print(f"The state with the highest proportion of '18-24' age group is {max_18_24_state} with a proportion of {max_18_24_value:.3f}.")


# Exercise 2.1
                           county_name state_abbr state_name  \
fips                                                           
02240  Southeast Fairbanks Census Area         AK     Alaska   
02261       Valdez-Cordova Census Area         AK     Alaska   
02270         Wade Hampton Census Area         AK     Alaska   
02275        Wrangell City and Borough         AK     Alaska   
02282         Yakutat City and Borough         AK     Alaska   
02290        Yukon-Koyukuk Census Area         AK     Alaska   
04001                    Apache County         AZ    Arizona   
04003                   Cochise County         AZ    Arizona   
04005                  Coconino County         AZ    Arizona   

                                long_name  sumlev  region  division  state  \
fips                                                                         
02240  Southeast Fairbanks Census Area AK  50.000   4.000     9.000  2.000   
02261       Valdez-Cordova Census Area AK  50.

### Part 3: Analyzing COVID-19 Death Data

**Data Source**: 'time_series_covid19_deaths_US.csv'

In this section, we will work with the COVID-19 death data sourced from Johns Hopkins University, detailing cumulative death counts for each county in the US on different dates.


__Exercise 3.1 (2 points).__ 

We've obtained a dataset that chronicles the COVID-19 death counts in the US. The goal here is to transform this dataset into a 'tidy' format, ensuring clear columns for date, deaths, population, and county FIPS code. 

Hint: Consider using the `melt` function in pandas for this transformation.

__Exercise 3.2 (2 points).__ 

As of February 25th, 2021, which county, inclusive of its state, has witnessed the most significant proportional impact in terms of deaths relative to its population? Use the population metric provided within the COVID-19 dataset.

__Exercise 3.3 (2 points).__ 

Let's delve deeper and ascertain the incidence proportion, which denotes the number of deaths on a daily basis per 100,000 population for each county. This metric can be stored in a new dataset for further analyses.

__Exercise 3.4 (2 points).__ 

Instead of a daily perspective, it would be insightful to view the data from a weekly standpoint. Calculate the incidence proportion per week (aggregating over seven days) for each county. To determine the week bearing the heaviest toll, identify the one with the highest average incidence proportion across all counties.

Hint: Leverage 'Periods' in pandas to work with time intervals effectively.

__Exercise 3.5 (2 points).__ 

For each week in our dataset (approximately 40 weeks), compute the correlation between the senior population's proportion in a county and that week's incidence proportion for the county. Each county serves as an independent sample in this correlation calculation, and the Spearman method should be used. Finally, list down the correlation value for each of these weeks.

<span style="color:#F00">Grading</span>
* All subproblems:
    * - 1 incorrect/ -2 empty

In [34]:
import pandas as pd

# Load the data
covid_deaths = pd.read_csv('time_series_covid19_deaths_US.csv', dtype={'FIPS':str})

# Exercise 3.1: Tidying the dataframe
covid_deaths['FIPS'] = covid_deaths['FIPS'].astype('str').str.zfill(5)  # Standardizing FIPS code

tidy_df = covid_deaths.melt(id_vars=['FIPS', 'Admin2', 'Province_State', 'Population'], 
                            value_vars=covid_deaths.columns[12:], 
                            var_name='Date', 
                            value_name='Deaths')
tidy_df['Date'] = pd.to_datetime(tidy_df['Date'])

print(f"# Exercise 3.1")
print(tidy_df.head())
print("\n")

# Exercise 3.2: Highest death proportion county as of 2/25/21
df_0225 = tidy_df[tidy_df['Date'] == '2021-02-25'].copy()
df_0225['Death_Ratio'] = df_0225['Deaths'] / df_0225['Population']
max_death_county = df_0225.loc[df_0225['Death_Ratio'].idxmax()]

print(f"# Exercise 3.2")
print(f"County with highest death proportion as of 2/25/21: {max_death_county['Province_State']}")
print(df_0225.head())
print("\n")

# Exercise 3.3: Daily incidence proportion
tidy_df['Incidence_Proportion'] = (tidy_df['Deaths'] / tidy_df['Population']) * 100000

print(f"# Exercise 3.3")
print(tidy_df.head())
print("\n")

# Exercise 3.4: Weekly incidence proportion
tidy_df['Week'] = tidy_df['Date'].dt.to_period('W')
weekly_df = tidy_df.groupby(['FIPS', 'Week']).sum().reset_index()
weekly_df['Weekly_Incidence'] = (weekly_df['Deaths'] / weekly_df['Population']) * 100000
max_week = weekly_df.groupby('Week').mean()['Weekly_Incidence'].idxmax()

print(f"# Exercise 3.4")
print(f"Week with highest average incidence proportion: {max_week}")
print(weekly_df.head())
print("\n")

# Exercise 3.5: Spearman correlation for each week
correlations = []
weeks = tidy_df['Week'].unique()

for week in weeks:
    temp_df = tidy_df[tidy_df['Week'] == week]
    correlation = temp_df[['Incidence_Proportion', 'Population']].corr(method='spearman').iloc[0, 1]
    correlations.append((week, correlation))

print(f"# Exercise 3.5")
print("Spearman correlations for each week:")
for week, corr in correlations:
    print(f"Week {week}: {corr:.4f}")



# Exercise 3.1
     FIPS   Admin2 Province_State  Population       Date  Deaths
0  1001.0  Autauga        Alabama       55869 2020-01-22       0
1  1003.0  Baldwin        Alabama      223234 2020-01-22       0
2  1005.0  Barbour        Alabama       24686 2020-01-22       0
3  1007.0     Bibb        Alabama       22394 2020-01-22       0
4  1009.0   Blount        Alabama       57826 2020-01-22       0


# Exercise 3.2
County with highest death proportion as of 2/25/21: Unassigned, Connecticut
           FIPS   Admin2 Province_State  Population       Date  Deaths  \
1336000  1001.0  Autauga        Alabama       55869 2021-02-25      89   
1336001  1003.0  Baldwin        Alabama      223234 2021-02-25     274   
1336002  1005.0  Barbour        Alabama       24686 2021-02-25      51   
1336003  1007.0     Bibb        Alabama       22394 2021-02-25      60   
1336004  1009.0   Blount        Alabama       57826 2021-02-25     125   

         Death_Ratio  
1336000        0.002  
1336001    

### Recursion

Consider the sequcence defined by $a_1 = a_2 = \sqrt{3}$ and for $n>2$, 
$$
a_n = \frac{a_{n-1} + a_{n-2}}{1 - a_{n-1}  a_{n-2}}.
$$
Throughout this exercise you may assume that $F_n$ as $n$-th value of the [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_sequence) is explicitly known. 


__Ex 1 a)__ Implement a function `recursive_bad` that takes the positive integer argument `n` and returns $a_n$ by recursively calling itself. For any other argument, `None` shall be returned. Introduce a global variable `count` that counts how often `recursive_bad` is called. This variable shall not be a function argument. 

Return the `[recursive_bad(i+1) for i in range(10)]`. 

__Test cases__
```
> count = 0
> recursive_bad(25)
1.7320508075574117
> count
150049
> recursive_bad('hello')
> recursive_bad(5)
-1.732050807568878
```

__b)__ How often is the function called? Denote $c_n$ as number of function calls for argument `n` and derive an *explicit* expression for $c_n$. 

Return $c_n$, $n\in\{1, \dots, 10\}$. 

*Hint: Relate $c_n$ to $F_n$.*


__c)__ Implement a function `dynamic` that takes the positive integer argument `n` and returns `a_n` by recursively calling itself. For any other argument, `None` shall be returned. The function shall append a list, using the implicit definition of $a_n$ above, until it has `n` entries and return the last one. 

Return the command `[dynamic(i+1) for i in range(10)]`. Using `import timeit` and `timeit.default_timer` ([docs](https://docs.python.org/3/library/timeit.html?#timeit.default_timer)) report how long it takes on your machine to calculate $a_n$ using `dynamic` versus `recursion_bad`. 

__d)__ Implement a function `recursion` that invokes a recursive function. `recursion`  shall take the positive integer argument `n` and return `a_n`. For any other argument, `None` shall be returned. Implement the global `count` as in a). 

Return the command `[recursion(i+1) for i in range(25)]`. 

__Test cases__
```
> count = 0
> recursion(10)
> count
9
```

*Hint: Evaluate the recursion on $n$, $a_n$ and $a_{n-1}$ simultaneously.*

__e)__ Implement a function `explicit` that explicitly computes $a_n$. The function takes the positive integer argument `n` as well as the Fibonacci sequence `F` (you may assume that `len(F) > n`) with default value and return `a_n`. 

You may use `import numpy` and `numpy.tan` as well as `numpy.pi`. Return the command `[explicit(i+1) for i in range(10)]`. 

*Hint: Recall that for $a, b\in\mathbb R\colon \tan(a + b) = (\tan(a) + \tan(b)) / (1 - \tan(a) \tan(b))$.*

__Test cases__
```
> explicit(1)
1.7320508075688767
> explicit(-1)
> explicit(8)
1.7320508075688528
```

In [35]:
# a) 
def recursive_bad(n):
   
    # count 
    global count
    count += 1
    
    # check for exceptions
    if not isinstance(n, int) or n <= 0:
        return None
    
    if n < 3:
        return 3**0.5
    else: 
        a_minus_1 = recursive_bad(n-1)
        a_minus_2 = recursive_bad(n-2)
        return (a_minus_1 + a_minus_2) / (1 - a_minus_1 * a_minus_2)

In [36]:
count = 0; 
recursive_bad(25)

1.7320508075574117

In [37]:
count

150049

In [4]:
recursive_bad('hello')

In [5]:
recursive_bad(5)

-1.732050807568878

In [6]:
[recursive_bad(i+1) for i in range(10)]

[1.7320508075688772,
 1.7320508075688772,
 -1.7320508075688776,
 -1.1102230246251565e-16,
 -1.732050807568878,
 -1.7320508075688785,
 1.7320508075688752,
 -8.326672684688677e-16,
 1.732050807568872,
 1.732050807568869]

__b)__ Let $c_n$ be the number of function calls at $n$. It holds that $c_1 = c_2 = 1$ and 
$$
c_n = c_{n-1} + c_{n-2} + 1.
$$
From the hint we know that we can relate this sequence to the Fibonacci numbers $F_n = F_{n-1} + F_{n-2}$. We can bring above equation in similar form up to a factor $\delta$ and compare coefficients: 
$$
\delta(c_n + 1) = \delta(c_{n-1}  + 1 ) + \delta(c_{n-2}  + 1)
$$
From the inital conditions we know that 
$$
\delta(1 + 1) = 1 \Leftrightarrow \delta = 1/2.
$$
Thus we established that $c_n/2 + 1/2 = F_n$. Solving for $c_n$ gives $c_n = 2F_n - 1$. 


In [7]:
def fib(n):
    if n == 1 or n == 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)
F = [fib(i+1) for i in range(20)]

In [8]:
F = [fib(i+1) for i in range(25)]
[2*F[i] - 1 for i in range(25)]

[1,
 1,
 3,
 5,
 9,
 15,
 25,
 41,
 67,
 109,
 177,
 287,
 465,
 753,
 1219,
 1973,
 3193,
 5167,
 8361,
 13529,
 21891,
 35421,
 57313,
 92735,
 150049]

In [9]:
# c) 
def dynamic(n): 
    
    # check for exceptions
    if not isinstance(n, int) or n <= 0:
        return None
    
    init = 3**0.5
    if n == 1: 
        return init
    x = [init, init]
    counter = 2
    
    while counter < n:
        x.append((x[counter-1] + x[counter-2]) / (1 - x[counter-1] * x[counter-2]))
        counter += 1
        
    return x[n-1]

In [10]:
[dynamic(i+1) for i in range(10)]

[1.7320508075688772,
 1.7320508075688772,
 -1.7320508075688776,
 -1.1102230246251565e-16,
 -1.732050807568878,
 -1.7320508075688785,
 1.7320508075688752,
 -8.326672684688677e-16,
 1.732050807568872,
 1.732050807568869]

In [11]:
import timeit 
start = timeit.default_timer()
dynamic(30)
timeit.default_timer() - start

0.00013041700000115952

In [12]:
start = timeit.default_timer()
recursive_bad(30)
timeit.default_timer() - start

0.4693446659999978

In [13]:
#d) 
def f(n, a_minus_1 = None, a_minus_2 = None): 
    
    # count 
    global count
    count += 1
    
    # check for exceptions
    if not isinstance(n, int) or n <= 0:
        return (None, None, None)
    
    if n < 3: 
        return (n + 1, 3**0.5, 3**0.5)
    
    elif a_minus_1 == None and a_minus_2 == None:
        _, a_minus_1, a_minus_2 = f(n - 1)
    
    if n != 2: 
        a_n = (a_minus_1 + a_minus_2) / (1 - a_minus_1 * a_minus_2)
        return (n + 1, a_n, a_minus_1)
    
def recursion(n): 
    return f(n)[1]

In [14]:
count = 0
recursion(10)
count

9

In [15]:
[recursion(i+1) for i in range(10)]

[1.7320508075688772,
 1.7320508075688772,
 -1.7320508075688776,
 -1.1102230246251565e-16,
 -1.732050807568878,
 -1.7320508075688785,
 1.7320508075688752,
 -8.326672684688677e-16,
 1.732050807568872,
 1.732050807568869]

__e)__ Using the hint and setting $\tan(b_n) = a_n$ we have that 
$$
\tan(b_{n}) = \tan(b_{n-1} + b_{n-2}). 
$$
As in b), this equates the Fibonacci sequence up to a factor $\delta$. With the initial conditions we have for $n=1$, $b_n = \tan^{-1}(\sqrt{3}) = \pi/3$, so that 
$$
\delta b_n = F_n \Leftrightarrow \delta \pi/3 = 1 \Leftrightarrow \delta = 3/\pi. 
$$
Rearranging and solving for $a_n$ gives: 
$$
a_n = \tan(\pi/3 F_n).
$$

In [30]:
import numpy
def explicit(n, F = F): 
    # check for exceptions
    if not isinstance(n, int) or n <= 0:
        return None
    else: 
        return numpy.tan(numpy.pi/3*F[n-1])

In [31]:
explicit(1)

1.7320508075688767

In [32]:
explicit(-1)

In [33]:
explicit(8)

-8.572527594031472e-16

In [34]:
[explicit(i+1) for i in range(10)]

[1.7320508075688767,
 1.7320508075688767,
 -1.7320508075688783,
 -1.2246467991473532e-16,
 -1.7320508075688805,
 -1.7320508075688812,
 1.7320508075688703,
 -8.572527594031472e-16,
 1.7320508075688528,
 1.7320508075688639]