# Assignment 09

## Due: See Date in Moodle

To receive a **full credit** for this assignment, you must complete all exercises.

## This Week's Assignment

In this week's assignment, you'll learn how to:

- sample from an `pandas` dataframe.

- write a user-defined Python function.

### Notes

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

## The Bootstrap

Bootstrapping is one of the simplest, yet most powerful methods in all of statistics because it allows us to estimate the sampling distribution of a statistic by resampling with replacement from the original data, without making strong assumptions about the underlying population. It provides an easy way to get a sense of what might happen if we could repeat an experiment several times. 

When we resample *with replacement*, each data point in the original dataset has an equal chance of being selected multiple times or not at all in each new sample. By creating many of these resampled datasets, we can calculate the statistic of interest for each one, generating a distribution of those statistics known as the sampling distribution. This approach is particularly useful when the sample size is small or when the theoretical distribution is unknown. The method is powerful because it turns estimates into distributions that can be used to calculate a range of metrics, including standard errors, confidence intervals, and even $p$-values, helping us better understand the reliability of our results.

Below is a video that explains the main ideas behind this computational technique.

In [None]:
from IPython.display import YouTubeVideo

# YouTube video ID
video_id = 'Xz0x-8-cgaQ'

# Embed the YouTube video
YouTubeVideo(video_id, width=800, height=400)

Most of the time when you're conducting research, it's impractical to collect data from the entire population. This can be due to budget, time constraints, or other factors. Instead, a subset of the population is taken and insight is gathered from that subset to learn more about the population.

Suppose we had data that was the entire population - say all the salaries of the city employees of Raleigh, NC. Before we load the data import the `pandas` module. 

In [None]:
...

Run the cell below to load the data.

In [None]:
raleigh = ...

## Raleigh City Employees

According to indeed.com, the average salary of a City of Raleigh, NC employee ranges from approximately \\$39,645 per year for Administrative Technician to \\$118,226 per year for Director of Operations. Average City of Raleigh, NC hourly pay ranges from approximately \\$18.46 per hour for an Istructor to \\$31.62 per hour for System Programmer.

_"Salary information comes from 6,163 data points collected directly from employees, users, and past and present job advertisements on Indeed in the past 36 months._

_Please note that all salary figures are approximations based upon third party submissions to Indeed. These figures are given to the Indeed users for the purpose of generalized comparison only. Minimum wage may differ by jurisdiction and you should consult the employer for actual salary figures."_ Indeed

For information on the positions and related salaries see https://www.indeed.com/cmp/City-of-Raleigh-Nc/salaries.

**Note:** Even though this information is public record the names have been removed for this exercise.

Let's look at information about the dataset.

**Question 1.** Use the `.info()` method to access information about the `raleigh` dataframe.

In [None]:
...

**Question 2.** Give a breif description of each variable and its data type.

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**

Now let's get more details. 

**Example 1.** What are the different departments and how many employees does each department have?

In [None]:
...

**Question 3.** Which department has the most employees? Which department has the least number of employees? Is there a department that has more or less employees than you expected? Which one? Why?

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**

Suppose we wanted to report the mean salary for a typical full-time employee of the City of Raleigh. Since we have all the salaries we can find the population mean. But before we can do that, we need to change the data type in 2 of the columns. The output from the `.info()` method was:

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6163 entries, 0 to 6162
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   SALARY       3346 non-null   object
 1   HOURLY RATE  2817 non-null   object
 2   TITLE        6163 non-null   object
 3   DEPARTMENT   6163 non-null   object
dtypes: object(4)
memory usage: 192.7+ KB
```

So we need to do some wrangling with the `SALARY` and `HOURLY.RATE` columns to convert the values from `object` to `float64`.

## Data Wrangling

There are two issues. The first is with the values in the `SALARY` column:

- The values are stored as strings.

- Some of the salary values are `NaN` (missing).

and the other is with the values in the `HOURLY.RATE` column:

- Some of the employees are paid per hour, thus their yearly salary is missing.

If we want the mean salary we will need to convert the string salary values into numbers, then compute the yearly earnings for an hourly employee.

**Example 2.** Remove the dollar sign and comma from one of the `SALARY` values.

Run the cell below to see the second observation in the `ral` dataframe for `SALARY`.

**Note:** We use the index value `[1]` because Python index values start with 0.

In [None]:
raleigh.SALARY[1]

**Example 3.** Use the `.replace` method to remove the `$` and the `,`. 

**Note:** In Python, the `.replace()` method is used to replace parts of a string with another substring. It is commonly used for basic string replacements.

```python
str.replace(old, new[, count])
```

- `old`: The substring you want to replace.

- `new`: The substring you want to replace it with.

- `count` (optional): The maximum number of occurrences to replace. If omitted, it replaces all occurrences.


In [None]:
...

**Example 4.** Use the `float` command to convert the data type of the output from the command in the previous code cell from a string to a number.

In [None]:
...

Now that we know how to do one value, we can do this for all the values. To apply this to all the items in the column we can use a user-defined function.

In [None]:
def convert_to_number(currency):
    """
    Convert a currency string to a numerical float value by removing
    the dollar sign ('$'), commas, and spaces. 

    Parameters
    ----------
    currency (str): The currency string to convert (e.g., "$1,234.56").

    Returns
    -------
    float or str: The numerical value if the input is a valid string,
                  otherwise returns the original value if it's NaN.
    """
    
    ## Check if the input value is not NaN
    if pd.notna(currency):
        
        ## Remove the dollar sign ('$') and commas (',') from the string
        result = currency.replace('$', '').replace(',', '')
        
        ## Convert the cleaned string to a float
        result = float(result)
        
        
        ## Return the numerical value
        return result
    else:
        
        ## If the value is NaN, return it as-is
        return currency

Use the `.apply()` method and the `convert_to_number` function to change each value in the `HOURLY.RATE` column from a string to a number. Save the output to an object named `hr` (hourly rate). Display the first 10 results.

**Note:** The `.apply()` method from `pandas` is used to apply a function to each element, row, or column of a `pandas` `DataFrame` or `Series`.

```python
DataFrame.apply(function, axis=0)
```

- `function`: A function that you want to apply.

- `axis` (for `DataFrame` only, optional):

    - `axis=0` (default): Apply the function to each column.

    - `axis=1`: Apply the function to each row.

- For a `Series`, `.apply()` applies the function to each element of the `Series`.

- For a `DataFrame`, you can control whether to apply the function to each column or row using the axis parameter.

In [None]:
## Apply the convert_to_number function to each element in the SALARY column
## and save the cleaned numerical values to a Series object named 'ys' (yearly salary)
ys = raleigh.SALARY.apply(convert_to_number)

## Display the first 10 results of the converted values
print(ys[:10])

**Example 5.** Apply the `convert_to_number` function to each element in the `SALARY` column and save the cleaned numerical values to a `Series` object named `hr` (hourly rate).

In [None]:
hr = raleigh['HOURLY RATE'].apply(...)
hr[:10]

**Question 4.** Take the `ys` and `hr` `Series` and combine them into a `DataFrame`nnamed `df`.

In [None]:
df = ...
df.head(10)

**Question 5.** For the last step we need to compute the yearly salary for an employee that is paid hourly. Here's where you get to make some choices. Assign a value to teach variable `weeks_per_year` and `hours_per_week`. 

Explain you choices in a comment in the code cell. You can remove the `_______` and the `...` and complete each statement.

In [None]:
## Weeks worked in a year
weeks_per_year = ...

# I chose ___ weeks because ...

# Hours worked in a week
hours_per_week = ...

# I chose ___ hours because ...

Run the cell below to calculate the yearly earnings for an employee who earns \$18.00 per hour, based on your selected values for `weeks_per_year` and `hours_per_week`.

**Note:** This is the first observation in our `df` dataframe.

In [None]:
df.hourly[0] * weeks_per_year * hours_per_week

## `.notna()`

Some of the employees already have a yearly salary so we don't need to do any calculations. We only need to do this for employees that have a value in the `hr` column. So, if we were saying it out loud, it would go like this:

1. We need to check the `hr` to see if there is a value;

1. If there is a value, then we need to calculate the yearly earnings;

1. To calculate the yearly earnings we can multiply the hourly rate by the number of weeks and the hours per week;

1. Finally we nee to do this for all the values in the `hr` column.

Thankfully, the `.notna()` method makes it realatively easy to do all of this in one line of Python code. [Click here](https://pandas.pydata.org/docs/reference/api/pandas.Series.notna.html) to read about the `.notna()` `pandas` Series method.

**Example 6.** Use the `.notna()` method and the `.apply()` method to compute the yearly earnings from the hourly rate.

In [None]:
def calculate_yearly_salary(row):
    """
    Calculate the yearly salary for an employee based on their hourly wage.
    
    If the hourly column value is not NaN, the function calculates the yearly
    salary using the formula: hourly rate * weeks per year * hours per week.
    
    If the hourly column is NaN, it returns the value in the yearly column instead.
    
    Parameters
    ----------
    row (pd.Series): A row from a DataFrame containing the columns 'hourly' and 'yearly'.
    
    Returns
    -------
    float: The calculated yearly salary if hourly is provided, or the existing yearly value if not.
    """
    
    ## Check if the hourly value is not NaN
    if not pd.isna(row.hourly):
        
        ## Calculate the yearly salary using the provided hourly rate
        result = row.hourly * weeks_per_year * hours_per_week
        
        ## Return the calculated yearly salary
        return result 
    else:
        
        ## If hourly is NaN, return the existing yearly salary value
        return row.yearly

Save the output to a Series named `salaries`.

In [None]:
salaries = df.apply(...)
salaries[:10]

Now we can find the mean and medain salary for all City of Raleigh, NC employees.

In [None]:
## Find the median value in the salaries Series
median_salary = salaries.median()
print('The median salary is', median_salary)

## Find the mean value in the salaries Series
mean_salary = salaries.mean()
print('The mean salary is', mean_salary)

## Compute the difference between the mean and the median salary 
diff = mean_salary - median_salary
print('The difference between the mean salary and the median salary is', diff)

**Question 6.** Interpret the significance of the mean and median salaries within the context of the City of Raleigh data. What might explain the gap of just over `$4,000 between these two measures?

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**

# A Sample

A random sample is a subset of data or individuals taken from a larger population or dataset in such a way that each member of the population has an equal and independent chance of being included in the sample. The goal of taking a random sample is to ensure that the sample is representative of the entire population, allowing for valid statistical inferences and generalizations to be made about the population as a whole.

**Example 7.** Sample one observation from the `salaries` Series.

**Note:** The `.sample()` method, by default, samples without replacement. This means that the same element can not be chosen more than once in the random sample. If you want to sample with replacement (each element can only be chosen once), you need to explicitly set the `replace` argument to `True`. 

In [None]:
## One sampled observation
salaries.sample(n=1, replace=True)

If we re-run the code cell above, we would most likely get a different value. Try it and see.

In [None]:
## One sampled observation
salaries.sample(n=1, replace=True)

What we want to do is draw a large enough sample from a population in order to draw conclusions about a population without having to examine every single member of that population. In our activity we have the population, but for the sake of this activity let's pretend that we don't.

To ensure that your work is reproducible we will set a seed. What does it mean to set a seed?

In the context of the python programming language, setting a seed refers to initializing the random number generator with a specific value. This is important when you want to ensure reproducibility in your code, especially when generating random numbers.

In python, the random number generator is used in functions that involve randomness, such as sampling or generating random numbers from distributions. When you set a seed, you are essentially starting the random number generator at a specific point, and if you use the same seed again, you should get the same sequence of random numbers.

We can use the `random_state` parameter to set the seed to a specific value.

In the cell below set the seed for this notebook using 4 digits from either your birthdate, street address, phone number , etc. **Note:** The first digit can not be 0.

In [None]:
seed = ...

**Example 8.** Sample 500 observations from the `salaries` vector.

In [None]:
s = 500
salaries_sample = salaries.sample(n=s, random_state=seed)

What is the median annual salary in our sample? Is it the same value as the median salary in the full datasaet?

In [None]:
print('The mean salary in our sample of', s, 'is', salaries_sample.mean())
print('The mean salary in our population of', len(salaries), 'is', salaries.mean())

**Question 7.** Could we make a statement about the population based off of our sample of 500 observations? Would this be a good idea? Explain.

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**

## A Bootstrap Sample Mean

Suppose we take a random sample from our sample (with replacement). Would that give us a better idea of the mean salary for a typical full-time City of Raleigh employee?

Let's try!

**Example 9.** Collect one bootstrap sample from the `salaries_sample` Series. Calculate the mean of that sample.

**Note:** We do not specify a size because we want the sample size to equal the number of items in the sample (500).

In [None]:
## Perform a bootstrap sample
one_bootstrap_sample = salaries_sample.sample(n=s, replace=...)

## Calculate the median of one bootstrap sample
print('The mean salary in our bootstrap of', len(salaries_sample), 'is', one_bootstrap_sample.mean())

## The median of the population
print('The mean salary in our population of', len(salaries), 'is', salaries.mean())

If we did another bootstrap sample and calculated it's mean, do you think we would get the same value?

**Question 8.** What if we generated an additional 10,000 bootstrap samples and recalculated the mean for each sample? Do you think we would ever get exactly the same mean? Explain your reasoning.

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**

## Lots of Bootstrap Sample Means

To proceed we will use `NumPy`. Remember, `NumPy` is a numerical computing library for Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays. `NumPy` is a fundamental package for scientific computing in Python, and it forms the foundation for many other libraries and tools in the Python data science ecosystem.

Complete the code cell below to import `NumPy`.

In [None]:
...

**Example 10.** Let's do 10000 bootstrap sample means.

In [None]:
## Initialize an empty NumPy array to store the means of each bootstrap sample
bootstrap_sample_means = np.array([])

## Loop 10,000 times to generate bootstrap samples and calculate their means
for _ in range(10000):
    
    ## Draw a random sample of size 's' from 'salaries_sample' with replacement
    one_bootstrap_sample = salaries_sample.sample(n=s, replace=True)
    
    ## Calculate the mean of the current bootstrap sample
    sample_mean = one_bootstrap_sample.mean()
    
    ## Append the calculated mean to the 'bootstrap_sample_means' array
    bootstrap_sample_means = np.append(bootstrap_sample_means, sample_mean)

Load `pyplot` from the `matplotlib` library so we can visualize the distribution of the bootstrap sample means.

In [None]:
...

**Example 11.** To analyze the distribution (i.e.; frequency and pattern) of our bootstrap means let's visualize our data using a histogram.

In [None]:
# Create a histogram of the bootstrap sample means
plt.hist(bootstrap_sample_means, edgecolor='black', linewidth=0.75)
plt.xlabel('bootstrap_means')
plt.ylabel('Frequency');

Based on the histogram above, do you think the true mean salary (or a value very close to it) appears more frequently than other values?

**Example 12.** Show the location of the mean of all the bootstrap sample means and the true average salary.

In [None]:
## Create a histogram of the bootstrap sample means
plt.hist(bootstrap_sample_means, edgecolor='black', linewidth=0.75)

## Plot a red marker for the true average salary
plt.plot(salaries.mean(), -55, marker='^', color='red', markersize=10)

## Plot a green marker for the mean salary of the bootstrapped means
plt.plot(bootstrap_sample_means.mean(), -55, marker='^', color='green', markersize=10)

## Add labels
plt.xlabel('bootstrap_means')
plt.ylabel('Frequency');

In [None]:
## Calculate the mean of one bootstrap sample distribution
print('The mean salary in our bootstrap distribution of', len(bootstrap_sample_means), 
      'bootstrapped sample means, each with size', len(salaries_sample),
      'is', bootstrap_sample_means.mean())

## The mean of the population
print('The mean salary in our population of', len(salaries), 'is', salaries.mean())

**Question 9.** If we repeated this process, do you think the mean of the bootstrapped sample means would get closer to the true mean salary, or further away? Why do you think that?

**Note:** There’s no need to stress about getting the _"right"_ answer. I’m more interested in your reasoning and perspective.

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**

**Example 13.** Show the location of the mean of all the bootstrap sample means, the true average salary, and a 95% confidence interval.

In [None]:
## Get lower and upper bound
lower_bound = np.percentile(bootstrap_sample_means, 2.5)
upper_bound = np.percentile(bootstrap_sample_means, 97.5)

## Create a histogram of the bootstrap sample means
plt.hist(bootstrap_sample_means, edgecolor='black', linewidth=0.75)

## Plot a 95% confidence interval
plt.hlines(25, lower_bound, upper_bound, color='orange', linewidth=5)

## Plot a red marker for the true average salary
plt.plot(salaries.mean(), -60, marker='^', color='red', markersize=10)

## Plot a green marker for the mean salary of the bootstrapped means
plt.plot(bootstrap_sample_means.mean(), -60, marker='^', color='green', markersize=10)

## Print statements
print("The 95% confidence interval is (", round(lower_bound, 2), ",",round(upper_bound, 2), ")")
print("The true mean is ", round(salaries.mean(), 2))
print("The mean of the bootstrapped samples is", round(bootstrap_sample_means.mean(), 2))

**Example 14.** Show the location of the mean of all the bootstrap sample means, the true average salary, and a 90% confidence interval.

In [None]:
## Get lower and upper bound
lower_bound = np.percentile(bootstrap_sample_means, 5)
upper_bound = np.percentile(bootstrap_sample_means, 95)

## Create a histogram of the bootstrap sample means
plt.hist(bootstrap_sample_means, edgecolor='black', linewidth=0.75)

## Plot a 90% confidence interval
plt.hlines(25, lower_bound, upper_bound, color='orange', linewidth=5)

## Plot a red marker for the true average salary
plt.plot(salaries.mean(), -60, marker='^', color='red', markersize=10)

## Plot a green marker for the mean salary of the bootstrapped means
plt.plot(bootstrap_sample_means.mean(), -60, marker='^', color='green', markersize=10)

## Print statements
print("The 90% confidence interval is (", round(lower_bound, 2), ",",round(upper_bound, 2), ")")
print("The true mean is ", round(salaries.mean(), 2))
print("The mean of the bootstrapped samples is", round(bootstrap_sample_means.mean(), 2))

**Question 10.** Does the true mean fall within the confidence interval? If not, do you think it would be possible to run a simulation where the true mean does fall inside the confidence interval? Conversely, if it does fall within the interval, do you think we could generate a simulation where it falls outside? Explain your reasoning.

**Note:** Don't worry about getting the _"right"_ answer. I'm more interested in your thought process and what you think.

**_CLICK HERE TO ENTER YOUR ANSWER, REPLACING THIS TEXT._**