# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [3]:
# Run this code:
salaries = pd.read_csv('Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [None]:
# Your code here

# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [4]:
# Your code here

from scipy.stats import ttest_1samp

# Step 1: Filter out rows where 'Hourly Rate' is not null (NaN)
hourly_wages = salaries['Hourly Rate'].dropna()

# Step 2: Perform one-sample t-test
# Test if the mean hourly wage is significantly different from 30 USD
test_statistic, p_value = ttest_1samp(hourly_wages, 30)

# Step 3: Print the results
print(f"T-test Statistic: {test_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Step 4: Interpret the result
alpha = 0.05  # 5% significance level
if p_value < alpha:
    print("We reject the null hypothesis. The average hourly wage is significantly different from 30 USD.")
else:
    print("We fail to reject the null hypothesis. There is no significant difference between the average hourly wage and 30 USD.")


T-test Statistic: 20.6198
P-value: 0.0000
We reject the null hypothesis. The average hourly wage is significantly different from 30 USD.


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [5]:
# Your code here
import numpy as np
from scipy import stats

# Step 1: Filter out rows with non-null 'Hourly Rate' values
hourly_wages = salaries['Hourly Rate'].dropna()

# Step 2: Calculate sample statistics
sample_mean = np.mean(hourly_wages)  # Sample mean
sample_size = len(hourly_wages)  # Sample size
sample_std = np.std(hourly_wages, ddof=1)  # Sample standard deviation (Bessel's correction with ddof=1)

# Step 3: Calculate standard error of the mean
standard_error = sample_std / np.sqrt(sample_size)

# Step 4: Calculate the 95% confidence interval using the t-distribution
confidence_level = 0.95
degrees_of_freedom = sample_size - 1  # Degrees of freedom for t-distribution

confidence_interval = stats.t.interval(confidence_level, df=degrees_of_freedom, loc=sample_mean, scale=standard_error)

# Display the results
print(f"Sample Mean: {sample_mean:.4f}")
print(f"Standard Error: {standard_error:.4f}")
print(f"95% Confidence Interval: {confidence_interval}")


Sample Mean: 32.7886
Standard Error: 0.1352
95% Confidence Interval: (32.52345834488425, 33.05365708767623)


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [6]:
# Your code here

from statsmodels.stats.proportion import proportions_ztest

# Step 1: Count the number of hourly workers and total number of workers
count_hourly_workers = len(hourly_wages)  # Number of hourly workers
total_workers = len(salaries['Hourly Rate'])  # Total number of workers in the dataset

# Step 2: Define the hypothesized proportion
hypothesized_proportion = 0.25

# Step 3: Perform a one-sample z-test for proportions
z_stat, p_value = proportions_ztest(count_hourly_workers, total_workers, value=hypothesized_proportion, alternative='two-sided')

# Step 4: Display the results
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Step 5: Interpretation
alpha = 0.05  # 5% significance level
if p_value < alpha:
    print(f"We reject the null hypothesis. The proportion of hourly workers is significantly different from {hypothesized_proportion*100}%.")
else:
    print(f"We fail to reject the null hypothesis. There is no significant difference in the proportion of hourly workers from {hypothesized_proportion*100}%.")


Z-statistic: -3.5100
P-value: 0.0004
We reject the null hypothesis. The proportion of hourly workers is significantly different from 25.0%.
