# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem



# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [5]:
# Run this code:
salaries = pd.read_csv(r"../data/Current_Employee_Names__Salaries__and_Position_Titles.csv")

Examine the `salaries` dataset using the `head` function below.

In [9]:
from icecream import ic

# ic(salaries)

print(salaries.head())


                  Name                              Job Titles   
0    AARON,  JEFFERY M                                SERGEANT  \
1      AARON,  KARINA   POLICE OFFICER (ASSIGNED AS DETECTIVE)   
2  AARON,  KIMBERLEI R                CHIEF CONTRACT EXPEDITER   
3  ABAD JR,  VICENTE M                       CIVIL ENGINEER IV   
4    ABASCAL,  REECE E             TRAFFIC CONTROL AIDE-HOURLY   

         Department Full or Part-Time Salary or Hourly  Typical Hours   
0            POLICE                 F           Salary            NaN  \
1            POLICE                 F           Salary            NaN   
2  GENERAL SERVICES                 F           Salary            NaN   
3       WATER MGMNT                 F           Salary            NaN   
4              OEMC                 P           Hourly           20.0   

   Annual Salary  Hourly Rate  
0       101442.0          NaN  
1        94122.0          NaN  
2       101592.0          NaN  
3       110064.0          NaN  
4   

# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [13]:
# test whether the hourly wage of all hourly workers is significantly different from $30/hr

# Null hypothesis (H0): mean hourly rate = $30/hr.
# Alternate hypothesis (Ha): mean hourly rate != $30/hr.


# import one sample test function from scipy
from scipy.stats import ttest_1samp 


# filter for hourly workers and get their hourly rates
hourly_workers = salaries[salaries['Salary or Hourly'] == 'Hourly']
hourly_rates = hourly_workers['Hourly Rate'].dropna()  # Remove NaN values

#  one-sample t-test
test_statistic, p_value = ttest_1samp(hourly_rates, 30)

# Print the results
print("Test Statistic:", round(test_statistic, 4))

if p_value < 0.001:
    print("P-Value: < 0.001")
else:
    print("P-Value:", round(p_value, 4))

# Evaluate the result
alpha = 0.05  # 95% confidence level
if p_value < alpha:
    print("Reject the null hypothesis (H0), accept the alternate hypothesis (Ha): The mean hourly rate IS SIGNIFICANTLY DIFFERENT from $30/hr.")
else:
    print("DO NOT reject the null hypothesis: There is NO SIGNIFICANT DIFFERENCE from $30/hr.")

Test Statistic: 20.6198
P-Value: < 0.001
Reject the null hypothesis (H0), accept the alternate hypothesis (Ha): The mean hourly rate IS SIGNIFICANTLY DIFFERENT from $30/hr.


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [17]:
import numpy as np
from scipy.stats import t

# Calculate basic statistics
n = len(hourly_rates)
hourly_rates_mean = np.mean(hourly_rates)
hourly_rates_stdev = np.std(hourly_rates, ddof=1)  # Use ddof=1 for sample standard deviation
hourly_rates_sterror = hourly_rates_stdev / np.sqrt(n)

# 95% confidence interval
confidence_level = 0.95
degrees_of_freedom = n - 1

confidence_interval = t.interval( # Confidence interval with equal areas around the median.
    confidence_level, 
    df=degrees_of_freedom, 
    loc=hourly_rates_mean, 
    scale=hourly_rates_sterror
)

# Print results
print("Mean Hourly Rate: $", round(hourly_rates_mean, 2))
print("Standard Error Hourly Rate: $", round(hourly_rates_sterror, 2))
print("95% Confidence Interval:", (round(confidence_interval[0], 2), round(confidence_interval[1], 2)))


Mean Hourly Rate: $ 32.79
Standard Error Hourly Rate: $ 0.14
95% Confidence Interval: (32.52, 33.05)


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [19]:
# Null hypothesis (H0): The proportion of hourly workers = 25% (p = 0.25)
# Alternative hypothesis hypothesis (Ha): The proportion of hourly workers != 25% (p != 0.25)

from statsmodels.stats.proportion import proportions_ztest



# Total number of employees and hourly workers
total_employees = len(salaries)
hourly_workers_count = len(salaries[salaries['Salary or Hourly'] == 'Hourly'])

# Hypothesized proportion
hypothesized_proportion = 0.25

# Perform the z-test for proportions
test_statistic, p_value = proportions_ztest(
    count=hourly_workers_count, 
    nobs=total_employees, 
    value=hypothesized_proportion, 
    alternative='two-sided'  # 95% confidence level (two-tailed test)
)

# Print the results
print("Test Statistic (z):", round(test_statistic, 4))

if p_value < 0.001:
    print("P-Value: < 0.001")
else:
    print("P-Value:", round(p_value, 4))


# Evaluate the result
alpha = 0.05  # 95% confidence level
if p_value < alpha:
    print("Reject the null hypothesis (H0), accept the alternate hypothesis (Ha): The proportion of hourly workers IS SIGNIFICANTY DIFFERENT from 25%.")
else:
    print("DO NOT reject the null hypothesis: The proportion of hourly workers IS NOT SIGNIFICANTLY DIFFERENT from 25%.")

Test Statistic (z): -3.51
P-Value: < 0.001
Reject the null hypothesis (H0), accept the alternate hypothesis (Ha): The proportion of hourly workers IS SIGNIFICANTY DIFFERENT from 25%.
