# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [3]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [6]:
# Run this code:
salaries = pd.read_csv(r"C:\Users\igriz\Documents\BOOTCAMP 2024\WEEK 4\DAY2\lab-hypothesis-testing-en-main\lab-hypothesis-testing-en-main\data\Current_Employee_Names__Salaries__and_Position_Titles.csv")

Examine the `salaries` dataset using the `head` function below.

In [9]:
salaries.head


<bound method NDFrame.head of                         Name                              Job Titles  \
0          AARON,  JEFFERY M                                SERGEANT   
1            AARON,  KARINA   POLICE OFFICER (ASSIGNED AS DETECTIVE)   
2        AARON,  KIMBERLEI R                CHIEF CONTRACT EXPEDITER   
3        ABAD JR,  VICENTE M                       CIVIL ENGINEER IV   
4          ABASCAL,  REECE E             TRAFFIC CONTROL AIDE-HOURLY   
...                      ...                                     ...   
33178  ZYLINSKA,  KATARZYNA                           POLICE OFFICER   
33179     ZYMANTAS,  LAURA C                          POLICE OFFICER   
33180      ZYMANTAS,  MARK E                          POLICE OFFICER   
33181    ZYRKOWSKI,  CARLO E                          POLICE OFFICER   
33182   ZYSKOWSKI,  DARIUSZ                  CHIEF DATA BASE ANALYST   

             Department Full or Part-Time Salary or Hourly  Typical Hours  \
0                POLICE     

# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [15]:
from scipy.stats import ttest_1samp

hourly_workers = salaries[salaries['Hourly Rate'].notnull()]

wage = hourly_workers['Hourly Rate']

tstat, pval, = ttest_1samp(wage,30)

print(f'T-statistic: {tstat}')
print(f'P value: {pval}')

if pval < .05:
    print('Null hypothesis rejected')
else:
    print('Failed to reject null hypothesis')

T-statistic: 20.6198057854942
P value: 4.3230240486229894e-92
Null hypothesis rejected


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [27]:
meanwage = np.mean(wage)
std_error= sem(wage)

confidence =.95
degrees_of_freedom = len(wage) - 1

confidence_interval = t.interval(
    confidence,
    degrees_of_freedom,
    loc=meanwage,
    scale=std_error
)

print(f'The mean hourly wage is {meanwage:.2f}')
print(f'95% confidence interval: {confidence_interval}')

The mean hourly wage is 32.79
95% confidence interval: (32.52345834488425, 33.05365708767623)


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [35]:
# Your code here
from statsmodels.stats.proportion import proportions_ztest

hourly = salaries[salaries['Salary or Hourly']=='Hourly'].shape[0]
total=salaries.shape[0]

null= .25 # 25% of workers are hourly 
zstat, pval = proportions_ztest(hourly, total, null)


print(f'Total workers: {total}')
print(f'Total hourly workers: {hourly}')
print(f'Z-statitic: {zstat:.2f}')
print(f'P-value: {pval:.5f}')

if pval < .5:
    print('Null hypothesis rejected, the proportion of hourly workers significantly different from 25%')
else:
    print('The proportion of hourly workers is not much different from 25%')


Total workers: 33183
Total hourly workers: 8022
Z-statitic: -3.51
P-value: 0.00045
Null hypothesis rejected, the proportion of hourly workers significantly different from 25%
