# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [58]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from scipy.stats import ttest_1samp
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import sem

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [61]:
# Run this code:
salaries = pd.read_csv('/Users/paolarivera/Documents/Ironhack/Week 4/Day 2/lab-hypothesis-testing-en-main/data/Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [64]:
print(salaries.head())

                  Name                              Job Titles  \
0    AARON,  JEFFERY M                                SERGEANT   
1      AARON,  KARINA   POLICE OFFICER (ASSIGNED AS DETECTIVE)   
2  AARON,  KIMBERLEI R                CHIEF CONTRACT EXPEDITER   
3  ABAD JR,  VICENTE M                       CIVIL ENGINEER IV   
4    ABASCAL,  REECE E             TRAFFIC CONTROL AIDE-HOURLY   

         Department Full or Part-Time Salary or Hourly  Typical Hours  \
0            POLICE                 F           Salary            NaN   
1            POLICE                 F           Salary            NaN   
2  GENERAL SERVICES                 F           Salary            NaN   
3       WATER MGMNT                 F           Salary            NaN   
4              OEMC                 P           Hourly           20.0   

   Annual Salary  Hourly Rate  
0       101442.0          NaN  
1        94122.0          NaN  
2       101592.0          NaN  
3       110064.0          NaN  
4   

In [66]:
salaries.columns

Index(['Name', 'Job Titles', 'Department', 'Full or Part-Time',
       'Salary or Hourly', 'Typical Hours', 'Annual Salary', 'Hourly Rate'],
      dtype='object')

In [91]:
print(salaries.isnull().sum())

Name                     0
Job Titles               0
Department               0
Full or Part-Time        0
Salary or Hourly         0
Typical Hours        25161
Annual Salary         8022
Hourly Rate          25161
dtype: int64


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [69]:
# Extract the hourly wage column (assuming the column is named 'hourly_wage')
hourly_wage = salaries['Hourly Rate']

# Perform the one-sample t-test (comparing to $30)
t_stat, p_value = ttest_1samp(hourly_wage, 30)

# Output the results
print(f'T-statistic: {t_stat}, P-value: {p_value}')

# Conclusion based on a 95% confidence level (alpha = 0.05)
if p_value < 0.05:
    print("Reject the null hypothesis: The hourly wage is significantly different from $30/hr.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference from $30/hr.")


T-statistic: nan, P-value: nan
Fail to reject the null hypothesis: There is no significant difference from $30/hr.


# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [71]:
# Calculate the mean and standard error
mean_wage = np.mean(hourly_wage)
std_error = sem(hourly_wage)

# Degrees of freedom (number of observations - 1)
df = len(hourly_wage) - 1

# Compute the 95% confidence interval
conf_interval = t.interval(0.95, df, loc=mean_wage, scale=std_error)

# Output the confidence interval
print(f'95% Confidence Interval: {conf_interval}')

95% Confidence Interval: (nan, nan)


# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [85]:
from statsmodels.stats.proportion import proportions_ztest

# Filter the dataset to get only hourly employees
hourly_employees = salaries[salaries['Salary or Hourly'] == 'hourly']

# Sample data: Number of hourly workers and total employees
hourly_worker_count = len(hourly_employees)  # Count of hourly workers
total_employee_count = len(salaries)  # Total count of employees

# Perform the proportions z-test
count = hourly_worker_count
nobs = total_employee_count
value = 0.25  # Null hypothesis proportion (25%)

stat, p_value = proportions_ztest(count, nobs, value)

# Output the results
print(f'Z-statistic: {stat}, P-value: {p_value}')

# Conclusion based on a 95% confidence level (alpha = 0.05)
if p_value < 0.05:
    print("Reject the null hypothesis: The proportion of hourly workers is significantly different from 25%.")
else:
    print("Fail to reject the null hypothesis: The proportion of hourly workers is not significantly different from 25%.")


Z-statistic: -inf, P-value: 0.0
Reject the null hypothesis: The proportion of hourly workers is significantly different from 25%.


  zstat = value / std


In [87]:
hourly_employees = salaries[salaries['Salary or Hourly'] == 'hourly']

# Sample data: Number of hourly workers and total employees
hourly_worker_count = len(hourly_employees)  # Count of hourly workers
total_employee_count = len(salaries)  # Total count of employees

# Check if there are any hourly workers or total employees
if hourly_worker_count == 0 or total_employee_count == 0:
    print("Error: No hourly workers or total employees to perform the test.")
else:
    # Perform the proportions z-test
    count = hourly_worker_count
    nobs = total_employee_count
    value = 0.25  # Null hypothesis proportion (25%)

    stat, p_value = proportions_ztest(count, nobs, value)

    print(f'Z-statistic: {stat}, P-value: {p_value}')

    if p_value < 0.05:
        print("Reject the null hypothesis: The proportion of hourly workers is significantly different from 25%.")
    else:
        print("Fail to reject the null hypothesis: The proportion of hourly workers is not significantly different from 25%.")


Error: No hourly workers or total employees to perform the test.
