# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [50]:
# import numpy and pandas
import pandas as pd
import numpy as np
from scipy.stats import trim_mean, mode, skew, gaussian_kde, pearsonr, spearmanr, beta
from statsmodels.stats.weightstats import ztest as ztest

from scipy.stats import ttest_ind, norm, t
from scipy.stats import f_oneway
from scipy.stats import ttest_1samp as t1s
# general import, fk it
from scipy import stats

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the City of Chicago. We will start by loading the dataset and examining its contents

In [35]:
# Run this code:
path= "../data/Current_Employee_Names__Salaries__and_Position_Titles.csv"
data= pd.read_csv(path)

salaries= pd.DataFrame(data)
#salaries = pd.read_csv('../Current_Employee_Names__Salaries__and_Position_Titles.csv')

Examine the `salaries` dataset using the `head` function below.

In [36]:
salaries.head(3)

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,
1,"AARON, KARINA",POLICE OFFICER (ASSIGNED AS DETECTIVE),POLICE,F,Salary,,94122.0,
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,F,Salary,,101592.0,


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of all hourly workers is significantly different from $30/hr. Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [37]:
salaries["Hourly Wage"]= salaries["Annual Salary"] / 2080
x= salaries["Hourly Wage"]

In [38]:
stat, p_value= t1s(x, popmean= 30, alternative= "two-sided")
print(f"t-statistic: {stat}\np-value: {p_value}")

t-statistic: nan
p-value: nan


In [39]:
salaries.shape[0]

33183

In [40]:
salaries["Hourly Wage"].isnull().sum()

8022

In [41]:
x= salaries["Hourly Wage"].dropna()
x.isnull().sum()

0

In [46]:
stat, p_value= t1s(x, popmean= 30, alternative= "two-sided")
print(f"t-statistic: {stat:.4f}\np-value: {p_value:.10f}")

t-statistic: 183.8436
p-value: 0.0000000000


**t-statistic**: 183.8436 is very far away from 30$/hr
**p-value**: is 0 or nearly 0. Is very probably that the salary per hour is 30$/hr

**H₀: μ = 30** → we reject the hypothesis

**H₁: μ ≠ 30** → we accept it -the results show us enought evidence in order alternative hypothesis

# Challenge 3 - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

The confidence interval is computed in SciPy using the `t.interval` function. You can read more about this function [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html).

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level, number of rows - 1 for degrees of freedom, the mean of the sample for the location parameter and the standard error for the scale. The standard error can be computed using [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html) function in SciPy.

In [None]:
# Your code here
# Our confidence interval CI needs: confidence, df (not dataframe but degrees of freedom), sample mean and the standard error:
# We cam either use 'x' instead of calculating directly form the df or use nan_policy = omit to avoid nan values

sample_mean= salaries["Hourly Wage"].mean()
sample_standard_error= stats.sem(salaries["Hourly Wage"], nan_policy= "omit")
dof= len(salaries["Hourly Wage"]) - 1

In [63]:
ci= stats.t.interval(
  confidence= 0.95,
  df= dof,
  loc= sample_mean,
  scale= sample_standard_error
)

print(f"Confidence Interval: min({ci[0]:.4f}) - max({ci[1]:.4f})")

Confidence Interval: min(41.5995) - max(41.8495)


We are 95% confident that the true average of `Hourly Wage` is between approximately `41.60$` and `41.85$`

>Since 30 is far outside this interval, we have another great result to reject *H₀: μ = 30*

# Challenge 4 - Hypothesis Tests of Proportions

Another type of one sample test is a hypothesis test of proportions. In this test, we examine whether the proportion of a group in our sample is significantly different than a fraction. 

You can read more about one sample proportion tests [here](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/SAS/SAS6-CategoricalData/SAS6-CategoricalData2.html).

In the cell below, use the `proportions_ztest` function from `statsmodels` to perform a hypothesis test that will determine whether the number of hourly workers in the City of Chicago is significantly different from 25% at the 95% confidence level.

In [73]:
salaries.head(1)

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate,Hourly Wage
0,"AARON, JEFFERY M",SERGEANT,POLICE,F,Salary,,101442.0,,48.770192


In [98]:
hourly_workers= salaries.dropna(subset=["Salary or Hourly"])
hourly_workers= (hourly_workers["Salary or Hourly"] == "Hourly").sum()

In [99]:
overall_workers_sample= salaries.dropna(subset=["Salary or Hourly"])
overall_workers_sample= len(overall_workers_sample)

In [100]:
print(hourly_workers)
print(overall_workers_sample)

8022
33183


In [101]:
# proportions_ztest from scipy needs (count[n of hourly workers], nombs[total of the sample], value[pob. proportion], alternative[two-sided this case])
from statsmodels.stats.proportion import proportions_ztest

z_stat, p_value= proportions_ztest(
  count= hourly_workers,
  nobs= overall_workers_sample,
  value= 0.25,
  alternative= "two-sided"
)

print(f"z-stat: {z_stat:.4f}\np-value: {p_value:.10f}")

z-stat: -3.5100
p-value: 0.0004481127


`z-stat` indicates how many std is far for our sample. The minus `-` says it is bellow H0 *(25)*

`p-value` is a very very low value, meaning that the Chicago employees working hourly is much less than *25%*