# Before you start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [1]:
# Import numpy and pandas

import numpy as np
import pandas as pd

# Challenge 1 - Exploring the Data

In this challenge, we will examine all salaries of employees of the San Francisco. We will start by loading the dataset and examining its contents. 

In [2]:
# Loading data

df = pd.read_csv('data/Salaries.csv')

  df = pd.read_csv('data/Salaries.csv')


Examine the `salaries` dataset using the `head` function below.

In [3]:
# Checking data head

df.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


We see from looking at the `head` function that there is quite a bit of missing data. Get the amount of missing data in every column

In [4]:
# Checking df for nulls

df.isna().sum()

Id                       0
EmployeeName             0
JobTitle                 0
BasePay                605
OvertimePay              0
OtherPay                 0
Benefits             36159
TotalPay                 0
TotalPayBenefits         0
Year                     0
Notes               148654
Agency                   0
Status              110535
dtype: int64

Get the shape of the dataframe

In [5]:
# Checking df shape

df.shape

(148654, 13)

Given output of the previous two cells, drop the corresponding column and compute again the amount of missing values.

In [6]:
# Since the Notes column only has missing values, we are going to drop it

df = df.drop(columns='Notes')
df.isna().sum()

Id                       0
EmployeeName             0
JobTitle                 0
BasePay                605
OvertimePay              0
OtherPay                 0
Benefits             36159
TotalPay                 0
TotalPayBenefits         0
Year                     0
Agency                   0
Status              110535
dtype: int64

Check out what are the possible values of the column "Status".

In [7]:
# Checking possible values of the column 'Status'

df['Status'].unique()

array([nan, 'PT', 'FT'], dtype=object)

Drop any row with missing values in the "Status" column and compute again the number of missing values.

In [8]:
# Dropping rows with nulls in the 'Status' column

df = df.dropna(subset=['Status'])
df.isna().sum()

Id                  0
EmployeeName        0
JobTitle            0
BasePay             0
OvertimePay         0
OtherPay            0
Benefits            0
TotalPay            0
TotalPayBenefits    0
Year                0
Agency              0
Status              0
dtype: int64

Check out the types of each column and see if they make sense.

In [9]:
# Checking out data types for each column

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38119 entries, 110531 to 148653
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                38119 non-null  int64  
 1   EmployeeName      38119 non-null  object 
 2   JobTitle          38119 non-null  object 
 3   BasePay           38119 non-null  object 
 4   OvertimePay       38119 non-null  object 
 5   OtherPay          38119 non-null  object 
 6   Benefits          38119 non-null  object 
 7   TotalPay          38119 non-null  float64
 8   TotalPayBenefits  38119 non-null  float64
 9   Year              38119 non-null  int64  
 10  Agency            38119 non-null  object 
 11  Status            38119 non-null  object 
dtypes: float64(2), int64(2), object(8)
memory usage: 3.8+ MB


Columns related to pay and benefits should be float data type instead of object.

In [10]:
# Checking df head

df.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Agency,Status
110531,110532,David Shinn,Deputy Chief 3,129150.01,0.0,342802.63,38780.04,471952.64,510732.68,2014,San Francisco,PT
110532,110533,Amy P Hart,Asst Med Examiner,318835.49,10712.95,60563.54,89540.23,390111.98,479652.21,2014,San Francisco,FT
110533,110534,William J Coaker Jr.,Chief Investment Officer,257340.0,0.0,82313.7,96570.66,339653.7,436224.36,2014,San Francisco,PT
110534,110535,Gregory P Suhr,Chief of Police,307450.04,0.0,19266.72,91302.46,326716.76,418019.22,2014,San Francisco,FT
110535,110536,Joanne M Hayes-White,"Chief, Fire Department",302068.0,0.0,24165.44,91201.66,326233.44,417435.1,2014,San Francisco,FT


Do any type conversions and reset the index.

In [11]:
# Creating list with object cols that should be float
cols_to_convert = ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']

# Casting to float
df[cols_to_convert] = df[cols_to_convert].astype(float)

df.dtypes

Id                    int64
EmployeeName         object
JobTitle             object
BasePay             float64
OvertimePay         float64
OtherPay            float64
Benefits            float64
TotalPay            float64
TotalPayBenefits    float64
Year                  int64
Agency               object
Status               object
dtype: object

In [12]:
# Resetting df index

df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Agency,Status
0,110532,David Shinn,Deputy Chief 3,129150.01,0.0,342802.63,38780.04,471952.64,510732.68,2014,San Francisco,PT
1,110533,Amy P Hart,Asst Med Examiner,318835.49,10712.95,60563.54,89540.23,390111.98,479652.21,2014,San Francisco,FT
2,110534,William J Coaker Jr.,Chief Investment Officer,257340.0,0.0,82313.7,96570.66,339653.7,436224.36,2014,San Francisco,PT
3,110535,Gregory P Suhr,Chief of Police,307450.04,0.0,19266.72,91302.46,326716.76,418019.22,2014,San Francisco,FT
4,110536,Joanne M Hayes-White,"Chief, Fire Department",302068.0,0.0,24165.44,91201.66,326233.44,417435.1,2014,San Francisco,FT


Check out if "TotalPayBenefits" = "BasePay" + "OvertimePay" + "OtherPay" + "Benefits"

In [13]:
# Creating bool series to check if the assumption is True for each row

pay_col_checker = df['TotalPayBenefits'] == df['BasePay'] + df['OvertimePay'] + df['OtherPay'] + df['Benefits']
pay_col_checker

0         True
1        False
2         True
3        False
4         True
         ...  
38114     True
38115     True
38116     True
38117     True
38118     True
Length: 38119, dtype: bool

What is the percetage of employees for which the previous assumption is not True?

In [14]:
# Computing the % of employees for which the assumption is not True

pay_col_checker.value_counts() / len(pay_col_checker) * 100

True     73.708649
False    26.291351
Name: count, dtype: float64

There are different departments in the city. List all departments and the count of employees in each department.

In [15]:
# Calculating the number of employees in each department

grouped_departments = pd.DataFrame(df.groupby('JobTitle')['Id'].count())
grouped_departments

Unnamed: 0_level_0,Id
JobTitle,Unnamed: 1_level_1
"ACPO,JuvP, Juv Prob (SFERS)",1
ASR Senior Office Specialist,22
ASR-Office Assistant,15
Account Clerk,93
Accountant I,2
...,...
Wire Rope Cable Maint Sprv,1
Worker's Comp Supervisor 1,6
Worker's Compensation Adjuster,26
X-Ray Laboratory Aide,35


# Challenge 2 - Hypothesis Tests

In this section of the lab, we will test whether the hourly wage of **all FT workers is significantly different from $75/hr**. Get first the hourly wage by dividing "TotalPayBenefits" by 50 weeks (assuming 10 labour days of holidays) and by 40hrs (assuming a 40hrs week).

$$Hourly Wage = \frac{TotalPayBenefits}{1 year}\frac{1 year}{50 Week}\frac{1 Week}{40 hr}$$

Import the correct one sample test function from scipy and perform the hypothesis test for a 95% two sided confidence interval.

In [16]:
# Computing Hourly_Wage column

df['Hourly_Wage'] = df['TotalPayBenefits'] / 50 / 40
df.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Agency,Status,Hourly_Wage
0,110532,David Shinn,Deputy Chief 3,129150.01,0.0,342802.63,38780.04,471952.64,510732.68,2014,San Francisco,PT,255.36634
1,110533,Amy P Hart,Asst Med Examiner,318835.49,10712.95,60563.54,89540.23,390111.98,479652.21,2014,San Francisco,FT,239.826105
2,110534,William J Coaker Jr.,Chief Investment Officer,257340.0,0.0,82313.7,96570.66,339653.7,436224.36,2014,San Francisco,PT,218.11218
3,110535,Gregory P Suhr,Chief of Police,307450.04,0.0,19266.72,91302.46,326716.76,418019.22,2014,San Francisco,FT,209.00961
4,110536,Joanne M Hayes-White,"Chief, Fire Department",302068.0,0.0,24165.44,91201.66,326233.44,417435.1,2014,San Francisco,FT,208.71755


In [17]:
# Computing mean hourly wage for FT employees

df_ft = df[df['Status'] == 'FT']

ft_mean_wage = df_ft['Hourly_Wage'].mean()
ft_mean_wage

69.26420640727143

In [18]:
# Your code here: (compute the t_statistic). Take into account that this dataset is a sample of a real population.
# Remember that you only need to consider "FT" employees

import scipy.stats as st

degrees_freedom = len(df_ft) - 1
alpha = 0.05

# Calculating t statistic
t = np.mean(df_ft['Hourly_Wage'] - 75)/(np.std(df_ft['Hourly_Wage'], ddof=1)/np.sqrt(len(df_ft)))
print('t statistic: {:.3f}'.format(t))

t statistic: -35.806


In [19]:
# Method 1: Critical value. Get the critical value and compare it against your statistic.

lower_cv = st.t.ppf(alpha/2, degrees_freedom)
upper_cv = st.t.ppf(alpha/2, degrees_freedom) - 1

print('Lower critical value: {:.3f}'.format(lower_cv))
print('Upper critical value: {:.3f}'.format(upper_cv))

Lower critical value: -1.960
Upper critical value: -2.960


#### The t statistic does not fall between the critical values, therefore we reject H0.

In [20]:
# Method 2: Use the p-value method.

p_val = st.t.cdf(t, degrees_freedom)
print('P-value: {:.3f}'.format(p_val))

P-value: 0.000


#### p_val is very close to zero, which does not fall between the alpha values (0 < 0.025), therefore we reject H0

In [21]:
# Calculating using the t_test function form scipy

stat, p_val = st.ttest_1samp(df_ft['Hourly_Wage'], popmean=75, alternative='two-sided')
stat, p_val

(-35.80631941460526, 4.517831233839447e-273)

#### We arrive to the same conclusion with this statistic and p-value since they are the same as the ones we computed earlier. 

Are all the methods in agreement?

We are also curious about salaries in the police force. The chief of police in San Francisco claimed in a press briefing that salaries this year are **higher than last year's mean of $86000/year for all salaried employees** (use the column "TotalPayBenefits". Test  hypothesis using a 95% confidence interval.

Hint: Use apply and a lambda function to check in "Police" is in the "JobTitle" to get all the "Police" jobs.

In [22]:
# Compute the t_statistic. Take into account that this dataset is a sample of a real population.
# Remember that you only need to consider "Police" employees

# Filtering df
police_df = df[df['JobTitle'].apply(lambda x: 'Police' in x)]

degrees_freedom = len(police_df['JobTitle']) - 1
alpha = 0.05

# Calculating t
t = (np.mean(police_df['TotalPayBenefits']) - 86000) / (np.std(police_df['TotalPayBenefits'], ddof=1) / np.sqrt(len(police_df['TotalPayBenefits'])))
print('t statistic: {:.3f}'.format(t))

t statistic: 50.253


In [23]:
# Method 1: Critical value. Get the critical value and compare it against your statistic.

critical_value = st.t.ppf(alpha, df=degrees_freedom)
print('Critical value: {:.3f}'.format(critical_value))

Critical value: -1.646


#### critical_value < t: therefore we accept H0

In [24]:
# Method 2: Use the p-value method.

p_val = st.t.cdf(t, degrees_freedom)
print('p-value: {:.3f}'.format(p_val))

p-value: 1.000


#### alpha = 0.05 < p-value therefore we accept H0

In [25]:
# Method 3: Use the ttest_1samp function from scipy. 

statistic, p_val = st.ttest_1samp(police_df['TotalPayBenefits'], popmean=86000, alternative='less')
statistic, p_val

(50.252984742109945, 1.0)

#### We arrive to the same conclusion with this statistic and p-value since they are the same as the ones we computed earlier. 

The workers from the "JobTitle" with the most employees have complained that their hourly wage is **less than $35/hour**. Using a one sample t-test, test this one-sided hypothesis at the 95% confidence level.

In [26]:
# Department with most employees

df['JobTitle'].value_counts().index[0]

'Transit Operator'

In [27]:
# Creating a variable of a pandas series of the Hourly_Wage column from a dataframe only containing rows with Transit Operator

df_operator = df[df['JobTitle'] == 'Transit Operator']

operator_wage = df_operator['Hourly_Wage']

In [28]:
# Computing the t statistic

degrees_freedom = len(operator_wage) - 1
alpha = 0.05

t = (np.mean(operator_wage) - 35) / (np.std(operator_wage, ddof=1) / np.sqrt(len(operator_wage)))
print('t statistic: {:.3f}'.format(t))

t statistic: 19.065


In [29]:
# Method 1: Critical value. Get the critical value and compare it against your statistic.

critical_value = st.t.ppf((1 - alpha), df=degrees_freedom)
print('Critical value: {:.3f}'.format(critical_value))

Critical value: 1.645


#### t > critical_value : therefore we reject H0

In [30]:
# Method 2: Use the p-value method.

p_val = 1 - st.t.cdf(t, degrees_freedom)
print('p-value: {:.3f}'.format(p_val))

p-value: 0.000


#### p-value < alpha : therefore we reject H0

In [31]:
# Method 3: Use the ttest_1samp function from scipy. 

statistic, p_val = st.ttest_1samp(operator_wage, popmean=35, alternative='greater')
statistic, p_val

(19.064990324906383, 5.012771694204206e-76)

#### We arrive to the same conclusion with this statistic and p-value since they are the same as the ones we computed earlier. 

# Challenge 3: To practice - Constructing Confidence Intervals

While testing our hypothesis is a great way to gather empirical evidence for accepting or rejecting the hypothesis, another way to gather evidence is by creating a confidence interval. A confidence interval gives us information about the true mean of the population. So for a 95% confidence interval, we are 95% sure that the mean of the population is within the confidence interval. 
).

To read more about confidence intervals, click [here](https://en.wikipedia.org/wiki/Confidence_interval).


In the cell below, we will construct a 95% confidence interval for the mean hourly wage of all hourly workers. 

To compute the confidence interval of the hourly wage, use the 0.95 for the confidence level.

In [32]:
# Method 1: Get the critical values which correspond to a 95% confidence.

confidence_level = 0.95
degrees_freedom = len(df) - 1
sample_mean = np.mean(df['Hourly_Wage'])
sample_standard_error = st.sem(df['Hourly_Wage'])

confidence_interval = st.t.interval(confidence_level, degrees_freedom, sample_mean, sample_standard_error)

print('Sample mean: {:.2f}'.format(sample_mean))
print('Confidence_interval:', confidence_interval)

Sample mean: 50.13
Confidence_interval: (49.798255413084874, 50.46318325529572)


Now compute a 95% confidence interval for the hourly salary of all the Police employees.

In [33]:
police_df = df[df['JobTitle'].apply(lambda x: 'Police' in x)]
police_wage = police_df['Hourly_Wage']

confidence_level = 0.95
degrees_freedom = len(police_wage) - 1
sample_mean = np.mean(police_wage)
sample_standard_error = st.sem(police_wage)

confidence_interval = st.t.interval(confidence_level, degrees_freedom, sample_mean, sample_standard_error)

print('Sample mean: {:.2f}'.format(sample_mean))
print('Confidence_interval:', confidence_interval)

Sample mean: 74.17
Confidence_interval: (72.95787917364453, 75.39116295155273)


# Chi2 test

Now we want to know if the amount of full time "FT" and part time "PT" employees is equal between Lawers, Meds, Police, Firemen and other departments. 

Considering all the options in this groups of employees will be very time consuming. To simplify this process, create first a function that returns:

* "Policemen" if "Police" is found on "JobTitle"
* "Firemen" if "Fire" is found on "JobTitle"
* "Medical" if "Med" or "Nurse" is found on "JobTitle"
* "Lawyer" if "Attorney" is found on "JobTitle"
* "Other" in any other cases

Then, create a new column named "employee_group" that determines to which group belong the employee. 

In [34]:
# Creating function to replace values in JobTitle

def replace_jobtitle(job_title):
    '''
    Replaces a string with another if it matches with generic job title words.
    '''
    if 'Police' in job_title:
        return 'Policemen'
    elif 'Fire' in job_title:
        return 'Firemen'
    elif 'Med' in job_title or 'Nurse' in job_title:
        return 'Medical'
    elif 'Attorney' in job_title:
        return 'Lawyer'
    else:
        return 'Other'

In [35]:
# Applying function to JobTitle

df['JobTitle'] = df['JobTitle'].apply(lambda x: replace_jobtitle(x))
df['JobTitle'].unique()

array(['Other', 'Medical', 'Policemen', 'Firemen', 'Lawyer'], dtype=object)

Determine how many "PT" and "FT" employess have all the employees groups.

In [36]:
# Your code here: (Store the output dataframe into a new variable)

pivot = df.pivot_table(index='JobTitle', columns='Status', aggfunc='size')
pivot

Status,FT,PT
JobTitle,Unnamed: 1_level_1,Unnamed: 2_level_1
Firemen,1333,178
Lawyer,317,102
Medical,1028,2889
Other,18126,12245
Policemen,1530,371


Now try compute the expected frequencies doing the calculations with the individual probabilities. Remember that the Chi2 test assumes that both variables (employee_group and FT/PT) are not related (therefore they are independent). Therefore, to compute the expected frequencies you need to compute the probability of each cell and multiply it by the number of observations. ie:

$$\nu(x,y) = p(x,y) * N = p(x) * p(y) * N$$

bear in mind that in general: $p(x,y)\neq p(x)*p(y)$; the equality will only be true if x and y are independent. However, the null hypotheses says that **x and y are independent.** but that's what we're assuming with the null hypotheses.

where "x" is the "employee_group" and "y" the (FT/PT). 

In [37]:
# Create an empty dataframe named "frequencies" to store the data.

frequencies = pd.DataFrame()

pivot['total'] = pivot['FT'] + pivot['PT']
pivot

Status,FT,PT,total
JobTitle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Firemen,1333,178,1511
Lawyer,317,102,419
Medical,1028,2889,3917
Other,18126,12245,30371
Policemen,1530,371,1901


In [38]:
# Compute Expected frequency of being "Firemen" and "FT". Store the solution in a variable named "firemen_ft"

n = pivot['total'].sum()

p_firemen = pivot.loc['Firemen', 'total'] / n

p_ft = pivot['FT'].sum() / n

firemen_ft = p_firemen * p_ft * n

firemen_ft

885.2979878800596

In [39]:
# Compute Expected frequency of being "Firemen" and "PT". Store the solution in a variable named "firemen_pt"

p_pt = pivot['PT'].sum() / n

firemen_pt = p_firemen * p_pt * n

firemen_pt

625.7020121199403

In [40]:
# Compute Expected frequency of being "Lawyers" and "FT". Store the solution in a variable named "lawyers_ft"

p_lawyer = pivot.loc['Lawyer', 'total'] / n

lawyers_ft = p_lawyer * p_ft * n

lawyers_ft

245.49295626852748

In [41]:
# Compute Expected frequency of being "Lawyers" and "PT". Store the solution in a variable named "lawyers_pt"

lawyers_pt = p_lawyer * p_pt * n

lawyers_pt

173.5070437314725

In [42]:
# Compute Expected frequency of being "Medical" and "FT". Store the solution in a variable named "medical_ft"

p_medical = pivot.loc['Medical', 'total'] / n

medical_ft = p_medical * p_ft * n

medical_ft

2294.9783047823917

In [43]:
# Compute Expected frequency of being "Medical" and "PT". Store the solution in a variable named "medical_pt"

medical_pt = p_medical * p_pt * n

medical_pt

1622.021695217608

In [44]:
# Compute Expected frequency of being "Other" and "FT". Store the solution in a variable named "other_ft"

p_other = pivot.loc['Other', 'total'] / n

other_ft = p_other * p_ft * n

other_ft

17794.430966184842

In [45]:
# Compute Expected frequency of being "Other" and "PT". Store the solution in a variable named "other_pt"

other_pt = p_other * p_pt * n

other_pt

12576.569033815158

In [46]:
# Compute Expected frequency of being "Policemen" and "FT". Store the solution in a variable named "policemen_ft"

p_policemen = pivot.loc['Policemen', 'total'] / n

policemen_ft = p_policemen * p_ft * n

policemen_ft

1113.7997848841785

In [47]:
# Compute Expected frequency of being "Policemen" and "PT". Store the solution in a variable named "policemen_pt"

policemen_pt = p_policemen * p_pt * n

policemen_pt

787.2002151158215

* Store all the expected frequencies of "FT" employees in a list 
* Store all the "PT" employees into another list
* Create a dictionary with "FT" and "PT" as keys and as the values use the previous lists
* Create a dataframe with this dictionary using pd.DataFrame()

In [48]:
# Storing FT frequencies in a list
ft_list = [firemen_ft, lawyers_ft, medical_ft, other_ft, policemen_ft]

# Storing PT frequencies in a list
pt_list = [firemen_pt, lawyers_pt, medical_pt, other_pt, policemen_pt]

# Creating dict
freq_dict = {'FT': ft_list, 'PT': pt_list}

# Creating df with the dict
freq_df = pd.DataFrame(freq_dict)

freq_df

Unnamed: 0,FT,PT
0,885.297988,625.702012
1,245.492956,173.507044
2,2294.978305,1622.021695
3,17794.430966,12576.569034
4,1113.799785,787.200215


Now use the "st.chi2_contingency()" from scipy.stats [documentation here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) to conduct a Chi2 test to determine if the diferences between employee groups are statistically significant using a 95% confidence level. Hint: fill the function with a dataframe of actual frequencies.

In [49]:
alpha = 0.05

# Creating crosstab data
data_crosstab = pd.crosstab(df['JobTitle'], df['Status'])

# Obtaining results from chi2 contingency
chi2, p_val, degrees_freedom, expected_freqs = st.chi2_contingency(data_crosstab)

print(f'chi2: {chi2}')
print(f'\np-value: {p_val}')
print(f'\ndegrees of freedom: {degrees_freedom}')

chi2: 2676.642333711905

p-value: 0.0

degrees of freedom: 4


In [50]:
# Obtaining critical value

critical_value = st.chi2.ppf(1-0.05,df=degrees_freedom)
print("Critical value for alpha = 0.05 is: {:.2f}".format(critical_value))

Critical value for alpha = 0.05 is: 9.49


In [51]:
p_val < alpha

True

In [52]:
chi2  > critical_value

True

#### According to chi2 test, we reject H0, which means the variables are related.

Check if your expected frequencies agree with the ones obtained with the st.chi2_contingency() function

In [53]:
freq_df

Unnamed: 0,FT,PT
0,885.297988,625.702012
1,245.492956,173.507044
2,2294.978305,1622.021695
3,17794.430966,12576.569034
4,1113.799785,787.200215


In [54]:
print(f'\nexpected frequencies array:\n {expected_freqs}')


expected frequencies array:
 [[  885.29798788   625.70201212]
 [  245.49295627   173.50704373]
 [ 2294.97830478  1622.02169522]
 [17794.43096618 12576.56903382]
 [ 1113.79978488   787.20021512]]


#### The frequencies obtained manually are the same as the ones obtained through the chi2_contingency.