In [1]:
import pandas as pd
import numpy as np


In [2]:
data = pd.read_csv('data/age_self_empl.csv')
data.head(3)

Unnamed: 0,id,self_employment_type,branch,age,gender,period,self_empl_persons,avg_personal_income,avg_self_empl_income
0,14424,1st_category_group,"Agriculture, forestry and fishing",75 years or older,Male,2011,1.1,31.6,18.1
1,14425,1st_category_group,"Agriculture, forestry and fishing",75 years or older,Male,2012,1.1,33.0,18.8
2,14426,1st_category_group,"Agriculture, forestry and fishing",75 years or older,Male,2013,1.2,34.8,18.9


The most frequently chosen branch (by average number of people)

In [3]:
person_branch  = data.groupby('branch').agg({'self_empl_persons': np.mean}).sort_values(by = 'self_empl_persons', ascending = False)
person_branch

Unnamed: 0_level_0,self_empl_persons
branch,Unnamed: 1_level_1
Other specialised business services,7.532405
Construction,7.085764
Wholesale and retail trade,6.191845
Health and social work activities,4.543145
"Agriculture, forestry and fishing",3.767548
Other service activities,3.136364
Financial institutions,3.024765
"Culture, sports and recreation",2.495806
Accommodation and food serving,2.337804
Information and communication,2.304786


Average self-employment income in this branch

In [4]:
business_services = data[data['branch'] == 'Other specialised business services']

In [5]:
business_services['avg_self_empl_income'].mean()

39.56004398826986

Age groups in this dataset

In [6]:
data['age'].value_counts()

45 to 54 years           3591
55 to 64 years           3419
35 to 44 years           3407
25 to 34 years           3001
65 to 74 years           2373
Younger than 25 years    1732
75 years or older         767
Name: age, dtype: int64

$H_0:$ Women of my age (25 - 34) working in **Other specialised business services** have average self-employment income equal to average self-employment income than the rest of people working in this branch **= 39.56**

$H_1:$ Women of my age (25 - 34) working in **Other specialised business services** have average self-employment income not equal average self-employment income than the rest of people working in this branch **!= 39.56**

Creating dataframe for 25 to 34 years old women working in Other specialised business services

In [7]:
age_group = data[(data['age'] =='25 to 34 years') & (data['gender'] =='Female') & (data['branch'] =='Other specialised business services')]
age_group.head()

Unnamed: 0,id,self_employment_type,branch,age,gender,period,self_empl_persons,avg_personal_income,avg_self_empl_income
1182,33674,1st_category_group,Other specialised business services,25 to 34 years,Female,2011,7.2,29.5,21.2
1183,33675,1st_category_group,Other specialised business services,25 to 34 years,Female,2012,7.7,28.3,20.4
1184,33676,1st_category_group,Other specialised business services,25 to 34 years,Female,2013,7.9,27.4,19.6
1185,33677,1st_category_group,Other specialised business services,25 to 34 years,Female,2014,8.3,27.3,19.7
1186,33678,1st_category_group,Other specialised business services,25 to 34 years,Female,2015,9.2,28.5,20.3


In [8]:
from scipy.stats import ttest_1samp

stat, pval = ttest_1samp(age_group['avg_self_empl_income'], 39.56)

print('stat is  ', stat)
print('P-value for the one-tailed test is ', pval)

stat is   -10.101531504144939
P-value for the one-tailed test is  8.366296508814275e-18


**P-value** is less than 0.05, so we reject $H_0$

A negative **Stat** (t-statistic) simply means that it lies to the left of the mean, i.e less than 39.56

In [9]:
import scipy.stats

confidence_level = 0.95
degrees_freedom = len(age_group['avg_self_empl_income']) - 1  
sample_mean = np.mean(age_group['avg_self_empl_income'])

sample_standard_error = scipy.stats.sem(age_group['avg_self_empl_income'])

confidence_interval = scipy.stats.t.interval(confidence_level, 
                                             degrees_freedom, 
                                             sample_mean, 
                                             sample_standard_error)

print( 'Confidence interval is ', confidence_interval, '.' )

Confidence interval is  (24.981997482135263, 29.759465932498884) .


### Result:
Based on our confidence interval values, **women of 25 to 34** years working in **Other specialised business services** have average self-employment income between 25 000 and 30 000 per person/year. 