For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.
Specifically, in this project you should:
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

# 1. Happiness at Work 

### 1.1 Real-world Phenomenon with at least one-hundred data points across four variables. 

![employee happy face](https://www.incimages.com/uploaded_files/image/970x450/getty_175280847_9707219704500142_47976.jpg)

There is an assumption that to be successful and paid well in your job will amount to happiness. Happier employees are more productive and organizations try to improve their employees’ happiness with the objective to achieve higher profitability and company value. In fact, a 2015 review of several global studies concluded that happier employees were up to 12% more productive. Basing the project on figures from the workplace (from a variety of studies), I intend to investigate contributing factors to job satisfaction. 

### 1.2 Identify the Variables 

Some factors affecting job satisfaction may include but not limited to: 

- Gender 
- Marital status 
- Education 
- Sector 
- Pay 
- Hours of Work 
- Age 
- Size of Office 

# 2. Variables 

### 2.1 Investigate variables and their likely distributions

<img src="http://blog.cloudera.com/wp-content/uploads/2015/12/distribution.png" alt="Drawing" style="width: 500px;"/>

#### Explicit/Named Variables

Gender 
- Variable 1: Male
- Variable 2: Female
- Assumption that there is an equitable distribution of men and women
- 50/50 Bernoulli distribution
- Categorical variable with 2 possible values

Marital status 
- Variable 1: Single (30%)
- Variable 2: Partnered (70%)
- Arbitrary assumption
- Binomial distribution
- Categorical variable with 2 possible values

Education 
- Variable 1: Bachelors (40%)
- Variable 2: No Bachelors (40%)
- Variable 3: More than a Bachelors (20%)
- Assumption based on the amount of third level qualifications from 1991-2016 at the CSO 
- Binomial distribution
- Categorical variable with 3 possible values

Sector 
- 35.7% of the employees’ companies are in private sector and closely held
- 28.6% in private sector and public held
- 29.5% in public sector, and 
- 6.2% in charity or non-profits. 
- 2014 US Happiness @ Work survey
- Binomial distribution
- Categorical variable with 4 possible values

Size of Business 
- Micro <10 (20%)
- Small-Medium 10-250 (60%)
- Large 250+ (20%)
- Assumption based on OECD Statistic shows that SMEs account for 60-70% of the workforce
- Binomial distribution. 
- Categorical variable with 4 possible values. This could have also been considered as a sliding scale variable with normal distribution peaking somewhere in the middle of the SMEs. 

#### Sliding Scale Variables

Job Satisfaction (0-10)
- Mean: 7.308 
- Standard Deviation: 1.968
- Summary Statistics Survey: European Social
- Normal distribution
- non-negative real number with 3 decimal places

Pay (€)
- Minimum: €18,041 (minimum wage in Ireland)
- Mean: €45,611 is the mean wage based on 2016 CSO figures
- Standard Deviation: €15,000 (Assumption)
- Normal distribution
- non-negative real number with 2 decimal places

Hours of Work (hrs)
- Mean: 39 hours based on CSO statistic
- Maximum: 48 (Irish legislation)
- Normal distribution
- non-negative real number with 2 decimal places

Age (yrs)
- Range: 17-65 
- Median: 42.3 years based on a US Age Demographic of workforce
- Combined uniform and triangular distribution
- non-negative real number with 2 decimal places

#### For the purposes of the Project, we will look at (i) Gender, (ii) Education, (iii) Pay, (iv) Hours of Work, (v) Age variables, and, (vi) Job Satisfaction

In [None]:
# Import numpy as np
import numpy as np

# import matplot
import matplotlib as mat
import matplotlib.pyplot as plt

#Use the 'inline' backend, so that the matplotlib graphs are included in notebook, next to the code
%matplotlib inline

# Import pandas
import pandas as pd

# import seaborn 
import seaborn as sns

In [None]:
# Gender 

gender_arr = ['Male', 'Female']

gender = (np.random.choice(gender_arr, 200))
gender


In [None]:
# check number of values so that generated figures for all variables will be equal to create the data set
len (gender)

In [None]:
plt.hist(gender) 
plt.ylabel('Count of 200') # labelling y axis
plt.xlabel('Gender') # labelling x axis 
plt.title('Gender Count') # Adding title

plt.show()

In [None]:
# Education 

# Set the seed to generate numbers from
np.random.seed(55) 

education_arr = ['Bachelors', 'No Bachelors', 'Post Bachelors']
education = np.random.choice(education_arr, 200, p= [0.4, 0.4, 0.2])

education

In [None]:
len (education)

In [None]:
plt.hist(education)
plt.ylabel('Count of 200') # labelling y axis
plt.title('Education') # Adding title
plt.show()

In [None]:
# Pay

# Set the seed to generate numbers from
np.random.seed(60) 

# (mean, standard deviation, size), rounded to one decimal place
pay = np.random.normal(45611, 15000, 200).round(2)
pay

In [None]:
len (pay)

In [None]:
plt.hist(pay)
plt.ylabel('Count of 200') # labelling y axis
plt.xlabel('Payscale') # labelling x axis 
plt.title('Salary') # Adding title
plt.show()

In [None]:
sns.boxplot(pay, orient="h") 

In [None]:
# Hours at Work

# Set the seed to generate numbers from
np.random.seed(70) 

# (mean, standard deviation, size), rounded to one decimal place
hours = np.random.normal(39, 4, 200).round(2) 
hours

In [None]:
len (hours)

In [None]:
plt.hist(hours)
plt.ylabel('Number of People') # labelling y axis
plt.xlabel('Number of Hours') # labelling x axis 
plt.title('Hours in a Week') # Adding title
plt.show()

In [None]:
plt.title('Hours per Week')
sns.distplot(hours)

The next Variable, in the dataset I will simulate, will be the Adult Age Group. The age variable is a continuous numerical variable. From intensive research online, and reading of countless of reports and articles regarding Homelessness in Ireland. If anything, there seems to be a constant age group where the majority of Homeless Adults fit into. This age group is , ages between 25 -44yrs, with 45 -64yrs being the second largest group. Therefore I will simulate data of a general age from 18 - 80yrs, hoping that any person over the age of 80yrs is been cared for age appropiately, and additional data showcasing the age breakdown. I intend to use Draw samples from a uniform distribution, as this distribution will allow me to set both the lower and upper ranges for the age group, subesequently this is why I will proceed to break down the age range into smaller groups also, on doing this I can then see , which group will fit my data set best.

In [None]:
# Age

# Set the seed to generate numbers from
np.random.seed(12) 

# https://stackoverflow.com/questions/36537811/numpy-trapezoidal-distribution-for-age-distribution

agex = np.random.uniform(17,65,100) 
agey = np.random.triangular(17,42.3,65,100)
age = np.concatenate((agex,agey)) 

age.round(1)

In [None]:
len (age)

In [None]:
plt.hist(age)
plt.ylabel('Number of People') # labelling y axis
plt.xlabel('Age') # labelling x axis 
plt.title('Age Distribution') # Adding title
plt.show()

In [None]:
plt.title('Age Distribution')
sns.distplot(age)

In [None]:
# Job Satisfaction 

# Set the seed to generate numbers from
np.random.seed(63) 

# (mean, standard deviation, size)
job = np.random.normal(7.308, 1.968, 200)
job

In [None]:
plt.hist(job)
plt.ylabel('Number of People') # labelling y axis
plt.xlabel('Job Satisfaction') # labelling x axis 
plt.title('Job Satisfaction Index') # Adding title
plt.show()

In [None]:
plt.title('Job Satisfaction')
sns.distplot(job)

the job satisfaction index is marked otu of 10 but values are greater thanb 10, maybe  adifferent distribution????

In [None]:
len (job)

### 2.2 Relationships with Each Other
Gender and marital status may determine if work-life balance is possible when starting children
Good education may lead to a better job
Some sectors are high pressure environments leading to stress
Less hours and more pay should have a positive impact on your happiness 
Maybe work gets easier as you climb the ladder and get older
Size of office may be capable of offering more benefits to its employees but also lacking in personability

In [None]:
df = pd.DataFrame({'Job Satisfaction' : job, 'Gender': gender, 'Education': education, 'Pay' : pay, 'Hours': hours, 'Age' : age.round(0)})
sns.pairplot(df)

In [None]:
# pay vs hours, no correlation
plt.scatter(pay,hours)

In [None]:
# pay vs hours, no correlation
plt.scatter(job,age)

In [None]:
# pay vs hours, no correlation
plt.scatter(gender,job)

In [None]:
plt.scatter(pay,job)
plt.plot(np.unique(pay), np.poly1d(np.polyfit(pay, job, 1))(np.unique(pay)))

esri probable job stress vs support chart

# 3. Simulate a Data Set 

### 3.1 Create a dataframe

In [None]:
#Create dataframe with all the data randomly generated
df = pd.DataFrame({'Job Satisfaction' : job, 'Gender': gender, 'Education': education, 'Pay' : pay, 'Hours': hours, 'Age' : age.round(0)})
df.head()

In [None]:
# Add a new column calculating the Hourly Rate from Pay and Hours column
df['Hourly Rate'] = df.Pay/df.Hours/52
df.head()

In [None]:
# Create a column to identify those paid below Minimum Wage
np.where(df['Hourly Rate'] < 9.55, 'Underpaid', 'Paid Correctly')

df['Above or Below Minimum Wage'] = np.where(df['Hourly Rate'] < 9.55, 'Below', 'Above')
df.head()

In [None]:
# Change column header Education to Third Level Education and change the variable returns in the cells
df['Education'].replace({'Bachelors':'Grad', 'Post Bachelors':'Post Grad', 'No Bachelors' : 'None'}, inplace=True)
df.head()

In [None]:
df.rename(columns={'Education': 'Third Level'}).head()

In [None]:
df.groupby('Gender').describe()

In [None]:
# Look at one row of data
df.loc[25]

In [None]:
# Find out how many are being paid below Minimum Wage
df.loc[df['Above or Below Minimum Wage'] == 'Below']

In [None]:
# Give the figure for those paid below Minimum Wage
len(df.loc[df['Above or Below Minimum Wage'] == 'Below'])

In [None]:
# How many people have Third Level education
df['Education'].value_counts()

In [None]:
# describe what the values mean in the numerical columns
df.describe()

Here, the average person's age 41, not far off the existing statistics. 
there's someone in the data clearly being taken advantage of - €1.70 an hour wage 

I adjusted the pay standard deviation from €15,000 to €10,000 which improved the figures for hourly rates and only one person was below minimum wage, however, I reset it again to check for outliers. 


### 3.2 Research Analysis

In [None]:
plt.figure(figsize=(10,10))

plt.subplot(2, 1, 2)

plt.title('Distribution of Pay')
x = sns.boxplot(data=df, orient="h",order=[ "Pay"]) ## displays all integer parameters of the dataset in boxplot form
y = sns.swarmplot( data=df,orient="h", color="0",order=["Pay"]) 

## Boxplots and swarmplot of Weight, Price and Price-Weight of the original dataset



In [None]:
plt.figure(figsize=(10,10))

plt.subplot(2, 1, 2)

plt.title('Number of Hours Per Week')
x = sns.boxplot(data=df, orient="h",order=[ "Hours"]) ## displays all integer parameters of the dataset in boxplot form
y = sns.swarmplot( data=df,orient="h", color="0",order=["Hours"]) 

In [None]:
plt.figure(figsize=(10,10))

plt.subplot(2, 1, 2)

plt.title('Age Distribution')
x = sns.boxplot(data=df, orient="h",order=[ "Age"]) ## displays all integer parameters of the dataset in boxplot form
y = sns.swarmplot( data=df,orient="h", color="0",order=["Age"]) 

In [None]:
plt.figure(figsize=(10,10))

plt.subplot(2, 1, 2)

plt.title('Job Satisfaction Index')
x = sns.boxplot(data=df, orient="h",order=[ "Job Satisfaction"]) ## displays all integer parameters of the dataset in boxplot form
y = sns.swarmplot( data=df,orient="h", color="0",order=["Job Satisfaction"]) 

In [None]:
# Change column header Education to Third Level Education and change the variable returns in the cells
df['Education'].replace({'Grad':1, 'Post Grad':2, 'None':0}, inplace=True)
df['Gender'].replace({'Male':1, 'Female':0}, inplace=True)
df['Above or Below Minimum Wage'].replace({'Above':1, 'Below':0}, inplace=True)
df.head()

In [None]:
sns.lineplot(x="Hours", y="Job Satisfaction", hue = "Gender",  data=df)

In [None]:
plt.figure(figsize=(10,10))

plt.subplot(2, 1, 2)

sns.scatterplot(x="Hours", y="Pay", data=df, hue ="Job Satisfaction")



In [None]:
# linear regression indicating positive relationship between male and female infection per month.

sns.lmplot(x="Pay", y="Hours", data=df)

# K Nearest Neighbours

In [None]:
# sns.pairplot(df, hue="Job Satisfaction")

In [None]:
sns.distplot(malc)
plt.title('Male Alcohol Consumption in Grams Per Day ')
plt.xlabel('Alcohol Consumption in Grams Per day ')
plt.ylabel('Prevalence')
plt.show()
print("The mean male alcohol consumption in grams per day is", np.mean(malc))

In [None]:
# Create regression plot via Seaborn [12].

import warnings # Importing the Python warnings module to ignore a benign "ImportWarning" 
warnings.filterwarnings('ignore')

import seaborn as sns

sns.set_style("whitegrid") # Adding a grid for better plot referencing 
sns.lmplot(x='Hours per week study', y='HDip grade', hue = 'Sex', data= dfnewer, palette="bright", height = 5) 

# Let's also plot a joint plot to show the distributions:
sns.set_style("darkgrid")
sns.jointplot(x='Hours per week study', y='HDip grade', data= dfnewer, kind="reg")

warnings.resetwarnings() # Reset the previously ignored warning to its default state

In [None]:
sns.pairplot(Batchdf, diag_kind='kde', plot_kws={'alpha':1.0});

# References 

https://www.esri.ie/pubs/RS84.pdf
***
https://en.wikipedia.org/wiki/Simpson%27s_paradox
***
https://www.irishtimes.com/life-and-style/health-family/what-makes-us-happy-at-work-1.3347589
***
https://www.irishtimes.com/business/work/happiness-at-work-requires-good-relationships-purpose-and-vision-1.2025872
***
https://hbr.org/2017/09/happiness-traps
***
http://eureka.sbs.ox.ac.uk/6319/1/2017-07.pdf
***
https://s3.amazonaws.com/happiness-report/2018/WHR_web.pdf
***
https://drum.lib.umd.edu/bitstream/handle/1903/18191/Huang_umd_0117N_16920.pdf?sequence=1
***
https://stackoverflow.com/questions/36537811/numpy-trapezoidal-distribution-for-age-distribution
***
https://www.irishexaminer.com/breakingnews/ireland/this-is-the-average-full-time-wage-in-ireland-795670.html
***
https://www.cso.ie/en/releasesandpublications/er/elcq/earningsandlabourcostsq22018finalq32018preliminaryestimates/
***
https://www.cso.ie/en/releasesandpublications/ep/p-wamii/womenandmeninireland2016/employment/
***
http://www.governing.com/gov-data/ages-of-workforce-for-industries-average-medians.html
***
https://www.huffingtonpost.com/kristie-arslan/five-big-myths-about-amer_b_866118.html
***
https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
***
https://www.thejournal.ie/ireland-european-education-rankings-facts-eoghan-murphy-cnbc-2992092-Sep2016/
***
https://stackoverflow.com/questions/36537811/numpy-trapezoidal-distribution-for-age-distribution
***
https://books.google.ie/books?id=TPNFDwAAQBAJ&pg=PA82&lpg=PA82&dq=random+sample+python+male+female+dataframe&source=bl&ots=s696v_uf2o&sig=RtWn4gHinAg_CcHHrHfTEbVQ27Y&hl=en&sa=X&ved=2ahUKEwiq2M_h5Z3fAhXhRxUIHZOEB7sQ6AEwDHoECAoQAQ#v=onepage&q=random%20sample%20python%20male%20female%20dataframe&f=false
***
https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html

# End