# Programming for Data Analysis Project
For this project I will be synthesizing a dataset, to do so I will be using various functions of the numpy.random package to create interpretations of real world phenomenon.

An example to illustrate this (as provided by the lecturer), is to consider students enrolled in a module, based on this example it seems sensible to consider each student has a grade at the end of a module and as we know from the common misconception that lecturers grade to a bell curve, is in fact due to the how phenomenon such as this tends to follow a normal distribution.
Following on from this we can assert that there are other factors that will help to determine their grade, such as their level of education going into this module (it stands to reason that someone with a PhD will in general outperform someone who has a bachelors degree), the amount of hours a student studies is also probably effecting their grade as is perhaps whether the student is full time or part time.  

Because I lack creativity I am going to start with this example and work through it in the hopes it gives me some experience in synthesizing data and maybe sparks my own idea in the process.

Additionally at this point I don't want to devote too much time to research the data in question, instead I want to play around with how to generate the data, so I will be making some assumptions which I will call out in comments or markdown text as appropriate.

In [7]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(1)

qualification = ['PhD', 'Masters', 'Bachelors', 'None'] #categories for the various students
qualification_prob = [0.2, 0.3, 0.4, 0.1] #I assume probabilities for these are that secondary is least likely and Bachelors most
grades = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2', 'D1', 'D2', 'F'] #setting up a list to determine possible grades

num_students = 200

qual_array = rng.choice(a=qualification, p=qualification_prob, size = 200) #picking from the list of qualifications using the previously defined probabilities
df = pd.DataFrame(columns=['Qualification'], data=qual_array) #using the nunmpy array tocreate the first column for the data table

In [8]:
Hours = rng.normal(loc=4, scale=2,size=200) #taking what Brian said in project pdf, hours per week students are typically centered aroudn a mean of 4 with standard deviation of a quarter and normal distribution is acceptable here
df['Hours per week study'] = Hours #Now I have two columns in my dataframe, qualification and hours per week study
df.head()

Unnamed: 0,Qualification,Hours per week study
0,Bachelors,6.219276
1,,4.336212
2,PhD,5.096811
3,,1.869751
4,Masters,7.65686


## My Simulated Data Idea - An Adults Gender, Height, Weight and Age

My idea here is to simulate a dataset for adults based on their Gender, Height, Weight and age.

- Gender is an easy one to start with as the population can be reasonably divided into half female and half male.

- Age is another easy factor to consider as it is going to be independent of gender, height and weight as none of these will influence someones age.

- Height will be exclusively influenced by their gender.

- Finally weight will be influenced by all of the other factors, as someones gender, height and age will all factor into their weight.

### Exploring the variables in more detail

On the face of it gender appears to be an easy 50 50 chance but for the sake of accuracy and to check the veracity of that claim I took a look at the Central Statistics Office (CSO) Census data for Men and Women, handily the CSO page also includes information for their ages as well.
[Link to CSO page on Men and Women](https://www.cso.ie/en/releasesandpublications/ep/p-cp3oy/cp3/assr/)

Based on the most recent census data (from 2016) the ratio of men to women is not quite equal, there is a minor difference between the ratio of men to women between the ages 15-64 of 980 men to 1000 women. More interestingly however the ratio of men to women broadens significantly when looking at the 65+ age group with 871 men to 1000 women.
To start with I will just work with the 15-64 age group.

In [60]:
#Ratio of men to women for 2016 980 men to 1000 women
p_male_15to64 = round(980/1980, 3) #get the probabilty of being male given age is between 15-64, round to 3 decimal places
p_female_15to64 = 1-p_male_15to64
print('Probability an adult between 15 to 64 years of age is male:',p_male_15to64)
print('Probability an adult between 15 to 64 years of age is female:',p_female_15to64)

Probability an adult between 15 to 64 years of age is male: 0.495
Probability an adult between 15 to 64 years of age is female: 0.505


In [61]:
df = pd.DataFrame(data=rng.choice(['Female','Male'], size=1000, p=[p_female_15to64,p_male_15to64]), columns=['Gender'])

### Female and Male heights
This was somewhat tricky to find, which was surprising and I could not find any figures specific to Ireland so instead I am using the information available from https://ourworldindata.org/human-height

The above linked page gives several useful pieces of information for determining height
- Height is normally distributed
- Females have an average height of 164.7cm
    - Std. Deviation of 7.07cm
- Males have an average height of 178.4cm
    - Std. Deviation of 7.59cm

With the above information I can now start simulating data for adult Female and Male heights, however I would like to point out the few generalisations made at this point. The average height and std. deviations listed above are not necessarily the case for Ireland, in fact they probably aren't totally accurate, however in the absence of any truly accurate figures for Ireland these will make a sufficient approximation.
Additionally the data presented from the link above uses a relatively young group of adults, which may further skew how the average height and std. deviation would be in reality as potentially for adults born in the 1970's their average height could be shorter than adults born in 

In [69]:
#df.loc[df['Gender'] =='Male', 'Height'] = rng.normal(loc=178.4,scale=7.07, size=10)
for index, row in df.iterrows():
    if(row['Gender'] == 'Female'):
        df.at[index,'Height'] = rng.normal(loc=164.7,scale=7.59, size=1)
    elif(row['Gender'] == 'Male'):
        df.at[index,'Height']  = rng.normal(loc=178.4,scale=7.07, size=1)

In [72]:
df.describe()

Unnamed: 0,Height
count,1000.0
mean,171.652302
std,9.715568
min,144.875613
25%,164.543693
50%,172.091147
75%,178.718054
max,195.675357
