# Programming for Data Analysis

## Project 2019

## Problem statement
For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:

• Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.

• Investigate the types of variables involved, their likely distributions, and their
relationships with each other.

• Synthesise/simulate a data set as closely matching their properties as possible.

• Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.

## Project Plan

**The following are the steps I will take:**
* Investigate possible real-world phenomenons to use to create a data set. 
* When I have identified a real-world phenomenon of interest to me I will research in order to pick the variables that are of most interest to me.
* I will ensure that at least 100 data points across at least 4 variables could be collected when dealing this real-world phenomenon 
* When I have picked the variables I would like to synthesise in the data set simulation I will research the properties of these varibales, their likely distributions and their relationships with each other.
* Use numpy.random to synthesise a data set matching these properties that I have identified as closely as possible. 
* Detail the research in this jupyter notebook and display the simulated data in an output cell within this notebook. 

### Section 1 - Introduction to Data Simulation




### Data Simulation with numpy.random

The official documentation of numpy.random is here: https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html where you can read about all the different functions of numpy.random that are available to be used in data analysis.

numpy.random is a sub package of the NumPy package within the Python library that has many functions within it that can be used to generate random numbers. As we discovered in the lectures computers do not have the ability to generate random numbers on their own and so we can use packages such as numpy.random to do this.

The overall purpose of the package is to generate random numbers. Within the numpy.random package there are many functions. The full list of functions from the official numpy.random documentation page: https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html#simple-random-data

**Within the numpy.random package there are 4 different sections:**

* **Simple Random Data**
    * Simple Random Data is essentially used to generate random numbers. Within the Simple random data there are a number of different functions that can be used to generate random numbers and each function returns results from different distributions, these include uniform distribution, standard normal distribution and continuous uniform distribution.
    
* **Permutations**
    * Within Permutations there are 2 functions shuffle and permutation.The shuffle function is used to shuffle the contents of an array. In other words change the order of an array of numbers. The permutation function randomly permutes a range when given a single argument. So for example if the user enters the argument 15 in numpy.random.permutation a range of 0:14 is permuted in a random order.
    
* **Distributions**
    * A probability distribution shows the relationship between a random variable and each possible outsome for that random variable.
    
* **Random Generator**
    * Truly random numbers are nondeterministic, meaning they cannot be pre determined and are essentially completely random. However, machines are deterministic and they cannot therefore generate true random nondeterministic numbers. Pseudo Random Number Generator(PRNG) is an algorithm used to generate sequences of random numbers. 


#### Section 1 References
* Official numpy documentation: https://numpy.org/
* Official NumPy tutorial: https://numpy.org/devdocs/user/quickstart.html
* What is NumPy: https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html
* Learn NumPy in 5 minutes tutorial: https://www.youtube.com/watch?v=xECXZ3tyONo
* Complete Python NumPy tutorial: https://www.youtube.com/watch?v=GB9ByFAIAH4
* NumPy Python image: https://i2.wp.com/www.simplifiedpython.net/wp-content/uploads/2018/11/Python-NumPy-14.png?resize=595%2C233&ssl=1
* numpy.random.permutation: https://docs.scipy.org/doc/numpy-1.16.0/reference/generated/numpy.random.permutation.html#numpy.random.permutation
* There is a good explanation of the difference between numpy.random.shuffle and numpy.random.permutation here: https://stackoverflow.com/questions/15474159/shuffle-vs-permute-numpy
* numpy.random.permutation: https://docs.w3cub.com/numpy~1.14/generated/numpy.random.permutation/
* Pseudo Random Number Generator (PRNG)- https://www.geeksforgeeks.org/pseudo-random-number-generator-prng/
* Introduction to Randomness and Random Numbers: https://www.random.org/randomness/
* What does numpy.random.seed do: https://stackoverflow.com/questions/21494489/what-does-numpy-random-seed0-do
* Khan Academy: https://www.khanacademy.org/computing/computer-science/cryptography/crypt/v/random-vs-pseudorandom-number-generators
* Random data generation in Python: https://realpython.com/lessons/random-data-generation-python/

## Section 2- Investigation

**I first must identify a real-world phenomenon of interest to me that would be suitable for the purposes of this project.**

As I am from a farming background and I have an interest in agriculture I am going to investigate if there are any interesting phenomenons that I could work with in this area.

My first port of call is the Central Statistics Office website Agriculture section- https://www.cso.ie/en/statistics/agriculture/ where I will conduct some research.

I would be interested in some of the statistics surrounding the average income of farmers, farm size and the average age of farmers. These statistics are of particular importance in the face of Brexit as well as increasing pressure on farmers concerning climate change issues and the environment. 

Looking at the farm structure survey 2016 on the CSO website I can see some interesting statistics are captured in this area: https://www.cso.ie/en/statistics/agriculture/.

**Some questions I have looking at these statisitcs:**
* Are there many females involved in agriculture?
* Are there many young people involved in agriculture?
* What is the average farm size in Ireland today?
* What is the average income of farmers in Ireland today?
* What is the averafe age of farmers in Ireland today?
* How many hours per week do farmers work on average?
* How many unpaid hours per week does the average farmer work in a week?
* What is the average direct payment amount received by farmers from the EU?
* Is there a relationship between income level and farm size?
* Is there a relationship between farm size and hours per week worked?
* Is there a relationship between direct payment amounts and farm size?

I cannot answer all of these questions but I will continue the research and see is anything interetsing comes up. 


**The below image is a summary of the main findings of the Farm Structure Survey 2016:**

![survey.png](https://www.cso.ie/en/media/csoie/releasespublications/documents/ep/farmstructuresurvey/2016/Structure_of_farming_in_Ireland_2016_-_infographic.png)

Source: https://www.cso.ie/en/releasesandpublications/ep/p-fss/farmstructuresurvey2016/

**Some interesting stats from the above image:**
* 88% of the 137,500 farms are owned by males.
* 30% of farmholders are over the age of 65.
* Only 5% of farms are held by people under the age of 35.
* The average farm size is 32.4 hectares.

**Some interesting statistics listed in the Farm Structure Survey 2016 Key findings section: https://www.cso.ie/en/releasesandpublications/ep/p-fss/farmstructuresurvey2016/kf/**

* In 2016 there were 137,500 farms in Ireland. More than half (52.7%) of all farms were located in the Border, Midland and Western (BMW) region.
* The average farm was 32.4 hectares.
* Farms in the Southern and Eastern (SE) region were 41.3% larger than those in the BMW region, with an average farm size of 38.3 hectares compared to 27.1 hectares.
* Almost one in five of all farms (18.0%) were 50 hectares or more in size while just over two in five farms (43.4%) had less than 20 hectares.
* Specialist Beef production continued to be the most common farm type or activity, accounting for over half of all farms in 2016 (78,300).
* Average Standard Output per farm was €45,855 in 2016. Standard output is the average monetary value of agricultural output at farm-gate prices.


**Standard Output (SO)**

The below explanation is given in the background notes & appendices section of the Farm Structure Survey:

_"The Standard Output (SO) of an agricultural product is defined as the average monetary value of the agricultural output at farm-gate prices. The SO does not take into account costs, direct payments, value added tax or taxes on products."_

Source: https://www.cso.ie/en/releasesandpublications/ep/p-fss/farmstructuresurvey2016/bgna/

#### Section 1 - References
* CSO Agriculture section: https://www.cso.ie/en/statistics/agriculture/ 
* CSO Farm Structure Survey 2016: https://www.cso.ie/en/releasesandpublications/ep/p-fss/farmstructuresurvey2016/
* Farm Structure Survey 2016 Key Findings: https://www.cso.ie/en/releasesandpublications/ep/p-fss/farmstructuresurvey2016/kf/
* Background notes on the collection of data for the Farm Structure Survey 2016: https://www.cso.ie/en/releasesandpublications/ep/p-fss/farmstructuresurvey2016/bgna/