# Programming for Data Analysis Assigment

## Initial file creation, assignment to be specified at a later date

"For this project you must create a dataset by simulating a real-world phenomenon of your chooising. You may pick any phenomenon you wish - you might pick on that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the `numpy.random` package for this purpose.

Specificailly, in this project you should:

1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables;

2. Investigate the types of variables involved, their likely distributions, and their relationships with each other;

3. Synthesise/simulate a dataset as closely matching their properties as possible;

4. Detail your research and implement the simulation in a Jupyter notebook - the dataset itself can simply be display in an output cell within the notebook.

Note that this project is about simulation - you must synthesise a dataset. Some students may already have some real-world datasets in their own files. It is okay to base your synthesised dataset on these should you wish (please reference it if you do), but the main task is in this project is to create a synthesised dataset."

For point two, it is not enough that you end up with a totally random dataset, you need to see which distribution might be most appropriate for that particular item. What are the likely data points that you collect from a phenomenon? How are the variables related to each other? What does the distribution look like? Is it a normal distribution? A Poisson distribution?

It is sufficient for the dataset to be a pandas dataframe, it doesn't need to be a CSV or XML file.

Brian: "So I'm not actually that interested in the actual data. What we're most interested in is the research investigations you do to create the dataset, because presumably there will be some random element in the generation of your data, following some sort of distribution. And I'll be able to re-run your notebook and get a different randomly-generated dataset with the same properties."

## Basic Plan

So you need to find a basic dataset that can be extrapolated, rather than something comprehensive. The starting points should probably be either Hitting Against the Spin or the Vollman hockey book. This could be something team-based or player-based, something that can be condensed into small enough chunks that it fits into a 100 row by 5 column (for the four variables) grid.

Remember that Brian's example was that he thinks there's a correlation between a student's grade and their level of degree, the amount of time spent studying and the number of commits they make.

Could even be the amount of wins gained from winning the toss? Variables would be: toss result, choice of batting/fielding first, match result, total runs in the match. 

Vollman definitely tell you where he sources all of his data from, and I'm guessing it will be either NHL.com or hockey-reference.com. HAIS the spin should just be a matter of going to Cricinfo and seeing what we can dig out from their core metrics.

After this you'll need to show investigation of each variable, and plot them accordingly using NumPy or Seaborn to see if a distribution can be identified. Once these are identified, even just using [Wikipedia](https://en.wikipedia.org/wiki/Probability_distribution#External_links) or the [NumPy documentation](https://numpy.org/doc/stable/reference/random/generator.html), you'll need to use the latest notebooks from this module to simulate a dataset accordingly, so if you plot, say, expected goals, and you can see that the data is Poisson-distributed, you can use the `poisson` function in NumPy to generate the data.


Law: https://www.lords.org/mcc/the-laws-of-cricket/covering-the-pitch

History: https://www.espncricinfo.com/story/cricket-s-turning-points-covered-pitches-461172

History: https://www.espncricinfo.com/wisdenalmanack/content/story/152416.html

Leamon, N. and Jones, B. (2021) Hitting Against the Spin: How Cricket Really Works. London: Little, Brown Book Group.

## Rationale

Results in cricket matches between England and Australia since 1970, dependence on key variables:
- where is the match played?
- which team wins the toss?
- did that team decide to bat or field?
- what was the total first innings score?

Core of this started as the chapter "The Cat That Turned into a Fence in Nathan Leamon and Ben Jones' _Hitting Against the Spin: How Cricket Really Works_ (2021, 172-197),

## Outline and Selection of Data

The dataset, taken from CricInfo's [Statsguru](https://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;filter=advanced;host=1;host=2;opposition=1;opposition=2;orderby=start;size=200;spanmin1=1+Jan+1970;spanval1=span;team=1;team=2;template=results;tournament_type=2;type=team;view=innings) query engine, details the match outcome of 153 five-day Test matches played between England and Australia in England, Wales and Australia since the 1st January, 1970. All but six of these matches (all taking place between 1977 and 1988) were contested as part of The Ashes, a bilateral series that alternates between the two countries approximately every two years. I have chosen 1970 as my start point as this was the period when groundsman at English cricket grounds were no longer permitted to leave pitches [uncovered](https://www.lords.org/mcc/the-laws-of-cricket/covering-the-pitch). Before this, pitches were not sheltered from any adverse weather once a match started, a process which significantly affected the conditions of the match, pitch quality, and captain's decisions at the toss, with Leamon and Jones (2021, 174) asserting that:

    "The vast majority (89 per cent) of captains who won the toss chose to bat, and this resulted in side who won the toss having a markedly better chance of winning."

### may as well chuck in a photo of the urn for the craic

The variables I've chosen to include here are those that are often regarded as key points in the early stage of the match. In cricket, winning a coin toss offers a team's captain the choice of batting or fielding first, and captains will make this decision based on a number of factors, such as their team's strengths (or their opponent's weaknesses), the weather forecast, or the condition of the pitch. This choice is seen as an immediate advantage, and much pre-match discussion centres around what each captain's likely decision will be. For this reason, I've included the winner of the coin toss, as well as their choice, as variables. Of course, a coin toss is the classic example used for demonstrating binomial distribution, but while the decision to bat or field is a binary choice, it is far from binary. Leamon and Jones (2021, 175) state that:

    "In Tests played between 1980 and 2010, nearly twice as many captains have batted first than have chosen to bowl."

Does this dataset need more quantitative data? Perhaps second innings? Could pull four innings purely to show how average innings decline, reference the Adelaide graph on p. 177.

In [7]:
# did you use all of these libraries?
import datetime as dt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn as sk

Data: show a sample of data and explain each column and it's dimensions and possibilities, only first innings (quote from HATS about average scoring rates per innings. Any declared scores have been changed to a simple innings total to preserve the numeric dimension of the data, and the declaration listed as a seperate variable. For the sake of cleanliness, games hosted in Cardiff are list England as the host. Ties, an extremely rare occurrence in cricket, did not occur in a single one of the matches between the two countries in this dataset; indeed they have only occurred twice in the history of Test cricket since 1877, and are notable enough to have a separate [Wikipedia](https://en.wikipedia.org/wiki/Tied_Test) entry.

https://en.wikipedia.org/wiki/Result_(cricket)

[Australian tosses](https://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;filter=advanced;host=1;host=2;opposition=1;orderby=start;size=200;spanmin1=1+Jan+1970;spanval1=span;team=2;template=results;toss=1;tournament_type=2;type=team;view=innings)

[English tosses](https://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;filter=advanced;host=1;host=2;opposition=2;orderby=start;size=200;spanmin1=1+Jan+1970;spanval1=span;team=1;template=results;toss=1;tournament_type=2;type=team;view=innings)

In [6]:
ash = pd.read_csv("eng_aus_data.csv")
ash[0:10]

Unnamed: 0,Start Date,Host,Toss,Choice,Team,1st Inns,Result,Ground,Declaration
0,27/11/1970,Australia,Australia,,Australia,433,draw,Brisbane,False
1,11/12/1970,Australia,Australia,,England,397,draw,Perth,False
2,09/01/1971,Australia,England,,England,332,won,Sydney,False
3,21/01/1971,Australia,Australia,,Australia,493,draw,Melbourne,True
4,29/01/1971,Australia,England,,England,470,draw,Adelaide,False
5,12/02/1971,Australia,Australia,,England,184,won,Sydney,False
6,08/06/1972,England,England,,England,249,won,Manchester,False
7,22/06/1972,England,England,,England,272,lost,Lord's,False
8,13/07/1972,England,England,,Australia,315,draw,Nottingham,False
9,27/07/1972,England,Australia,,Australia,146,lost,Leeds,False
