# Introduction

A distribution of data is a representation (or function) showing all possible values (or intervals) and how often those values occur.

For **categorical data**, we'll often see percentage or exact number for each of the category.<br>
For **numerical data**, we'll see the data split into appropriate sized buckets ordered from smallest to largest

When a distribution is plotted into a graph, we can observe different shapes of the curve. Based on the shape and other attributes, there exists many types of distributions. A few statistical distributions are,
* Bernoulli Distribution
* Binomial Distribution
* Cumulative frequency distribution
* Bimodal distribution
* Gaussian distribution (Normal distribution)
* Uniform distribution

Let's setup our dataset,

In [0]:
import math
import numpy as np
import pandas as pd
from scipy import stats

matches    = pd.read_csv('../input/matches.csv')
deliveries = pd.read_csv('../input/deliveries.csv')

# Cumulative relative frequency graph

Let's take `win_by_wickets` dataset and plot a frequency distribution graph.

X-axis - Win by wickets (value from 1 to 10), Y-axis - Number of instances (or frequency) of win-by-wicket margin

In [0]:
win_by_wickets_data = matches[matches.win_by_wickets > 0].win_by_wickets
win_by_wickets_freq = win_by_wickets_data.value_counts(sort=False)
print(win_by_wickets_freq)
plt = win_by_wickets_freq.plot.bar()
plt.set_title("Frequency distribution graph - Win by wickets")
plt.set_xlabel("Win by wickets")
plt.set_ylabel("Frequency")

Now, let's plot **Relative frequency distribution graph** for the same data. Here in **Y-axis**, instead of showing the frequency, we show the **percentage** of the value. We can use `normalize = True` argument for `pandas.Series.value_counts` method

In [0]:
win_by_wickets_rel_freq = win_by_wickets_data.value_counts(sort = False, normalize = True)
print(win_by_wickets_rel_freq)
plt = win_by_wickets_rel_freq.plot.bar()
plt.set_title("Relative Frequency distribution graph - Win by wickets")
plt.set_xlabel("Win by wickets")
plt.set_ylabel("Relative frequency (%)")

From here, we can plot the **cumulative relative frequency graph** using `pandas.Series.cumsum` .

In [0]:
win_by_wickets_cumulative_freq = win_by_wickets_data.value_counts(sort = False, normalize = True).cumsum()
print(win_by_wickets_cumulative_freq)
plt = win_by_wickets_cumulative_freq.plot.bar()
plt.set_title("Cumulative relative frequency distribution graph - Win by wickets")
plt.set_xlabel("Win by wickets")
plt.set_ylabel("Cumulative relative frequency (%)")

What's the relevance of this representation?

Let's try to answer this &rarr; **What is the probability of winning a match by 6 wickets or less?**. 

Of course we can calculate that using the data. But let's try to figure out from the graph. Draw a line from the top of **"6"** to the Y-axis. We'll draw **Line graph** instead of Bar graph.

In [0]:
plt = win_by_wickets_cumulative_freq.plot.line()
plt.axhline(y = win_by_wickets_cumulative_freq[6], xmax = 5.5/10, linestyle='dashed')
plt.axvline(x = 6, ymax = win_by_wickets_cumulative_freq[6], linestyle='dashed')

We can roughly approximate this value to be around 0.54. 

Thus, using cumulative relative frequency graph, the **probability of winning a match by 6 wickets or less is approximately 54%.**

We can also calculate the **percentile** of a value using the above graph. For example, if a team wins by a margin of **4** wickets, the **percentile** of this match is around **16%**