[View in Colaboratory](https://colab.research.google.com/github/nowke/notebooks/blob/master/Statistics_1_Summarizing_Quantitative_data.ipynb)

# Introduction
**Quantitative data** is information that can be measured in real numbers. Examples include,
* Height of a person
* Speed of Tesla cars
* Runs scored by a batsman
* Wickets taken by a bowler

In this notebook, we'll explore various statistical concepts involved in **summarizing quantitative data** with the help of **Indian Premier League (IPL)** dataset.

The data consists of two CSV files for all IPL matches played from  **2008 - 2018** (11 seasons)
* **`matches.csv`** - match-by-match data
* **`deliveries.csv`** - ball-by-ball data

Let's setup `pandas` dataframes for the above files and import necessary libraries.

In [0]:
import numpy as np
import pandas as pd
import os

matches    = pd.read_csv('../input/matches.csv')
deliveries = pd.read_csv('../input/deliveries.csv')

Let's inspect the `matches` data before stepping into the concepts

In [0]:
print(f'Number of rows    = {len(matches)}')
print(f'Number of columns = {len(matches.columns)}')
matches.head()

# Measuring center

First step often learnt in [descriptive statistics](https://en.wikipedia.org/wiki/Descriptive_statistics) is to measure the center of given data. There are various ways to measure the center. We'll go through some of them.

Let's get the data ready for our experiments. 
* **`win_by_runs`** columns represents the margin in which a team has won against the opponent, if the team batting first has won.
* i.e. If **`team1`** scores **200** runs and **`team2`** scores **150** runs, **`team1`** won the match by **50 runs** - If **`team1`** bats first

Hence, we have to exclude all instances of **`win_by_wickets`** cases, i.e. **`win_by_runs = 0`**

In [0]:
win_by_runs_data = matches[matches['win_by_runs'] > 0].win_by_runs
print(f'Number of rows = {len(win_by_runs_data)}')
win_by_runs_data.head()

Number of rows = 315


0     35
4     15
8     97
13    17
14    51
Name: win_by_runs, dtype: int64

We'll discuss about 3 methods of measuring center - ***Mean***, ***Median*** and ***Mode***

## Mean

**Mean** (usuallly refered to **Arithmetic Mean**, also called **Average**) is calculated as **sum** of all numbers in the dataset and dividing by the **total** number of values

### Arithmetic Mean

\begin{align}
Arithmetic\,mean = {Sum\,of\,all\,numbers \over No.\,of\,values\,in\,the \,set}\,\,\,\,or\, 
\end{align}

\begin{align}
\bar{x} = {\sum_{i=i}^{n} x_{i} \over n}
\end{align}

Arithmetic mean of our data is calculated as,

`mean = (35 + 15 + 97 + 17 + ...) / 315`

Let's do that in code.

In [0]:
win_by_runs_rows = len(win_by_runs_data) # No. of values in the set (n)
win_by_runs_sum = sum(win_by_runs_data) # Sum of all numbers

print(f'Sum of all numbers = {win_by_runs_sum}, No. of values in the set = {win_by_runs_rows}')

win_by_runs_arithmetic_mean = win_by_runs_sum / win_by_runs_rows # Calculating arithmetic mean
print(f'Arithmetic mean = {win_by_runs_arithmetic_mean}')

We can verify the number with the help of `mean()` method in `pandas`

In [0]:
win_by_runs_arithmetic_mean_verify = win_by_runs_data.mean()
print(f'Arithmetic mean (verify) = {win_by_runs_arithmetic_mean_verify}')

Arithmetic mean (verify) = 29.76825396825397


### Geometric Mean

Another type of mean is **geometric mean**. It is calculated as **Nth root** of **product** of all the numbers, where N is the total number of values in the dataset

\begin{align}
Geometric\,mean = \sqrt[n]{product\,of\,all\,numbers}
\end{align}

\begin{align}
\bar{x}_{geom} = \sqrt[n]{\prod_{i=1}^n x_i}
\end{align}

Geometric mean of our data is calculated as,

`geometric_mean = 315thRoot(35 x 15 x 97 x 17 x ...)`

In [0]:
from scipy.stats.mstats import gmean

win_by_runs_geo_mean = gmean(win_by_runs_data)
print(f'Geometric mean = {win_by_runs_geo_mean}')

## Meadian

**Median** is the middle value, when the data is sorted in ascending order. Half of the data points are smaller and half of data points are larger than the median.

For example purpose, let's take first 10 entries of the data.

In [0]:
win_by_runs_10 = list(win_by_runs_data[:10])
print(win_by_runs_10)
print(sorted(win_by_runs_10))

[35, 15, 97, 17, 51, 27, 5, 21, 15, 14]
[5, 14, 15, 15, 17, 21, 27, 35, 51, 97]


To find median,
* Sort the data from smallest to largest (ascending order)
* If there are **odd** number of data points, median is the *middle* data point.
* If there are **even** number of data points, median is the *average of two* middle data points

```
[5, 14, 15, 15, 17, 21, 27, 35, 51, 97]
                ^^  ^^  
           (middle numbers)
                                  
Median = (17 + 21)/2 = 19
```

Let's verify,

In [0]:
win_by_runs_10_median = win_by_runs_data[:10].median()
print(f'Median (first 10) = {win_by_runs_10_median}')

win_by_runs_median = win_by_runs_data.median()
print(f'Median = {win_by_runs_median}')

Median (first 10) = 19.0
Median = 22.0


## Mode

**Mode** is the number occurring most often in the dataset.
* It is only meaningful if we have many repeated values in our dataset
* If no value is repeated, there is **no mode**
* A dataset can have ***one mode***, ***multiple modes*** or ***no mode***.

Let's try to retrieve mode for our dataset.

In [0]:
# Retrieve frequency (sorted, descending order)
win_by_runs_data.value_counts(sort=True, ascending=False).head()

4     11
14    11
10    10
15     9
13     9
Name: win_by_runs, dtype: int64

As we can observe, `[4, 14]` occurs **11 times** in the dataset. 

Hence, **Mode = [4, 14]**,

We can verify using `pandas.DataFrame.mode` method

In [0]:
win_by_runs_data_mode = win_by_runs_data.mode()
print(f'Mode = {list(win_by_runs_data_mode)}')

Mode = [4, 14]
