<a href="https://colab.research.google.com/github/manujsinghwal/applied-statistics-in-python/blob/main/3.%20descriptive-statistics/intro_to_descriptive_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Statistics

Descriptive statistics is a way to summarize and describe the main features of a set of data. It helps us understand the basic details of the data, like:

1. **Mean (Average):** What the "typical" value is.
2. **Median:** The middle value when data is sorted.
3. **Mode:** The most frequent value.
4. **Range:** The difference between the highest and lowest values.
5. **Standard Deviation:** How spread out the data is from the average.

In short, it's all about summarizing data to make it easier to understand.

### Frequency Distribution

A frequency distribution is a way to organize data to show how often each value (or group of values) occurs. It's like a table or graph that shows the 'frequency' (or count) of each value in a dataset.

**E.g.** Referring to the data below, the daily demand of hammers at a hardware store over last 20 days. Develop a frequency distribution to summarizing this data.

[2, 1, 0, 2, 1, 3, 0, 2, 4, 0, 3, 2, 3, 4, 2, 2, 2, 4, 3, 0]

A frequency distribution is a two-column table. In the left column, list each distinct value in the data set from least to greatest. Count the number of times each value appears and record those totals in the right column.

In [43]:
# Importing libraries
import pandas as pd
import numpy as np

In [44]:
# Generating dataset
daily_demand = [2, 1, 0, 2, 1, 3, 0, 2, 4, 0, 3, 2, 3, 4, 2, 2, 2, 4, 3, 0]
print(daily_demand)
print(type(daily_demand))

[2, 1, 0, 2, 1, 3, 0, 2, 4, 0, 3, 2, 3, 4, 2, 2, 2, 4, 3, 0]
<class 'list'>


In [45]:
# Making dataframe
df1 = pd.DataFrame(
    daily_demand,
    columns=['DailyDemand']
)

# Shows top 5 records only
df1.head()

Unnamed: 0,DailyDemand
0,2
1,1
2,0
3,2
4,1


In [46]:
# Summarizing data
summary1 = (
    df1
    .groupby('DailyDemand')
    .size()
    .reset_index(name='Frequency')
    .sort_values(by='DailyDemand')
)
summary1

Unnamed: 0,DailyDemand,Frequency
0,0,4
1,1,2
2,2,7
3,3,4
4,4,3


### Relative Frequency

Relative frequency is the proportion or percentage of times a specific value occurs in a dataset, compared to the total number of values. It tells us how common a value is relative to the entire dataset.

$ Relative\:Frequency = \frac{Frequency\:of\:a\:value}{Total\:number\:of\:values}\  $

In [47]:
# Calculating relative frequency
summary1['RelativeFrequency'] = summary1['Frequency']/summary1['Frequency'].sum()
summary1

Unnamed: 0,DailyDemand,Frequency,RelativeFrequency
0,0,4,0.2
1,1,2,0.1
2,2,7,0.35
3,3,4,0.2
4,4,3,0.15


The sum of the relative frequencies should always equal to 1.00 (100%).

### Cumulative Relative Frequency

Running total of relative frequency. The cumulative relative frequency for a particular row is the relative frequency for that row plus the cumulative relative frequency for previous row. The last cumulative frequency should always be 1.00 (100%).

In [48]:
# Calculating cumulative relative frequency
summary1['CumulativeRelativeFrequency'] = summary1['RelativeFrequency'].cumsum()
summary1

Unnamed: 0,DailyDemand,Frequency,RelativeFrequency,CumulativeRelativeFrequency
0,0,4,0.2,0.2
1,1,2,0.1,0.3
2,2,7,0.35,0.65
3,3,4,0.2,0.85
4,4,3,0.15,1.0


**E.g.** Refer to the dataset below, the numbers of calls per day made from a cell phone for the past 30 days.
\
[4, 5, 1, 0, 7, 8, 3, 6, 8, 3, 0, 9, 2, 12, 14, 5, 5, 10, 7, 2, 11, 9, 4, 3, 1, 5, 7, 3, 5, 6]

Develop a frequency distribution summarizing the data.
\
\
Because this data has many possible outcomes, we should group the number
of calls per day into groups, which are known as **classes**.
\
\
One option is the $ 2^k >= n $ rule to determine the number of classes, where $k$ equals the number of classes and $n$ equals the number of data points.
\
\
Given $n = 30$, the best value for $k$ is 5. How?

\
$ 2^k >= n $
\
$ 2^k >= 30 $
\
$ 2^5 >= 30 $
\
$ 32 >= 30 $
\
\
Calculate the width W of each class.
\
\
$ W = \frac{Largest\:value - Smallest\:value}{Number\:of\:classes}\ $
\
$ W = \frac{14 - 0}{5}\ $
\
$ W = 2.8 ≈	3 $
\
\
Set the size of each class to 3 and list the classes in the left column of the frequency distribution. Count the number of values contained in each group and list those values in the right column.

In [49]:
# Generating dataset
calls_per_day = [4, 5, 1, 0, 7, 8, 3, 6, 8, 3, 0, 9, 2, 12, 14, 5, 5, 10, 7, 2, 11, 9, 4, 3, 1, 5, 7, 3, 5, 6]

# Create a pandas DataFrame
df2 = pd.DataFrame(
    calls_per_day,
    columns=['CallsPerDay']
    )

# Shows top 5 records only
df2.head()

Unnamed: 0,CallsPerDay
0,4
1,5
2,1
3,0
4,7


In [50]:
# Best number of classes based on the rule 2^k >= n (since 2^5 = 32 >= 30)
k = 5
n = len(df2)
min_value = df2['CallsPerDay'].min()
max_value = df2['CallsPerDay'].max()

# Class width (rounded up to the next whole number)
w = np.ceil((max_value - min_value)/k).astype(int)
w

3

In [68]:
# Creating the class interval
bins = np.arange(min_value, max_value+w, w-1)
bins

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16])

In [70]:
# Summarizing data with bins
df2['Class'] = pd.cut(df2['CallsPerDay'], bins=bins, include_lowest=True)
summary2 = (
    df2
    .groupby('Class', observed=True)['CallsPerDay']
    .count()
    .reset_index(name='Occurrence')
)
summary2

Unnamed: 0,Class,Occurrence
0,"(-0.001, 2.0]",6
1,"(2.0, 4.0]",6
2,"(4.0, 6.0]",7
3,"(6.0, 8.0]",5
4,"(8.0, 10.0]",3
5,"(10.0, 12.0]",2
6,"(12.0, 14.0]",1


In [74]:
summary2['RelativeFrequency'] = summary2['Occurrence']/summary2['Occurrence'].sum()
summary2['CumulativeRelativeFrequency'] = summary2['RelativeFrequency'].cumsum()
summary2

Unnamed: 0,Class,Occurrence,RelativeFrequency,CumulativeRelativeFrequency
0,"(-0.001, 2.0]",6,0.2,0.2
1,"(2.0, 4.0]",6,0.2,0.4
2,"(4.0, 6.0]",7,0.233333,0.633333
3,"(6.0, 8.0]",5,0.166667,0.8
4,"(8.0, 10.0]",3,0.1,0.9
5,"(10.0, 12.0]",2,0.066667,0.966667
6,"(12.0, 14.0]",1,0.033333,1.0
