<a href="https://colab.research.google.com/github/manujsinghwal/applied-statistics-in-python/blob/main/3.%20descriptive-statistics/intro_to_descriptive_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Statistics

Descriptive statistics is a way to summarize and describe the main features of a set of data. It helps us understand the basic details of the data, like:

1. **Mean (Average):** What the "typical" value is.
2. **Median:** The middle value when data is sorted.
3. **Mode:** The most frequent value.
4. **Range:** The difference between the highest and lowest values.
5. **Standard Deviation:** How spread out the data is from the average.

In short, it's all about summarizing data to make it easier to understand.

### Frequency Distribution

A frequency distribution is a way to organize data to show how often each value (or group of values) occurs. It's like a table or graph that shows the 'frequency' (or count) of each value in a dataset.

E.g.: Referring to the data below, the daily demand of hammers at a hardware store over last 20 days. Develop a frequency distribution to summarizing this data.

[2, 1, 0, 2, 1, 3, 0, 2, 4, 0, 3, 2, 3, 4, 2, 2, 2, 4, 3, 0]

A frequency distribution is a two-column table. In the left column, list each distinct value in the data set from least to greatest. Count the number of times each value appears and record those totals in the right column.

In [1]:
# Importing libraries
import pandas as pd

In [9]:
# Generating dataset
daily_demand = [2, 1, 0, 2, 1, 3, 0, 2, 4, 0, 3, 2, 3, 4, 2, 2, 2, 4, 3, 0]
print(daily_demand)
print(type(daily_demand))

[2, 1, 0, 2, 1, 3, 0, 2, 4, 0, 3, 2, 3, 4, 2, 2, 2, 4, 3, 0]
<class 'list'>


In [15]:
# Making dataframe
df = pd.DataFrame(
    daily_demand,
    columns=['DailyDemand']
)
df

Unnamed: 0,DailyDemand
0,2
1,1
2,0
3,2
4,1
5,3
6,0
7,2
8,4
9,0


In [17]:
# Summarizing data
summary = (
    df
    .groupby('DailyDemand')
    .size()
    .reset_index(name='Frequency')
    .sort_values(by='DailyDemand')
)
summary

Unnamed: 0,DailyDemand,Frequency
0,0,4
1,1,2
2,2,7
3,3,4
4,4,3


### Relative Frequency

Relative frequency is the proportion or percentage of times a specific value occurs in a dataset, compared to the total number of values. It tells you how common a value is relative to the entire dataset.

$ Relative\:Frequency = \frac{Frequency\:of\:a\:value}{Total\:number\:of\:values}\  $

In [21]:
# Calculating relative frequency
summary['RelativeFrequency'] = summary['Frequency']/summary['Frequency'].sum()
summary

Unnamed: 0,DailyDemand,Frequency,RelativeFrequency
0,0,4,0.2
1,1,2,0.1
2,2,7,0.35
3,3,4,0.2
4,4,3,0.15


The sum of the relative frequencies should always equal to 1.00 (100%).

### Cumulative Relative Frequency

Running total of relative frequency. The cumulative relative frequency for a particular row is the relative frequency for that row plus the cumulative relative frequency for previous row. The last cumulative frequency should always be 1.00 (100%).

In [24]:
# Calculating cumulative relative frequency
summary['CumulativeRelativeFrequency'] = summary['RelativeFrequency'].cumsum()
summary

Unnamed: 0,DailyDemand,Frequency,RelativeFrequency,CumulativeRelativeFrequency
0,0,4,0.2,0.2
1,1,2,0.1,0.3
2,2,7,0.35,0.65
3,3,4,0.2,0.85
4,4,3,0.15,1.0
