## Introduction

Contained within this notebook is work specifically pertaining to statistics, with a primary focus on a specific branch called descriptive statistics.

Descriptive statistics involves the characterization and depiction of data, encompassing measures that assess central tendency, dispersion, and the shape of the data.

<b>Measure of Central Tendency:-</b> A Measure of Central Tendency refers to a singular value or number that represents the center of a dataset, with all data points exhibiting an equal distance from this central point. There are three common metrics used to measure central tendency: the Mean, Median, and Mode.

<b>Measure of Dispersion:-</b> A Measure of Dispersion denotes a value that illustrates the extent to which data is spread out from the central tendency or the variability within the data. It provides insights into the spread of the data. Two commonly used metrics to measure dispersion are the Standard Deviation and Variance.

## Required Dependencies

In [1]:
import statistics
from typing import Sequence

In [2]:
from warnings import filterwarnings
filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd

In [4]:
from matplotlib import pyplot as plt
import seaborn as sns

## Data

Data can be described as individual units of information that are stored in both structured and unstructured formats. When data is organized in a CSV file or an Excel spreadsheet, it is categorized as structured data. On the other hand, when data lacks a specific organization, such as in the form of text, paragraphs, or tables, it falls under the category of unstructured data.

In the world of data, two main types can be distinguished: qualitative data and quantitative data.

### Qualitative Data

Qualitative data refers to categorical information found within a dataset, which can be further categorized into two types: ordinal data and nominal data. Ordinal data is characterized by a specific order among its categories, such as the sizes of T-shirts (e.g., Small, Medium, Large) where there is a clear hierarchy like "Small < Medium < Large" or "Large > Medium > Small." On the other hand, nominal data represents categories without any inherent order, like countries (e.g., U.S.A, Canada, England, Scotland), where no meaningful order can be established.

### Quantative Data

Quantitative data pertains to numerical information present in a dataset, which can be classified into two categories: integer points and floating points. Floating points consist of values that include decimal places, such as 1.2, 5.5, and 3.2. An example of this could be the amount paid in dollars and cents. On the other hand, integer points represent whole numbers, like the number of apples purchased by different customers, such as 3, 3, 5, and 1. In this case, the store does not sell fractional amounts of apples like 4.5 or 3.2.

## Measure of Central Tendency

### Mean Calculation Algorithm and Code

In mathematical terms, the mean is defined as the sum of all observations divided by the total number of observations.

#### Mean using traditional functional program

In [5]:
np.random.seed(52)
random_values = np.random.randint(low=1, high=10, size=10)
random_values

array([6, 8, 7, 8, 1, 6, 4, 4, 2, 4])

In [6]:
def calculate_mean(num_sequence: Sequence) -> float:
    """calculates mean function"""

    sum_of_values = 0
    number_of_values = len(num_sequence)
    
    for i in num_sequence:
        sum_of_values += i
    
    mean_of_data = sum_of_values/number_of_values
    return mean_of_data

In [7]:
calculate_mean(num_sequence=random_values)

5.0

#### Mean using numpy arrays

In [8]:
def calculate_mean_using_np_array(num_sequence: Sequence) -> float:
    """calculate mean using numpy module"""

    num_sequence = np.array(num_sequence)
    mean_value = np.mean(num_sequence)
    return mean_value

In [9]:
calculate_mean_using_np_array(num_sequence=random_values)

5.0

#### Mean using statistics module

In [10]:
def calculate_mean_using_statistics(num_sequence: Sequence) -> float:
    """calculate mean using statistics module"""
    
    mean_value = statistics.mean(num_sequence)
    return mean_value

In [11]:
calculate_mean_using_statistics(num_sequence=random_values)

5

### Median Calculation Algorithm and Code

Mathematically speaking, the median is the value located at the middle position of a set of data when arranged in ascending order. When there is an even number of observations, the median is determined by taking the average of the values at positions (n/2) and ((n+1)/2). Conversely, when there is an odd number of observations, the median is simply the value at position ((n+1)/2).

#### Median using traditional functional program 

In [12]:
def calculate_median(num_sequence: Sequence) -> float:
    """calculate median function"""

    num_sequence = sorted(list(num_sequence))
    
    if len(num_sequence) % 2 == 0:
        first_term = (len(num_sequence)//2) -1 
        second_term = (len(num_sequence)//2 + 1) - 1
        median = (num_sequence[first_term] + num_sequence[second_term])/2
    else:
        median_term = (len(num_sequence)//2)
        median = num_sequence[median_term]
    return median

In [13]:
calculate_median(num_sequence=random_values)

5.0

#### Median using numpy arrays

In [14]:
def calculate_median_using_np_array(num_sequence: Sequence) -> float:
    """calculate median using numpy module"""
    
    num_sequence = np.array(num_sequence)
    mean_value = np.median(num_sequence)
    return mean_value

In [15]:
calculate_median_using_np_array(random_values)

5.0

#### Median using statistics module

In [16]:
def calculate_median_using_statistics(num_sequence: Sequence) -> float:
    """calculate median using statistics module"""
    
    median = statistics.median(num_sequence)
    return median

In [17]:
calculate_median_using_statistics(num_sequence=random_values)

5.0

### Mode Calculation Algorithm and Code

Mathematically speaking, the mode is the value that appears most frequently in a given dataset. Occasionally, a dataset may not have a mode, or it may have two modes, which is referred to as bimodal data. However, it is not possible to have trimodal or quadmodal data. 

#### Mode using traditional functional program

In [18]:
def calculate_mode(num_sequence: Sequence) -> float:
    """calculate mode function"""

    val_counts = {}
    for val in num_sequence:
        if not val in val_counts:
            val_counts[val]=1
        else:
            val_counts[val]+=1
        mode = [value for value, count in val_counts.items() if count==max(val_counts.values())][0]
    return mode

In [19]:
calculate_mode(num_sequence=random_values)

4

#### Mode using numpy arrays

In [20]:
def calculate_mode_using_np_array(num_sequence: Sequence) -> float:
    """calculate mode using numpy module"""
    
    num_sequence = np.array(num_sequence)
    vals, counts = np.unique(num_sequence, return_counts=True)
    mode = vals[np.argmax(counts)]
    return mode

In [21]:
calculate_mode_using_np_array(random_values)

4

#### Mode using statistics module

In [22]:
def calculate_mode_using_statistics(num_sequence: Sequence) -> float:
    """calculate mode using statistics module"""
    
    mode = statistics.mode(num_sequence)
    return mode

In [23]:
calculate_mode_using_statistics(random_values)

4

## Measure of Dispersion

### Variance Calculation Algorithm and Code

Variance, expressed mathematically, indicates the extent to which data points are spread out from the central tendency. Another way to put it is that variance measures the level of dispersion or scattering among the data points.

#### Variance using traditional functional program

In [25]:
def calculate_variance(num_sequence: Sequence) -> float:
    """calculate variance function"""

    mean = calculate_mean(num_sequence)
    variance = sum([(val-mean)**2 for val in num_sequence])/len(num_sequence)
    return variance

In [26]:
calculate_variance(random_values)

5.2

#### Variance using numpy arrays

In [27]:
def calculate_variance_using_np_array(num_sequence: Sequence) -> float:
    """calculate variance using numpy module"""
    
    num_sequence = np.array(num_sequence)
    variance = np.var(num_sequence)
    return variance

In [28]:
calculate_variance_using_np_array(random_values)

5.2

#### Variance using statistics module

In [29]:
def calculate_variance_using_statistics(num_sequence: Sequence) -> float:
    """calculate variance using statistics module"""
    
    variance = statistics.variance(num_sequence)
    return variance

In [31]:
calculate_variance_using_statistics(random_values)

5

### Standard Deviation Calculation Algorithm and Code

The standard deviation, in mathematical language, can be described as the square root of variance. It serves as an alternative method for quantifying the spread or dispersion of data points from the mean or central tendency.

#### Standard Deviation using tradition functional program

In [33]:
def calculate_std(num_sequence: Sequence) -> float:
    """calculate standard deviation"""

    standard_deviation = np.sqrt(calculate_variance(num_sequence=num_sequence))
    return standard_deviation

In [34]:
calculate_std(random_values)

2.280350850198276

#### Standard Deviation using numpy module

In [37]:
def calculate_std_using_np_array(num_sequence: Sequence) -> float:
    """calculate standard deviation using numpy module"""
    
    num_sequence = np.array(num_sequence)
    standard_deviation = np.std(np.array(num_sequence))
    return standard_deviation

In [38]:
calculate_std_using_np_array(random_values)

2.280350850198276

#### Standard Deviation using statistics module

In [39]:
def calculate_std_using_statistics(num_sequence: Sequence) -> float:
    """calculate standard deviation using statistics module"""
    
    standard_deviation = statistics.stdev(num_sequence)
    return standard_deviation

In [40]:
calculate_std_using_statistics(random_values)

2.23606797749979