Statistics

In [15]:

# Importing Necessary Libraries

import numpy as np
import pandas as pd
import scipy.stats as stats


# Function to Generate Gaussian Distribution Data

def gauss(size, sigma=1, mu=3):
    """Generates random data from a normal distribution."""
    return np.random.normal(mu, sigma, size)


# Generating Datasets

data_100 = gauss(100)
data_10000 = gauss(10000)


# Function to Calculate Descriptive Statistics

def calculate_statistics_final(data):
    """Calculates a variety of descriptive statistics for a given dataset."""
    mode_result = stats.mode(data, keepdims=True)
    mode_value = mode_result.mode[0] if len(mode_result.mode) > 0 else np.nan
    statistics = {
        'Mean': np.mean(data),
        'Median': np.median(data),
        'Mode': mode_value,
        '1st Quartile (Q1)': np.percentile(data, 25),
        '3rd Quartile (Q3)': np.percentile(data, 75),
        'Range': np.max(data) - np.min(data),
        'IQR': stats.iqr(data),
        'Variance': np.var(data),
        'Standard Deviation': np.std(data),
        'Skewness': stats.skew(data),
        'Min': np.min(data),
        'Max': np.max(data)
    }
    return statistics


# Calculating Statistics for Both Datasets

stats_100_final = calculate_statistics_final(data_100)
stats_10000_final = calculate_statistics_final(data_10000)


# Creating DataFrame to Compare Statistics

df_stats_final = pd.DataFrame({
    'Statistic': stats_100_final.keys(),
    'Sample Size 100': stats_100_final.values(),
    'Sample Size 10,000': stats_10000_final.values()
})


# Displaying the DataFrame with the Statistics

df_stats_final

Unnamed: 0,Statistic,Sample Size 100,"Sample Size 10,000"
0,Mean,3.348747,2.978463
1,Median,3.346255,2.968966
2,Mode,0.926071,-0.535762
3,1st Quartile (Q1),2.605357,2.302967
4,3rd Quartile (Q3),4.112184,3.649948
5,Range,4.81176,7.461104
6,IQR,1.506826,1.346981
7,Variance,1.074192,0.977052
8,Standard Deviation,1.036432,0.988459
9,Skewness,0.02157,0.021164


1. Mean and Median: In both cases, the values are close to the expected mean (3). However, with a larger number of samples (10,000), the mean and median are more stable and closer to the theoretical value

2. Mode: For a sample size of 100, the mode is more variable and less reliable, due to the smaller amount of data and the possibility of outliers. For a large sample (10,000), mode may also be less relevant for normally distributed data, where the values are more dispersed.

3. Quartiles (Q1 and Q3) and IQR: The first quartile (Q1) and third quartile (Q3), along with the interquartile range (IQR), are more stable and closer to expected values in the larger sample. Smaller samples show more variability in these statistics.

4. Range: The range is wider for the smaller sample, indicating more extreme values. The larger sample gives a range closer to what is expected in a normal distribution, where extreme values are less likely.

5. Variance and Standard Deviation: Both variance and standard deviation are more stable and closer to the expected values in the larger sample. The smaller sample shows more variability, which is typical in small datasets.

6. Skewness: Skewness is closer to zero (indicating a symmetric distribution) in the larger sample, while the smaller sample shows more skewness, reflecting the increased impact of random fluctuations.

In conclusion, the larger sample size (10,000) produces more stable and reliable estimates of the statistical parameters, whereas the smaller sample (100) shows more variability and less reliable estimates, particularly for measures sensitive to sample size, like mode and skewness.