Problem set 3: Statistical basics

Due: 11:59pm on Friday, September 20 by uploading to Brightspace

Your name:

1. Descriptive statistics and measures of variation. This section should be completed without generative AI, but with open access to static online resources like course material, stackoverflow, blogs with sample code, etc.

In [9]:
# Generate a sample dataset of at least 1,000 random numbers
import numpy as np
from scipy import stats
np.random.seed(42)
data = np.random.rand(1000)

# Calculate the mean, median, and mode of the dataset
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode
print(f"Mean: {mean}, Median: {median}, Mode: {mode}")

# Calculate the range, variance, and standard deviation of the dataset
data_range = np.ptp(data)
variance = np.var(data)
std_dev = np.std(data)
print(f"Range: {data_range}, Variance: {variance}, Standard Deviation: {std_dev}")

# Perform a random sampling of 5% of the data
sample_size = int(0.05 * len(data))
random_sample = np.random.choice(data, size=sample_size, replace=False)

# Calculate the mean, median, and mode as well as the range, variance, and standard deviation of the sample
sample_mean = np.mean(random_sample)
sample_median = np.median(random_sample)
sample_mode = stats.mode(random_sample).mode
sample_range = np.ptp(random_sample)
sample_variance = np.var(random_sample)
sample_std_dev = np.std(random_sample)
print(f"Sample Mean: {sample_mean}, Sample Median: {sample_median}, Sample Mode: {sample_mode}")
print(f"Sample Range: {sample_range}, Sample Variance: {sample_variance}, Sample Standard Deviation: {sample_std_dev}")

# Compare the values from the random sample and the full dataset. Describe what you find
print("\nComparison between full dataset and sample: the average, median, range, variance, and standard deviation numbers are pretty much the same. \nThis is because the sample is representative of the full dataset. But the mode is different because the sample size is smaller and the sample selection could be random.")


Mean: 0.4902565533201336, Median: 0.4968073765468109, Mode: 0.004632023004602859
Range: 0.9950856502815277, Variance: 0.08525889400739081, Standard Deviation: 0.291991256731072
Sample Mean: 0.4992356574741427, Sample Median: 0.4880753970047136, Sample Mode: 0.012154474689816341
Sample Range: 0.9734959794207844, Sample Variance: 0.08303729965895683, Sample Standard Deviation: 0.2881619330497296

Comparison between full dataset and sample: the average, median, range, variance, and standard deviation numbers are pretty much the same. 
This is because the sample is representative of the full dataset. But the mode is different because the sample size is smaller and the sample selection could be random.


2. Generative AI comparison. Use a reliable generative AI like ChatGPT, Claude, Bing, Jupyter AI, etc. to answer the same questions as Section 1. You should alter the prompt given to the AI to maximize the accuracy and efficiency of its output. Paste both your prompt and the generative AI's output below.

Generative AI prompt:

In [10]:
# Generative AI output
# Generate a sample dataset of at least 1,000 random numbers

# Calculate the mean, median, and mode of the dataset

# Calculate the range, variance, and standard deviation of the dataset

# Perform a random sampling of 5% of the data

# Calculate the mean, median, and mode as well as the range, variance, and standard deviation of the sample

# Compare the values from the random sample and the full dataset. Describe what you find


# --- Solution starts here ---
import numpy as np
import pandas as pd

# Reproducibility
np.random.seed(42)

# 1) Generate dataset (1,000+ random numbers in [0, 1))
N = 1000
data = np.random.rand(N)
s_data = pd.Series(data)

def stats(series: pd.Series, *, sample_variance=False):
    """Return a dict of summary stats for a numeric pandas Series.
       If sample_variance=True, use ddof=1; else ddof=0 (population)."""
    # Mode on continuous data is unstable; use pandas mode (may return multiple).
    modes = series.mode()
    mode_value = modes.iloc[0] if not modes.empty else np.nan
    ddof = 1 if sample_variance else 0
    mean = float(series.mean())
    median = float(series.median())
    variance = float(series.var(ddof=ddof))
    std = float(series.std(ddof=ddof))
    rng = float(series.max() - series.min())
    return {
        "mean": mean,
        "median": median,
        "mode": float(mode_value),
        "range": rng,
        "variance": variance,
        "std": std,
    }

def pretty(label, d):
    return (
        f"{label}\n"
        f"  Mean: {d['mean']}\n"
        f"  Median: {d['median']}\n"
        f"  Mode: {d['mode']}\n"
        f"  Range: {d['range']}\n"
        f"  Variance: {d['variance']}\n"
        f"  Standard Deviation: {d['std']}\n"
    )

# 2) Full dataset stats (population-style, ddof=0)
full = stats(s_data, sample_variance=False)

# 3) Random 5% sample (without replacement)
sample_idx_count = max(1, int(0.05 * N))
sample = s_data.sample(sample_idx_count, replace=False, random_state=42).reset_index(drop=True)

# 4) Sample stats (population-style for apples-to-apples comparison)
samp = stats(sample, sample_variance=False)

# Optionally, show sample-variance style too (ddof=1) if you want:
samp_samplevar = stats(sample, sample_variance=True)

# 5) Print results
print("Full dataset (population-style):")
print(pretty("", full))

print("Sample (population-style):")
print(pretty("", samp))

print("Sample (sample-variance style, ddof=1):")
print(pretty("", samp_samplevar))

# 6) Brief comparison
print("Comparison between full dataset and sample:")
print(
    f"- Means/medians are close (sample ≈ dataset), as expected for a random sample.\n"
    f"- Variability in the sample is often a bit smaller or larger just by chance.\n"
    f"- Mode differs a lot because for continuous uniforms the exact mode is unstable."
)


Full dataset (population-style):

  Mean: 0.4902565533201336
  Median: 0.4968073765468109
  Mode: 0.004632023004602859
  Range: 0.9950856502815277
  Variance: 0.08525889400739091
  Standard Deviation: 0.2919912567310722

Sample (population-style):

  Mean: 0.46579603848900497
  Median: 0.44387411030119067
  Mode: 0.018221825651549728
  Range: 0.9679889188280532
  Variance: 0.0930791846459737
  Standard Deviation: 0.3050888143573502

Sample (sample-variance style, ddof=1):

  Mean: 0.46579603848900497
  Median: 0.44387411030119067
  Mode: 0.018221825651549728
  Range: 0.9679889188280532
  Variance: 0.0949787598428303
  Standard Deviation: 0.30818624213749435

Comparison between full dataset and sample:
- Means/medians are close (sample ≈ dataset), as expected for a random sample.
- Variability in the sample is often a bit smaller or larger just by chance.
- Mode differs a lot because for continuous uniforms the exact mode is unstable.


3. Probability distributions and sampling. This section should be completed without generative AI, but with open access to static online resources like course material, stackoverflow, blogs with sample code, etc.

In [11]:
# Using the SciPy library, calculate the probability of a value between 45 and 60 in a normal distribution with mean 50 and std deviation 10
from scipy.stats import norm

mean = 50
std_dev = 10
probability = norm.cdf(60, mean, std_dev) - norm.cdf(45, mean, std_dev)
print(f"Probability of a value between 45 and 60: {probability}")

# Simulate the Central Limit Theorem by repeatedly sampling and calculating sample means
sample_means = []
for _ in range(1000):
    sample = np.random.normal(mean, std_dev, 100)
    sample_means.append(np.mean(sample))

# Calculate the mean and standard deviation of the sample means
clt_mean = np.mean(sample_means)
clt_std_dev = np.std(sample_means)
print(f"CLT Mean: {clt_mean}, CLT Standard Deviation: {clt_std_dev}")

Probability of a value between 45 and 60: 0.532807207342556
CLT Mean: 50.01006968798436, CLT Standard Deviation: 1.0015993081187784


4. Generative AI comparison. Use a reliable generative AI like ChatGPT, Claude, Bing, Jupyter AI, etc. to answer the same questions as Section 3. You should alter the prompt given to the AI to maximize the accuracy and efficiency of its output. Paste both your prompt and the generative AI's output below.

In [12]:
# Using SciPy to compute a normal CDF probability, and simulating the CLT with simple code

from scipy.stats import norm
import numpy as np

# --- Part 1: SciPy probability for a Normal(μ=50, σ=10) between 45 and 60 ---
mu, sigma = 50, 10
a, b = 45, 60
p = norm.cdf(b, loc=mu, scale=sigma) - norm.cdf(a, loc=mu, scale=sigma)
print(f"Probability(45 ≤ X ≤ 60) for Normal(μ=50, σ=10): {p:.6f}")

# --- Part 2: Simulate the Central Limit Theorem (CLT) ---
# We'll sample from a non-normal distribution (Exponential with mean=50) 
# so the CLT effect is clear. The mean of Exponential(scale=50) is 50, std is also 50.
rng = np.random.default_rng(42)
scale = 50.0              # Exponential scale parameter (mean = scale)
n = 30                    # sample size per trial
trials = 10000            # number of repeated samples

# Draw (trials × n) samples, compute the mean of each row
samples = rng.exponential(scale=scale, size=(trials, n))
sample_means = samples.mean(axis=1)

# --- Part 3: Mean and standard deviation of the sample means ---
mean_of_means = sample_means.mean()
std_of_means = sample_means.std(ddof=1)  # unbiased estimate

# Theoretical comparison (for CLT): std of sample mean ≈ population std / sqrt(n)
theoretical_std = scale / np.sqrt(n)  # since std of Exp(scale) = scale

print(f"Mean of sample means (CLT): {mean_of_means:.4f}")
print(f"Std of sample means (CLT):  {std_of_means:.4f}")
print(f"Theoretical std (σ/√n):     {theoretical_std:.4f}")


Probability(45 ≤ X ≤ 60) for Normal(μ=50, σ=10): 0.532807
Mean of sample means (CLT): 49.9184
Std of sample means (CLT):  9.1271
Theoretical std (σ/√n):     9.1287


5. Generative AI evaluation. Compare the code you produced using static sources to that produced by the generative AI. Write 2-3 paragraphs evaluating the strengths and weaknesses of both approaches. In doing so, you should answer the following questions:
* Having now seen the code produced by Generative AI, how would you have written your code differently?
* Where does Generative AI excel, and where did it fall short?
* What was important in writing your prompt?

1) How I’d change my code  
- Use a non-normal source (e.g., exponential) to actually *show* the CLT.  
- Set a random seed for reproducibility.  
- Vectorize instead of looping for speed.  
- Print a quick theory check (σ/√n) next to the simulated std.

2) Where AI excelled vs. fell short  
- Excelled: clear structure, good comments and a useful theory comparison.  
- Fell short: a bit heavy for a simple homework, such as too much unnecessary explanations.

3) What mattered in the prompt  
- Being specific about the distribution, sample size, trials, and whether I want a theory check.  
- Mentioning reproducibility (seed) and simplicity level.  
- Saying exactly what outputs I expect (probability, std of means).
