# Generating Synthetic Data with SciPy

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate now create a similar notebook using scipy

 This Jupyter notebook offers an in-depth exploration of generating synthetic data akin to real-world observations utilizing the SciPy library in Python. It begins with introducing SciPy and its significance in creating simulated datasets, emphasizing that readers should have foundational programming skills in Python. The setup involves configuring a Jupyter Notebook environment and installing essential SciPy libraries such as numpy and scipy. Post installation, the notebook guides through importing these libraries to facilitate data generation tasks. Key methods of generating synthetic data are discussed, including utilizing statistical distributions or noise models provided by SciPy. It provides a detailed example of generating random numbers from specific distributions within SciPy, complete with code snippets for implementation. The notebook also covers saving and visualizing the generated data using CSV or JSON file formats along with visualization tools like matplotlib. For advanced users, there's an exploration of custom data generation by integrating or transforming existing SciPy functions, adding layers of complexity to the data simulation process.

## Setting Up Your Environment

 ```python
# Setting Up Your Environment
# ==============================================
# Ensure you have Jupyter Notebook installed
# You can install it using pip: !pip install notebook

In [None]:
# Importing necessary libraries from SciPy
from scipy import constants, stats, optimize, integrate

# Checking if SciPy is installed
try:
    import scipy
    print("SciPy version:", scipy.__version__)
except ImportError:
    print("SciPy not found. Installing now...")
    !pip install scipy
    print("SciPy has been successfully installed.")

In [None]:
# Displaying the setup message
print("\nYour Jupyter Notebook environment is set up to use SciPy for generating data similar to real-world observations.\n")

## Importing Required Libraries

 ```python
# Importing required libraries from SciPy suite
import numpy as np  # For numerical operations, array manipulation, and high-level mathematical functions
import scipy.stats as stats  # For generating statistical data distributions, optimization, integration, interpolation, signal processing, linear algebra, and image manipulation among other tasks

In [None]:
# Explanation:
# - NumPy is essential for numerical computing in Python, providing support for large multi-dimensional arrays and matrices, along with a wide range of high-level mathematical functions to operate on these arrays.
# - SciPy extends the capabilities of NumPy with more specialized functions for various tasks including optimization, integration, interpolation, signal processing, linear algebra, and image manipulation among other tasks. In this section, we are particularly interested in its submodule scipy.stats which contains a large number of probability distributions that can be used to generate synthetic data similar to real-world observations.

## Generating Synthetic Data with SciPy

 ```python
# Import necessary libraries
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [None]:
# Set the random seed for reproducibility
np.random.seed(0)

In [None]:
# Section: Generating Synthetic Data with SciPy
print("Generating synthetic data using statistical distributions and noise models.")

In [None]:
# Method 1: Using a Normal Distribution
mean = 0
std_dev = 1
size = 1000
data_normal = np.random.normal(mean, std_dev, size)
plt.figure(figsize=(8, 4))
plt.hist(data_normal, bins=30, density=True, alpha=0.65, label='Normal Distribution')
plt.title('Histogram of Normal Distribution Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
# Method 2: Using a Uniform Distribution
low = 0
high = 1
data_uniform = np.random.uniform(low, high, size)
plt.figure(figsize=(8, 4))
plt.hist(data_uniform, bins=30, density=True, alpha=0.65, label='Uniform Distribution')
plt.title('Histogram of Uniform Distribution Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
# Method 3: Adding Noise to a Signal (e.g., using a Gaussian Noise Model)
signal = np.linspace(0, 10, size)
noise_std = 1
data_noisy = signal + np.random.normal(0, noise_std, size)
plt.figure(figsize=(8, 4))
plt.plot(signal, label='Signal')
plt.scatter(range(size), data_noisy, color='red', label='Noisy Signal')
plt.title('Signal with Gaussian Noise')
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend()
plt.show()

In [None]:
# Method 4: Using a Poisson Distribution for Count Data
lambda_val = 5
data_poisson = np.random.poisson(lambda_val, size)
plt.figure(figsize=(8, 4))
plt.hist(data_poisson, bins=range(int(max(data_poisson))+1), density=True, alpha=0.65, label='Poisson Distribution')
plt.title('Histogram of Poisson Distribution Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

## Example: Generating Random Numbers from a Distribution

 ```python
# Import necessary libraries
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [None]:
# Section: Generating Random Numbers from a Distribution
print("Generating random numbers from predefined distributions in SciPy")

In [None]:
# Step 1: Choose a distribution
distribution = 'normal'  # Supported distribution: normal (Gaussian)

In [None]:
# Step 2: Define parameters for the chosen distribution
if distribution == 'normal':
    mu, sigma = 0, 1  # Parameters for the normal distribution: mean and standard deviation
elif distribution == 'uniform':
    low, high = 0, 1  # Parameters for the uniform distribution: lower and upper bounds
elif distribution == 'exponential':
    scale = 1  # Parameter for the exponential distribution: scale parameter
else:
    raise ValueError("Unsupported distribution. Please choose from predefined distributions.")

In [None]:
# Step 3: Generate random numbers from the chosen distribution
num_samples = 1000  # Number of samples to generate
if distribution == 'normal':
    random_numbers = stats.norm.rvs(loc=mu, scale=sigma, size=num_samples)
elif distribution == 'uniform':
    random_numbers = stats.uniform.rvs(loc=low, scale=high - low, size=num_samples)
elif distribution == 'exponential':
    random_numbers = stats.expon.rvs(scale=scale, size=num_samples)

In [None]:
# Step 4: Plotting the generated random numbers
plt.figure(figsize=(10, 6))
if distribution == 'normal':
    plt.hist(random_numbers, bins=30, density=True, alpha=0.65, color='g')
    x = np.linspace(mu - 3*sigma, mu + 3*sigma, num=100)
    plt.plot(x, stats.norm.pdf(x, loc=mu, scale=sigma), 'r-', lw=2, label='Normal distribution')
elif distribution == 'uniform':
    plt.hist(random_numbers, bins=30, density=True, alpha=0.65, color='b')
    plt.ylim([0, 1])
    plt.axvline(x=low, color='r', linestyle='--')
    plt.axvline(x=high, color='g', linestyle='--')
elif distribution == 'exponential':
    plt.hist(random_numbers, bins=30, density=True, alpha=0.65, color='m')
    x = np.linspace(0, mu + 3*sigma, num=100)
    plt.plot(x, stats.expon.pdf(x, scale=scale), 'c-', lw=2, label='Exponential distribution')

In [None]:
# Adding labels and title
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Histogram of Generated Random Numbers from a {} Distribution'.format(distribution))
plt.legend()
plt.grid(True)

In [None]:
# Display the plot
plt.show()

In [None]:
print("Random numbers generated successfully!")

## Saving and Visualizing Generated Data

 ```python
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
import pandas as pd
import json

In [None]:
# Generate synthetic data using SciPy
mean = 0
std_dev = 1
num_samples = 1000
generated_data = norm.rvs(loc=mean, scale=std_dev, size=num_samples)

In [None]:
# Save generated data to a CSV file
csv_file_path = 'generated_data.csv'
np.savetxt(csv_file_path, generated_data, delimiter=',')
print(f"Generated data saved to {csv_file_path}")

In [None]:
# Save generated data to a JSON file
json_file_path = 'generated_data.json'
data_dict = {'data': generated_data.tolist()}
with open(json_file_path, 'w') as f:
    json.dump(data_dict, f)
print(f"Generated data saved to {json_file_path}")

In [None]:
# Visualize the generated data using matplotlib
plt.figure(figsize=(10, 6))
plt.hist(generated_data, bins=30, density=True, alpha=0.65, color='g', label='Generated Data')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mean, std_dev)
plt.plot(x, p, 'k', linewidth=2, label='Normal Distribution')
plt.title('Histogram of Generated Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
# Basic data validation
mean_generated = np.mean(generated_data)
std_dev_generated = np.std(generated_data)
print(f"Mean of generated data: {mean_generated}")
print(f"Standard deviation of generated data: {std_dev_generated}")

In [None]:
# Check if the mean and standard deviation are close to the expected values
expected_mean = 0
expected_std_dev = 1
tolerance = 0.1
if abs(mean_generated - expected_mean) < tolerance and abs(std_dev_generated - expected_std_dev) < tolerance:
    print("Data validation passed: Mean and standard deviation are within the acceptable tolerance.")
else:
    print("Data validation failed: Mean or standard deviation is out of tolerance.")

## Advanced Topics: Custom Data Generation

 ```python
# Import necessary libraries
import numpy as np
from scipy.stats import multivariate_normal, expon, uniform

In [None]:
# Define a function to generate custom data based on a mixture of Gaussians
def generate_mixture_of_gaussians(num_samples, means, covs, weights):
    """
    Generates data from a mixture of Gaussian distributions.
    
    Parameters:
        num_samples (int): Number of samples to generate.
        means (list or array-like): Means of the Gaussian components.
        covs (list or array-like): Covariance matrices of the Gaussian components.
        weights (list or array-like): Weights for each Gaussian component.
    
    Returns:
        np.ndarray: Generated data samples.
    """
    if len(means) != len(covs) or len(means) != len(weights):
        raise ValueError("The number of means, covariances, and weights must be the same.")
    
    num_components = len(means)
    data = np.zeros((num_samples, len(means[0])))
    
    # Generate each component's contribution to the mixture
    contributions = np.random.choice(range(num_components), size=num_samples, p=weights)
    
    for i in range(num_samples):
        comp = contributions[i]
        data[i] = np.random.multivariate_normal(mean=means[comp], cov=covs[comp])
    
    return data

In [None]:
# Example usage: Generate 1000 samples from a mixture of two Gaussians
num_samples = 1000
means = [np.array([0, 0]), np.array([5, 5])]
covs = [np.eye(2), np.eye(2)]
weights = [0.4, 0.6]

In [None]:
mixture_data = generate_mixture_of_gaussians(num_samples, means, covs, weights)
print("Generated Mixture of Gaussians Data:\n", mixture_data[:10])

In [None]:
# Define a function to generate custom data based on an exponential distribution mixed with uniform noise
def generate_exponential_uniform_mixture(num_samples, rate):
    """
    Generates data from an exponential distribution mixed with uniform noise.
    
    Parameters:
        num_samples (int): Number of samples to generate.
        rate (float): Rate parameter for the exponential distribution.
    
    Returns:
        np.ndarray: Generated data samples.
    """
    if rate <= 0:
        raise ValueError("Rate must be positive.")
    
    # Generate uniform noise
    uniform_noise = uniform.rvs(size=num_samples)
    
    # Generate exponential distribution based on the noise
    exponential_data = expon.ppf(uniform_noise, scale=1/rate)
    
    return exponential_data

In [None]:
# Example usage: Generate 500 samples from an exponential distribution with rate 2
num_samples = 500
rate = 2
exponential_mixture_data = generate_exponential_uniform_mixture(num_samples, rate)
print("\nGenerated Exponential-Uniform Mixture Data:\n", exponential_mixture_data[:10])