# CPU-Only Data Analysis Example

This notebook demonstrates a typical data analysis workflow that is optimized for CPU execution.
It uses pandas, numpy, and matplotlib - all libraries that are designed to work efficiently on CPU.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Generate sample data for demonstration
np.random.seed(42)
n_samples = 1000

# Create synthetic dataset
data = {
    'feature1': np.random.normal(0, 1, n_samples),
    'feature2': np.random.normal(5, 2, n_samples),
    'feature3': np.random.exponential(2, n_samples)
}

# Create target variable with some relationship to features
data['target'] = (data['feature1'] * 2 + 
                 data['feature2'] * 0.5 + 
                 data['feature3'] * 0.1 + 
                 np.random.normal(0, 0.5, n_samples))

df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
df.head()

In [None]:
# Basic statistical analysis
print("Dataset Statistics:")
print(df.describe())

print("\nCorrelation Matrix:")
correlation_matrix = df.corr()
print(correlation_matrix)

In [None]:
# Data visualization
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution plots
df['feature1'].hist(bins=30, ax=axes[0,0], title='Feature 1 Distribution')
df['feature2'].hist(bins=30, ax=axes[0,1], title='Feature 2 Distribution')
df['feature3'].hist(bins=30, ax=axes[1,0], title='Feature 3 Distribution')
df['target'].hist(bins=30, ax=axes[1,1], title='Target Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
# Simple machine learning with scikit-learn (CPU-optimized)
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model Performance:")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R² Score: {r2:.4f}")
print(f"\nModel Coefficients: {model.coef_}")
print(f"Model Intercept: {model.intercept_:.4f}")

In [None]:
# Prediction vs actual plot
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

## Summary

This notebook demonstrates a typical CPU-optimized data analysis workflow:

1. **Data Loading & Preprocessing**: Using pandas for efficient data manipulation
2. **Statistical Analysis**: Computing descriptive statistics and correlations
3. **Data Visualization**: Creating plots with matplotlib and seaborn
4. **Machine Learning**: Training a simple model with scikit-learn

All these operations are designed to run efficiently on CPU and do not require GPU acceleration.
The dataset size (1,000 samples) and model complexity (linear regression) are well-suited for CPU execution.