# Creating a Real Data Analysis Script

Now that you've learned NumPy, Pandas, and data visualization, let's put it all together and create a **real, reusable data analysis script** that you can run from the command line or share with others.

This is how professional data analysts work in real projects!

## Why Create Scripts Instead of Just Notebooks?

### Advantages of Python Scripts:
1. **Reusability**: Run the same analysis on different datasets
2. **Automation**: Schedule scripts to run automatically
3. **Sharing**: Easy to share with colleagues who don't use Jupyter
4. **Version Control**: Better for Git and collaboration
5. **Production Ready**: Can be integrated into larger systems
6. **Command Line Arguments**: Make scripts flexible with parameters

### When to Use Scripts vs Notebooks:
- **Notebooks**: Exploration, learning, presenting results
- **Scripts**: Automation, production, repeated analysis

## Project: Sales Data Analysis Script

We'll create a complete data analysis script that:
1. Loads data from a CSV file
2. Cleans and processes the data
3. Performs statistical analysis
4. Generates visualizations
5. Exports results to files
6. Accepts command-line arguments

This mimics real-world data analysis workflows!

## Step 1: Prepare Sample Data

First, let's create sample sales data to work with:

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Generate sample sales data
np.random.seed(42)

# Create dates for 6 months
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=x) for x in range(180)]

# Create sample data
n_records = 500
data = {
    'date': np.random.choice(dates, n_records),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'], n_records),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_records),
    'salesperson': np.random.choice(['Alice', 'Bob', 'Charlie', 'Diana', 'Edward'], n_records),
    'quantity': np.random.randint(1, 10, n_records),
    'unit_price': np.random.choice([299, 599, 799, 999, 1299], n_records)
}

df = pd.DataFrame(data)
df['total_sales'] = df['quantity'] * df['unit_price']

# Add some missing values to make it realistic
df.loc[np.random.choice(df.index, 10), 'quantity'] = np.nan
df.loc[np.random.choice(df.index, 5), 'region'] = np.nan

# Save to CSV
df.to_csv('sales_data.csv', index=False)
print("Sample data created: sales_data.csv")
print(f"\nFirst 5 rows:")
print(df.head())

## Step 2: Create the Analysis Script

Now let's create a complete Python script that analyzes this data. We'll write it in a code cell first, then save it as a `.py` file.

In [None]:
# This is the complete analysis script
# We'll save this as 'sales_analysis.py'

script_content = '''
"""
Sales Data Analysis Script
--------------------------
Analyzes sales data and generates reports and visualizations.

Usage:
    python sales_analysis.py sales_data.csv
    python sales_analysis.py sales_data.csv --output-dir results
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import argparse
import os
from datetime import datetime

# Set style for visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)


def load_data(filepath):
    """Load sales data from CSV file."""
    print(f"Loading data from {filepath}...")
    df = pd.read_csv(filepath, parse_dates=['date'])
    print(f"Loaded {len(df)} records")
    return df


def clean_data(df):
    """Clean and prepare data for analysis."""
    print("\\nCleaning data...")
    
    # Store original count
    original_count = len(df)
    
    # Handle missing values
    df = df.dropna(subset=['quantity', 'unit_price'])
    df['region'] = df['region'].fillna('Unknown')
    
    # Remove duplicates
    df = df.drop_duplicates()
    
    # Recalculate total_sales to ensure consistency
    df['total_sales'] = df['quantity'] * df['unit_price']
    
    # Add derived columns
    df['month'] = df['date'].dt.to_period('M')
    df['day_of_week'] = df['date'].dt.day_name()
    
    removed = original_count - len(df)
    print(f"Removed {removed} invalid records")
    print(f"Clean dataset: {len(df)} records")
    
    return df


def analyze_data(df):
    """Perform statistical analysis on the data."""
    print("\\n" + "="*50)
    print("SALES ANALYSIS REPORT")
    print("="*50)
    
    # Overall statistics
    print("\\n1. OVERALL STATISTICS")
    print("-" * 50)
    print(f"Total Revenue: ${df['total_sales'].sum():,.2f}")
    print(f"Average Order Value: ${df['total_sales'].mean():,.2f}")
    print(f"Total Units Sold: {df['quantity'].sum():,.0f}")
    print(f"Number of Transactions: {len(df)}")
    
    # Product analysis
    print("\\n2. PRODUCT PERFORMANCE")
    print("-" * 50)
    product_stats = df.groupby('product').agg({
        'total_sales': 'sum',
        'quantity': 'sum'
    }).sort_values('total_sales', ascending=False)
    print(product_stats)
    
    # Regional analysis
    print("\\n3. REGIONAL PERFORMANCE")
    print("-" * 50)
    region_stats = df.groupby('region').agg({
        'total_sales': ['sum', 'mean', 'count']
    }).round(2)
    print(region_stats)
    
    # Salesperson performance
    print("\\n4. SALESPERSON PERFORMANCE")
    print("-" * 50)
    sales_by_person = df.groupby('salesperson')['total_sales'].sum().sort_values(ascending=False)
    print(sales_by_person)
    
    # Monthly trends
    print("\\n5. MONTHLY TRENDS")
    print("-" * 50)
    monthly_sales = df.groupby('month')['total_sales'].sum()
    print(monthly_sales)
    
    return {
        'product_stats': product_stats,
        'region_stats': region_stats,
        'sales_by_person': sales_by_person,
        'monthly_sales': monthly_sales
    }


def create_visualizations(df, stats, output_dir='outputs'):
    """Create and save visualizations."""
    print(f"\\nCreating visualizations in '{output_dir}' directory...")
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # 1. Sales by Product
    plt.figure(figsize=(12, 6))
    stats['product_stats']['total_sales'].plot(kind='bar', color='skyblue')
    plt.title('Total Sales by Product', fontsize=16, fontweight='bold')
    plt.xlabel('Product')
    plt.ylabel('Total Sales ($)')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig(f'{output_dir}/sales_by_product.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # 2. Sales by Region
    plt.figure(figsize=(10, 6))
    region_sales = df.groupby('region')['total_sales'].sum()
    plt.pie(region_sales, labels=region_sales.index, autopct='%1.1f%%', startangle=90)
    plt.title('Sales Distribution by Region', fontsize=16, fontweight='bold')
    plt.savefig(f'{output_dir}/sales_by_region.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # 3. Monthly Sales Trend
    plt.figure(figsize=(14, 6))
    monthly_data = df.groupby('month')['total_sales'].sum()
    monthly_data.plot(kind='line', marker='o', linewidth=2, markersize=8)
    plt.title('Monthly Sales Trend', fontsize=16, fontweight='bold')
    plt.xlabel('Month')
    plt.ylabel('Total Sales ($)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(f'{output_dir}/monthly_trend.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # 4. Salesperson Performance
    plt.figure(figsize=(12, 6))
    stats['sales_by_person'].plot(kind='barh', color='coral')
    plt.title('Sales Performance by Salesperson', fontsize=16, fontweight='bold')
    plt.xlabel('Total Sales ($)')
    plt.ylabel('Salesperson')
    plt.tight_layout()
    plt.savefig(f'{output_dir}/salesperson_performance.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    # 5. Heatmap: Product vs Region
    plt.figure(figsize=(10, 6))
    pivot_table = df.pivot_table(values='total_sales', index='product', 
                                  columns='region', aggfunc='sum', fill_value=0)
    sns.heatmap(pivot_table, annot=True, fmt='.0f', cmap='YlOrRd', cbar_kws={'label': 'Total Sales ($)'})
    plt.title('Sales Heatmap: Product vs Region', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.savefig(f'{output_dir}/product_region_heatmap.png', dpi=300, bbox_inches='tight')
    plt.close()
    
    print(f"✓ Saved 5 visualizations to '{output_dir}' directory")


def export_results(df, stats, output_dir='outputs'):
    """Export analysis results to CSV files."""
    print(f"\\nExporting results to '{output_dir}' directory...")
    
    os.makedirs(output_dir, exist_ok=True)
    
    # Export summary statistics
    stats['product_stats'].to_csv(f'{output_dir}/product_summary.csv')
    stats['sales_by_person'].to_csv(f'{output_dir}/salesperson_summary.csv')
    stats['monthly_sales'].to_csv(f'{output_dir}/monthly_summary.csv')
    
    # Export cleaned data
    df.to_csv(f'{output_dir}/cleaned_data.csv', index=False)
    
    print(f"✓ Exported 4 CSV files to '{output_dir}' directory")


def main():
    """Main function to run the analysis."""
    # Parse command-line arguments
    parser = argparse.ArgumentParser(description='Analyze sales data and generate reports')
    parser.add_argument('input_file', help='Path to input CSV file')
    parser.add_argument('--output-dir', default='outputs', help='Output directory for results')
    
    args = parser.parse_args()
    
    # Run analysis pipeline
    print("\\n" + "="*50)
    print("SALES DATA ANALYSIS SCRIPT")
    print("="*50)
    
    # Load data
    df = load_data(args.input_file)
    
    # Clean data
    df = clean_data(df)
    
    # Analyze data
    stats = analyze_data(df)
    
    # Create visualizations
    create_visualizations(df, stats, args.output_dir)
    
    # Export results
    export_results(df, stats, args.output_dir)
    
    print("\\n" + "="*50)
    print("ANALYSIS COMPLETE!")
    print("="*50)
    print(f"\\nAll results saved to: {args.output_dir}/")


if __name__ == "__main__":
    main()
'''

# Save the script to a file
with open('sales_analysis.py', 'w') as f:
    f.write(script_content)

print("✓ Created sales_analysis.py")
print("\nYou can now run this script from the command line!")

## Step 3: Run the Script

Now let's run our script! There are several ways to do this:

In [None]:
# Run the script from within Jupyter
import subprocess
import sys

# Method 1: Run with default output directory
result = subprocess.run([sys.executable, 'sales_analysis.py', 'sales_data.csv'], 
                       capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("Errors:", result.stderr)

## Running from Command Line

You can also run this script directly from your terminal:

```bash
# Basic usage
python sales_analysis.py sales_data.csv

# Specify custom output directory
python sales_analysis.py sales_data.csv --output-dir my_results

# Get help
python sales_analysis.py --help
```

This is how data analysts share and run analysis scripts in real projects!

## Step 4: View the Results

Let's check what our script generated:

In [None]:
import os
from IPython.display import Image, display

# List all generated files
print("Generated files in 'outputs' directory:")
print("-" * 50)
for file in os.listdir('outputs'):
    print(f"  ✓ {file}")

# Display one of the visualizations
print("\nSample visualization:")
display(Image('outputs/sales_by_product.png'))

In [None]:
# View the exported summary data
print("Product Summary:")
product_summary = pd.read_csv('outputs/product_summary.csv')
print(product_summary)

print("\n\nSalesperson Summary:")
salesperson_summary = pd.read_csv('outputs/salesperson_summary.csv')
print(salesperson_summary)

## Understanding the Script Structure

Let's break down the key components of our analysis script:

In [None]:
# Key components of a data analysis script:

print("""
1. IMPORTS
   - Import all necessary libraries at the top
   - pandas, numpy, matplotlib, etc.

2. CONFIGURATION
   - Set visualization styles
   - Define constants and parameters

3. FUNCTIONS
   - load_data(): Load and parse data
   - clean_data(): Handle missing values, outliers
   - analyze_data(): Perform calculations and statistics
   - create_visualizations(): Generate charts and graphs
   - export_results(): Save outputs to files

4. COMMAND-LINE ARGUMENTS
   - argparse module for flexible inputs
   - Makes script reusable with different datasets

5. MAIN FUNCTION
   - Orchestrates the entire workflow
   - Calls functions in the right order
   - Handles errors gracefully

6. __name__ == "__main__"
   - Allows script to be imported or run directly
   - Professional Python practice
""")

## Making Scripts More Flexible

Let's enhance our script with more command-line options:

In [None]:
# Example of enhanced argument parser
enhanced_parser = '''
parser = argparse.ArgumentParser(
    description='Analyze sales data and generate reports',
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog="""
Examples:
  python sales_analysis.py data.csv
  python sales_analysis.py data.csv --output-dir results
  python sales_analysis.py data.csv --no-plots
  python sales_analysis.py data.csv --min-sales 1000
    """
)

# Required arguments
parser.add_argument('input_file', help='Path to input CSV file')

# Optional arguments
parser.add_argument('--output-dir', default='outputs', 
                   help='Output directory (default: outputs)')
parser.add_argument('--no-plots', action='store_true',
                   help='Skip generating visualizations')
parser.add_argument('--min-sales', type=float, default=0,
                   help='Filter sales below this amount')
parser.add_argument('--export-format', choices=['csv', 'excel'], default='csv',
                   help='Export format for results')
'''

print("Enhanced argument parser:")
print(enhanced_parser)

## Best Practices for Data Analysis Scripts

### 1. Code Organization
- Use functions for modularity
- Add docstrings to explain what each function does
- Keep functions focused on one task

### 2. Error Handling
- Check if files exist before loading
- Handle missing or invalid data gracefully
- Provide clear error messages

### 3. Documentation
- Add comments to explain complex logic
- Include usage examples in the docstring
- Create a README file for your project

### 4. Output Management
- Create output directories automatically
- Use timestamps in filenames to avoid overwriting
- Save both data and visualizations

### 5. Reproducibility
- Set random seeds for consistent results
- Document dependencies (requirements.txt)
- Include sample data for testing

## Creating a requirements.txt File

For sharing your script with others, create a requirements.txt file:

In [None]:
# Create requirements.txt
requirements = """pandas>=2.0.0
numpy>=1.24.0
matplotlib>=3.7.0
seaborn>=0.12.0
"""

with open('requirements.txt', 'w') as f:
    f.write(requirements)

print("✓ Created requirements.txt")
print("\nOthers can install dependencies with:")
print("  pip install -r requirements.txt")

## Exercise: Customize the Script

Try modifying the script to add these features:

### Easy:
1. Add a new visualization (e.g., sales by day of week)
2. Calculate and print the top 3 products
3. Add a filter to analyze only a specific region

### Medium:
4. Add a `--start-date` and `--end-date` argument to analyze a date range
5. Create a summary text file with all statistics
6. Add exception handling for missing files

### Advanced:
7. Add a `--format` option to export results as Excel instead of CSV
8. Create an interactive HTML dashboard using Plotly
9. Add email functionality to send results automatically
10. Optimize the script for large datasets (millions of rows)

**Hint**: Start with copying `sales_analysis.py` to a new file and making small changes!

## Real-World Applications

This type of script is used in real data analytics projects for:

1. **Automated Reporting**: Run daily/weekly reports automatically
2. **Data Pipelines**: Part of ETL (Extract, Transform, Load) processes
3. **Batch Processing**: Analyze multiple files at once
4. **Production Systems**: Integrate with business applications
5. **Collaboration**: Share analysis workflows with team members
6. **Scheduling**: Use with cron (Linux) or Task Scheduler (Windows)

### Example Workflow:
```bash
# Monday morning: analyze last week's data
python sales_analysis.py weekly_sales.csv --output-dir reports/week_23

# Generate monthly report
python sales_analysis.py monthly_sales.csv --output-dir reports/january_2024

# Batch process multiple regions
for region in north south east west; do
    python sales_analysis.py data_${region}.csv --output-dir reports/${region}
done
```

## From Notebook to Script: Conversion Tips

### When to Convert:
- ✅ Analysis is working well in notebook
- ✅ Need to run analysis repeatedly
- ✅ Want to share with non-technical users
- ✅ Need to automate the process

### Conversion Steps:
1. **Extract code from notebooks**: Copy working cells
2. **Organize into functions**: Group related operations
3. **Add parameters**: Make hardcoded values configurable
4. **Add error handling**: Make script robust
5. **Test thoroughly**: Run with different inputs
6. **Document**: Add docstrings and comments

### Tools to Help:
- `nbconvert`: Convert notebooks to Python scripts
- `papermill`: Run notebooks with parameters
- `jupytext`: Sync notebooks and Python files

## Summary

You've learned to:
- ✅ Create a complete data analysis script
- ✅ Use command-line arguments for flexibility
- ✅ Structure code with functions and modules
- ✅ Generate and save visualizations programmatically
- ✅ Export results in multiple formats
- ✅ Share reusable analysis workflows

### Key Takeaways:
1. **Scripts are reusable**: Write once, run many times
2. **Automation saves time**: Let the computer do repetitive work
3. **Professional workflow**: This is how real data analysts work
4. **Shareable**: Easy to collaborate with team members
5. **Integration ready**: Can be part of larger systems

### Next Steps:
- Practice creating scripts for different types of analysis
- Learn about scheduling scripts to run automatically
- Explore building web dashboards with Streamlit or Dash
- Study data engineering and pipeline tools

**Remember**: Start with notebooks for exploration, then convert to scripts for production!