# EDA Automator Quick Start Guide

This notebook demonstrates the basic functionality of the EDA Automator library.

## 1. Installation

First, let's make sure we have EDA Automator installed:

In [None]:
# Uncomment to install if needed
# !pip install eda-automator

## 2. Import Libraries

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from eda_automator import EDAAutomator

# Set matplotlib to display plots inline
%matplotlib inline

ModuleNotFoundError: No module named 'eda_automator'

## 3. Create Sample Dataset

Let's create a sample dataset for demonstration purposes:

In [4]:
# Set random seed for reproducibility
np.random.seed(42)

# Create a sample dataset
data = {
    'age': np.random.normal(35, 10, 1000).astype(int),
    'income': np.random.lognormal(10.5, 0.5, 1000),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 1000),
    'satisfaction': np.random.choice([1, 2, 3, 4, 5], 1000),
    'customer_segment': np.random.choice(['A', 'B', 'C'], 1000),
    'churn': np.random.choice([0, 1], 1000, p=[0.8, 0.2]),
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Add some correlations
df['income'] = df['income'] + df['age'] * 500 + np.random.normal(0, 5000, 1000)
df['satisfaction'] = 5 - np.random.binomial(4, 0.3 + 0.4 * df['churn'])

# Add some missing values
mask = np.random.random(1000) < 0.05
df.loc[mask, 'income'] = np.nan

mask = np.random.random(1000) < 0.02
df.loc[mask, 'education'] = np.nan

# Display the first few rows
df.head()

Unnamed: 0,age,income,education,satisfaction,customer_segment,churn
0,39,91941.767624,Bachelor,2,B,0
1,33,78405.888746,Master,3,C,0
2,41,59816.514241,Bachelor,4,C,0
3,50,48367.627055,Master,4,B,0
4,32,60592.929706,Master,4,C,0


## 4. Basic EDA with EDA Automator

Now, let's initialize the EDA Automator with our sample dataset:

In [5]:
# Initialize EDA Automator with the dataset and specify the target variable
automator = EDAAutomator(df, target_variable='churn')

# Print the configuration
print("Configuration:")
automator.print_config()

NameError: name 'EDAAutomator' is not defined

### 4.1 Data Quality Assessment

Let's start by checking the quality of our data:

In [None]:
# Run data quality analysis
quality_results = automator.run_data_quality_analysis()

# Print a summary of the quality assessment
print(quality_results['summary'])

# Display missing values information
quality_results['missing_data']['missing_table']

### 4.2 Statistical Analysis

Now, let's perform some statistical analysis:

In [None]:
# Run statistical analysis
stats_results = automator.run_statistical_analysis()

# Display descriptive statistics
stats_results['descriptive_stats']

### 4.3 Univariate Analysis

Let's examine each variable individually:

In [None]:
# Run univariate analysis
univariate_results = automator.run_univariate_analysis()

# Display distribution of a numeric variable
plt.figure(figsize=(10, 6))
plt.subplot(121)
plt.title("Age Distribution")
univariate_results['numerical_plots']['age'].set_size_inches(8, 6)
plt.subplot(122)
plt.title("Income Distribution")
univariate_results['numerical_plots']['income'].set_size_inches(8, 6)
plt.tight_layout()

### 4.4 Bivariate Analysis

Now, let's look at relationships between variables:

In [None]:
# Run bivariate analysis
bivariate_results = automator.run_bivariate_analysis()

# Display correlation matrix
bivariate_results['correlation_plot']

### 4.5 Multivariate Analysis

Let's explore more complex relationships:

In [None]:
# Run multivariate analysis
multivariate_results = automator.run_multivariate_analysis()

# Display PCA results
if 'pca_plot' in multivariate_results:
    multivariate_results['pca_plot']

## 5. Generate a Complete Report

Finally, let's generate a comprehensive report:

In [None]:
# Run full analysis and generate HTML report
automator.run_full_analysis()
report_path = "customer_data_analysis.html"
automator.generate_report(report_path, format="html")

print(f"Report generated: {report_path}")

## 6. Customizing the Analysis

You can customize various aspects of the analysis:

In [None]:
# Custom configuration
custom_config = {
    'missing_threshold': 0.1,            # Mark columns with >10% missing values
    'outlier_method': 'iqr',             # Use IQR for outlier detection
    'outlier_threshold': 1.5,            # Use 1.5*IQR as outlier threshold
    'sampling_threshold': 500,           # Sample data if rows exceed 500
    'correlation_method': 'spearman',    # Use Spearman correlation
    'palette_type': 'categorical'        # Use categorical color palette
}

# Initialize EDA Automator with custom configuration
custom_automator = EDAAutomator(df, target_variable='churn', **custom_config)

# Run analysis with custom configuration
custom_results = custom_automator.run_full_analysis()
custom_automator.generate_report("custom_analysis.html", format="html")

print("Custom analysis complete!")