# Data Analysis Agent

This project implements a comprehensive data analysis tool called DataAnalysisAgent that automates the process of exploring, visualizing, and deriving insights from datasets. It's built using OpenAI's Agent SDK, leveraging Plotly for interactive visualizations and OpenAI's GPT-4 for AI-powered insights. The agent is designed to make data analysis more accessible and efficient, especially for users who may not have extensive experience with data science libraries.

### Key Components
1. DataAnalysisAgent Class : The core class that handles all analysis operations, including:
   
   - Loading data from CSV or Excel files
   - Preprocessing data (handling missing values, duplicates)
   - Identifying key columns for analysis
   - Generating summary statistics
   - Creating intelligent visualizations
   - Providing AI-powered insights

2. Smart Column Detection : The agent automatically identifies:
   
   - Potential target variables based on column names and positions
   - Feature columns for analysis
   - Categorical columns with a reasonable number of unique values

3. Visualization System : Creates relevant visualizations based on data structure:
   
   - Scatter plots with trend lines for numeric relationships
   - 3D visualizations for multi-feature analysis
   - Distribution histograms with box plots
   - Categorical analysis with box plots and bar charts
   - Correlation heatmaps with highlighted strong correlations

4. AI-Powered Insights : Uses OpenAI's GPT-4 to generate human-readable insights about the data
5. User-Friendly Interface : Provides a simple wrapper function ( analyze_data ) that orchestrates the entire analysis process
### Use Case
This tool is particularly useful for:

- Quick exploratory data analysis
- Automated report generation
- Identifying key relationships in data
- Getting AI-assisted interpretations of data patterns
The project demonstrates how AI can be integrated into the data analysis workflow to make it more accessible and insightful, while still providing flexibility for customization.m

In [None]:
!pip install plotly



### 1. Import Libraries and Setup

In [8]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### 3. Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### 4. Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [30]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

In [36]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
!pip install plotly



In [None]:
!pip install plotly



In [None]:
!pip install plotly



In [None]:
!pip install plotly



In [None]:
!pip install plotly



In [None]:
!pip install plotly



### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

In [None]:
!pip install plotly



In [None]:
!pip install plotly



In [None]:
!pip install plotly



In [None]:
!pip install plotly



### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

### 1. Import Libraries and Setup

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

### 2. Agent Class Initialization

This initialization method sets up a data analysis agent by:

1. Validating and loading a CSV or Excel file from the provided path
2. Creating a copy of the original data for manipulation
3. Setting up the OpenAI API key (either from the parameter or environment variable)
4. Initializing tracking attributes like **dropped_columns** and **analysis_report**
5. Identifying key columns in the dataset using the **_identify_key_columns()** method
6. Printing a confirmation message with the data dimensions

In [None]:
class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV or Excel file
        api_key : str, optional
            OpenAI API key
        """
        # Validate and load file
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
            
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # OpenAI setup
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Tracking attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded: {self.data.shape[0]} rows, {self.data.shape[1]} columns")

### Preprocessing Methods

This code contains two key methods of the DataAnalysisAgent class:

1. **_identify_key_columns()** : This is a helper method that analyzes the dataset to identify important columns for visualization and analysis. It:
   
   - Separates numeric and categorical columns
   - Identifies potential target variables by looking for columns with names containing keywords like "price", "sales", "target", etc.
   - Falls back to using the last numeric column as a target if no obvious targets are found
   - Identifies feature columns (numeric columns that aren't targets)
   - Selects categorical columns with 10 or fewer unique values
   - Returns a dictionary with these categorized columns
2. **preprocess_data()** : This method performs common data cleaning operations:
   
   - Removes specified columns
   - Optionally drops rows with missing values
   - Optionally removes duplicate rows
   - Tracks changes in the analysis report
   - Re-identifies key columns after preprocessing
   - Displays a summary of the preprocessing steps

In [None]:
def _identify_key_columns(self) -> Dict[str, List[str]]:
        """Identify key columns for visualization"""
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 
                           'profit', 'income', 'cost', 'amount', 'total', 'value']
        
        potential_targets = [
            col for col in numeric_columns 
            if any(keyword in col.lower() for keyword in target_keywords)
        ]
        
        # Fallback to last numeric column if no targets found
        if not potential_targets and numeric_columns:
            potential_targets = [numeric_columns[-1]]
        
        # Identify feature columns
        potential_features = [
            col for col in numeric_columns 
            if col not in potential_targets
        ]
        
        # Identify categorical columns
        key_categorical = [
            col for col in categorical_columns 
            if self.data[col].nunique() <= 10
        ]
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }

def preprocess_data(self, 
                       drop_na: bool = False, 
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        
        Parameters:
        -----------
        drop_na : bool, optional
            Drop rows with missing values
        drop_duplicates : bool, optional
            Remove duplicate rows
        columns_to_drop : List[str], optional
            Columns to remove from the dataset
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with preprocessed data
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self

### Summary Statistics

1. **generate_summary_statistics()** : This method generates comprehensive summary statistics for your dataset and returns the agent instance. It:
   
   - Separates columns into numeric and categorical types
   - Creates and displays descriptive statistics for numeric columns using pandas' describe()
   - For each categorical column, calculates and displays value counts and unique value counts
   - Attempts to generate AI-powered insights by calling the _generate_ai_insights() method
   - Stores all statistics in the agent's analysis report
   - Returns the agent instance for method chaining

2. **generate_ai_insights()** : This is a helper method that uses OpenAI's API to generate insights about the data. It:
   
   - Takes numeric and categorical summaries as input
   - Constructs a prompt that includes these summaries
   - Makes an API call to OpenAI's GPT-4 Turbo model
   - Returns the AI-generated insights as a dictionary
   - Uses a system prompt that positions the AI as a data analysis expert

These methods work together to provide both statistical summaries and AI-powered interpretations of your dataset, making it easier to understand the key characteristics and patterns in your data.

In [None]:
# Cell 4: Summary Statistics Methods
def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self

def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        
        Parameters:
        -----------
        numeric_summary : pd.DataFrame
            Summary of numeric columns
        categorical_summary : Dict
            Summary of categorical columns
        
        Returns:
        --------
        Dict[str, str]
            Dictionary containing AI-generated insights
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}

### Data Visualization Methods (Part 1)

This method creates data visualizations that focus on the relationships between key columns in your dataset. It:

1. Retrieves the key columns identified earlier (numeric targets, numeric features, and categorical columns)
2. Displays debug information about which columns were detected
3. Implements a fallback visualization if no clear target/feature columns were detected:
   - Creates a simple scatter plot between the first two numeric columns
4. If both target and feature columns are available:
   - Creates scatter plots with trend lines for each feature vs. the target variable
   - Limits to the first 3 features for clarity
5. If multiple feature columns are available:
   - Creates a 3D scatter plot showing the relationship between two features and the target
   - Colors points by a categorical variable if available

The method is designed to be intelligent about which visualizations to show based on the data structure, automatically selecting the most relevant plots. It returns the agent instance to allow for method chaining.

In [None]:
def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        
        Returns:
        --------
        DataAnalysisAgent
            The instance with generated visualizations
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()

#### Data Visualization Methods (Part2)


The **_continue_visualization** method is a helper method that extends the visualization capabilities of the **DataAnalysisAgent** class. Here's what it does:

This method focuses on two specific types of visualizations:

1. Distribution Analysis of Target Variables :
   
   - Creates a histogram with a box plot in the margin to show the distribution of the primary target variable
   - This helps identify the shape of the distribution (normal, skewed, bimodal, etc.) and potential outliers

2. Categorical Analysis with Target Variables :
   
   - Creates two visualizations that show the relationship between categorical and target variables:
     - A box plot showing how the target variable is distributed within each category
     - A bar chart showing the average value of the target variable for each category
   - These visualizations help identify how categorical variables influence the target variable

The method is designed to be called after the main visualize_data method, continuing the visualization process with more specialized plots. It retrieves the key columns identified earlier and creates visualizations only if the appropriate column types are available.

In [None]:
def _continue_visualization(self):
        """
        Continue visualization methods from Cell 5
        
        This method continues the visualization process for distribution and categorical analysis
        """
        numeric_targets = self.key_columns['numeric_targets']
        categorical_cols = self.key_columns['categorical']
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()

### Custom Visualization and Analysis Function

The **analyze_data** function serves as a high-level wrapper for the DataAnalysisAgent class, providing a streamlined way to analyze a dataset. Here's what it does:

This function takes a file path as input and:

1. Initializes a DataAnalysisAgent with the provided file path
2. Displays a sample of the data using **head()**
3. Prints preprocessing options that users can modify in their notebook:
   - Whether to drop rows with missing values
   - Whether to drop duplicate rows
   - Which columns to drop
4. Includes commented-out code for preprocessing that users can uncomment and customize
5. Generates summary statistics using the agent's **generate_summary_statistics()** method
6. Creates visualizations using the agent's **visualize_data()** method
7. Provides examples of custom visualizations (commented out)
8. Returns the configured agent for further use
9. Includes error handling to gracefully handle any issues during analysis

In [None]:
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    
    Returns:
    --------
    DataAnalysisAgent or None
        Configured data analysis agent or None if error occurs
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Optional preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # Example of custom visualization 
        print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None

### Data Visualization Method (Part 2) and Custom Visualization Method

In [None]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import openai
import plotly.io as pio
from typing import Dict, Any, List
from IPython.display import display, HTML, Markdown

# Set default plotly template
pio.templates.default = "plotly_white"

class DataAnalysisAgent:
    def __init__(self, csv_path: str, api_key: str = None):
        """
        Initialize the Data Analysis Agent for Jupyter
        
        Parameters:
        -----------
        csv_path : str
            Path to the CSV file to be analyzed
        api_key : str, optional
            OpenAI API key (can also be set via environment variable)
        """
        # Validate CSV file path
        if not os.path.exists(csv_path):
            raise FileNotFoundError(f"File not found at path: {csv_path}")
        
        # Load data
        try:
            file_extension = os.path.splitext(csv_path)[1].lower()
            if file_extension == '.csv':
                self.original_data = pd.read_csv(csv_path)
            elif file_extension in ['.xlsx', '.xls']:
                self.original_data = pd.read_excel(csv_path)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
                
            self.data = self.original_data.copy()
        except Exception as e:
            raise ValueError(f"Error reading file: {e}")
        
        # Initialize OpenAI client
        openai.api_key = api_key or os.getenv('OPENAI_API_KEY')
        
        # Preprocessing and analysis attributes
        self.dropped_columns = []
        self.analysis_report: Dict[str, Any] = {}
        
        # Identify key columns for visualization
        self.key_columns = self._identify_key_columns()
        
        print(f"Data loaded successfully: {self.data.shape[0]} rows, {self.data.shape[1]} columns")
    
    def _identify_key_columns(self) -> Dict[str, List[str]]:
        """
        Identify key columns for visualization based on data types and importance
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Identify potential target variables (often at the end of the dataset)
        potential_targets = []
        
        # Look for columns that might be targets based on common names
        target_keywords = ['target', 'label', 'class', 'price', 'sales', 'revenue', 'profit', 
                          'income', 'cost', 'amount', 'total', 'value']
        
        for col in numeric_columns:
            col_lower = col.lower()
            if any(keyword in col_lower for keyword in target_keywords):
                potential_targets.append(col)
        
        # If no targets found by name, use the last numeric column as a potential target
        if not potential_targets and numeric_columns:
            potential_targets.append(numeric_columns[-1])
        
        # Identify potential feature columns
        potential_features = [col for col in numeric_columns if col not in potential_targets]
        
        # Identify key categorical columns (limit to those with fewer unique values)
        key_categorical = []
        for col in categorical_columns:
            if self.data[col].nunique() <= 10:  # Only include categorical with reasonable number of categories
                key_categorical.append(col)
        
        return {
            'numeric_targets': potential_targets,
            'numeric_features': potential_features,
            'categorical': key_categorical
        }
    
    def preprocess_data(self, 
                       drop_na: bool = False, # specify which columns to drop
                       drop_duplicates: bool = True, 
                       columns_to_drop: List[str] = None) -> 'DataAnalysisAgent':
        """
        Preprocess the dataset with common cleaning operations
        """
        # Store initial dimensions
        initial_rows, initial_cols = self.data.shape
        
        # Drop specified columns
        if columns_to_drop:
            self.data.drop(columns=columns_to_drop, inplace=True)
            self.dropped_columns.extend(columns_to_drop)
        
        # Drop missing values if specified
        if drop_na:
            self.data.dropna(inplace=True)
        
        # Drop duplicate rows if specified
        if drop_duplicates:
            self.data.drop_duplicates(inplace=True)
        
        # Update analysis report
        self.analysis_report['preprocessing'] = {
            'initial_rows': initial_rows,
            'initial_cols': initial_cols,
            'final_rows': self.data.shape[0],
            'final_cols': self.data.shape[1],
            'dropped_columns': self.dropped_columns
        }
        
        # Re-identify key columns after preprocessing
        self.key_columns = self._identify_key_columns()
        
        # Display preprocessing summary
        print("### Preprocessing Summary")
        print(f"Initial Rows: {initial_rows}, Initial Columns: {initial_cols}")
        print(f"Final Rows: {self.data.shape[0]}, Final Columns: {self.data.shape[1]}")
        print(f"Dropped Columns: {self.dropped_columns}")
        
        return self
    
    def generate_summary_statistics(self) -> 'DataAnalysisAgent':
        """
        Generate comprehensive summary statistics
        """
        numeric_columns = self.data.select_dtypes(include=[np.number]).columns
        categorical_columns = self.data.select_dtypes(include=['object', 'category']).columns
        
        # Numeric summary
        print("### Numeric Columns Summary")
        numeric_summary = self.data[numeric_columns].describe().T
        display(numeric_summary)
        
        # Categorical summary
        print("### Categorical Columns Summary")
        categorical_summary = {}
        
        for col in categorical_columns:
            value_counts = self.data[col].value_counts()
            categorical_summary[col] = {
                'unique_count': self.data[col].nunique(),
                'top_categories': value_counts.head(5).to_dict()
            }
            
            # Display top categories for each categorical column
            print(f"**{col}** - Unique Values: {value_counts.shape[0]}")
            display(value_counts.head())
        
        # Generate AI insights
        try:
            ai_insights = self._generate_ai_insights(numeric_summary, categorical_summary)
            print("### AI Generated Insights")
            print(ai_insights.get('insights', 'No insights generated'))
        except Exception as e:
            print(f"Error generating AI insights: {e}")
            ai_insights = {"error": str(e)}
        
        self.analysis_report['summary_statistics'] = {
            'numeric': numeric_summary,
            'categorical': categorical_summary,
            'ai_insights': ai_insights
        }
        
        return self
    
    def _generate_ai_insights(self, numeric_summary: pd.DataFrame, categorical_summary: Dict) -> Dict[str, str]:
        """
        Generate AI-powered insights using OpenAI
        """
        # Prepare prompt for AI insight generation
        prompt = "Provide concise, insightful analysis of the following data summaries:\n\n"
        prompt += "Numeric Columns Summary:\n"
        prompt += str(numeric_summary) + "\n\n"
        prompt += "Categorical Columns Summary:\n"
        prompt += str(categorical_summary) + "\n\n"
        prompt += "Please provide key observations, potential patterns, and meaningful insights."
        
        # Generate insights using OpenAI
        response = openai.ChatCompletion.create(
            model="gpt-4-turbo",
            messages=[
                {"role": "system", "content": "You are a data analysis expert providing concise, actionable insights."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=300
        )
        
        return {"insights": response.choices[0].message.content}
    
    def visualize_data(self) -> 'DataAnalysisAgent':
        """
        Generate focused visualizations based on key column relationships
        """
        print("## Smart Data Visualizations")
        print("Showing the most relevant visualizations based on data analysis")
        
        # Get key columns identified earlier
        numeric_targets = self.key_columns['numeric_targets']
        numeric_features = self.key_columns['numeric_features']
        categorical_cols = self.key_columns['categorical']
        
        # Debug information
        print(f"Detected target columns: {numeric_targets}")
        print(f"Detected feature columns: {numeric_features}")
        print(f"Detected categorical columns: {categorical_cols}")
        
        # Check if we have enough data for visualizations
        all_numeric = self.data.select_dtypes(include=[np.number]).columns.tolist()
        
        # Fallback if no appropriate columns were detected
        if (len(numeric_features) == 0 or len(numeric_targets) == 0) and len(all_numeric) >= 2:
            print("### Basic Numeric Relationships")
            print("No clear target/feature columns detected. Showing basic relationships between numeric columns.")
            
            # Select the first two numeric columns for visualization
            x_col = all_numeric[0]
            y_col = all_numeric[1]
            
            # Create a simple scatter plot
            fig = px.scatter(self.data, x=x_col, y=y_col, 
                            title=f"{x_col} vs {y_col}",
                            template="plotly_white")
            fig.show()
        
        # 1. Relationship between key numeric variables
        if len(numeric_features) > 0 and len(numeric_targets) > 0:
            print("### Relationships Between Key Numeric Variables")
            
            # Use the first target and up to 3 features
            target_col = numeric_targets[0]
            feature_cols = numeric_features[:min(3, len(numeric_features))]
            
            print(f"Target variable: {target_col}")
            print(f"Feature variables: {feature_cols}")
            
            # Create scatter plots for each feature vs target
            for feature in feature_cols:
                fig = px.scatter(self.data, x=feature, y=target_col, 
                                title=f"{feature} vs {target_col}",
                                template="plotly_white", 
                                trendline="ols")  # Add trend line
                fig.show()
            
            # If we have multiple features, show a 3D plot
            if len(feature_cols) >= 2:
                print("### 3D Relationship Visualization")
                fig = px.scatter_3d(self.data, 
                                   x=feature_cols[0], 
                                   y=feature_cols[1], 
                                   z=target_col,
                                   color=categorical_cols[0] if categorical_cols else None,
                                   title=f"3D Relationship: {feature_cols[0]}, {feature_cols[1]} vs {target_col}")
                fig.show()
        
        # 2. Distribution of target variable
        if numeric_targets:
            print("### Distribution of Target Variable")
            target_col = numeric_targets[0]
            
            fig = px.histogram(self.data, x=target_col, 
                              marginal="box", 
                              title=f"Distribution of {target_col}",
                              template="plotly_white")
            fig.show()
        
        # 3. Categorical analysis with target
        if categorical_cols and numeric_targets:
            print("### Categorical Analysis")
            cat_col = categorical_cols[0]
            target_col = numeric_targets[0]
            
            print(f"Analyzing {target_col} by {cat_col}")
            
            # Box plot showing distribution of target by category
            fig = px.box(self.data, x=cat_col, y=target_col, 
                        title=f"{target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
            
            # Bar chart showing average target value by category
            avg_by_cat = self.data.groupby(cat_col)[target_col].mean().reset_index()
            fig = px.bar(avg_by_cat, x=cat_col, y=target_col, 
                        title=f"Average {target_col} by {cat_col}",
                        template="plotly_white")
            fig.show()
        
        # 4. Correlation heatmap for numeric variables
        print("### Correlation Analysis")
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
        
        if len(numeric_cols) > 1:
            # Use up to 8 numeric columns for correlation analysis
            selected_cols = list(numeric_cols[:min(8, len(numeric_cols))])
            
            print(f"Showing correlations for: {selected_cols}")
            correlation_matrix = self.data[selected_cols].corr()
            
            fig = px.imshow(correlation_matrix, 
                           text_auto=True, 
                           color_continuous_scale='RdBu_r',
                           title="Correlation Heatmap",
                           template="plotly_white")
            fig.show()
            
            # Highlight strong correlations
            print("#### Strong Correlations")
            strong_corrs = []
            
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_value = correlation_matrix.iloc[i, j]
                    if abs(corr_value) > 0.5:  # Threshold for strong correlation
                        strong_corrs.append({
                            'variables': f"{correlation_matrix.columns[i]} & {correlation_matrix.columns[j]}",
                            'correlation': corr_value,
                            'strength': 'Strong Positive' if corr_value > 0 else 'Strong Negative'
                        })
            
            if strong_corrs:
                for corr in strong_corrs:
                    print(f"- {corr['variables']}: {corr['correlation']:.2f} ({corr['strength']})")
            else:
                print("No strong correlations detected")
        
        self.analysis_report['visualizations'] = "Smart visualizations displayed"
        return self
    
    def custom_visualization(self, viz_type: str, **kwargs):
        """
        Create custom visualizations based on user specifications
        
        Parameters:
        -----------
        viz_type : str
            Type of visualization ('scatter', 'bar', 'box', 'line', 'pie')
        **kwargs : 
            Additional parameters specific to each visualization type
        """
        numeric_cols = self.data.select_dtypes(include=[np.number]).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()
        
        if viz_type.lower() == "scatter":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else numeric_cols[0])
            color_col = kwargs.get('color_col', None)
            
            if x_col and y_col:
                if color_col:
                    fig = px.scatter(self.data, x=x_col, y=y_col, color=color_col, 
                                    title=f"{x_col} vs {y_col} by {color_col}")
                else:
                    fig = px.scatter(self.data, x=x_col, y=y_col, title=f"{x_col} vs {y_col}")
                
                fig.show()
            else:
                print("Error: Not enough numeric columns for scatter plot")
        
        elif viz_type.lower() == "bar":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            agg_func = kwargs.get('agg_func', 'mean')
            
            if x_col and y_col:
                # Aggregate the data
                agg_data = self.data.groupby(x_col)[y_col].agg(agg_func).reset_index()
                
                fig = px.bar(agg_data, x=x_col, y=y_col, 
                            title=f"{agg_func.capitalize()} of {y_col} by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for bar chart")
        
        elif viz_type.lower() == "box":
            x_col = kwargs.get('x_col', categorical_cols[0] if categorical_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[0] if numeric_cols else None)
            
            if x_col and y_col:
                fig = px.box(self.data, x=x_col, y=y_col, title=f"{y_col} Distribution by {x_col}")
                fig.show()
            else:
                print("Error: Missing required columns for box plot")
        
        elif viz_type.lower() == "line":
            x_col = kwargs.get('x_col', numeric_cols[0] if numeric_cols else None)
            y_col = kwargs.get('y_col', numeric_cols[1] if len(numeric_cols) > 1 else None)
            
            if x_col and y_col:
                fig = px.line(self.data.sort_values(x_col), x=x_col, y=y_col, 
                             title=f"{y_col} vs {x_col}")
                fig.show()
            else:
                print("Error: Not enough numeric columns for line chart")
        
        elif viz_type.lower() == "pie":
            col = kwargs.get('col', categorical_cols[0] if categorical_cols else None)
            
            if col:
                # Limit to top categories if there are too many
                value_counts = self.data[col].value_counts().reset_index()
                value_counts.columns = [col, 'count']
                
                if len(value_counts) > 8:
                    other_count = value_counts.iloc[8:]['count'].sum()
                    value_counts = value_counts.head(8)
                    value_counts.loc[len(value_counts)] = ['Other', other_count]
                
                fig = px.pie(value_counts, names=col, values='count', 
                            title=f"Distribution of {col}")
                fig.show()
            else:
                print("Error: No categorical column available for pie chart")
        
        else:
            print(f"Unsupported visualization type: {viz_type}")
            print("Supported types: scatter, bar, box, line, pie")

# Example usage in a Jupyter notebook
def analyze_data(file_path):
    """
    Analyze a dataset using the DataAnalysisAgent
    
    Parameters:
    -----------
    file_path : str
        Path to the CSV or Excel file to analyze
    """
    try:
        # Initialize the agent
        agent = DataAnalysisAgent(file_path)
        
        # Display data sample
        print("### Data Sample")
        display(agent.data.head())
        
        # Preprocess data
        print("\n## Preprocessing Options")
        print("You can modify these options in your notebook:")
        print("1. drop_na: Drop rows with missing values (default: False)")
        print("2. drop_duplicates: Drop duplicate rows (default: True)")
        print("3. columns_to_drop: List of columns to drop (default: None)")
        
        # Example preprocessing (uncomment and modify as needed)
        # agent.preprocess_data(drop_na=True, drop_duplicates=True, columns_to_drop=['column_to_drop'])
        
        # Generate summary statistics
        print("\n## Summary Statistics")
        agent.generate_summary_statistics()
        
        # Generate visualizations
        print("\n## Data Visualizations")
        agent.visualize_data()
        
        # example of custom visualization 
        # print("\n## Custom Visualization Examples")
        # agent.custom_visualization("scatter", x_col="column1", y_col="column2")
        # agent.custom_visualization("bar", x_col="category_column", y_col="numeric_column", agg_func="mean")
        
        return agent
        
    except Exception as e:
        print(f"Error analyzing data: {e}")
        return None



In [None]:
agent = analyze_data("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Data Sample


Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75



## Preprocessing Options
You can modify these options in your notebook:
1. drop_na: Drop rows with missing values (default: False)
2. drop_duplicates: Drop duplicate rows (default: True)
3. columns_to_drop: List of columns to drop (default: None)

## Summary Statistics
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
### Key Observations and Insights from Data Summaries:

#### Numeric Columns Analysis:
1. **Score Distributions**:
   - **Math score** has a mean of 66.09, with a relatively wide spread indicated by a standard deviation (std) of 15.16 and minimum score of 0 which suggests potential outliers or extreme failures. Approximately 50% of the students scored between 57 and 77.
   - **Reading score** presents a higher average (69.17) compared to math, with a slightly tighter dispersion (std of 14.60). The minimum score of 17 also points to less extreme low values.
   - **Writing score** is closely aligned with reading in terms of average (68.05) and has a similar pattern of distribution (std of 15.20). Both reading and writing scores show better performance compared to math.
   
2. **Performance Trends**:
   - The scores for reading and writing are significantly correlated, suggested by their similar means, standard deviations, and interquartile ranges.
   - Math scor

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


In [None]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'

In [15]:
agent = analyze_dataset("C:/Users/Omkar/PROJECTS/Langchain_projects/AgenticAI/AgentSDK_Tutorials/advanced_agents/StudentsPerformance.csv")

Data loaded successfully: 1000 rows, 8 columns
### Preprocessing Summary
Initial Rows: 1000, Initial Columns: 8
Final Rows: 1000, Final Columns: 8
Dropped Columns: []
### Numeric Columns Summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
math score,1000.0,66.089,15.16308,0.0,57.0,66.0,77.0,100.0
reading score,1000.0,69.169,14.600192,17.0,59.0,70.0,79.0,100.0
writing score,1000.0,68.054,15.195657,10.0,57.75,69.0,79.0,100.0


### Categorical Columns Summary
**gender** - Unique Values: 2


gender
female    518
male      482
Name: count, dtype: int64

**race/ethnicity** - Unique Values: 5


race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

**parental level of education** - Unique Values: 6


parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
Name: count, dtype: int64

**lunch** - Unique Values: 2


lunch
standard        645
free/reduced    355
Name: count, dtype: int64

**test preparation course** - Unique Values: 2


test preparation course
none         642
completed    358
Name: count, dtype: int64

### AI Generated Insights
**Key Observations and Insights:**

1. **Performance in Assessments:**
   - The average scores are relatively close across subjects with math (66.089), reading (69.169), and writing (68.054). Reading scores have a slightly higher mean and lower variability compared to writing and math, indicating a generally better and more consistent performance in reading.
   - The minimum scores are strikingly low compared to the average for all subjects, especially in math (0) and writing (10), suggesting that a few students struggled significantly.

2. **Spread and Distribution of Scores:**
   - Math scores show the highest variability (std: 15.16). This might suggest differing levels of aptitude or preparation among students in mathematics compared to writing (std: 15.20) and reading (std: 14.60).
   - The maximum score in each category is 100, indicating top performances. However, the 75th percentile values vary (math: 77, reading: 79, writing: 79), suggesting a tighter

### 3D Relationship Visualization


### Distribution of Target Variable


### Categorical Analysis
Analyzing writing score by gender


### Correlation Analysis
Showing correlations for: ['math score', 'reading score', 'writing score']


#### Strong Correlations
- math score & reading score: 0.82 (Strong Positive)
- math score & writing score: 0.80 (Strong Positive)
- reading score & writing score: 0.95 (Strong Positive)


AttributeError: 'DataAnalysisAgent' object has no attribute 'generate_correlation_heatmap'