Skip to content

magistak/llm-data

Repository files navigation

DataGuy

DataGuy is a Python package designed to simplify data science workflows by leveraging the power of Large Language Models (LLMs). It provides tools for automated data wrangling, intelligent analysis, and AI-assisted visualization, making it ideal for small-to-medium datasets.

Features

  • Automated Data Wrangling: Clean and preprocess your data with minimal effort using LLM-generated code.
  • AI-Powered Data Visualization: Generate insightful plots and visualizations based on natural language descriptions.
  • Intelligent Data Analysis: Perform descriptive and inferential analysis with the help of LLMs.
  • Customizable Workflows: Integrate with pandas, matplotlib, and other Python libraries for seamless data manipulation.
  • Safe Code Execution: Built-in safeguards to ensure only safe and trusted code is executed.

Installation

Install the package using pip:

pip install dataguy

Usage

Getting Started

  1. Load Anthrpic API key in your environment:

    import os
    os.environ["ANTHROPIC_API_KEY"] = "your_api_key_here"

    Replace your_api_key_here with your actual API key from Anthropic.

  2. Import the Package:

    from dataguy import DataGuy
  3. Initialize a DataGuy Instance:

    dg = DataGuy()
  4. Load Your Data:

    import pandas as pd
    data = pd.DataFrame({"age": [25, 30, None], "score": [88, 92, 75]})
    dg.set_data(data)
  5. Summarize Your Data:

    summary = dg.summarize_data()
    print(summary)
  6. Wrangle Your Data:

    cleaned_data = dg.wrangle_data()
  7. Visualize Your Data:

    dg.plot_data("age", "score")
  8. Analyze Your Data:

    results = dg.analyze_data()
    print(results)

Example Workflow

from dataguy import DataGuy
import pandas as pd

# Initialize DataGuy
dg = DataGuy()

# Load data
data = pd.read_csv("path/to/data.csv")
dg.set_data(data)

# Summarize data
summary = dg.summarize_data()
print("Data Summary:", summary)

# Wrangle data
cleaned_data = dg.wrangle_data()

# Visualize data
dg.plot_data("column_x", "column_y")

# Analyze data
analysis_results = dg.analyze_data()
print("Analysis Results:", analysis_results)

Key Methods

  • set_data(obj): Load data into the DataGuy instance. Supports pandas DataFrames, dictionaries, lists, numpy arrays, and CSV files.
  • summarize_data(): Generate a summary of the dataset, including shape, columns, missing values, and means.
  • wrangle_data(): Automatically clean and preprocess the dataset for analysis.
  • plot_data(column_x, column_y): Create a scatter plot of two columns using matplotlib.
  • analyze_data(): Perform an automated analysis of the dataset, returning descriptive statistics and insights.

Requirements

  • Python 3.8 or higher
  • Dependencies:
    • pandas
    • numpy
    • matplotlib
    • scikit-learn
    • claudette
    • anthropic

Contributing

Contributions are welcome! Please submit issues or pull requests via the GitHub repository.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

  • István Magyary
  • Sára Viemann
  • Kristóf Bálint

For inquiries, contact: magistak@gmail.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors