In [1]:
import sys
from pathlib import Path

from openai.types.chat.chat_completion import ChatCompletion

sys.path.append(str(Path().resolve().parent))

from config.openai_client import client

Пользователь:
- Вводит название фичи
- Выбирает ее роль: таргет или фича
- Выбирает ее тип: бинарная, категориальная, ранговая, дискретная или непрерывная
- Вводит описание фичи

In [2]:
features: list[dict[str, str]] = [
    {"name": "diabetes", "role": "target", "type": "binary", "description": "whether a patient has diabetes"},
    {"name": "mother", "role": "feature", "type": "binary", "description": "whether a patient's mother has diabetes"},
    {"name": "father", "role": "feature", "type": "binary", "description": "whether a patient's father has diabetes"},
    {"name": "occupation", "role": "feature", "type": "categorical", "description": "a patient's occupation"},
    {"name": "education", "role": "feature", "type": "ordinal", "description": "a patient's level of education"},
    {"name": "age", "role": "feature", "type": "discrete", "description": "a patient's age, full years"},
    {"name": "height", "role": "feature", "type": "discrete", "description": "a patient's height, cm (rounded)"},
    {"name": "weight", "role": "feature", "type": "continuous", "description": "a patient's weight, kg"},
]

Придумываем новые признаки:

In [3]:
prompt1: str = """
You are a virtual assistant for a data scientist.
The user provides you with an overview of a dataset.
Your task is to construct new features based solely on the columns present in the dataset.

In your response:
1. Enumerate the features you suggest adding.
2. For each feature, provide a formula using the existing columns of the dataset.
3. Explain why each feature would be beneficial for an ML model.

Important constraints:
- Use only the columns contained in the dataset.
- Do not suggest collecting additional data or adding anything that cannot be calculated from the existing columns.
- Respond with the list of new features, formulae, and explanations without any welcoming or accompanying text.
- You have not seen the data yet, so do not construct features based on concrete names of categories.
"""

In [4]:
dataset_overview: str = "\n\n".join(
    "\n".join(
        [
            f"{i + 1}. `{params['name']}`:",
            f"Role: {params['role']}",
            f"Type: {params['type']}",
            f"Description: {params['description']}",
        ],
    )
    for i, params in enumerate(features)
)

print(dataset_overview)

1. `diabetes`:
Role: target
Type: binary
Description: whether a patient has diabetes

2. `mother`:
Role: feature
Type: binary
Description: whether a patient's mother has diabetes

3. `father`:
Role: feature
Type: binary
Description: whether a patient's father has diabetes

4. `occupation`:
Role: feature
Type: categorical
Description: a patient's occupation

5. `education`:
Role: feature
Type: ordinal
Description: a patient's level of education

6. `age`:
Role: feature
Type: discrete
Description: a patient's age, full years

7. `height`:
Role: feature
Type: discrete
Description: a patient's height, cm (rounded)

8. `weight`:
Role: feature
Type: continuous
Description: a patient's weight, kg


In [5]:
response1: ChatCompletion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": prompt1},
        {"role": "user", "content": dataset_overview},
    ],
    max_tokens=4096,
    temperature=0.5,
)

In [6]:
print(response1.choices[0].message.content)

1. **BMI (Body Mass Index)**
   - **Formula**: `weight / (height / 100)^2`
   - **Explanation**: BMI is a widely used indicator of body fatness and is correlated with the risk of various diseases, including diabetes. It can provide a more nuanced understanding of the patient's health status compared to height and weight alone.

2. **Age Group**
   - **Formula**: `age // 10 * 10` (for example, 30 for ages 30-39, 40 for ages 40-49, etc.)
   - **Explanation**: Grouping ages into decades can help the model capture trends and patterns related to age more effectively, reducing noise from individual age values.

3. **Parental Diabetes**
   - **Formula**: `mother + father` (resulting in values 0, 1, or 2)
   - **Explanation**: The sum of parental diabetes status can indicate the genetic predisposition to diabetes, which is a significant risk factor.

4. **Height-to-Age Ratio**
   - **Formula**: `height / age`
   - **Explanation**: This ratio can provide insights into growth patterns and overal

Готовим код для EDA:

In [7]:
prompt2 = """
You are a virtual assistant for a data scientist.
The user has provided a dataset and wishes to perform exploratory data analysis (EDA).
You have already received an overview of the dataset and have suggested new features to construct.

Write Python code to perform the following EDA tasks:

1. Read the dataset from a CSV file.
2. Add the new features you previously suggested to the dataset.
3. Calculate appropriate statistics based on the data type of each feature.
4. Check for missing values, sparsity, and outliers.
5. Visualize the target variable's distribution using a plot relevant to its data type.
6. Visualize the distribution of each feature using subplots relevant to their data types.
7. Estimate the relevancy of each feature to the target variable using appropriate methods and statistical tests based on their data types.
8. Visualize the relationships between the target variable and each feature using subplots relevant to their data types.
9. Estimate feature redundancy using appropriate methods and statistical tests based on their data types.
10. Visualize pairwise relationships between features using subplots relevant to their data types.

Respond only with the Python code, including all necessary inline comments.
"""

In [8]:
response2: ChatCompletion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": prompt1},
        {"role": "user", "content": dataset_overview},
        {"role": "assistant", "content": response1.choices[0].message.content},
        {"role": "system", "content": prompt2},
    ],
    max_tokens=4096,
    temperature=0.5,
)

In [9]:
print(response2.choices[0].message.content)

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency, pearsonr, spearmanr

# 1. Read the dataset from a CSV file
df = pd.read_csv('dataset.csv')

# 2. Add the new features
# BMI (Body Mass Index)
df['BMI'] = df['weight'] / (df['height'] / 100) ** 2

# Age Group
df['Age_Group'] = (df['age'] // 10) * 10

# Parental Diabetes
df['Parental_Diabetes'] = df['mother'] + df['father']

# Height-to-Age Ratio
df['Height_to_Age'] = df['height'] / df['age']

# Weight-to-Age Ratio
df['Weight_to_Age'] = df['weight'] / df['age']

# Education Level Indicator (assuming education levels are 'Primary', 'Secondary', 'Tertiary')
education_mapping = {'Primary': 1, 'Secondary': 2, 'Tertiary': 3}
df['Education_Level'] = df['education'].map(education_mapping)

# Weight Category (assuming standard BMI categories)
df['Weight_Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 24.9, 29.9, np.inf], labels=['Underweight', 'Normal', '