In [1]:
import json
import pickle
import tqdm
import pandas as pd
from conversation import create_coder, create_reviewer, create_refiner, start_conversation

# Trabalho de RL - Qual o Melhor Prompt para Iterar sobre a Geração de Código?
Nosso trabalho de RL é sobre descobrir quais os melhores prompts para iterar sobre a geração de 
código de LLMs.

## Como funciona
Criamos uma conversa com 3 participantes: Coder, Reviewer e Refiner.
Cada participante é responsável por enviar um prompt para o LLM, e inputar os resultados do LLM no
ambiente (aqui chamado de 'conversa').

### Coder
O Coder é responsável por escrever o código.  
Ele envia o prompt inicial, descrevendo o problema a ser resolvido. Definimos que todos os nossos
problemas serão de limpeza de uma base de dados csv.  
O Coder só participa da conversa 1 vez (no início) e, por isso, não o definimos como um agente RL.
Ao invés disso, ele é programado para iterar por todos os prompts n vezes, e avaliamos os resultados
das conversas com cada prompt inicial posteriormente.

### Reviewer
O Reviewer é responsável por avaliar o código gerado pelo LLM.  
Ele envia um prompt solicitando a avaliação do código gerado pelo LLM. Ele pode essa avaliação
sempre após a geração de um código que não tem nota superior à nota terminal.  
O Reviewer é um agente RL, e seu objetivo é maximizar a nota do código gerado pelo LLM.

### Refiner
O Refiner é responsável por refinar o código gerado pelo LLM.
Ele envia um prompt solicitando a melhoria do código gerado pelo LLM. Ele pode essa avaliação
sempre após uma revisão do Reviewer.  
O Refiner é um agente RL, e seu objetivo é maximizar a nota do código gerado pelo LLM.

### Prompt
Para cada participante, geramos prompts que iam de 1 a $n$ nas **escalas** das seguintes **propriedades**:
- Clareza;
- Comprimento;
- Especificidade, e
- Complexidade.

Isso totalizou até 20 prompts diferentes para cada participante. Para diminuir o espaço de ações,
optamos por usar uma estratégia mais simples:

- Para cada prompt (**comprimento da escala** x **número de propriedades**) do **Coder**;
    - Para cada **propriedade** do **Reviewer**;
        - Para cada **propriedade** do **Refiner**;
            - Geramos $m$ conversas onde:
                1. O Coder envia o prompt e adiciona o código inicial;
                2. O código é avaliado (se a nota não for terminal, prossegue);
                3. O Reviewer escolhe um dos $n$ prompts da **propriedade** e adiciona a revisão;
                4. O Refiner escolhe um dos $n$ prompts da **propriedade** e adiciona a melhoria.
                5. Se o comprimento da conversa não for terminal, volta ao passo 2.

### Avaliação do Código
O código é avaliado por um LLM usando a bibliteca `instructor`. Pedimos que o código receba uma nota
de 0 a 100 para a sua corretude e legibilidade, bem como uma curta explicação do porquê da nota 
(esse comentário é adicionado posteriormente à conversa).  
Se a nota média for superior a 95, a conversa é terminada pois consideramos que o código é bom o
suficiente.

In [2]:
# List with JSON files name
json_files_coder = [
    "json_files/prompts_clarity_coder.json",
    "json_files/prompts_size_coder.json",
    "json_files/prompts_specificity_coder.json",
    "json_files/prompts_complexity_coder.json"
]

json_files_reviewer = [
    "json_files/prompts_clarity_reviewer.json",
    "json_files/prompts_size_reviewer.json",
    "json_files/prompts_specificity_reviewer.json",
    "json_files/prompts_complexity_reviewer.json"
]

json_files_refiner = [
    "json_files/prompts_prop1_refiner.json",
    "json_files/prompts_prop2_refiner.json",
    "json_files/prompts_prop3_refiner.json",
    "json_files/prompts_prop4_refiner.json"
]

prompts_coder = []
for file_name in json_files_coder:
    with open(file_name, "r", encoding="utf-8") as file:
        data = json.load(file)
        for i, item in enumerate(data):
            item["index"] = i
        prompts_coder += data

reviewer_properties = {}    
for file_name in json_files_reviewer:
    with open(file_name, "r", encoding="utf-8") as file:
        data = json.load(file)
        for item in data:
            if item["propriedade"] not in reviewer_properties:
                reviewer_properties[item["propriedade"]] = []
            reviewer_properties[item["propriedade"]].append(item['prompt'])

refiner_properties = {}
for file_name in json_files_coder:
    with open(file_name, "r", encoding="utf-8") as file:
        data = json.load(file)
        for item in data:
            if item["propriedade"] not in refiner_properties:
                refiner_properties[item["propriedade"]] = []
            refiner_properties[item["propriedade"]].append(item['prompt'])                      

In [3]:
import random

# Choose a random coder prompt
random_coder_prompt = random.choice(prompts_coder)

# Choose a random reviewer prompt
random_reviewer_prop = random.choice(list(reviewer_properties.keys()))
random_reviewer_prompt = reviewer_properties[random_reviewer_prop]

# Choose a random refiner prompt
random_refiner_prop = random.choice(list(refiner_properties.keys()))
random_refiner_prompts = refiner_properties[random_refiner_prop]

print("Random Coder Prompt Index:", random_coder_prompt["index"])
print("Random Reviewer Prop:", random_reviewer_prop)
print("Random Refiner Prop:", random_refiner_prop)

Random Coder Prompt Index: 1
Random Reviewer Prop: size
Random Refiner Prop: complexity


In [4]:
from rl.code_evaluator import CodeEvaluator


MAX_TURNS = 1
coder = create_coder(prompts_coder)

coder_prompt_dict = random_coder_prompt
reviewer = create_reviewer(random_reviewer_prompt)
refiner = create_refiner(random_refiner_prompts)
evaluator = CodeEvaluator(environment=None, prompt="Evaluate the code quality", name="Code Evaluator")           
            
environment = start_conversation(
    coder, 
    coder_prompt_dict, 
    reviewer, 
    refiner, 
    MAX_TURNS
)


**Initial prompt**: Can you provide me with Python code that cleans up a dirty CSV file? The code should use the 'pandas' and 'numpy' libraries. It would be helpful if it included some basic cleaning steps like handling missing values, converting formats to correct types, and checking for inconsistencies in the data. Please focus on creating a concise and well-structured script that showcases best practices.

Here is the `imdb_sample_10.csv` file content:

Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Meta_score,Director,No_of_Votes
The Shawshank Redemption,1994,,Drama,nan,nan,,nan
,1972,,"Crime, Drama",nan,nan,,nan
The Dark Knight,,,"Action, Crime, Drama",,,Christopher Nolan,
The Godfather: Part II,nan,,"Crime, Drama",9.0,90.0,Francis Ford Coppola,nan
,1957,,"Crime, Drama",9.0,96.0,Sidney Lumet,nan
The Lord of the Rings: The Return of the King,2003,201 min,,8.9,94.0,Peter Jackson,nan
,,154 min,"Crime, Drama",,,,
Schindler's List,,195 min,,,,,
Inception,,148 min,"Action, Adventure, Sci-Fi",,,Christopher Nolan,
Fight Club,nan,139 min,Drama,8.8,66.0,David Fincher,nan


**Coder**: ```python
import pandas as pd
import numpy as np

# Load the CSV file into a Pandas DataFrame
df = pd.read_csv("imdb_sample_10.csv")

# Check for missing values and handle them appropriately
print(df.isnull().sum())  

# Convert columns to correct data types (example: replace 'nan' with NaN)
df['IMDB_Rating'] = df['IMDB_Rating'].fillna(pd.NA) 


# Handle inconsistencies in the data, such as format errors and inconsistent entries
# ... Add your custom cleaning code here ...


# Print the cleaned DataFrame to verify the results
print(df.head())  

```

**Explanation:**

1. **Import Libraries:** The first line imports the necessary libraries (`pandas` for data manipulation and `numpy` for numerical operations). 
2. **Load CSV File:**  `pd.read_csv("imdb_sample_10.csv")` reads the CSV file into a Pandas DataFrame, which is a more suitable format for data analysis than a plain list. 
3. **Check for Missing Values (Optional):** The code uses `df.isnull().sum()` to identify and count missing values in each column. 
4. **Convert Data Types:**  This example replaces 'nan' with the default NaN representation from Pandas (`pd.NA`). You might use other methods based on your data's context, such as:
    - Converting strings to numerical using `pd.to_numeric()` if you expect numeric values in a column. 
    - Replacing specific entries or converting to appropriate data types (e.g., dates).

5. **Handle Inconsistencies:** You can add your custom code here to address inconsistencies such as:
   - **Format Errors:** If there's inconsistent formatting, like strings being dates, numbers or different characters, you need to convert them to the correct format (using `str` functions) and check for any errors. 
   - **Inconsistency in Values:**  Check for values that are out of range (e.g., extremely high ratings) and determine if they should be replaced or adjusted based on your criteria. 


6. **Print Results:** The `df.head()` function shows the first few rows of the cleaned DataFrame, allowing you to confirm the results.

**Important Notes:**

* **Custom Cleaning:** This code provides a starting point for cleaning your CSV file. You'll need to customize it based on the specific issues in your data and the requirements of your project (e.g., what inconsistencies are causing problems).  
* **Data Analysis Libraries:** If you have more complex cleaning needs, consider using libraries like `numpy` or `scikit-learn`. 


Let me know if you want to learn about any specific aspect of the cleaning process or need help applying custom fixes to your data!



**Code Grade**: **Correctness:** 80

**Grade Explanation:** The code demonstrates a solid understanding of necessary preprocessing steps for a dataset. Notably, it utilizes both Pandas for file handling and its inherent functions  like `isnull()` and  `fillna()`. It clearly explains the importance of handling missing values  and inconsistent entries (though this could be enhanced by specific examples about how they are addressed). 

**Readability:** 75

**Grade Explanation:** The code is mostly readable, employing clear comments and indentation.  It demonstrates a focus on producing cleaned data for further analysis. However, the user can benefit from more robust explanations of techniques such as removing outliers or handling data based on specific rules.

**Reviewer**: ## Python Script Review: Cleaning a CSV Database

Here's an analysis of the provided code snippet for cleaning a CSV database, focusing on structure, logic, and potential areas for improvement. 


**Structure & Logic:**

* **Strengths:** The code is structured clearly, with imports, data loading, missing value handling, and data type conversion steps in separate blocks. This promotes readability and modularity.  
    * **Example:** Using `df.isnull().sum()`, the user can easily identify missing values, and its presence facilitates further data cleaning and analysis.  

* **Potential Improvements:** 
   -  **Code Organization:** While the structure is clear, it might benefit from a more structured approach, like using functions for specific tasks to make the code easier to understand and maintain.
   
**Data Processing Logic:**

* **Strengths:** The script effectively addresses common data cleaning needs:
    -  **Checking for missing values:**  `df.isnull().sum()` provides quick insights into missing value occurrences. 
    - **Missing Value Handling:** Replacing `nan` with `pd.NA` is a standard practice in data analysis, ensuring consistency and accuracy.
    
* **Potential Improvements:** 
   - **Handling Data Inconsistency:** The current code focuses on missing values. However, to efficiently deal with inconsistencies such as date format, inconsistent entry types, etc., you might benefit from adding checks and specific actions based on your dataset's context. For instance, consider: 
      *  Replacing dates with a uniform format using `pd.to_datetime()`
      *  Converting strings to numeric values if they represent quantities 

**Efficiency & Best Practices:**

* **Strengths:** The code is efficient and easy to understand. It employs Pandas' built-in functions, making it fast and reliable.

 * **Potential Improvements:** 
   - **Preprocessing before Cleaning:** Depending on the scope of your cleaning, you might benefit from pre-processing steps (e.g., header extraction or data standardization) before starting the actual cleaning process.  This helps to focus on the specific actions required for each step. 


**Clarity & User Experience:**

* **Strengths:**  The code is written in a clear and concise manner, making it easy to follow and understand. Use of `df.head()` provides quick visual feedback on progress.
   
* **Potential Improvements:**  Comments can provide further context about the cleaning steps performed for specific tasks. Adding detailed explanations of the `fillna()` method and data type conversion techniques would benefit users unfamiliar with these operations.

**Overall:** 


The code demonstrates a good understanding of basic data cleaning principles. However, incorporating more comprehensive functionalities like data type validation and custom handling of inconsistencies will increase its robustness and user experience. The use of functions for specific tasks would further improve readability and maintainability.  



**Recommendations:** 
* **Modularization:** Encapsulate specific cleaning steps into separate functions to make the code easier to read and maintain. 
* **Comments:** Explain the reasoning behind your decisions when implementing changes. 

By incorporating these recommendations, you can create a more comprehensive and efficient script for cleaning your CSV database!

**Code Refiner**: Here's a Python code snippet showcasing how to clean your CSV file using Pandas:

```python
import pandas as pd

# Load the CSV file into a DataFrame 
df = pd.read_csv("your_data.csv")  

# Handle Missing Values
# Example: Replace missing values in 'Age' column with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)  


# Remove Empty Rows
# Example: Remove rows with all NaN values
df.dropna(inplace=True) 

# Convert Data Types (Optional)
# Example: Convert 'Date' to datetime format
df['Date'] = pd.to_datetime(df['Date'])  

print(df.head())
```

**Explanation:**

1. **Importing Pandas:** `import pandas as pd` imports the Pandas library for data manipulation and analysis. 
2. **Loading Data:** `df = pd.read_csv("your_data.csv")` reads your CSV file into a Pandas DataFrame (a table-like structure). Replace `"your_data.csv"` with the actual path to your file.  

3. **Handling Missing Values:**
   * We use the `.fillna()` method: 
      * `df['Age'].fillna(df['Age'].mean(), inplace=True)` replaces missing values in the 'Age' column with the average age of the non-missing values (using the mean). You can replace this with strategies like:
         * **Forward Fill:** Filling with the next value if it exists 
         * **Backward Fill:** Filling with the previous value.
  
4. **Removing Empty Rows:**  `df.dropna(inplace=True)` removes rows that have all `NaN` (Not a Number) values in any column. This is useful to remove empty data entries.

5. **Converting Data Types:** The `pd.to_datetime()` function helps convert a string column representing dates or times into datetime format, allowing for easier analysis and manipulation of this information. 


**Example Usage:**  
Let's assume your CSV file has data like this:
```csv
Name,Age,Date
Alice,25,2023-10-26
Bob,30,None
Charlie,28,2023-10-27
David,None,2023-10-28
Emily,40,2023-10-29
```

After running the code, you would get a cleaner DataFrame (the output will look different depending on your specific data).



**Key Points:**  

* **Data Cleaning is Essential:** Before diving into analysis, cleaning up the raw data ensures accuracy and helps avoid bias.
* **Flexibility:** Pandas offers various techniques for handling missing values like `fillna()` or `interpolate()`. 


Let me know if you have a specific dataset and challenges, and I can provide more tailored code examples! 

