## Get the data from the database
Sample the exercise IDs manually from the available exercises and adapt the `EXERCISE_IDS` variable accordingly.
The `fetch_data_from_db` function fetches the data from the database for the specified exercise IDs.

In [None]:
from service.db_service import fetch_data_from_db
from langid import classify

EXERCISE_IDS = [4066, 642, 544, 506]
data = fetch_data_from_db(EXERCISE_IDS)

## Data Preprocessing
The data preprocessing steps include:
- Dropping rows with missing or invalid data.
- Filtering out non-English submissions.

You can adapt the data preprocessing steps based on the requirements of your evaluation.

### Drop Rows with Missing or Invalid Data
Drops the rows with missing data in the `submission_text` and `result_score` columns. Also, filters out submissions with no text.

In [None]:
data = data.dropna(subset=["submission_text", "result_score"])
data = data[data["submission_text"].str.strip() != ""]

### Filter Out Non-English Submissions
Filters out non-English submissions using the `langid` library.

In [None]:
unique_texts = data["submission_text"].unique()
classification_results = {text: classify(text)[0] == "en" for text in unique_texts}

data["is_english"] = data["submission_text"].map(classification_results)
data = data[data["is_english"]]

data = data.drop(columns=["is_english"])

## Save the Sampled Exercises in a CSV File
Saves the sampled exercises to a CSV file for the next steps in the evaluation process.
You can also retrieve the sampled exercises from an existing CSV file using the `read_csv` command.

In [None]:
data.to_csv("data/1_exercises/exercises.csv", index=False)
# data = pd.read_csv("data/1_exercises/exercises.csv")