Here we describe how to prepare and make a submission for Python users.  

Disclaimer: the data preprocessing and the model that we will train here is very basic. You should not use it as an example of how to properly train a model (e.g. imputing missing values with a mean or mode is often a bad idea). The sole purpose of this script is to make the submission process more clear.  

In this example we assume that you are using [GitHub Desktop](https://docs.github.com/en/desktop), which is a very convenient tool for updating your github repository. [Here](https://github.com/eyra/fertility-prediction-challenge/wiki) you can find useful links how to use GitHub Desktop for cloning your repository and for update the files in it.  

We also assume that you did all the prerequisite steps described   [here](https://github.com/eyra/fertility-prediction-challenge?tab=readme-ov-file#prerequisites){target="_blank"}.

Let's imagine that you want to add one predictor to the model that is already in the repository -- highest education level (variable "oplmet_2020", as you found using the codebooks). You can edit the scripts that are already in the repository to preprocess the data and train and save such a model.

Steps:  

1. Go to your cloned folder and open "training.py" script in the environment that you normally use for python. This is the script that you should update to produce a model.    

2. Copy paste the "clean_df" function from the "submission.py" script. Add the variable that you want to add and impute missing values.   
This is what we have now in this function:


In [None]:
# List your libraries and modules here. Don't forget to update environment.yml if you use packages that are not there!
import pandas as pd
from sklearn.linear_model import LogisticRegression
import joblib


def clean_df(df, background_df=None):
    """
    Preprocess the input dataframe to feed the model.
    # If no cleaning is done (e.g. if all the cleaning is done in a pipeline) leave only the "return df" command

    Parameters:
    df (pd.DataFrame): The input dataframe containing the raw data (e.g., from PreFer_train_data.csv or PreFer_fake_data.csv).
    background (pd.DataFrame): Optional input dataframe containing background data (e.g., from PreFer_train_background_data.csv or PreFer_fake_background_data.csv).

    Returns:
    pd.DataFrame: The cleaned dataframe with only the necessary columns and processed variables.
    """

    ## This script contains a bare minimum working example
    # Create new variable with age
    df["age"] = 2024 - df["birthyear_bg"]

    # Imputing missing values in age with the mean
    df["age"] = df["age"].fillna(df["age"].mean())
    
    # Imputing missing values in education (oplmet_2020) with the mode 
    df["oplmet2020"] = df["oplmet2020"].fillna(df["oplmet2020"].mode()) # <----- that's what we added!

    # Selecting variables for modelling
    keepcols = [
        "nomem_encr",  # ID variable required for predictions,
        "age"          # newly created variable
    ] 

    # Keeping data with variables selected
    df = df[keepcols]

    return df

(add the note about importance of using the same versions of packages and updating the environment.yml)