# Choose an appropriate machine learning algorithm

Prompt: Write code to predict retention of the students. Use input file called predict.csv and machine learning  algorithm Logisitic Regression with default parameters. Train system on train.csv, that has categorical variables.


To predict student retention using Logistic Regression with default parameters, we will first train the model on a dataset called train.csv, which contains categorical variables. Then, we'll use this trained model to make predictions on the data from predict.csv.

In this example, I'll demonstrate how to handle categorical variables using one-hot encoding before training the model. This encoding process is essential since Logistic Regression in scikit-learn does not natively handle categorical variables.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

# Load the training dataset
df_train = pd.read_csv('train.csv', sep=";")

# Separate the features and the target variable
# Replace 'target' with your actual target column name
X_train = df_train.drop('Class', axis=1)
y_train = df_train['Class']

# Preprocessing and model pipeline
pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore')),  # One-hot encode categorical variables
    ('logreg', LogisticRegression())  # Logistic Regression model
])

# Train the model
pipeline.fit(X_train, y_train)

# Load the prediction dataset
df_predict = pd.read_csv('predict.csv', sep=";")
print(df_predict.head())
# Predict retention using the model
predictions = pipeline.predict(df_predict)

# Output the predictions
# Assuming df_predict has a column 'Student_ID' to identify students
df_predict['Predicted_Retention'] = predictions
print(df_predict[['Student_ID','Predicted_Retention']])

# Save the predictions to a CSV file if needed
df_predict.to_csv('retention_predictions.csv', index=False)


  Matric_Aggregate_Group FTEN Matric_Status_Code   Age_Group     Race Gender   
0              1201_1300    F               DEGR  24_25       African      F  \
1              0000_0000    T               DIPL  29_34       African      M   
2              1001_1050    F               DIPL  18_19       African      M   
3              0000_0000    T               DEGR  22_23       African      M   
4              1101_1150    T               DEGR  26_28        Indian      F   

    Home_Language Employment Magisterial_District         Region  
0  NORTHERN SOTHO    Student               GIYANI  North Eastern  
1        ISIXHOSA   Employed         ALIWAL NORTH   Eastern Cape  
2         ISIZULU   Employed             PINETOWN  KwaZulu-Natal  
3       AFRIKAANS    Student         JOHANNESBURG        Gauteng  
4        SETSWANA   Employed             PHOKWANI        Gauteng  
   Predicted_Retention
0                    1
1                    0
2                    0
3                    1
4 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Replace 'target' with the actual name of your target variable in the train.csv dataset.

The script assumes that your predict.csv has the same structure (same features) as train.csv. If not, adjust the script to match your data.

If predict.csv includes a target column, make sure to drop it before making predictions.

The script uses a Pipeline from scikit-learn to streamline the process of encoding and model training.

The OneHotEncoder is set to ignore unknown categories that might appear in predict.csv but not in train.csv.

Adjust 'Student_ID' to the actual identifier used in your predict.csv dataset.