# Project 2: DROP OUT CLASSIFIER
(by: Martin Marsal, Benedikt Allmendinger, Christian Diegmann; Heilbronn University, Germany, November 2024)


# Preprocessing Data

- Nullen raus (braucht nochmal Diskussionsbedarf, Nullen werden bei manchen Features nämlich gebraucht)
- Long floats kürzen auf 2te kommastelle (done)
- Normalisieren von Daten? Daruch wird training schneller, ohne Zusammenhangsverlust
- Biased Features raus? nö, alle wichtig
- Ausreißer raus

In [None]:
import pandas as pd

In [None]:
# Convert csv file to a pandas DataFrame
df = pd.read_csv('student_data.csv')

# Strip any leading/trailing spaces from column names
df.columns = df.columns.str.strip()

# Rounding to two decimal places
df['Curricular units 2nd sem (grade)'] = df['Curricular units 2nd sem (grade)'].round(2)

# Mapping the target values to level of risk:
df['Target'] = df['Target'].map({'Dropout': 2, 'Enrolled': 1, 'Graduate': 0})

# Print the DataFrame
print(df)

In [None]:
# Shuffle the DataFrame and reset the index
shuffle_df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Calculate the size of the training set (80% of the data)
train_size = int(0.8 * len(shuffle_df))

# Split the DataFrame into training and test sets
train_df = shuffle_df.iloc[:train_size]
test_df = shuffle_df.iloc[train_size:]

# Print the training and test sets
print(train_df)
print(test_df)

# Train at least four machine learning algorithms

## Model 1 probabilistic

In [None]:
# Code Cell

## Model 2 tree based - B

In [None]:

# Identify categorical columns
categorical_cols = train_df.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_cols)

# Combine training and test sets for consistent encoding
combined_df = pd.concat([train_df, test_df], axis=0)

# Apply one-hot encoding
combined_df = pd.get_dummies(combined_df, columns=categorical_cols)

# Split back into training and test sets
train_df = combined_df.iloc[:len(train_df), :]
test_df = combined_df.iloc[len(train_df):, :]

# Separate features and target
X_train = train_df.drop('Target', axis=1)
y_train = train_df['Target']
X_test = test_df.drop('Target', axis=1)
y_test = test_df['Target']


In [None]:
# Code Cell
from sklearn.tree import DecisionTreeClassifier

# Initialize the model with a random state for reproducibility
clf = DecisionTreeClassifier(random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Generate classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Generate confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Initialize the model with hyperparameters
clf = DecisionTreeClassifier(random_state=42, max_depth=5, min_samples_split=10)

# Retrain the model
clf.fit(X_train, y_train)

from sklearn.model_selection import cross_val_score

# Perform cross-validation
cv_scores = cross_val_score(clf, X_train, y_train, cv=5)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.2f}")


## Model 3 distance-based  - M

In [None]:
# Code Cell

## Model 4 Ensemble method - C

In [None]:
# Code Cell

## Discussion

Are all models equally well suited for this task? Discuss your conclusion.

In [None]:
# Code Cell

# Evaluation

Evaluate the four models using k-fold cross validation and give at least accuracy (mean and standard deviation) and
confusion matrix for the trained models. Is one of the models significantly better than the others?

In [None]:
# Code Cell

## Model 1 probabilistic

In [None]:
# Code Cell

## Model 2 tree based - B

In [None]:
# Code Cell

## Model 3 distance-based  - M

In [None]:
# Code Cell

## Model 4 Ensemble method - C

In [None]:
# Code Cell

# Pick your favorite model. 
Which features were most relevant for the for the students’ success?


In [None]:
# Code Cell

# Save your favorite model as pickle-file with https://scikit-learn.org/stable/model_persistence.html. Call the file “best_model.pkl”.


The submission consists of two files:
1. A Jupyter Notebook containing the preprocessing, the training, and the evaluation of
your models.
2. A pickle-file “best_model.pkl”

In [None]:
# Code Cell