# Loading and Previewing the Dataset

This is the first step in our course recommender machine learning project. Here we use the `pandas` library, a powerful tool for data manipulation and analysis in Python, to load our dataset and prepare it for further analysis. We start by reading a CSV file into a DataFrame, make a copy of the data, and then preview the first few rows to get an initial understanding of its structure.

In [2]:
import pandas as pd  # Import the pandas library for data manipulation

# Load the dataset from a CSV file into a DataFrame
df1 = pd.read_csv("student-scores.csv")

# Create a copy of the DataFrame to preserve the original data
df = df1.copy()

# Display the first 5 rows of the DataFrame to understand the data structure
df.head()

Unnamed: 0,id,first_name,last_name,email,gender,extracurricular_activities,riasec,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score
0,1,Paul,Casey,paul.casey.1@gslingacademy.com,male,False,27,Bachelor of Statistics,73,81,93,97,63,80,87
1,2,Danielle,Sandoval,danielle.sandoval.2@gslingacademy.com,female,False,40,Bachelor of Supply Chain Management,90,86,96,100,90,88,90
2,3,Tina,Andrews,tina.andrews.3@gslingacademy.com,female,True,30,Bachelor of Corporate Communications,81,97,95,96,65,77,94
3,4,Tara,Clark,tara.clark.4@gslingacademy.com,female,False,40,Bachelor of Human Resouce Management,71,74,88,80,89,63,86
4,5,Anthony,Campos,anthony.campos.5@gslingacademy.com,male,False,25,Bachelor of Development Studies,84,77,65,65,80,74,76


# Understanding and Cleaning the Dataset

We inspect the column names to understand the dataset's structure, then clean the data by removing irrelevant columns. This is essential to enable us focus on the features that contribute directly to our model's accuracy.

In [24]:
# Display the column names of the DataFrame to understand its structure
df.columns

# Drop unnecessary columns that do not contribute to the course recommendation process
df.drop(columns=["id", "first_name", "last_name", "email"], axis=1, inplace=True)

KeyError: "['id', 'first_name', 'last_name', 'email'] not found in axis"

# Feature Engineering

## Calculating Total and Average Scores

This step in our course recommender project focuses on feature engineering, where we create new features from the existing data. Here, we calculate the total and average scores for each student across various subjects. These new features will be crucial in determining the student's overall performance, which can then be used to recommend appropriate courses.

In [4]:
# Calculate the total score by summing the scores from all subjects
df["total_score"] = (
    df["math_score"]
    + df["history_score"]
    + df["physics_score"]
    + df["chemistry_score"]
    + df["biology_score"]
    + df["english_score"]
    + df["geography_score"]
)

# Calculate the average score by dividing the total score by the number of subjects
df["average_score"] = df["total_score"] / 7

# Display the first 5 rows of the DataFrame to verify the new columns
df.head()

Unnamed: 0,gender,extracurricular_activities,riasec,career_aspiration,math_score,history_score,physics_score,chemistry_score,biology_score,english_score,geography_score,total_score,average_score
0,male,False,27,Bachelor of Statistics,73,81,93,97,63,80,87,574,82.0
1,female,False,40,Bachelor of Supply Chain Management,90,86,96,100,90,88,90,640,91.428571
2,female,True,30,Bachelor of Corporate Communications,81,97,95,96,65,77,94,605,86.428571
3,female,False,40,Bachelor of Human Resouce Management,71,74,88,80,89,63,86,551,78.714286
4,male,False,25,Bachelor of Development Studies,84,77,65,65,80,74,76,521,74.428571


# Encoding Categorical Variables for Model Training

In this step, we encode categorical variables into numerical values, which is a necessary preprocessing step for most machine learning algorithms. Categorical variables such as gender, extracurricular activities, and career aspirations are mapped to integers, making the data suitable for the machine learning model to process.

In [5]:
# from sklearn.preprocessing import LabelEncoder

gender_map = {'male': 0, 'female': 1}
extracurricular_activities_map = {False: 0, True: 1}
career_aspiration_map = {
    "Bachelor of Statistics": 0,
    "Bachelor of Supply Chain Management": 1,
    "Bachelor of Corporate Communications": 2,
    "Bachelor of Human Resouce Management": 3,
    "Bachelor of Development Studies": 4,
    "Bachelor of Procurement and Contract Management": 5,
    "Bachelor of Project Management": 6,
    "Bachelor of Business Administration": 7,
    "Bachelor of Journalism": 8,
    "Bachelor of Business and Office Management": 9,
    "Bachelor of Economics and Statistics": 10,
    "Bachelor of Mass Communication": 11,
    "Bachelor of Commerce": 12,
    "Bachelor of Procurement and Logistics": 13,
    "Bachelor of Finance": 14,
    "Bachelor of Business Information Technology": 15,
    "Bachelor of Technology and Entrepreneurship Management": 16,
}
# Apply mapping to the DataFrame
df['gender'] = df['gender'].map(gender_map)
df['extracurricular_activities'] = df['extracurricular_activities'].map(extracurricular_activities_map)
df['career_aspiration'] = df['career_aspiration'].map(career_aspiration_map)

# Balance Dataset

Before applying any balancing techniques, it’s important to understand the distribution of the target classes. In our case, if certain career aspirations are underrepresented, the model may become biased toward the more frequent classes.

To address this issue, we use the Synthetic Minority Over-sampling Technique (SMOTE). Class imbalance arises when some career aspirations are underrepresented in the dataset, leading to potential bias in model predictions. By applying SMOTE, we generate synthetic samples for the minority classes, thereby creating a more balanced dataset. This approach enhances the model's performance and ensures fairness in the predictions across all career aspirations.

In [6]:
# Display the unique values in the 'career_aspiration' column to understand the diversity of career goals
df["career_aspiration"].unique()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  9, 10, 11, 12, 15, 13, 16,  8, 14],
      dtype=int64)

Our dataset includes 14 unique career aspirations, each representing a distinct course. This means that the recommendations generated by our model will be focused on these 14 courses.

In [7]:
# Count the number of instances for each unique value in 'career_aspiration' to identify any class imbalance
df["career_aspiration"].value_counts()

career_aspiration
5     315
7     309
4     223
10    169
0     138
12    126
1     119
14     83
8      73
13     68
3      67
16     63
2      61
6      59
15     56
9      39
11     32
Name: count, dtype: int64

The distribution of career aspirations in our dataset reveals a significant imbalance among the various categories. 

- Career Aspirations 5 and 7 are the most prevalent, with 315 and 309 occurrences, respectively. This suggests that these career paths are the most popular or preferred among students.

- Career Aspirations 11 and 9 are the least frequent, with only 32 and 39 occurrences each. This indicates that these career paths are less common and might be underrepresented in the dataset.

- The other career aspirations fall somewhere in between, with varying counts that further highlight the disparity. For instance, aspirations like 4, 10, and 0 have counts of 223, 169, and 138, respectively, while aspirations like 16, 15, and 2 have counts as low as 63, 56, and 61.

This imbalance could lead to challenges in model performance, as machine learning models tend to favor the majority classes, potentially neglecting the minority ones. To address this, it is crucial to apply balancing techniques, such as SMOTE, to ensure that each career aspiration is fairly represented in the training data. This will help in creating a model that can provide accurate and equitable recommendations across all career paths.

We will apply SMOTE to the career aspirations in the next cell.

In [8]:
# Import the SMOTE class from the imbalanced-learn library
from imblearn.over_sampling import SMOTE

# Create a SMOTE object with a set random state for reproducibility
smote = SMOTE(random_state=42)

# Separate the features (X) and the target variable (y)
X = df.drop(
    "career_aspiration", axis=1
)  # Features are all columns except 'career_aspiration'
y = df["career_aspiration"]  # Target variable is the 'career_aspiration' column

# Apply SMOTE to balance the dataset by oversampling the minority classes
X_resampled, y_resampled = smote.fit_resample(X, y)

In [9]:
print("Number of NaNs in y:", y.isna().sum())

Number of NaNs in y: 0


# Train test Split

we divide our balanced dataset into training and testing sets. The training set is used to train the model, while the testing set is used to assess how well the model generalizes to new, unseen examples.

In [10]:
# Import the train_test_split function from sklearn to split the dataset
from sklearn.model_selection import train_test_split

# Split the balanced dataset into training and testing sets
# 'test_size=0.2' means 20% of the data will be used for testing, and 80% for training
# 'random_state=42' ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42
)

In [11]:
# Check the shape of the training feature set
X_train.shape

(4284, 12)

# Feature Scalling

Feature scaling standardizes the range of independent variables or features. This process ensures that each feature contributes equally to the model, improving convergence and performance, particularly for algorithms that are sensitive to feature scaling, like support vector machines or gradient-based methods.

In [12]:
# Import the StandardScaler class from sklearn
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the training data and transform both training and testing data
# 'fit_transform' calculates the mean and standard deviation from the training data and scales it
X_train_scaled = scaler.fit_transform(X_train)
# 'transform' scales the testing data based on the mean and standard deviation of the training data
X_test_scaled = scaler.transform(X_test)

In [13]:
X_train_scaled.shape

(4284, 12)

# Model Training and Evaluation

We will train and evaluate several machine learning models to determine which one best fits our course recommender system. The selected models are Logistic Regression, Support Vector Classifier (SVC), Random Forest Classifier, Gradient Boosting Classifier, and XGBoost Classifier. 

## Logistic Regression

### How It Works
Logistic Regression models the probability of a binary outcome based on one or more predictor variables. For multi-class classification, it extends to multinomial logistic regression.

#### Pros
- Simple and interpretable.
- Effective for linearly separable data.
- Provides probabilities for class memberships.

#### Cons
- Assumes linear relationships between features and the target variable.
- May not perform well with complex or non-linear data.
- Can be sensitive to feature scaling.

## Support Vector Classifier (SVC)

### How It Work
SVC finds the hyperplane that best separates the data into different classes. It uses kernel functions to handle non-linear data by transforming it into higher dimensions.

#### Pros
- Effective in high-dimensional spaces.
- Works well for both linear and non-linear problems (with the appropriate kernel).

#### Cons
- Computationally intensive, especially with large datasets.
- Requires careful tuning of parameters like the kernel and regularization.

## Random Forest Classifier

### How It Work
 
Random Forest is an ensemble method that builds multiple decision trees and merges their predictions. It uses bagging (bootstrap aggregating) to improve performance and control overfitting.

#### Pros
- Handles both numerical and categorical features.
- Robust to overfitting and provides feature importance scores.
- Can manage imbalanced datasets effectively.

#### Cons
- Less interpretable due to the complexity of multiple trees.
- Can be computationally expensive with a large number of trees.

## Gradient Boosting Classifier

### How It Works
 
Gradient Boosting builds models sequentially, with each new model correcting the errors of the previous ones. It uses gradient descent to minimize the loss function.

#### Pros
- Often achieves high predictive accuracy.
- Handles various types of data and relationships well.

#### Cons
- Prone to overfitting if not tuned properly.
- Computationally expensive and requires careful parameter tuning.

## XGBoost Classifier

### How It Works
XGBoost is a gradient boosting algorithm that incorporates advanced features like regularization and parallel processing. It builds an ensemble of trees in a boosting framework.

#### Pros
- Known for its speed and performance.
- Handles missing values and imbalanced data effectively.
- Offers various hyperparameters for fine-tuning.

#### Cons
- Requires careful tuning of hyperparameters.
- Complexity can make it less interpretable.

Each of these models has its strengths and weaknesses. By evaluating them using accuracy, classification reports, and confusion matrices, we can determine which model best balances performance and computational efficiency for our course recommender system. The goal is to choose a model that provides accurate and reliable recommendations, handling the complexity and diversity of student preferences effectively.

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# Define models
models = {
    "Logistic Regression": LogisticRegression(),
    "Support Vector Classifier": SVC(),
    "Random Forest Classifier": RandomForestClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "XGBoost Classifier": XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
}

# Train and evaluate each model
for name, model in models.items():
    print("="*50)
    print("Model:", name)
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Predict on test set
    y_pred = model.predict(X_test_scaled)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    # Print metrics
    print("Accuracy:", accuracy)
    print("Classification Report:\n", classification_rep)
    print("Confusion Matrix:\n", conf_matrix)


Model: Logistic Regression
Accuracy: 0.34827264239028943
Classification Report:
               precision    recall  f1-score   support

           0       0.30      0.25      0.27        68
           1       0.40      0.71      0.51        72
           2       0.22      0.12      0.16        57
           3       0.24      0.16      0.19        58
           4       0.33      0.09      0.14        66
           5       0.29      0.25      0.27        76
           6       0.40      0.62      0.49        71
           7       0.47      0.46      0.46        61
           8       0.24      0.33      0.28        49
           9       0.33      0.37      0.35        63
          10       0.22      0.11      0.15        64
          11       0.35      0.66      0.46        50
          12       0.41      0.41      0.41        69
          13       0.35      0.56      0.43        55
          14       0.36      0.16      0.22        62
          15       0.12      0.05      0.07        65


# summary of the model performance results

## Logistic Regression

- Accuracy: 34.83%
- Classification Report:
- Precision: Ranges from 0.12 to 0.47 across classes.
- Recall: Ranges from 0.05 to 0.71 across classes.
- F1-Score: Ranges from 0.07 to 0.54 across classes.
- Confusion Matrix: Indicates significant misclassification across most classes, with better performance for class 1 and class 6.

## Support Vector Classifier (SVC)
- Accuracy: 55.09%
- Classification Report:
- Precision: Ranges from 0.36 to 0.90 across classes.
- Recall: Ranges from 0.28 to 0.89 across classes.
- F1-Score: Ranges from 0.32 to 0.74 across classes.
- Confusion Matrix: Shows improved performance over Logistic Regression, particularly for classes 1, 6, and 7.

## Random Forest Classifier
- Accuracy: 79.27%
- Classification Report:
- Precision: Ranges from 0.47 to 0.98 across classes.
- Recall: Ranges from 0.21 to 0.92 across classes.
- F1-Score: Ranges from 0.29 to 0.90 across classes.
- Confusion Matrix: Demonstrates strong performance across most classes, especially for classes 1, 6, and 7.

## Gradient Boosting Classifier
- Accuracy: 66.85%
- Classification Report:
- Precision: Ranges from 0.43 to 0.98 across classes.
- Recall: Ranges from 0.15 to 0.89 across classes.
- F1-Score: Ranges from 0.22 to 0.90 across classes.
- Confusion Matrix: Generally better than Logistic Regression and comparable to SVC, with particularly strong performance for classes 1, 7, and 9.

### Summary 
- Random Forest Classifier shows the best overall accuracy and balanced performance across classes. It is the most robust model in this comparison.
- Gradient Boosting Classifier also performs well, with significant accuracy and precision improvements over Logistic Regression.
- Support Vector Classifier offers good performance but slightly less consistent across all classes compared to Random Forest and Gradient Boosting.
Given these results, the Random Forest Classifier is the most reliable model for your task. It achieves the highest accuracy and performs well across most classes. 
We will tune this model further to potentially enhance performance even more.

In [25]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["auto", "sqrt", "log2"],
}

rf = RandomForestClassifier()
grid_search = GridSearchCV(
    estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2
)
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best parameters found:  {'max_depth': 30, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


In [42]:
"""

# Create the model with best parameters
model = RandomForestClassifier(
    max_depth=30,
    max_features="log2",
    min_samples_leaf=1,
    min_samples_split=3,
    n_estimators=200,
)

# Train the model
model.fit(X_train, y_train)

# Evaluate the model
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

"""

              precision    recall  f1-score   support

           0       0.70      0.75      0.72        68
           1       0.86      0.94      0.90        72
           2       0.89      0.86      0.88        57
           3       0.85      0.90      0.87        58
           4       0.74      0.39      0.51        66
           5       0.39      0.21      0.27        76
           6       0.82      0.92      0.87        71
           7       0.90      0.74      0.81        61
           8       0.78      0.94      0.85        49
           9       0.81      0.95      0.88        63
          10       0.63      0.70      0.67        64
          11       0.82      0.98      0.89        50
          12       0.86      0.72      0.79        69
          13       0.69      0.98      0.81        55
          14       0.83      0.77      0.80        62
          15       0.86      0.91      0.88        65
          16       0.77      0.85      0.81        65

    accuracy              

# Untuned Model

In [44]:
model = RandomForestClassifier()

model.fit(X_train_scaled, y_train)
# Predict on test set
y_pred = model.predict(X_test_scaled)

# Calculate metrics
print("Accuracy: ",accuracy_score(y_test, y_pred))
print("Report: ",classification_report(y_test, y_pred))
print("Confusion Matrix: ",confusion_matrix(y_test, y_pred))

Accuracy:  0.7889822595704948
Report:                precision    recall  f1-score   support

           0       0.75      0.78      0.76        68
           1       0.79      0.94      0.86        72
           2       0.84      0.84      0.84        57
           3       0.85      0.86      0.85        58
           4       0.73      0.41      0.52        66
           5       0.45      0.24      0.31        76
           6       0.85      0.93      0.89        71
           7       0.91      0.70      0.80        61
           8       0.78      0.96      0.86        49
           9       0.83      0.95      0.89        63
          10       0.66      0.75      0.70        64
          11       0.86      1.00      0.93        50
          12       0.89      0.71      0.79        69
          13       0.70      1.00      0.82        55
          14       0.83      0.79      0.81        62
          15       0.84      0.86      0.85        65
          16       0.79      0.89      0.8

# Single Input Predictions

In [47]:
# Test sample index
test_index = 10

# Print actual label and model prediction
print(f"Actual Label: {y_test.iloc[test_index]}")
prediction = model.predict(X_test_scaled[test_index].reshape(1, -1))[0]
print(f"Model Prediction: {prediction}")

# Check if prediction matches the actual label
if y_test.iloc[test_index] == prediction:
    print("The model's prediction is accurate.")
else:
    print("The model's prediction is inaccurate.")

Actual Label: 12
Model Prediction: 12
The model's prediction is accurate.


In [37]:
# test 2
print("Actual Label :", y_test.iloc[300])
print("Model Prediction :",model.predict(X_test_scaled[300].reshape(1,-1))[0])
if y_test.iloc[10]==model.predict(X_test_scaled[10].reshape(1,-1)):
    print("The model's prediction is accurate.")
else:
    print("The model's prediction is inaccurate.")

Actual Label : 0
Model Prediction : 0
Wow! Model doing well.....


In [18]:
# test 2
print("Actual Label :", y_test.iloc[23])
print("Model Prediction :",model.predict(X_test_scaled[23].reshape(1,-1))[0])
if y_test.iloc[10]==model.predict(X_test_scaled[10].reshape(1,-1)):
    print("The model's prediction is accurate.")
else:
    print("The model's prediction is inaccurate.")

Actual Label : 3
Model Prediction : 3
Wow! Model doing well.....


# Saving & Load Files

we save the trained scaler and model using the `pickle` module to ensure they can be easily loaded and reused for future predictions. We chose pickle because it efficiently serializes Python objects, allowing us to preserve the exact preprocessing steps and model parameters. This approach ensures consistency in predictions, as the same scaling and model configurations are applied to any new data.

In [19]:

import pickle

# Save the scaler and the model
with open("Models/scaler.pkl", "wb") as scaler_file:
    pickle.dump(scaler, scaler_file)
with open("Models/model.pkl", "wb") as model_file:
    pickle.dump(model, model_file)

# Recommendation System

This script integrates the pre-trained Random Forest model and the scaler, both saved using pickle, to recommend suitable university courses for students based on various input features. The Recommendations function encodes the input features, scales them, and then uses the model to predict the top five recommended courses with their associated probabilities. This approach ensures that the predictions are consistent and aligned with the original training conditions, allowing for accurate and reliable recommendations. The model’s capability to output probabilities (predict_proba) is verified to ensure that the recommendations are backed by probabilistic reasoning.


In [20]:
import pickle
import numpy as np

# Load the scaler, label encoder, model, and class names
scaler = pickle.load(open("Models/scaler.pkl", 'rb'))
model = pickle.load(open("Models/model.pkl", 'rb'))

# Verify the model type
print(
    type(model)
)  # Should output something like <class 'sklearn.ensemble._forest.RandomForestClassifier'>

# Check if model supports predict_proba
if hasattr(model, "predict_proba"):
    print("Model supports predict_proba")
else:
    print("Model does NOT support predict_proba")
    
class_names = [
    "Bachelor of Supply Chain Management",
    "Bachelor of Statistics",
    "Bachelor of Corporate Communications",
    "Bachelor of Human Resouce Management",
    "Bachelor of Development Studies",
    "Bachelor of Procurement and Contract Management",
    "Bachelor of Project Management",
    "Bachelor of Business Administration",
    "Bachelor of Journalism",
    "Bachelor of Business and Office Management",
    "Bachelor of Economics and Statistics",
    "Bachelor of Mass Communication",
    "Bachelor of Commerce",
    "Bachelor of Procurement and Logistics",
    "Bachelor of Finance",
    "Bachelor of Business Information Technology",
    "Bachelor of Technology and Entrepreneurship Management",
]

def Recommendations(gender, extracurricular_activities,
                    riasec, math_score, history_score, physics_score,
                    chemistry_score, biology_score, english_score, geography_score,
                    total_score,average_score):
    
    # Encode categorical variables
    gender_encoded = 1 if gender.lower() == 'female' else 0
    extracurricular_activities_encoded = 1 if extracurricular_activities else 0
    
    # Create feature array
    feature_array = np.array([[gender_encoded, extracurricular_activities_encoded,
                               riasec, math_score, history_score, physics_score,
                               chemistry_score, biology_score, english_score, geography_score,total_score,average_score]])
    
    # Scale features
    scaled_features = scaler.transform(feature_array)
    
    # Predict using the model
    probabilities = model.predict_proba(scaled_features)
    
    # Get top five predicted classes along with their probabilities
    top_classes_idx = np.argsort(-probabilities[0])[:5]
    top_classes_names_probs = [(class_names[idx], probabilities[0][idx]) for idx in top_classes_idx]
    
    return top_classes_names_probs

<class 'sklearn.ensemble._forest.RandomForestClassifier'>
Model supports predict_proba


In [48]:
# Example usage 1
final_recommendations = Recommendations(gender='female',
                                        extracurricular_activities=False,
                                        riasec=32,
                                        math_score=65,
                                        history_score=60,
                                        physics_score=97,
                                        chemistry_score=94,
                                        biology_score=71,
                                        english_score=81,
                                        geography_score=66,
                                        total_score=534,
                                        average_score=76.285714)

print("Top recommended studies with probabilities:")
print("="*50)
for class_name, probability in final_recommendations:
    print(f"{class_name} with probability {probability}")

Top recommended studies with probabilities:
Bachelor of Project Management with probability 0.54
Bachelor of Development Studies with probability 0.2
Bachelor of Business Administration with probability 0.09
Bachelor of Business Information Technology with probability 0.04
Bachelor of Finance with probability 0.03


In [22]:
# Example usage 2
final_recommendations = Recommendations(gender='female',
                                        extracurricular_activities=False,
                                        riasec=40,
                                        math_score=87,
                                        history_score=73,
                                        physics_score=67,
                                        chemistry_score=91,
                                        biology_score=79,
                                        english_score=60,
                                        geography_score=77,
                                        total_score=583,
                                        average_score=83.285714)

print("Top recommended studies with probabilities:")
print("="*50)
for class_name, probability in final_recommendations:
    print(f"{class_name} with probability {probability}")


Top recommended studies with probabilities:
Bachelor of Human Resouce Management with probability 0.44
Bachelor of Business Administration with probability 0.2
Bachelor of Finance with probability 0.1
Bachelor of Corporate Communications with probability 0.06
Bachelor of Project Management with probability 0.06


In [49]:
# sklear version in pychar production 
import sklearn
print(sklearn.__version__)
# in pycharm env install
# pip install scikit-learn==1.3.2

1.4.1.post1
