# Supervised Algorithms Group10 (career recommendations)

Our dataset involves career prediction based on historical job titles and skills. This is a classification problem where each data point (skills and job titles) is mapped to a future job title. The choice of Support Vector Machine (SVM) and Random Forest (RF) is justified based on the following factors:

## 1. Support Vector Machine (SVM)


#### -Handles High-Dimensional Text Data Well: 
Since we are using TF-IDF vectorization, the features are sparse and high-dimensional. SVM works well in such cases because it finds a hyperplane that maximizes class separation.
#### -Good for Small to Medium Datasets: 
Since our dataset consist of 2000 rows it consider small data set, and SVM is known to work well even with limited training data, making it suitable for our dataset.
#### -Robust to Overfitting (with Linear Kernel):
Since our features are textual representations, using a linear kernel prevents overfitting while maintaining interpretability.
Works Well with Imbalanced Data: SVM is effective when there is class imbalance (some job titles appear much less frequently than others), as it focuses on the hardest-to-classify points near the decision boundary.

## 2. Random Forest (RF)

#### -Handles Non-Linear Relationships Well: 
Unlike SVM, which finds a linear decision boundary (unless using kernels), Random Forest can capture complex relationships between job history, skills, and job title.
#### -Handles Categorical & Text Features Well: 
Our dataset has job titles, skills, and employer names, which can be encoded as categorical data or word embeddings.
#### -Resistant to Overfitting: 
Since Random Forest averages multiple decision trees, it reduces the risk of overfitting.
#### -Feature Importance: 
It helps interpret which skills, courses, or past jobs are most important in career recommendations.
#### -Works Well with Mixed Data: 
It handles both numerical (num_skills, num_past_jobs) and categorical (careerjunction_za_skills, careerjunction_za_historical_jobtitles) data efficiently.



-----------------------------------------------------------------------------------------------------------------------
Now, let's implement both two algorithms and compare results:

In [1]:
# Import necessary libraries
import pandas as pd
import ast
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import resample
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from scipy.sparse import hstack

#### 1. LOAD AND CLEAN DATASET

In [2]:
# Load dataset
file_path = 'data_science_extract(in).csv'
df = pd.read_csv(file_path, encoding="latin1", usecols=[
    "careerjunction_za_future_jobtitles",
    "careerjunction_za_skills",
    "careerjunction_za_historical_jobtitles"
])

In [3]:
# Drop rows where future job title is missing or empty
df.dropna(subset=['careerjunction_za_future_jobtitles'], inplace=True)
df = df[df['careerjunction_za_future_jobtitles'].str.strip() != "[]"]


In [4]:
# Standardize target labels (strip whitespace, fix inconsistencies)
df['careerjunction_za_future_jobtitles'] = df['careerjunction_za_future_jobtitles'].str.strip()
df['careerjunction_za_future_jobtitles'] = df['careerjunction_za_future_jobtitles'].str.replace('\xa0', ' ', regex=False)


In [5]:
# Convert skills & job titles from string representations of lists to actual lists
df['careerjunction_za_skills'] = df['careerjunction_za_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else [])
df['careerjunction_za_historical_jobtitles'] = df['careerjunction_za_historical_jobtitles'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else [])


In [6]:
# Convert lists to text format
df['skills_text'] = df['careerjunction_za_skills'].apply(lambda skills: ' '.join(skills))
df['hist_text'] = df['careerjunction_za_historical_jobtitles'].apply(lambda jobs: ' '.join(jobs))


####  2. MERGE RARE CLASSES INTO "OTHER"

In [7]:
# Count occurrences of each job title category
class_counts = df['careerjunction_za_future_jobtitles'].value_counts()

In [8]:
# Define threshold (minimum samples per job title category)
min_samples = 2

In [9]:
# Replace rare categories with "Other"
df['careerjunction_za_future_jobtitles'] = df['careerjunction_za_future_jobtitles'].apply(
    lambda x: x if class_counts[x] >= min_samples else "Other"
)


#### 3. SPLIT DATA INTO TRAIN & TEST SETS

In [10]:
# Define features (skills_text + hist_text) and target
X = df[['skills_text', 'hist_text']]
y = df['careerjunction_za_future_jobtitles']

In [11]:
# Stratified train-test split (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

#### 4. HANDLE CLASS IMBALANCE (OVERSAMPLING)

In [12]:
# Combine X_train and y_train to resample minority classes
train_data = X_train.copy()
train_data['future_jobtitle'] = y_train.values

In [13]:
# Find the max count of the most common class
max_count = train_data['future_jobtitle'].value_counts().max()

In [15]:
 #Perform oversampling for minority classes
oversampled_train_parts = []
for category, group in train_data.groupby('future_jobtitle'):
    if len(group) < max_count:
        group_oversampled = resample(group, replace=True, n_samples=max_count, random_state=42)
        oversampled_train_parts.append(group_oversampled)
    else:
        oversampled_train_parts.append(group)


In [16]:
# Create new balanced training set
train_data_balanced = pd.concat(oversampled_train_parts).reset_index(drop=True)
train_data_balanced = train_data_balanced.sample(frac=1, random_state=42).reset_index(drop=True)


In [17]:
# Extract oversampled features and labels
X_train_balanced = train_data_balanced[['skills_text', 'hist_text']]
y_train_balanced = train_data_balanced['future_jobtitle']

#### 5. TEXT FEATURE EXTRACTION (TF-IDF)

In [18]:
# Initialize TF-IDF vectorizers for skills & job titles
skills_vectorizer = TfidfVectorizer(max_features=5000)
history_vectorizer = TfidfVectorizer(max_features=5000)

In [19]:
# Fit and transform on training data
X_train_skill_tfidf = skills_vectorizer.fit_transform(X_train_balanced['skills_text'])
X_train_hist_tfidf = history_vectorizer.fit_transform(X_train_balanced['hist_text'])


In [20]:
# Transform test data
X_test_skill_tfidf = skills_vectorizer.transform(X_test['skills_text'])
X_test_hist_tfidf = history_vectorizer.transform(X_test['hist_text'])


In [21]:
# Combine TF-IDF matrices
X_train_final = hstack([X_train_skill_tfidf, X_train_hist_tfidf])
X_test_final = hstack([X_test_skill_tfidf, X_test_hist_tfidf])


 #### 6. TRAIN & EVALUATE MACHINE LEARNING MODELS

In [22]:
# Train SVM Classifier
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_final, y_train_balanced)

SVC(kernel='linear', random_state=42)

In [23]:
# Train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_final, y_train_balanced)


RandomForestClassifier(random_state=42)

#### 7. MODEL PERFORMANCE EVALUATION

In [24]:
# Predict on the test set
y_pred_svm = svm_model.predict(X_test_final)
y_pred_rf = rf_model.predict(X_test_final)

In [25]:
# Print Accuracy Scores
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

SVM Accuracy: 0.4976635514018692
Random Forest Accuracy: 0.5397196261682243


In [26]:
# Print Classification Reports
print("\nSVM Classification Report:")
print(classification_report(y_test, y_pred_svm))


SVM Classification Report:
                                         precision    recall  f1-score   support

               Administrative & Support       0.00      0.00      0.00         1
         Consulting & Business Analysis       0.09      0.09      0.09        35
  Data Analysis & Business Intelligence       0.30      0.32      0.31        38
  Database Administration & Development       0.00      0.00      0.00        14
          Engineering & Technical Roles       0.00      0.00      0.00        10
                   Finance & Accounting       0.00      0.00      0.00         1
            IT Support & Administration       0.22      0.18      0.20        39
Learning & Development (L&D) / Training       0.00      0.00      0.00         1
                                  Other       0.00      0.00      0.00         4
                     Project Management       0.36      0.32      0.34        47
                      Sales & Marketing       0.36      0.16      0.22        25

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [27]:
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))


Random Forest Classification Report:
                                         precision    recall  f1-score   support

               Administrative & Support       0.00      0.00      0.00         1
         Consulting & Business Analysis       0.20      0.03      0.05        35
  Data Analysis & Business Intelligence       0.61      0.29      0.39        38
  Database Administration & Development       0.00      0.00      0.00        14
          Engineering & Technical Roles       0.00      0.00      0.00        10
                   Finance & Accounting       0.00      0.00      0.00         1
            IT Support & Administration       0.00      0.00      0.00        39
Learning & Development (L&D) / Training       0.00      0.00      0.00         1
                                  Other       0.00      0.00      0.00         4
                     Project Management       0.44      0.34      0.39        47
                      Sales & Marketing       0.43      0.24      0.31

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Based on the results, Random Forest (53.97%) outperformed SVM (49.77%) in accuracy, making it the better choice for predicting future job titles from skills and historical job data. The main reason for Random Forest’s superior performance is its ability to handle non-linear relationships better than SVM. Career progression is often complex and not strictly linear, and Random Forest, being an ensemble of decision trees, can capture these variations more effectively. Additionally, Random Forest is more robust to noise and outliers, as it averages across multiple trees, reducing the impact of misclassified instances. On the other hand, SVM relies on a linear decision boundary, which may not be the best fit for this type of data. Moreover, even after oversampling, some job categories may still be underrepresented, and Random Forest’s ability to handle imbalanced data through multiple decision paths gives it an advantage over SVM, which tends to struggle with class imbalance.

### Reasons for the Obtained Accuracy Values : 
#### -Class Imbalance in the Dataset:
The dataset contains a disproportionate number of samples for different job categories. Some job titles, such as "Software Development," have significantly more occurrences compared to others like "Finance & Accounting" or "Learning & Development." This imbalance causes the model to favor predicting the majority classes while failing to correctly classify underrepresented job titles.

#### -Overlapping Job Titles and Skills:
Many job roles share similar skill sets, making it difficult for the model to distinguish between them. For example, "Data Analysis & Business Intelligence" and "Database Administration & Development" both require database knowledge, which confuses the model and leads to misclassifications.

#### -Limited Feature Representation Using TF-IDF:
The model relies on TF-IDF (Term Frequency-Inverse Document Frequency) to represent job skills in a numerical format. However, TF-IDF does not capture semantic relationships between words. For instance, it treats "Machine Learning" and "Artificial Intelligence" as completely separate terms, even though they are conceptually related. This limitation reduces the model’s ability to generalize well.

#### -Insufficient Training Samples for Some Categories:
Certain job categories have very few samples in the dataset, leading to zero recall for those classes. Since the model does not encounter enough examples of rare job titles, it struggles to make correct predictions, which is evident in categories like "Administrative & Support" and "Finance & Accounting."

#### -SVM Struggles with Complex Class Boundaries:
Support Vector Machines (SVM) perform best when there is a clear separation between classes. However, due to overlapping skills and job roles in the dataset, SVM fails to define accurate decision boundaries, leading to lower accuracy.

#### -Random Forest Performs Better but is Still Biased:
Random Forest captures non-linear patterns better than SVM, which is why it achieves higher accuracy. However, it still struggles with underrepresented job categories and tends to favor dominant job titles, causing lower recall for less common roles.

#### -Low Macro Average F1-Score Due to Imbalance:
The overall performance of the model is negatively affected by the imbalance in job titles, leading to a low macro-average F1-score. The model performs well on frequently occurring job roles but fails to generalize well to less common job categories.

------------------------------------------------------------------------------------------------------------------------


#### 8. FUNCTION FOR USER INPUT & PREDICTION

In [28]:
def predict_future_job(skills, historical_jobs, model_choice='svm'):
    """Predict future job title based on input skills and historical job titles."""

    # Convert input lists to text
    skills_text = ' '.join(skills)
    hist_text = ' '.join(historical_jobs)

    # Transform input using trained TF-IDF vectorizers
    skills_tfidf = skills_vectorizer.transform([skills_text])
    hist_tfidf = history_vectorizer.transform([hist_text])

    # Combine feature vectors
    input_features = hstack([skills_tfidf, hist_tfidf])

    # Predict using selected model
    if model_choice == 'svm':
        prediction = svm_model.predict(input_features)
    elif model_choice == 'rf':
        prediction = rf_model.predict(input_features)
    else:
        raise ValueError("Invalid model choice. Choose 'svm' or 'rf'.")

    return prediction[0]

In [29]:
print(predict_future_job(["Python", "Machine Learning"], ["Data Analyst"], model_choice='svm'))

Data Analysis & Business Intelligence
