**Input Features:**
- Years of Experience: The number of years of experience required for the job, which has been processed to be a numeric value.
- Role: The job role (e.g., Developer, Senior, Manager). This is one-hot encoded into multiple binary columns, one for each role type.
- Additional features (optional): Other features in your dataset can be included, such as the company, job title, etc. But in your current example, only Years of Experience and Role are used for training the model.

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load dataset
data = pd.read_csv("D:\\Work\\Projects\\Intro2DS-ITJobTrendAnalysis\\data\\processed\\11-2024\\merged_data.csv")

# Handle missing values
data.fillna({"Company": "Unknown", "Salary Range": "Negotiable"}, inplace=True)

# Split salary range
data[['min_salary', 'max_salary']] = data['Salary Range'].str.extract(r'(\d+\.?\d*)-(\d+\.?\d*)')
data[['min_salary', 'max_salary']] = data[['min_salary', 'max_salary']].apply(pd.to_numeric, errors='coerce')

# Handle missing salary values
data['min_salary'].fillna(0, inplace=True)
data['max_salary'].fillna(data['min_salary'], inplace=True)

# Process Years of Experience
data['Years of Experience'] = data['Years of Experience'].str.extract(r'(\d+)')  # Extract the first number
data['Years of Experience'] = pd.to_numeric(data['Years of Experience'], errors='coerce')  # Convert to numeric
data['Years of Experience'].fillna(0, inplace=True)

# One-Hot Encode 'Job Level' or 'Experience Level' if it's a string (e.g., 'Junior', 'Mid', 'Senior')
if 'Experience Level' in data.columns:  # Example: 'Experience Level' is a possible column for such data
    experience_encoder = OneHotEncoder(sparse_output=False)
    encoded_experience = experience_encoder.fit_transform(data[['Experience Level']])
    experience_columns = experience_encoder.get_feature_names_out(['Experience Level'])
    encoded_experience_df = pd.DataFrame(encoded_experience, columns=experience_columns, index=data.index)
    data = pd.concat([data, encoded_experience_df], axis=1)

# One-Hot Encode 'Role' (if it's categorical)
role_encoder = OneHotEncoder(sparse_output=False)
encoded_roles = role_encoder.fit_transform(data[['Role']])
role_columns = role_encoder.get_feature_names_out(['Role'])
encoded_roles_df = pd.DataFrame(encoded_roles, columns=role_columns, index=data.index)
data = pd.concat([data, encoded_roles_df], axis=1)

# Drop original 'Role' column
data = data.drop(columns=['Role'])

# Normalize numerical features
scaler = StandardScaler()
data[['Years of Experience', 'min_salary', 'max_salary']] = scaler.fit_transform(data[['Years of Experience', 'min_salary', 'max_salary']])

# Prepare training and test sets
X = data.drop(columns=['Job Title','Company', 'Salary Range','Source Platform', 'max_salary', 'Level', 'Location', 'Required Skills'])  # Drop unnecessary columns
y = data['min_salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build and evaluate model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['min_salary'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['max_salary'].fillna(data['min_salary'], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setti

Mean Squared Error: 0.0001790240431042898


In [24]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

# Sample input (you can replace this with actual input data)
sample_data = pd.DataFrame({
    'Years of Experience': [3],
    'Experience Level_Entry': [0],  # Assuming you already have encoded values for experience level
    'Experience Level_Mid': [1],
    'Experience Level_Senior': [0],
    'Role_Data Analyst': [1],  # Assuming you have one-hot encoded Role
    # Add all the other columns that were used in the model's training process here
})

# Ensure the input features are scaled using the same scaler as the training data
scaler = StandardScaler()
sample_data_scaled = scaler.fit_transform(sample_data[['Years of Experience']])

# Make the prediction
predicted_salary = model.predict(sample_data_scaled)

# Output the prediction
print(f"Predicted Minimum Salary: {predicted_salary[0]}")




ValueError: X has 1 features, but RandomForestRegressor is expecting 38 features as input.