### **Salary prediction** 
Predicting salaries based on various features is crucial for informed decision-making in HR, finance, and economic planning. This assignment aims to develop a robust salary prediction model using unsupervised learning, supervised learning, and neural networks. Each method plays a specific role in achieving our goals.

**Feature selection and Supervised Learning**
    
- Achieve Predictive Accuracy: Use algorithms like linear regression and decision trees. V
- Evaluate Models: Measure performance with metrics like MSE and cross-validation. V 
- Determine Feature Importance: Identify key predictors of salary. V
- Prepare Data: Detect and handle anomalies. V

**Unsupervised Learning**

- Identify Patterns and Clusters: Group similar job roles or experience levels. V
- Reduce Dimensionality: Focus on significant features. V


**Neural Networks**

- Model Complex Relationships: Capture intricate patterns and interactions.
- Ensure Flexibility and Scalability: Handle large datasets and many features.
- Leverage Large Data: Improve learning and generalization with extensive data.

In [1]:
import pandas as pd

# Imports custom functions. see ./functions folder.
from functions.wrangling import preprocessing, decode_column  
from functions.plotting import scatter_plot, bar_plot
from functions.ML import TSNE_reduction, DBSCAN_cluster, KMEANS_cluster, RF_train_test

In [2]:
## PRE-PROCESSING
df = pd.read_csv("./data/salary_data.csv").dropna()

df['DOJ'] = pd.to_datetime(df['DOJ'])
df['CURRENT DATE'] = pd.to_datetime(df['CURRENT DATE'])
df['DAYS ELAPSED'] = (df['CURRENT DATE'] - df['DOJ']).dt.days

# Returns preprocessed df and exposes tools to decode it 
preprocessed_df, encodings_cat, label_encoders, num_scaler = preprocessing(df) 

preprocessed_df

Unnamed: 0,FIRST NAME,LAST NAME,SEX,DOJ,CURRENT DATE,DESIGNATION,AGE,SALARY,UNIT,LEAVES USED,LEAVES REMAINING,RATINGS,PAST EXP,DAYS ELAPSED
0,2203,2430,0,0.384075,0.0,0,-1.000000,-0.277197,0,0.285714,-0.285714,-0.5,-0.5,-0.384075
2,1766,1667,0,0.550351,0.0,0,-1.000000,-0.730006,0,0.142857,-0.142857,0.0,-0.5,-0.550351
3,391,2131,0,-0.576112,0.0,0,-0.666667,-0.154444,1,0.000000,0.000000,0.0,-0.5,0.576112
6,704,578,1,-0.220141,0.0,0,-0.666667,-0.807165,3,-0.428571,0.428571,1.0,-0.5,0.220141
8,1244,1284,0,0.482436,0.0,4,1.333333,2.091188,4,-0.285714,0.285714,0.0,0.0,-0.482436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2634,1237,1274,0,-2.086651,0.0,5,4.000000,17.435210,2,-1.000000,1.000000,1.0,4.5,2.086651
2635,1447,1296,0,0.093677,0.0,0,-0.333333,-0.128390,1,-0.714286,0.714286,-0.5,-0.5,-0.093677
2636,1888,1393,0,0.114754,0.0,0,-1.000000,0.066637,5,1.000000,-1.000000,1.0,-0.5,-0.114754
2637,2164,1647,0,0.238876,0.0,0,0.000000,-0.201791,5,0.142857,-0.142857,0.0,0.0,-0.238876


**Feature selection and Supervised Learning**

Some features have been randomly removed, now use a RF algorithms to determine which ones are the most important ones.

Training a Random Forest Regressor to predict salary, removing least important features results in improved accuracy.

On average, the features selected model results in greater accuracy, with 4/5 iteration performing better than non features selected model.


In [59]:
acc_scores = pd.DataFrame(columns=["Itr","Not pruned", "Pruned"])
itrs = 5 

score_RF, feat_imp_RF, fitted_model_RF = RF_train_test(preprocessed_df, "SALARY")
bar_plot(feat_imp_RF, ["Importance","Feature"])

for itr in range(0,itrs):
    
    # With all features
    score_RF, feat_imp_RF, fitted_model_RF = RF_train_test(preprocessed_df, "SALARY")


    # With only important features
    selected_df = preprocessed_df.drop(preprocessed_df[["FIRST NAME", "LAST NAME", "DOJ", "CURRENT DATE"]], axis=1)
    score_RF_pr, feat_imp_RF_pr, fitted_model_RF_pr = RF_train_test(selected_df, "SALARY")
    
    # Save results to output df
    new_results = pd.DataFrame({"Itr": [itr], "Not pruned": [score_RF], "Pruned": [score_RF_pr]})
    acc_scores = pd.concat([acc_scores, new_results], ignore_index=True)

acc_scores


The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.



Unnamed: 0,Itr,Not_pruned,Pruned
0,0,0.954083,0.93884
1,1,0.944876,0.955438
2,2,0.942405,0.961174
3,3,0.942031,0.973503
4,4,0.927046,0.960101


**Dimensionality reduction and Clustering**

"FIRST NAME", "LAST NAME", "DOJ", "CURRENT DATE" features were removed. 
These are highly specific and add lots of noise to the data. A way to account for how long person stayed at the company would involve calculating the days elapsed between the data given. 

This was done in the preprocessing.

In [42]:

tsne_df = TSNE_reduction(selected_df, rng=50)

**Unsupervised Findings**

Relationship between salary and designation can be noticed by connecting the two plots, clustering observed in tsne is further confirmed using various clustering algorithms.  


- There are 2 main clusters of data analysts with a broad range of salaries.
- Senior analysts are payed more than data analysts.
- Associates make as much as Senior Analysts.  
- Managers have the second highest salary.
- Directors make the most money.

Unsupervised clustering algorithms identifies a split in the data analyst tSNE cluster,although the cause could not be determined.  

Interestingly, there isnt a significant difference of salaries between Senior and normal data scientists, suggesting that the best option for earning most money involves becoming director or associate.INterestingly, one manager is payed less compared to the average.

In [46]:
tsne_df_des = tsne_df

# Decodes and builds df used for plotting. Clustering algorithms are applied as well.
tsne_df_des["DES"] = decode_column(preprocessed_df, "DESIGNATION", label_encoder=label_encoders["DESIGNATION"])
tsne_df_des["SAL"] =preprocessed_df["SALARY"]
dbscan_anon_clusters = DBSCAN_cluster(selected_df)
kmeans_anon_clusters = KMEANS_cluster(selected_df, clusters=8)


tsne_df_des["cluster_DB"] = dbscan_anon_clusters["cluster"]
tsne_df_des["cluster_K"] = kmeans_anon_clusters["cluster"]

# PLots scatterplots with different layers of information.
scatter_plot(tsne_df_des, ["tsne1", "tsne2"], "DES", title="Designation tsne")
scatter_plot(tsne_df_des, ["tsne1", "tsne2"], "SAL", title="Salary tsne")
scatter_plot(tsne_df_des, ["tsne1", "tsne2"], "cluster_DB", title="DBSCAN tsne")
#scatter_plot(tsne_df_des, ["tsne1", "tsne2"], "cluster_K", title="KMEANS tsne")

**Neural Network**
(add description)

...

In [5]:
import logging
import os

# Set logging level to suppress certain TensorFlow warnings and errors
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Set TensorFlow logging to suppress INFO, WARNING, and ERROR messages

# Create a custom logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger to display INFO level messages

# Filter out specific TensorFlow CUDA-related warnings and errors
logging.getLogger('tensorflow').setLevel(logging.ERROR)
logging.getLogger('tensorflow.compiler').setLevel(logging.ERROR)
logging.getLogger('tensorflow.compiler.xla').setLevel(logging.ERROR)
logging.getLogger('tensorflow.compiler.tfrt').setLevel(logging.ERROR)
logging.getLogger('tensorflow.compiler.mlir').setLevel(logging.ERROR)
# Actual TensorFlow

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import sklearn
import sklearn.model_selection


# Splitting into training and testing sets
X_train, X_holdout, y_train, y_holdout = sklearn.model_selection.train_test_split(preprocessed_df.drop(columns=["SALARY"]) ,
                                                                            preprocessed_df["SALARY"], 
                                                                            test_size=0.2, 
                                                                            random_state=42)

#Split the holdout (test) set into evaluation and testing sets (50% each)
X_eval, X_test, y_eval, y_test = sklearn.model_selection.train_test_split(X_holdout, y_holdout, test_size=0.6, random_state=42)



Unnamed: 0,FIRST NAME,LAST NAME,SEX,DOJ,CURRENT DATE,DESIGNATION,AGE,SALARY,UNIT,LEAVES USED,LEAVES REMAINING,RATINGS,PAST EXP,DAYS ELAPSED
0,2203,2430,0,0.384075,0.0,0,-1.000000,-0.277197,0,0.285714,-0.285714,-0.5,-0.5,-0.384075
2,1766,1667,0,0.550351,0.0,0,-1.000000,-0.730006,0,0.142857,-0.142857,0.0,-0.5,-0.550351
3,391,2131,0,-0.576112,0.0,0,-0.666667,-0.154444,1,0.000000,0.000000,0.0,-0.5,0.576112
6,704,578,1,-0.220141,0.0,0,-0.666667,-0.807165,3,-0.428571,0.428571,1.0,-0.5,0.220141
8,1244,1284,0,0.482436,0.0,4,1.333333,2.091188,4,-0.285714,0.285714,0.0,0.0,-0.482436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2634,1237,1274,0,-2.086651,0.0,5,4.000000,17.435210,2,-1.000000,1.000000,1.0,4.5,2.086651
2635,1447,1296,0,0.093677,0.0,0,-0.333333,-0.128390,1,-0.714286,0.714286,-0.5,-0.5,-0.093677
2636,1888,1393,0,0.114754,0.0,0,-1.000000,0.066637,5,1.000000,-1.000000,1.0,-0.5,-0.114754
2637,2164,1647,0,0.238876,0.0,0,0.000000,-0.201791,5,0.142857,-0.142857,0.0,0.0,-0.238876


In [8]:

# Model structure
model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(13,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(1, activation='linear')  # Linear activation for regression
])

model.compile(optimizer='adam',
              loss='mae',  # Mean Absolute Error
              metrics=['mse'])   

model.summary()
model.fit(X_train, y_train, epochs=100, validation_data=(X_eval, y_eval))

model.save("./temp/model/salary_pred.keras")

Epoch 1/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 2.3095 - mse: 24.4229 - val_loss: 2.2348 - val_mse: 28.8294
Epoch 2/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.8914 - mse: 21.8074 - val_loss: 2.2288 - val_mse: 29.0922
Epoch 3/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.9109 - mse: 22.2656 - val_loss: 2.1906 - val_mse: 29.5173
Epoch 4/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.8247 - mse: 21.1470 - val_loss: 2.2308 - val_mse: 30.4059
Epoch 5/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.8195 - mse: 22.3159 - val_loss: 2.1890 - val_mse: 29.4414
Epoch 6/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 1.9175 - mse: 24.7802 - val_loss: 2.1870 - val_mse: 29.6002
Epoch 7/100
[1m66/66[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/st

In [9]:
loaded_model = tf.keras.models.load_model("./temp/model/salary_pred.keras")

In [10]:
y_pred = loaded_model.predict(X_test)

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared error: ",mse)
print("Mean Absolute error:", mae)
print(f"R2 score:{r2*100}% ",)


[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
Mean Squared error:  2.9449099866864006
Mean Absolute error: 0.7291743008489989
R2 score:80.18091066793087% 


In [13]:
#Reverse the scaling and outputs the predicted salaries
print(num_scaler["SALARY"].inverse_transform(y_pred))

[[ 73455.805]
 [ 44491.52 ]
 [ 45916.566]
 [ 44554.336]
 [ 44744.812]
 [ 44153.89 ]
 [ 44285.918]
 [ 43715.812]
 [ 45728.305]
 [ 45859.562]
 [ 44659.08 ]
 [ 44386.324]
 [ 44593.93 ]
 [ 53813.22 ]
 [ 44945.594]
 [ 45849.703]
 [ 44115.82 ]
 [ 47438.5  ]
 [ 45154.418]
 [ 54227.72 ]
 [ 44146.918]
 [ 44895.19 ]
 [ 44638.27 ]
 [ 44155.566]
 [ 44863.438]
 [ 43522.062]
 [ 46277.035]
 [ 44171.1  ]
 [ 44584.58 ]
 [ 59339.945]
 [142295.97 ]
 [ 44251.26 ]
 [ 46098.613]
 [ 45155.766]
 [ 44412.086]
 [ 48325.375]
 [ 58436.402]
 [ 47606.82 ]
 [168054.56 ]
 [161978.4  ]
 [ 45804.633]
 [ 45362.703]
 [127895.3  ]
 [ 44261.86 ]
 [ 46067.367]
 [ 44738.637]
 [ 49225.605]
 [ 45442.566]
 [ 44738.92 ]
 [ 43698.3  ]
 [ 44121.72 ]
 [ 98093.53 ]
 [ 44138.32 ]
 [ 45910.1  ]
 [ 44875.8  ]
 [ 44675.03 ]
 [ 44278.027]
 [ 44908.547]
 [ 44308.56 ]
 [ 45363.785]
 [ 43890.68 ]
 [ 44653.59 ]
 [112997.61 ]
 [ 44104.547]
 [ 43919.996]
 [ 93355.94 ]
 [ 59858.797]
 [ 60842.227]
 [ 46221.426]
 [ 43097.008]
 [ 44857.555]
 [ 544