## Importing the Libraries

This imports pandas, numpy, matplotlib.pyplot, and seaborn so that they can be used to manipulate and analyze the data from the csv file.

In [472]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Reading the Dataset

`dataset` is created here by using pandas to read the csv file and pull all the data into a single place.

In [473]:
dataset = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/CS 430 Machine Learning/InClass_Assignment1/HRDataset_v14.csv.xls")

## Exploring the Dataset

`dataset.head()` allows the user to see the first 5 rows of the dataset. `dataset.info()` allows the user to see the total number of entries, the datatypes of each column, how many entries are null / not null, and each column in the dataset.

In [487]:
dataset.head()

Unnamed: 0,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Zip,EngagementSurvey,...,Department_Admin Offices,Department_Executive Office,Department_IT/IS,Department_Production,Department_Sales,Department_Software Engineering,PerformanceScore_Exceeds,PerformanceScore_Fully Meets,PerformanceScore_Needs Improvement,PerformanceScore_PIP
0,1,1,5,4,0,62506,0,19,1960,4.6,...,0,0,0,1,0,0,1,0,0,0
1,1,5,3,3,0,104437,1,27,2148,4.96,...,0,0,1,0,0,0,0,1,0,0
2,0,5,5,3,0,64955,1,20,1810,3.02,...,0,0,0,1,0,0,0,1,0,0
3,0,1,5,3,0,64991,0,19,1886,4.84,...,0,0,0,1,0,0,0,1,0,0
4,0,5,5,3,0,50825,1,19,2169,5.0,...,0,0,0,1,0,0,0,1,0,0


In [488]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Columns: 166 entries, GenderID to PerformanceScore_PIP
dtypes: float64(1), int64(13), uint8(152)
memory usage: 80.3 KB


## Dropping Unnecessary Columns and Imputing Categorical Data

The columns that were either not useful in finding the best salary score and some categorical columns were dropped so that there were less columns to be trained and tested. `dataset.info()` shows that the columns not needed have been dropped and are no longer in the dataset. `get_dummies()` is used here to impute the categorical columns that were useful into numerical data. `dataset.head()` shows that all the data in the dataset are all numerical data.

In [476]:
dataset = dataset.drop(['Employee_Name', 'EmpID', 'MarriedID', 'MaritalStatusID', 'Position', 'DOB', 'Sex', 'MaritalDesc', 'CitizenDesc', 'DateofTermination', 'TermReason', 'ManagerName', 'ManagerID', 'RecruitmentSource', 'LastPerformanceReview_Date'], axis = 1)

In [477]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   GenderID                311 non-null    int64  
 1   EmpStatusID             311 non-null    int64  
 2   DeptID                  311 non-null    int64  
 3   PerfScoreID             311 non-null    int64  
 4   FromDiversityJobFairID  311 non-null    int64  
 5   Salary                  311 non-null    int64  
 6   Termd                   311 non-null    int64  
 7   PositionID              311 non-null    int64  
 8   State                   311 non-null    object 
 9   Zip                     311 non-null    int64  
 10  HispanicLatino          311 non-null    object 
 11  RaceDesc                311 non-null    object 
 12  DateofHire              311 non-null    object 
 13  EmploymentStatus        311 non-null    object 
 14  Department              311 non-null    ob

In [478]:
dataset = pd.get_dummies(dataset, ['State', 'HispanicLatino', 'RaceDesc', 'DateofHire', 'EmploymentStatus', 'Department', 'PerformanceScore'])

In [479]:
dataset.head()

Unnamed: 0,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,Termd,PositionID,Zip,EngagementSurvey,...,Department_Admin Offices,Department_Executive Office,Department_IT/IS,Department_Production,Department_Sales,Department_Software Engineering,PerformanceScore_Exceeds,PerformanceScore_Fully Meets,PerformanceScore_Needs Improvement,PerformanceScore_PIP
0,1,1,5,4,0,62506,0,19,1960,4.6,...,0,0,0,1,0,0,1,0,0,0
1,1,5,3,3,0,104437,1,27,2148,4.96,...,0,0,1,0,0,0,0,1,0,0
2,0,5,5,3,0,64955,1,20,1810,3.02,...,0,0,0,1,0,0,0,1,0,0
3,0,1,5,3,0,64991,0,19,1886,4.84,...,0,0,0,1,0,0,0,1,0,0
4,0,5,5,3,0,50825,1,19,2169,5.0,...,0,0,0,1,0,0,0,1,0,0


## Splitting the Dataset

The `x` and `y` variables are created using the drop method. `train_test_split` is called from sklearn to split the data 80/20 into 4 different parts, `x_train`, `x_test`, `y_train`, and `y_test`.

In [480]:
x = dataset.drop(['Salary'], axis = 1)
y = dataset[['Salary']]

In [481]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

## Scaling the Dataset

`StandardScaler()` is called from `sklearn.preprocessing` to take both x and y training and test sets and scale them to fit the rest of the data and make the model more accurate.

In [482]:
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
y_train = sc_y.fit_transform(y_train)
y_test = sc_y.transform(y_test)

## Randomized Search CV

`model_params` is a dictionary that was created that holds `LinearRegression`, `SGD` and its parameters, `Ridge` and its parameters, `SVR` and its parameters, and `DecisionTree` and its parameters. These all allow that model to determine the best score out of these parameters and regressors in order to find the most accurate model. A for loop is then used to loop through the parameters to find the best score, best parameters, and which model it came from, then saving it in a score array. It then prints the scores so that we can see the best score and parameters from each model to then compare the data to determine which model is the best. This model has the `DecisionTreeRegressor` having the highest score, making it the best model for this dataset.

In [483]:
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

model_params = {
    'LinearRegression': {
        'model': LinearRegression(), 
        'params': {
            
        }
    },
    'SGD': {
        'model': SGDRegressor(max_iter = 1000, tol = 1e-3),
        'params': {
            'alpha': [0.01, 0.1, 1],
            'max_iter': [100, 500, 1000]
        }
    },
    'Ridge': {
        'model': Ridge(),
        'params': {
            'alpha': [0.1, 1, 5, 10],
            'max_iter': [100, 500, 1000, 2000]
        }
    },
    'SVR': {
        'model': SVR(),
        'params': {
            'kernel': ['rbf', 'poly'],
            'tol': [0.01, 0.1, 1]
        }
    },
    'DecisionTree': {
        'model': DecisionTreeRegressor(),
        'params': {
            'criterion': ['squared_error', 'absolute_error']
        }
    }
}

In [484]:
from sklearn.model_selection import RandomizedSearchCV
scores = []
for model_name, model_parameter in model_params.items():
  clf = RandomizedSearchCV(model_parameter['model'], model_parameter['params'], cv = 6, return_train_score = False, n_iter = 5)
  clf.fit(x_train, y_train.ravel())
  scores.append({
      'model': model_name,
      'best_scores': clf.best_score_,
      'best_params': clf.best_params_
})



In [485]:
scores

[{'model': 'LinearRegression',
  'best_scores': -5.322690872163867e+28,
  'best_params': {}},
 {'model': 'SGD',
  'best_scores': 0.238483157625901,
  'best_params': {'max_iter': 500, 'alpha': 1}},
 {'model': 'Ridge',
  'best_scores': -0.04651206564753877,
  'best_params': {'max_iter': 1000, 'alpha': 5}},
 {'model': 'SVR',
  'best_scores': 0.19946809362464865,
  'best_params': {'tol': 0.1, 'kernel': 'rbf'}},
 {'model': 'DecisionTree',
  'best_scores': 0.3854808199514776,
  'best_params': {'criterion': 'squared_error'}}]

In [486]:
df = pd.DataFrame(scores)
df

Unnamed: 0,model,best_scores,best_params
0,LinearRegression,-5.322691e+28,{}
1,SGD,0.2384832,"{'max_iter': 500, 'alpha': 1}"
2,Ridge,-0.04651207,"{'max_iter': 1000, 'alpha': 5}"
3,SVR,0.1994681,"{'tol': 0.1, 'kernel': 'rbf'}"
4,DecisionTree,0.3854808,{'criterion': 'squared_error'}
