### After the preprocessing step we will be eastablishing the baseline for our model. Also, we know that our problem can be best solved by Regression methods.

In [1]:
# importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

#__author__ = Monish Khambhati
#__email__ = monish.khambhati@gmail.com

In [2]:
# importing the train data
train_df = pd.read_csv('data/train_features.csv')

In [3]:
# importing the target features data
target_df = pd.read_csv('data/train_salaries.csv')

In [4]:
# importing the test features data
test_df = pd.read_csv('data/test_features.csv')

In [5]:
# Merging the train and target dataframe
train_df = pd.merge(left = train_df,right = target_df,how='inner', on='jobId')

From the EDA conclusion we know that **jobId** is not related to the target variables. So, I will start by dropping those fetaures from train and test data. And also choosing the salaries which are greater than 0 in **train_df**.

In [6]:
# function to clean the dataframe
def clean_data(raw_df):
    '''remove rows that contain salary <= 0 or duplicate job IDs'''
    clean_df = raw_df.drop_duplicates(subset='jobId')
    clean_df = clean_df[clean_df.salary>0]
    return clean_df

In [7]:
train_df = clean_data(train_df)

In [8]:
test_df = test_df.drop('jobId', axis=1)

Before feeding data to our model we will be writing a function to perform **One Hot Encoding**.
 - One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [9]:
def one_hot_encode_df(df, cat_vars=None, num_vars=None):
    '''performs one-hot encoding on all categorical variables and combines result with continous variables'''
    cat_df = pd.get_dummies(df[cat_vars])
    num_df = df[num_vars].apply(pd.to_numeric)
    return pd.concat([cat_df, num_df], axis=1)

In [10]:
# defining categorical variable and numerical variables
categorical_vars = ['companyId','jobType', 'degree', 'major', 'industry']
numerical_vars = ['yearsExperience', 'milesFromMetropolis']
target_var = 'salary'

In [11]:
train_df = one_hot_encode_df(train_df,cat_vars=categorical_vars, num_vars=numerical_vars)

In [12]:
train_df.head()

Unnamed: 0,companyId_COMP0,companyId_COMP1,companyId_COMP10,companyId_COMP11,companyId_COMP12,companyId_COMP13,companyId_COMP14,companyId_COMP15,companyId_COMP16,companyId_COMP17,...,major_PHYSICS,industry_AUTO,industry_EDUCATION,industry_FINANCE,industry_HEALTH,industry_OIL,industry_SERVICE,industry_WEB,yearsExperience,milesFromMetropolis
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,10,83
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,3,73
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,10,38
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,8,17
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,8,16


## Splitting data into training and validation set and establishing the baseline
 - I will be using sklearn to split the data into training and validation set into 80:20 feataure
 - Also, the metric that we will choose in this case is Mean Squared Error(MSE). Mean squared error measures the average of the squares of errors, i.e, the difference between actual value (y) and the estimated value (ŷ).
 - Baseline model would be a simple linear regession model and we will be using it to hypothesize solutions based on the results of the baseline.

In [13]:
# Extracting target variable from train_salaries which are greater than 0
target_df = clean_data(target_df)

In [14]:
target_df = target_df.drop('jobId',axis = 1)

In [15]:
# Splitting the data into train and validation set
X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(train_df, target_df, test_size=0.2, random_state=1)

In [16]:
test_df = one_hot_encode_df(test_df,cat_vars=categorical_vars, num_vars=numerical_vars)

### Linear Regression

In [None]:
# Creating a linear regression object
lr = LinearRegression()

# Fitting the model with the train data
lr.fit(X_train_data, y_train_data)

# Predicting the model on the validation data
y_predict = lr.predict(X_test_data)
print("The first 5 predictied salaries: ", y_predict[0:5])

In [None]:
# Evaluating our model using validation set by calculating mean squared error
mse = mean_squared_error(y_test_data, y_predict)

In [None]:
print(mse)

In [None]:
#Prediction accuracy using k fold cross validation
Rcross = cross_val_score(lr, y_test_data, y_predict, cv = 5)

print("The k-cross validation accuracy is: ", (Rcross.mean(), Rcross.std()))

## Hypothesize a solution
#### On the baseline simple regression model the MSE is 384.87. We would try other techniques and also do feature engineering and hyperparameter tuning to imrove model on test set. 
 - The models we will be using to achieve better accuracy are:
     - Random Forest Regressor
     - Gradient Boosting Regressor

### Random Forest Regressor

In [None]:
rf = RandomForestRegressor(n_estimators = 60, max_depth = 25, 
                           min_samples_split = 20, n_jobs = 2, 
                           max_features = 30)

In [None]:
#Fitting the object to training data
rf.fit(X_train_data, y_train_data)

In [None]:
# Checking the model accuracy on test set
rf.score(X_test_data, y_test_data)

In [None]:
# Making predictions on test data
y_predict_rf = rf.predict(X_test_data)
print("First 5 Predictions on test set: ", y_predict_rf[0:5])

In [None]:
# Calculating mean squared error on test set
mean_squared_error(y_test_data, y_predict_rf)

### Gradient Boosting Regressor

In [None]:
#Creating Gradient Boosting Regressor object
gd = GradientBoostingRegressor(n_estimators = 60, max_depth = 5,loss = 'ls', verbose = 5)

In [None]:
#Fitting object to data
gd.fit(X_train_data, y_train_data)

In [None]:
y_predict_gd = gd.predict(X_test_data)

In [None]:
mean_squared_error(y_test_data, y_predict_gd)

### After hypothesizing solutions and testing them using 20% of test data we will develop a solution usingobjet oriente approach and also predicting the salaries on test data. 
 - We would be following object oriented approach to automate the process.
 - We would be performing cross validation using k-fold cross validation with 5 folds.
 - We would try to automate the pipeline using the best model.