#Introduction to machine learning
## New Models: Random Forest

## Question

Train a random forest model. The test set accuracy should be at least 0.88.

**Hint**

Try n_estimators values from 1 to 10. Pick the option with the best quality for the validation set.



In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression



In [None]:
#import our data and preview the data
hrdataset_df = pd.read_csv('hrdataset.csv')
hrdataset_df.head()
hrdataset_df.sample(10)
hrdataset_df.shape

(54808, 14)

In [None]:
#check for missing, duplicates, and columns name
hrdataset_df.isnull().sum() #education and previous_year_rating columns have many missing values, drop them
hrdataset_df.duplicated().any() #no duplicated observation
hrdataset_df.dtypes #data type okey for the variable values
hrdataset_df.nunique() #check uniques values per columns. column 'is_promoted' is classification label
hrdataset_df.columns #columns start with capital


In [None]:

#clean the data, drop missing value, and standardise the columns name
hrdataset_df.columns = hrdataset_df.columns.str.lower().str.strip().str.replace("?", "").str.replace('>','').str.replace('%','').str.replace(' ','')
hrdataset_df.columns

  


Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'kpis_met80', 'awards_won', 'avg_training_score',
       'is_promoted'],
      dtype='object')

In [None]:
#drop columns with missing data and objest type of columns
cleaned_df = hrdataset_df.drop(columns=['employee_id','department','gender','recruitment_channel','previous_year_rating', 'education','region'])

cleaned_df.head()


Unnamed: 0,no_of_trainings,age,length_of_service,kpis_met80,awards_won,avg_training_score,is_promoted
0,1,35,8,1,0,49,0
1,1,30,4,0,0,60,0
2,1,34,7,0,0,50,0
3,2,39,10,0,0,50,0
4,1,45,2,0,0,73,0


In [None]:
#data modeling
#since there is no separate test dataset create train and validation dataset
#use train and test split function. 
#when using the valid_df no model achieved accuracy of >=0.85
train_df, valid_df = train_test_split(cleaned_df, test_size=0.25, random_state=1234)
print(train_df.shape)
print(valid_df.shape)


(41106, 7)
(13702, 7)


In [None]:
#create features and target for both train and test
features_train = train_df.drop(columns=['is_promoted'])
target_train = train_df['is_promoted']
features_valid = valid_df.drop(columns=['is_promoted'])
target_valid = valid_df['is_promoted']

#create a model for Decision Trees, Random Forest and Logistic Regression
#model for Decision Trees, declare and find the ideal depth for the tree
for d in range(1, 11, 1):
  model = DecisionTreeClassifier(random_state=1234, max_depth=d)
  model.fit(features_train, target_train)  #train the model
  #check for accuracy
  print(f'Decision tree has accuracy of: {model.score(features_valid, target_valid)} for depth of: {d}')

# #declare model for random forest and find the best n_estimator value
# for n in range(1,20,1):
#   forest_model = RandomForestClassifier(random_state=1234, n_estimators=n)
#   forest_model.fit(features_train, target_train)
#   print(f'Random forest has accuracy of: {forest_model.score(features_train, target_train)} for n={n}')

# #declare a model for logistic regression
# log_model = LogisticRegression(random_state=1234, solver='liblinear')
# log_model.fit(features_train, target_train)
# print(f'logistic regression has accuracy of: {log_model.score(features_train, target_train)}')




Decision tree has accuracy of: 0.9216172821485914 for depth of: 1
Decision tree has accuracy of: 0.9216172821485914 for depth of: 2
Decision tree has accuracy of: 0.9216172821485914 for depth of: 3
Decision tree has accuracy of: 0.9241716537731718 for depth of: 4
Decision tree has accuracy of: 0.923952707633922 for depth of: 5
Decision tree has accuracy of: 0.923441833309006 for depth of: 6
Decision tree has accuracy of: 0.9235877974018392 for depth of: 7
Decision tree has accuracy of: 0.9227849948912568 for depth of: 8
Decision tree has accuracy of: 0.9232228871697562 for depth of: 9
Decision tree has accuracy of: 0.9217632462414246 for depth of: 10


###Finding and Recommendation
*   Decision tree gives an accuracy of 0.92 regardless of the tree depth which is a good accuracy
*   With the trained model is now possible given a new observations to predict whether that employee will be promoted 






