## Steps :

1. Import Libraries and load data
2. Handle missing values if present (both)
3. Categorical to numerical (both)
4. Features Engineering / Feature Selecction (both)
5. Split dataset into train and test data
6. Import sklearn and create an object of ml algorithm
7. fit and predict / training and testing (traindf)
8. calculate the accuracy_score
9. calculate the probability (If needed)
10. repeat steps(7-9) for test data (if needed )

## 1. Import libraries and load data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
traindf = pd.read_csv('train.csv')
testdf = pd.read_csv('test.csv')

In [None]:
traindf.head()

In [None]:
testdf.head()

## 2. Handling Missing values (if any)

Use can use df.isnull() or its alice df.isna()
You can use heatmap to visualize null values 

In [None]:
traindf.isna().sum()

In [None]:
sns.heatmap(testdf.isnull(), yticklabels = False)

There is no null value in train and test data.

## 3. Categorical to numerical

To convert categorical variables to numerical you can use any of the following methods -
1. get_dummies
2. sklearn_preprocessing or labelencoder
3. remove method
4. astype something

#### labelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
for column in traindf.columns:
    if traindf[column].dtype == np.number:
        continue
    traindf[column] = LabelEncoder().fit_transform(traindf[column]) 

#### pd.get_dummies method

In [None]:
'''
for column in testdf.columns :
    if testdf[column].dtype == np.number:
        continue
    testdf[column] = pd.get_dummies(testdf[column]) '''

## 4. Features Engineering / Feature Selecction

#### Drop columns in train dataset

In [None]:
cols = ['Id', 'EmployeeNumber', 'Behaviour']

traindf.drop(traindf[cols], axis = 1, inplace=True)
testdf.drop(testdf[cols], axis = 1, inplace = True)

In [None]:
plt.figure(figsize=(14,14))
sns.heatmap(traindf.corr(), annot=True, fmt='.0%')

## 5. Split dataset into train and test data (traindf)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(traindf.drop('Attrition', axis=1),
                                                   traindf['Attrition'], test_size = 0.30)

## 6. Import sklearn and create an object of ml algorithm

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=250, random_state=79)

## 7. fit and predict / training and testing (traindf)

In [None]:
model.fit(X_train, y_train)  #train our model
preds = model.predict(X_test)        #testing our model

## 8. calculate the accuracy_score

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, preds)
accuracy

In [None]:
from sklearn.metrics import accuracy_score,roc_auc_score,roc_curve

rf_roc_auc = roc_auc_score(y_test, model.predict(X_test))

rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test,model.predict_proba(X_test)[:,1])
plt.figure()

plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('ROC')
plt.show()

## 9. calculate the probability (If needed)

In [None]:
train_prob = model.predict_proba(X_test)[:, 1]
train_prob

## 10. Test data ---

### 10.1 predict / test trained model with test data

In [None]:
test_preds = model.predict(testdf)

### 10.2 Calculate the accuracy score 

In [None]:
test_acc = accuracy_score(y_test[:-19], test_preds)
test_acc

### 10.3 Calculate the test probability

In [None]:
test_prob = model.predict_proba(testdf)[:, 1]
test_prob

#### ROC

In [None]:
from sklearn.metrics import accuracy_score,roc_auc_score,roc_curve

rf_roc_auc = roc_auc_score(y_test[:-19], model.predict(testdf))

rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test[:-19], model.predict_proba(testdf)[:,1])
plt.figure()

plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('ROC')
plt.show()