# Predict Employee Attrition

```python
# Libraries
import pandas as pd
import numpy as np
import sklearn

pd.set_option("display.max_columns", 101)
pd.set_option('display.max_colwidth', 100)
```

## Data Description

Column | Description
:---|:---
`emp_id` | Unique ID corresponding to the employee
`MonthlyIncome` | Monthly Income of the employee
`EmployeeNumber` | Number of employees in the division of given employee
`Age` | Age of the employee
`DistanceFromHome` | Office distance from home.
`OverTime` | Employee works overtime or not
`TotalWorkingYears` | Total Working Experience of employee
`StockOptionLevel` | Company Stocks given to an employee
`YearsAtCompany` | Number of years at current company
`NumCompaniesWorked` | Number of companies in which an employee has worked before joining the current company.
`YearsWithCurrManager` | Number of years with current manager
`JobSatisfaction` | Job Satisfaction of Employee (1-Lowest, 4-Highest)
`PercentSalaryHike` | Average Annual Salary Hike in Percentages
`Attrition` | Employee Attrition or not(0-no, 1-yes)

```python
# Load train and test data, following given naming conventions
data = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
```

```python
data.dtypes
```




    emp_id                   int64
    MonthlyIncome            int64
    EmployeeNumber          object
    Age                      int64
    DistanceFromHome         int64
    OverTime                object
    TotalWorkingYears       object
    StockOptionLevel         int64
    YearsAtCompany          object
    NumCompaniesWorked       int64
    YearsWithCurrManager     int64
    JobSatisfaction          int64
    PercentSalaryHike        int64
    Attrition                int64
    dtype: object




```python
test.dtypes
```




    emp_id                   int64
    MonthlyIncome            int64
    EmployeeNumber          object
    Age                      int64
    DistanceFromHome         int64
    OverTime                object
    TotalWorkingYears       object
    StockOptionLevel         int64
    YearsAtCompany          object
    NumCompaniesWorked       int64
    YearsWithCurrManager     int64
    JobSatisfaction          int64
    PercentSalaryHike        int64
    dtype: object




```python
data.head()
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>emp_id</th>
      <th>MonthlyIncome</th>
      <th>EmployeeNumber</th>
      <th>Age</th>
      <th>DistanceFromHome</th>
      <th>OverTime</th>
      <th>TotalWorkingYears</th>
      <th>StockOptionLevel</th>
      <th>YearsAtCompany</th>
      <th>NumCompaniesWorked</th>
      <th>YearsWithCurrManager</th>
      <th>JobSatisfaction</th>
      <th>PercentSalaryHike</th>
      <th>Attrition</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>19468</td>
      <td>972</td>
      <td>51</td>
      <td>2</td>
      <td>Yes</td>
      <td>24</td>
      <td>0</td>
      <td>12</td>
      <td>3</td>
      <td>6</td>
      <td>1</td>
      <td>14</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>3462</td>
      <td>443</td>
      <td>28</td>
      <td>2</td>
      <td>Yes</td>
      <td>5</td>
      <td>0</td>
      <td>3</td>
      <td>4</td>
      <td>2</td>
      <td>3</td>
      <td>12</td>
      <td>1</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2</td>
      <td>5295</td>
      <td>1654</td>
      <td>39</td>
      <td>12</td>
      <td>No</td>
      <td>7</td>
      <td>0</td>
      <td>5</td>
      <td>4</td>
      <td>0</td>
      <td>2</td>
      <td>21</td>
      <td>0</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3</td>
      <td>2073</td>
      <td>1592</td>
      <td>23</td>
      <td>10</td>
      <td>No</td>
      <td>4</td>
      <td>1</td>
      <td>2</td>
      <td>2</td>
      <td>2</td>
      <td>3</td>
      <td>16</td>
      <td>0</td>
    </tr>
    <tr>
      <th>4</th>
      <td>4</td>
      <td>-100</td>
      <td>106</td>
      <td>55</td>
      <td>1</td>
      <td>No</td>
      <td>24</td>
      <td>1</td>
      <td>1</td>
      <td>3</td>
      <td>0</td>
      <td>4</td>
      <td>14</td>
      <td>0</td>
    </tr>
  </tbody>
</table>
</div>



We see negative values for MonthlyIncome. We can either drop these records, use a measure of central tendency to replace invalid values, or set the invalid values equal to 0.

My decision is to drop records with negative values.

We also drop non-informative columns from train and test data


```python
# Drop EmployeeNumber column as it's irrelevant
data = data.drop(columns=['EmployeeNumber'])
test = test.drop(columns=['EmployeeNumber'])

# Encode OverTime as 0-1 (trivial case of 1-Hot encoding)
data['OverTime'] = data['OverTime'].map({'Yes': 1, 'No': 0})
test['OverTime'] = test['OverTime'].map({'Yes': 1, 'No': 0})

# Convert 'YearsAtCompany' and 'TotalWorkingYears' to numeric
data['YearsAtCompany'] = pd.to_numeric(data['YearsAtCompany'], errors='coerce')
test['YearsAtCompany'] = pd.to_numeric(test['YearsAtCompany'], errors='coerce')
data['TotalWorkingYears'] = pd.to_numeric(data['TotalWorkingYears'], errors='coerce')
test['TotalWorkingYears'] = pd.to_numeric(test['TotalWorkingYears'], errors='coerce')

# Drop rows containing NaN values
data = data.dropna()
test = test.dropna()

# Drop rows where any numerical column is negative
data = data[(data.select_dtypes(include=['number']) >= 0).all(axis=1)]
test = test[(test.select_dtypes(include=['number']) >= 0).all(axis=1)]

# Separate features and target from the training data
X = data.drop(columns=['emp_id', 'Attrition'])
y = data['Attrition']

# Save the emp_id from the test data for later use in the submission file
test_emp_id = test['emp_id']

# Drop the emp_id column from the test data to prepare for prediction
X_test = test.drop(columns=['emp_id'])
```

## Machine Learning

Build a machine learning model that can predict the attrition probability of an employee.
- **The model's performance will be evaluated on the basis of AUC ROC.**

## Logistic Regression


```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)  # Don't forget to scale the test data too!

# Create the logistic regression model
model = LogisticRegression(max_iter=10000)  # Increased max_iter

# Train the model
model.fit(X_train, y_train)

# Validate the model
y_val_pred = model.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f'Validation Accuracy: {val_accuracy}')
```

    Validation Accuracy: 0.6946107784431138



```python
from sklearn.metrics import roc_auc_score

# Predict the probabilities for the validation set
y_val_proba = model.predict_proba(X_val)[:, 1]  # Get the probabilities for the positive class

# Compute the ROC AUC score
roc_auc = roc_auc_score(y_val, y_val_proba)
print(f'ROC AUC Score: {roc_auc}')
```

    ROC AUC Score: 0.7078713968957872


Since the ROC AUC score for Random Forest is higher, we use it as our model.

## Random Forest


```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Split the data into training and validation sets (80% training, 20% validation)
X_train, X_val, y_train, y_val = tra