## Group Assignment: Machine Learning 2

This year's Group Assignment will verse on predicting employee attrition. The dataset is available on Kaggle: [Employee Attrition competition](https://www.kaggle.com/competitions/playground-series-s3e3/data)

* The goal is to predict whether an employee will leave the company or not (`Attrition` column, binary classification).
* The dataset is artificially generated, but it is based on real data: [original data](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)
    * Here you can find what each feature represents and their possible values.
* You're given a training set and a test set. 
    * Perform your analysis, experiments and model selection on the training set, and don't touch the test set until you're ready to submit your predictions.
    * You can save some data from the training set to use as a validation set, but you should not use the test set for this purpose.
    * Once you're comfortable with the performance of your model on the training set, you can use the test set to get a final estimate of the performance of your model.
    * The CSV file that you will submit, should contain the predictions of your model on the test set. This means that the CSV should contain as many rows as the test set, and a single column with the predictions (0 or 1).

### 1. Rules

* You should work within your team. If I see any signs of cheating between teams, that's and immediate fail for both groups.
  * If you have any question, post it on Slack and I will answer it as soon as possible.
* The grade will be the same for all the members of the team.
* Everyone within the team should contribute and all of you will have to either present or answer questions during the presentation.
  * If somebody in the team is not collaborating, let me know as soon as possible since I will not accept any excuses at the end of the term
  * Help the others in the team to understand the code and the results, because they might be asked to explain it during the presentation
* It's fine to explore and learn from out there. 
  * I want to see that you are learning and that you are trying to improve your skills. However, you should not just copy-paste code from other sources -tell me where is it from and show me it's useful for your assignment-
  * If you just copy and paste, I will know and I will ask you about it, so be prepared.
* The final submission should use any of the algorithms that we've seen during the course, so no neural networks or similar. There's will be a time for that, but not now.


### 2. Submission

**1 ZIP file per group, named `submission_group_X.zip`, containing:**
  * A Jupyter Notebook with the code and the results, mandatory to include the names of the members of the team.
  * A PDF file with the presentation.
  * A CSV file with the predictions for the test set.
  * Failing to submit any of the above in the required conditions will result in a 0 for the Group Assignment.

### 3. Grade

The Group Assignment will weight 40% of the final grade. The grade will be based on the following criteria:
* **40%: PDF report and presentation**
    * The report should be done for an executive audience, so don't go too much into the technical details underneath the algorithms. 
    * It should cover a brief exploration of the data, the experiments done regarding feature engineering and the algorithms used, the results and the conclusions.
    * For the presentation, I will choose somebody from the team to present.
* **30%: Code in Jupyter notebook**
    * Comment why you're doing what you're doing.
    * Document your experiments and the results in the notebook.
    * Make sure that you do all your `import`s at the beginning of the notebook, so that I know what packages you're using. If your code doesn't run, I will not grade it fully.
* **30%: Answers to my questions during the presentation**
    * There will be questions about the data and its exploration, the experiments done with the features and with the algorithms, and obviously about the performance and results.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score,recall_score,precision_score,f1_score
from collections import Counter
import pandas_profiling
from sklearn.metrics import roc_auc_score
import xgboost as xgb
from sklearn.decomposition import PCA

  import pandas_profiling


In [2]:
train = pd.read_csv('/Users/mac/Desktop/group_assignment_ML2/data/train.csv', sep = ",")
test = pd.read_csv('/Users/mac/Desktop/group_assignment_ML2/data/test.csv', sep = ",")

## PREPROCESS

In [3]:
train['MonthlyIncome/Age'] = train['MonthlyIncome'] / train['Age']
test['MonthlyIncome/Age'] = test['MonthlyIncome'] / test['Age']

travel_map = {'Non-Travel': 0, 'Travel_Rarely': 1, 'Travel_Frequently': 2}
train['BusinessTravel'] = train['BusinessTravel'].replace(travel_map)
test['BusinessTravel'] = test['BusinessTravel'].replace(travel_map)

train["Dedication"] = train["YearsAtCompany"] + train["YearsInCurrentRole"] + train["TotalWorkingYears"]
test["Dedication"] = test["YearsAtCompany"] + test["YearsInCurrentRole"] + test["TotalWorkingYears"]

train["JobSkill"] = train["JobInvolvement"] * train["JobLevel"]
test["JobSkill"] = test["JobInvolvement"] * test["JobLevel"]

train["Satisfaction"] = train["EnvironmentSatisfaction"] * train["RelationshipSatisfaction"]
test["Satisfaction"] = test["EnvironmentSatisfaction"] * test["RelationshipSatisfaction"]

train["MonthlyRateIncome"] = train["MonthlyIncome"] * train["MonthlyRate"]
test["MonthlyRateIncome"] = test["MonthlyIncome"] * test["MonthlyRate"]

train["HourlyDailyRate"] = train["HourlyRate"] * train["DailyRate"]
test["HourlyDailyRate"] = test["HourlyRate"] * test["DailyRate"]

train.drop(["MonthlyIncome","Age","TotalWorkingYears","TotalWorkingYears","JobInvolvement","JobLevel","EnvironmentSatisfaction","RelationshipSatisfaction","MonthlyRate","HourlyRate","DailyRate","DistanceFromHome"],axis=1,inplace=True)
test.drop(["MonthlyIncome","Age","TotalWorkingYears","TotalWorkingYears","JobInvolvement","JobLevel","EnvironmentSatisfaction","RelationshipSatisfaction","MonthlyRate","HourlyRate","DailyRate","DistanceFromHome"],axis=1,inplace=True)

In [4]:
train

Unnamed: 0,id,BusinessTravel,Department,Education,EducationField,EmployeeCount,Gender,JobRole,JobSatisfaction,MaritalStatus,...,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition,MonthlyIncome/Age,Dedication,JobSkill,Satisfaction,MonthlyRateIncome,HourlyDailyRate
0,0,2,Research & Development,3,Medical,1,Male,Laboratory Technician,4,Married,...,0,7,8,0,72.111111,20,3,8,13237004,25158
1,1,1,Sales,3,Other,1,Male,Sales Representative,1,Married,...,2,0,3,0,82.828571,10,3,4,31245422,42366
2,2,1,Sales,3,Marketing,1,Male,Sales Executive,4,Divorced,...,2,1,2,0,144.593750,9,6,12,76322365,57440
3,3,1,Research & Development,3,Medical,1,Female,Healthcare Representative,1,Married,...,0,0,2,0,140.710526,21,6,9,71564248,59520
4,4,1,Research & Development,4,Medical,1,Female,Manager,1,Single,...,14,4,10,1,380.660000,76,15,6,376948565,37629
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1672,1672,1,Sales,3,Life Sciences,1,Female,Sales Executive,3,Single,...,0,0,8,0,290.733333,20,9,8,124332110,68985
1673,1673,1,Research & Development,3,Life Sciences,1,Male,Research Scientist,2,Married,...,2,1,3,0,110.750000,16,3,4,56604768,62544
1674,1674,2,Human Resources,3,Human Resources,1,Male,Human Resources,1,Married,...,0,0,0,1,96.689655,2,2,6,42962888,42624
1675,1675,1,Sales,2,Marketing,1,Male,Sales Executive,3,Divorced,...,3,0,8,0,150.166667,23,8,6,21899706,21168


## MODEL

In [24]:
X= train.drop(["Attrition"],axis=1)
y = train["Attrition"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=99)

Sc = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore')

numeric_features = ["Education","NumCompaniesWorked","PercentSalaryHike","MonthlyRateIncome","HourlyDailyRate","TrainingTimesLastYear","YearsSinceLastPromotion","MonthlyIncome/Age","YearsWithCurrManager","Dedication","JobSkill"]
categorical_features = ["BusinessTravel","Gender","EducationField","Department","MaritalStatus","EducationField","JobRole","OverTime","WorkLifeBalance","Satisfaction"]

preprocess = ColumnTransformer(
    transformers=[
                ('num', Sc, numeric_features),
                ('cat', ohe, categorical_features)
                ])

model = Pipeline(steps=[('preprocess', preprocess),
                                    ('model', xgb.XGBClassifier(
            random_state=99,
            learning_rate=0.035,
            n_estimators=300,
            max_depth=1,
            reg_lambda=0.6,
            reg_alpha=0.6,
            scale_pos_weight=3,
            subsample=0.6,
            colsample_bytree=0.7,
            objective='binary:logistic',
            max_delta_step=2,
            tree_method='hist',
            alpha=1,
            eta=0,
            eval_metric='auc',
            min_child_weight=1
        ))])
            
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("ROC AUC: ", roc_auc)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1)

ROC AUC:  0.8948458948458948
Precision:  0.8421052631578947
Recall:  0.41025641025641024
F1 score:  0.5517241379310346


In [6]:
y_pred = model.predict_proba(test)
y_pred=y_pred[:, 1]

In [7]:
y_pred.sum()

281.68726

In [8]:
Final_Results = test['id']
Final_Results = pd.DataFrame(Final_Results)
Final_Results['Attrition'] = y_pred
Final_Results = Final_Results.set_index('id')

In [9]:
Final_Results

Unnamed: 0_level_0,Attrition
id,Unnamed: 1_level_1
1677,0.464022
1678,0.300385
1679,0.254460
1680,0.202768
1681,0.679185
...,...
2791,0.440612
2792,0.108720
2793,0.113407
2794,0.160512


In [10]:
#Final_Results.to_csv('Final_Results_submission-1.csv')