# Activity 7

* Download the Activity7 lab and upload it onto Google Colab.
* Answer the Activity7 questions in Canvas and populate the cells below
* Submit Activity7 questions on Canvas and upload the PDF version of this lab:
>* To submit the this lab as PDF, go to File, click Print, then save it as PDF instead of printing it

# Business Problem

Ted Cordaro is the Director of HR Analytics at a large multinational corporation. Turnover has been high in the Research & Development Department at the company in recent years and Ted has been tasked with building a model to predict employee attrition. Attrition occurs when an employee leaves the company. Ted also wants to know which factors are most important in predicting which employees will leave. Ted has information about past and present employees, including demographics, job characteristics, and HR survey response data.


First, you will help Ted train and evaluate a Logistic Regression classification model. Then, you will help Ted train and evaluate a Decision Tree classification model.

An overview of the **HRData** dataset is below:

| Variable   |   Description |
| ----------- | ----------- |
| Age | The age of the employee (in years)|
| BusinessTravel | The level of travel the employee does for work (0: no travel, 1: some travel, 2: frequent travel)|
| DistanceFromHome | The distance the employee commutes to work (in miles)|
| Education | The employee's education level (ranging from 1-5)|
| EnvironmentSatisfaction |The employee's satisfaction with their work environment (ranging from 1-4)|
| Married | Indicates if the employee is married (1) or not (0)|
| MonthlyIncome | The employee's monthly pre-tax salary (in USD)|
| OverTime | Indicates if the employee is eligible for overtime compensation (1) or not (0)|
| PerformanceRating | The performance rating the employee received from their direct supervisor (ranging from 3-4)|
| StockOptionLevel | The level of stock options the employee has (ranging from 0-3)|
| TrainingTimesLastYear | The number of times the employee participated in training last year|
| WorkLifeBalance | The level of employee's work-life balance (ranging from 1-4)|
| YearsAtCompany | The number of years that the employee has been employed with the company|
| Attrition | Indicates if the employee left the company (1) or not (0)|


# Import Packages

In [None]:
# do not manipluate this cell - just run it

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import ConfusionMatrixDisplay, classification_report

set_config(transform_output = "pandas")

# Data Import

In [None]:
# do not manipluate this cell - just run it

data = pd.read_csv('https://raw.githubusercontent.com/CHill-MSU/INFO265_Data/refs/heads/main/HRData.csv')
data.head()

In [None]:
# do not manipluate this cell - just run it

data.info()

In [None]:
# do not manipluate this cell - just run it

data.describe()

In [None]:
# do not manipluate this cell - just run it

data['Attrition'].plot.hist()

In [None]:
# do not manipluate this cell - just run it

X = data.drop('Attrition', axis = 1)
y = data['Attrition']
train_X, test_X, train_y, test_y = train_test_split(X, y,test_size = 0.3, stratify = y, random_state = 123)

# Part 1: Logistic Regression

## Q1

* Which of the following code lines should you use to train a logistic regression model to predict if an employee will leave the company? Use balanced class weights to address class imbalance.
* Select the right answer from Canvas, paste it below, and run the cell

In [None]:
# Copy and paste your answer from Canvas to Here



## Q2

* Run the code cell below to output the fitted coefficients for you linear regression model.
* Based on the output, which of the following statements is true?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

print('Intercept              ', logr.intercept_)
print(pd.DataFrame(zip(logr.feature_names_in_, np.transpose(logr.coef_.squeeze())), columns = ['Variable', 'Coefficient']))

## Q3

* No code needed.
* Select the right answer in Canvas.

## Q4

* Run the code cell below to obtain predictions and output the confusion matrix.
* Based on the output, the specificity of the model is _____?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

preds_lr = logr.predict(X = test_X)
ConfusionMatrixDisplay.from_predictions(test_y, preds_lr)

## Q5

* Run the code cell below to obtain the classification report output for the Logistic Regression classification model predicting if an employee will leave the company.
* Based on the output, which of the following statements is true?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

print(classification_report(test_y, preds_lr))

# Part 2: Decision Trees



## Q6

* Which of the following code lines should you use to train a Decision Tree classification model to predict if an employee will leave the company? Use a maximum depth of 2 and balanced class weights to address class imbalance.
* Select the right answer from Canvas, paste it below, and run the cell

In [None]:
# Copy and paste your answer from Canvas to Here



## Q7

* Run the code cell below to output the fitted Decision Tree plot.
* Based on the output, which of the following statements is true?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

# increase plot size
fig = plt.figure(figsize=(12, 10))
# create and plot tree
tree_plot = plot_tree(decision_tree = dt, # use optimal tree from grid search pipeline
                      feature_names = dt.feature_names_in_, # identify variable names
                      class_names=['0', '1']) # set target class levels

## Q8

* No code needed.
* Select the right answer in Canvas.

## Q9

* Run the code cell below to output the variable importance information for the fitted Decision Tree classification model.
* Based on the output, which of the following statements is true?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

dtt_imp = pd.DataFrame({'variable': dt.feature_names_in_, 'importance': dt.feature_importances_})
dtt_imp.sort_values('importance', ascending = False)

## Q10

* Run the code cell below to make class predictions and output the confusion matrix and classification performance measures for the Decision Tree model predicting if an employee will leave the company.
* Based on the output, which of the following statements is true?
* Select the right answer from Canvas.

In [None]:
# do not manipluate this cell - just run it

preds_dt = dt.predict(X = test_X)
ConfusionMatrixDisplay.from_predictions(test_y, preds_dt)
print(classification_report(test_y, preds_dt))

## Q11 (Extra Credit)

* No code needed.
* Select the right answer in Canvas.