# Logistic Regression Assignment

- Run the below cells. If you have the data in a different directory, you'll need to change the url.
- Complete all of the numbered questions. You may call any packages that we've used in class.  

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/log_reg/employee-turnover-balanced.csv')
df.head()

Unnamed: 0,left_company,age,frequency_of_travel,department,commuting_distance,education,satisfaction_with_environment,gender,seniority_level,position,satisfaction_with_job,married_or_single,last_raise_pct,last_performance_rating,total_years_working,years_at_company,years_in_current_job,years_since_last_promotion,years_with_current_supervisor
0,No,37,Travel_Rarely,Sales,16,4,4,Male,2,Sales Executive,3,Divorced,19,3,9,1,0,0,0
1,No,39,Travel_Rarely,Research & Development,3,2,3,Male,2,Laboratory Technician,3,Divorced,15,3,11,10,8,0,7
2,No,52,Travel_Frequently,Research & Development,25,4,3,Female,4,Manufacturing Director,4,Married,22,4,31,9,8,0,0
3,No,50,Non-Travel,Sales,1,3,4,Female,2,Sales Executive,3,Married,12,3,19,18,7,0,13
4,No,44,Travel_Rarely,Research & Development,4,3,4,Male,2,Healthcare Representative,2,Single,12,3,10,5,2,2,3


## Data Definitions
- `left_company`: Whether individual left the company or not. This is the target variable.  
- `age`: Age of individual. 
- `frequency_of_travel`: How often person travels for work.  
- `department`: Department person works(worked).  
- `commuting_distance`: Distance person lives from office.  
- `education`: Highest education category.  
- `satisfaction_with_environment`: Satisfaction of environment, on lickert scale.  
- `gender`: Gender of individual.  
- `seniority_level`: Seniority level of individual.  
- `position`: Last position held at the company.  
- `satisfaction_with_job`: Satisfaction of their job, on lickert scale.  
- `married_or_single`: Marital status of person.  
- `last_raise_pct`: Percent increase their last raise represented.  
- `last_performance_rating`: Most recent annual performance rating, on lickert scale.  
- `total_years_working`: Number of years the individual has spent working in their career.  
- `years_at_company`: Number of years the individual has been at the company, regardless of position.  
- `years_in_current_job`: Number of years the individual has been in their current position.  
- `years_since_last_promotion`: Years since the person had their last promotion.  
- `years_with_current_supervisor`: Years the person has had their current supervisor.

# Question 1
- What is the distribution of the target (`left_company`)?  
- Do you have any concerns on class imbalances?

In [28]:
# insert code
df['left_company'].value_counts()

No     500
Yes    500
Name: left_company, dtype: int64

 The code reads in a CSV file that is balanced with equal numbers of 0s and 1s. Therefore, there are no class imbalances in the initial dataset

# Question 2
- Create and print a list of the variables that you would treat as numerical and another list for the variables that you would treat as categorical.  
- Explain your choices.

In [29]:
numerical_vars = ['age', 'commuting_distance', 'education', 'satisfaction_with_environment', 'seniority_level', 'satisfaction_with_job', 'last_raise_pct', 'last_performance_rating', 'total_years_working', 'years_at_company', 'years_in_current_job', 'years_since_last_promotion', 'years_with_current_supervisor']
categorical_Vars = ['frequency_of_travel', 'department', 'gender', 'position' , 'married_or_single']
print(numerical_vars,categorical_Vars)

['age', 'commuting_distance', 'education', 'satisfaction_with_environment', 'seniority_level', 'satisfaction_with_job', 'last_raise_pct', 'last_performance_rating', 'total_years_working', 'years_at_company', 'years_in_current_job', 'years_since_last_promotion', 'years_with_current_supervisor'] ['frequency_of_travel', 'department', 'gender', 'position', 'married_or_single']


 Numerical variables are variables that take on numeric values and can be either continuous or discrete. Categorical variables, on the other hand, are variables that take on a limited number of values and represent different categories or groups.

# Question 3
- Determine if any numerical variables risk multicolinearity.  
- Remove those variables (if any) from your numerical_vars list.  
- Why did you or did not remove any?

In [30]:
# insert code here
df[numerical_vars].corr()



Unnamed: 0,age,commuting_distance,education,satisfaction_with_environment,seniority_level,satisfaction_with_job,last_raise_pct,last_performance_rating,total_years_working,years_at_company,years_in_current_job,years_since_last_promotion,years_with_current_supervisor
age,1.0,0.012074,0.199138,0.001556,0.522604,0.095242,0.027851,0.003629,0.673804,0.38476,0.31001,0.242456,0.273679
commuting_distance,0.012074,1.0,0.033003,-0.019556,0.038915,0.023859,0.104421,0.089282,0.025593,0.023017,0.03189,0.047552,0.03152
education,0.199138,0.033003,1.0,-0.059586,0.080685,0.015148,0.013515,-0.014162,0.160822,0.091614,0.073181,0.077218,0.083453
satisfaction_with_environment,0.001556,-0.019556,-0.059586,1.0,0.009462,-0.00616,0.014812,0.006943,-0.027203,0.001339,0.023698,0.042132,0.021875
seniority_level,0.522604,0.038915,0.080685,0.009462,1.0,0.040606,-0.022683,-0.029956,0.779351,0.572724,0.478151,0.392935,0.430047
satisfaction_with_job,0.095242,0.023859,0.015148,-0.00616,0.040606,1.0,-0.037273,-0.08903,0.029119,0.07192,0.037591,0.038015,0.001472
last_raise_pct,0.027851,0.104421,0.013515,0.014812,-0.022683,-0.037273,1.0,0.792791,-0.004905,0.004435,0.039691,0.000615,0.060882
last_performance_rating,0.003629,0.089282,-0.014162,0.006943,-0.029956,-0.08903,0.792791,1.0,0.014877,0.022364,0.087038,0.030595,0.100502
total_years_working,0.673804,0.025593,0.160822,-0.027203,0.779351,0.029119,-0.004905,0.014877,1.0,0.685955,0.548494,0.423619,0.506007
years_at_company,0.38476,0.023017,0.091614,0.001339,0.572724,0.07192,0.004435,0.022364,0.685955,1.0,0.801423,0.630344,0.781147


In [31]:
print("years_in_current_job and years_at_company are closely correlated.  hence droping years_in_current_job")
numerical_vars.remove('years_in_current_job')

years_in_current_job and years_at_company are closely correlated.  hence droping years_in_current_job


# Question 4
- Split the data into training and test sets.  
- Use 20% of the data for test and a random state of 124.  

In [32]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('left_company', axis=1), df['left_company'], test_size=0.2, random_state=124)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (800, 18)
X_test shape: (200, 18)
y_train shape: (800,)
y_test shape: (200,)


# Question 5
- Create a pipeline to process the numerical data.  
- Create a pipeline to process the categorical data.  

Verify each pipeline contains the columns you would expect using a fit_transform on the training data, i.e., print the shapes of the fit_transforms for each pipeline.

In [33]:
# insert code here
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, numerical_vars),
    ('cat', cat_pipeline, categorical_Vars)
])

# Question 6
- Create a pipeline that combines the pre-processing and implements a logistic regression model.  
- Print the accuracy on the training set and the test set.
- Do you have any concerns of overfitting based on the differences between the two accuracy scores?

In [34]:
logreg_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('logreg', LogisticRegression())
])

logreg_pipeline.fit(Xtrain, ytrain)

train_preds = logreg_pipeline.predict(Xtrain)
test_preds = logreg_pipeline.predict(Xtest)

train_acc = accuracy_score(ytrain, train_preds)
test_acc = accuracy_score(ytest, test_preds)

print(f"Training accuracy: {train_acc}")
print(f"Test accuracy: {test_acc}")

Training accuracy: 0.72625
Test accuracy: 0.655


# Question 7
What would you recommend as potential next steps for continuing to develop and evaluate a model?

Choose a suitable machine learning model based on the problem, size of data, and complexity of relationships between input and output variables.

Train the model on the preprocessed training dataset using an algorithm like stochastic gradient descent or random forest.

Evaluate the performance of the model using a separate validation dataset and metrics such as accuracy, precision, recall, F1 score (for classification), mean squared error or mean absolute error (for regression).

Fine-tune the model by adjusting hyperparameters such as learning rate or regularization strength to optimize its performance on the validation dataset.

Test the final model on a separate testing dataset to ensure it generalizes well to new data.

Deploy the trained model in a production environment, such as a web or mobile app, and continuously monitor its performance over time.




