# **Optimizing Candidate Selection Using Recruitment Data - Preprocessing and Training Data Development - Hector Sanchez**

**Preprocessing and Training Data Development Plan**

1. Load and Review the Most Current Data

   Objective: In this step, we'll start by loading our 'final_cleaned_encoded_recruitment_data.csv' file/dataset. The goal is to ensure that we have a clear understanding of the dataset's current structure so that we know what steps need to be taken for further processing. We'll start by importing any necessary and useful libraries (such as pandas, numpy, scikit-learn) and then load the dataset. Next, we'll use .head() and .info() to view the first few rows of the dataset, and also to check the data types of each feature. Once we verify the data type of each feature, we can move on to the next step.
   
2. Create Dummy Variables if Necessary

   Objective: In this next step we will ensure that all of the categorical features are ready for modeling. In the original dataset, the categorical features were 'Gender', 'EducationLevel', and 'RecruitmentStrategy.' We one hot encoded RecuitmentStrategy in our Data Wrangling notebook. We also encoded 'Gender', and 'EducationLevel' in our Exploratory Data Analysis notebook. IF it seems that any columns aren't ready for modeling, we will use pd.get_dummies() to one hot encode them. We will move on to the next step once we confirm that the data is fully encoded. 
   
3. Scale Numeric Features

   Objective: In this step we wil focus on standardizing the numeric features in our dataset so that they are all on a similar scale. We will use a scaler from sklearn.preprocessing to scale the numeric features, and we'll fit the scaler on the entire dataset.
   
4. Perform a Train-Test Split

   Objective: We will focus on splitting our data into training and testing sets that we can use to evaluate our models' performance. We will need to define our target variable and the feature variables. We will also split our data into subsets of 80% training and 20% testing. 
   
5. Save the Preprocessed Dataset for Future Use

   Objective: We will save our preprocessed and split dataset for use in our modeling notebook. 

**Load Packages and Review Data**

In [2]:
# We'll start by importing the necessary libraries and packages

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [4]:
# Next, we'll load the fully cleaned and encoded dataset

file_path = 'C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Three/datasets/final_cleaned_encoded_recruitment_data.csv'
recruitment_data_cleaned_encoded = pd.read_csv(file_path)

In [6]:
# Utilize .head() to show the first few rows of the dataset

recruitment_data_cleaned_encoded.head()

Unnamed: 0,Age,ExperienceYears,PreviousCompanies,DistanceFromCompany,InterviewScore,SkillScore,PersonalityScore,HiringDecision,Strategy_Aggressive,Strategy_Moderate,Strategy_Conservative,Gender_Male,EducationLevel_Bachelor's Type 2,EducationLevel_Master's,EducationLevel_PhD
0,26,0,3,26.783828,48,78,91,1,True,False,False,False,True,False,False
1,39,12,3,25.862694,35,68,80,1,False,True,False,False,False,False,True
2,48,3,2,9.920805,20,67,13,0,False,True,False,True,True,False,False
3,34,5,2,6.407751,36,27,70,0,False,False,True,False,True,False,False
4,30,6,1,43.105343,23,52,85,0,False,True,False,True,False,False,False


In [8]:
# Utilize .info() to check on the data types 
# This will help identify numerical and categorical features

recruitment_data_cleaned_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Age                               1500 non-null   int64  
 1   ExperienceYears                   1500 non-null   int64  
 2   PreviousCompanies                 1500 non-null   int64  
 3   DistanceFromCompany               1500 non-null   float64
 4   InterviewScore                    1500 non-null   int64  
 5   SkillScore                        1500 non-null   int64  
 6   PersonalityScore                  1500 non-null   int64  
 7   HiringDecision                    1500 non-null   int64  
 8   Strategy_Aggressive               1500 non-null   bool   
 9   Strategy_Moderate                 1500 non-null   bool   
 10  Strategy_Conservative             1500 non-null   bool   
 11  Gender_Male                       1500 non-null   bool   
 12  Educat

In this step, we focused on loading and reviewing our dataset in order to ensure that we can move forward with preprocesing. We started by importing the necessary packages: pandas for data manipulation, train_test_split from sklearn.model_selection, and StandardScaler, MinMaxScaler from  sklearn.preprocessing. 

Our initial call of .head() helped us ensure that our work from the previous Wrangling and EDA steps was probably reflected in our dataset. Also, our call of .info() helped us verify that our features are all in the proper data type for modeling. 

**Create Dummy Variables if Necessary**

In [10]:
# Utilize the following code to check for any remaining categorical columns

categorical_columns = recruitment_data_cleaned_encoded.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_columns)

if len(categorical_columns) > 0:
    recruitment_data_cleaned_encoded.get_dummies(recruitment_data_cleaned_encoded, columns=categorical_colummns, drop_first=True)
else:
    print("All categorical data is already encoded.")

Categorical columns: Index([], dtype='object')
All categorical data is already encoded.


While this step may seem redundant due to the work that we did in our previous notebooks, we still used the code above to further clarify that our data is ready. 

**Scale Numeric Features**

In [21]:
# Start by identifying the numeric columns that need to be scaled

numeric_columns = recruitment_data_cleaned_encoded.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_columns.remove('HiringDecision')

In [None]:
# Initialize the Scaler

scaler = StandardScaler()

# Fit and transform the scaler on the numeric features

recruitment_data_cleaned_encoded[numeric_columns] = scaler.fit_transform(recruitment_data_cleaned_encoded[numeric_columns])

# Utilize .head() to show the first few rows of the dataset
# This will confirm the scaling has been applied

recruitment_data_cleaned_encoded.head()

In this step, I selected to use StandardScaler solely based on the fact that it's recommended for most algorithms. I could also use MinMaxScaler, but it's strictly as option for now. 

**Perform a Train-Test Split**

In [26]:
# Define features for (X) and for our target variable (y)

X = recruitment_data_cleaned_encoded.drop('HiringDecision', axis=1)
y = recruitment_data_cleaned_encoded['HiringDecision']

# Split the data using a 80% train, and 20% test ratio

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the splits

print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

Training set shape: (1200, 14) (1200,)
Testing set shape: (300, 14) (300,)


In this step, we performed a train test split as the final step in this notebook (aside form saving the data). Our call on train_test_split split the dataset into 80% train and 20% test. 
This split results in X_train having 1200 rows and 14 columns, and X_test has 300 rows and 14 columns. 

**Save the Preprocessed Dataset for Future Use**

In [31]:
# Now we will save the training and testing sets

X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

# We will also save the full preprocessed DataFrame for future reference

recruitment_data_cleaned_encoded.to_csv('final_preprocessed_data.csv', index=False)