<a href="https://colab.research.google.com/github/namozhdehi/Amazon-Fine-Food-Reviews/blob/main/04_Pre_Processing_Pathrise_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4 Data Preprocessing <a id='4_Preprocessing'></a>

4.1 Contents <a id='4.1_Contents'></a>
4 Data Preprocessing
4.1 Contents
4.2 Introduction
4.3 Imports
4.4 Load Data
4.5 Handle Missing Values
4.6 Text Preprocessing
4.7 Feature Encoding
4.8 Feature Engineering
4.9 Train-Test Split
4.10 Summary

## 4.2 Introduction <a id='4.2_Introduction'></a>

In this notebook, we will preprocess the Pathrise dataset to prepare it for further analysis and model building. This includes handling missing values, cleaning text data, encoding categorical features, and splitting the data into training and testing sets.

## 4.3 Imports <a id='4.3_Imports'></a>

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 4.4 Load Data <a id='4.4_Load_Data'></a>

In [2]:
# Load the cleaned data from CSV file
df = pd.read_csv('pathrise_cleaned_data.csv')

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,id,pathrise_status,primary_track,cohort_tag,program_duration_days,placed,employment_status,highest_level_of_education,length_of_job_search,biggest_challenge_in_search,professional_experience,work_authorization_status,number_of_interviews,number_of_applications,gender,race,cleaned_biggest_challenge_in_search
0,3,Closed Lost,Design,AUG19B,0.0,0,Employed Part-Time,Master's Degree,Less than one month,Figuring out which jobs to apply for,Less than one year,Citizen,0.0,0,Male,East Asian or Asian American,figuring jobs apply
1,4,Closed Lost,PSO,AUG19B,0.0,0,Contractor,Bachelor's Degree,Less than one month,Getting past final round interviews,Less than one year,Citizen,5.0,25,Male,Decline to Self Identify,getting past final round interviews
2,5,Placed,SWE,AUG19A,89.0,1,Unemployed,Bachelor's Degree,1-2 months,Hearing back on my applications,1-2 years,F1 Visa/OPT,10.0,100,Male,East Asian or Asian American,hearing back applications
3,6,Closed Lost,SWE,AUG19A,0.0,0,Employed Full-Time,Master's Degree,1-2 months,Technical interviewing,3-4 years,Green Card,5.0,100,Male,East Asian or Asian American,technical interviewing
4,7,Closed Lost,SWE,AUG19B,0.0,0,Employed Full-Time,Master's Degree,Less than one month,Getting past phone screens,3-4 years,Green Card,0.0,9,Male,"Black, Afro-Caribbean, or African American",getting past phone screens


## 4.5 Handle Missing Values <a id='4.5_Missing'></a>

We will check for missing values and handle them appropriately, either by filling them with suitable values or dropping rows with missing data.

In [3]:
# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df.dropna(inplace=True)

# Confirm there are no missing values
print(df.isnull().sum())

id                                     0
pathrise_status                        0
primary_track                          0
cohort_tag                             0
program_duration_days                  0
placed                                 0
employment_status                      0
highest_level_of_education             0
length_of_job_search                   0
biggest_challenge_in_search            0
professional_experience                0
work_authorization_status              0
number_of_interviews                   0
number_of_applications                 0
gender                                 0
race                                   0
cleaned_biggest_challenge_in_search    0
dtype: int64
id                                     0
pathrise_status                        0
primary_track                          0
cohort_tag                             0
program_duration_days                  0
placed                                 0
employment_status                      0
hig

## 4.6 Text Preprocessing <a id='4.6_Text_Cleaning'></a>

We will clean the text data in the biggest_challenge_in_search column by removing unwanted characters, converting the text to lowercase, and removing stopwords.

In [4]:
import re

# Define a function to clean the conversation text
def clean_text(text):
    if not isinstance(text, str):
        return ''
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    tokens = word_tokenize(text)  # Tokenize the text
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return ' '.join(tokens)

# Apply the clean_text function to the 'biggest_challenge_in_search' column
df['cleaned_biggest_challenge_in_search'] = df['biggest_challenge_in_search'].apply(clean_text)

# Display the first few cleaned rows
df[['biggest_challenge_in_search', 'cleaned_biggest_challenge_in_search']].head()

Unnamed: 0,biggest_challenge_in_search,cleaned_biggest_challenge_in_search
0,Figuring out which jobs to apply for,figuring jobs apply
1,Getting past final round interviews,getting past final round interviews
2,Hearing back on my applications,hearing back applications
3,Technical interviewing,technical interviewing
4,Getting past phone screens,getting past phone screens


## 4.7 Feature Encoding <a id='4.7_Feature_Encoding'></a>

We will encode the categorical features like primary_track and employment_status into numerical format using Label Encoding.

In [5]:
# Initialize LabelEncoder
le = LabelEncoder()

# Encode categorical features
df['primary_track_encoded'] = le.fit_transform(df['primary_track'])
df['employment_status_encoded'] = le.fit_transform(df['employment_status'])

# Display the first few rows with encoded columns
df[['primary_track', 'primary_track_encoded', 'employment_status', 'employment_status_encoded']].head()

Unnamed: 0,primary_track,primary_track_encoded,employment_status,employment_status_encoded
0,Design,1,Employed Part-Time,2
1,PSO,2,Contractor,0
2,SWE,3,Unemployed,4
3,SWE,3,Employed Full-Time,1
4,SWE,3,Employed Full-Time,1


## 4.8 Feature Engineering <a id='4.8_Feature_Engineering'></a>

In this section, we will engineer some new features that could help improve model performance. For example, we might create binary indicators or categorical groupings.

In [6]:
# Example Feature Engineering: Create a binary feature based on 'placed' (0 or 1)
df['placed_binary'] = df['placed'].apply(lambda x: 1 if x == 1 else 0)

# Another example: Group job search lengths into fewer categories
df['job_search_group'] = df['length_of_job_search'].apply(lambda x: 'short' if 'month' in x else 'long')

# Display the first few rows with new engineered features
df[['placed', 'placed_binary', 'length_of_job_search', 'job_search_group']].head()

Unnamed: 0,placed,placed_binary,length_of_job_search,job_search_group
0,0,0,Less than one month,short
1,0,0,Less than one month,short
2,1,1,1-2 months,short
3,0,0,1-2 months,short
4,0,0,Less than one month,short


## 4.9 Train-Test Split <a id='4.9_Train_Test_Split'></a>

We will now split the dataset into training and testing sets for model development.

In [7]:
# Select features and target variable
X = df[['primary_track_encoded', 'employment_status_encoded', 'number_of_applications', 'number_of_interviews', 'program_duration_days', 'placed_binary', 'job_search_group']]
y = df['placed']

# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print sizes of the splits
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Training set size: 790
Test set size: 198


## 4.10 Summary <a id='4.10_Summary'></a>

In this Data Preprocessing notebook, we cleaned and prepared the Pathrise dataset for further analysis and model development. The first step involved handling missing values by dropping rows with incomplete data, ensuring that the dataset was clean and ready for further processing. Following this, we performed text preprocessing on the biggest_challenge_in_search column by removing special characters, converting the text to lowercase, and eliminating common stopwords to make the text consistent for analysis.

Next, we encoded categorical features such as primary_track and employment_status into numerical representations using label encoding. This was followed by feature engineering, where we created new features like placed_binary to indicate whether a participant was placed and grouped job search lengths into categorical values like 'short' or 'long.' These engineered features helped improve the dataset's structure for better model performance.

Finally, we split the dataset into training and testing sets, ensuring that 80% of the data would be used for training, while 20% was reserved for testing the model. This structured data preparation workflow set the stage for building and training machine learning models to provide career guidance based on the Pathrise dataset. The comprehensive steps taken ensure the dataset is well-prepared for accurate and efficient analysis.