# Introduction

This is a step by step approach to the Predictive Insights competition.

Youth unemployment and under-employment is a major concern for any developing country, and serves as an important predictor of economic health and prosperity. Being able to predict, and understand,  which young people will find employment and which ones will require additional help,  helps promote evidence-based decision-making, supports economic empowerment, and allows young people to thrive in their chosen careers.

The objective of this challenge is to build a machine learning model that predicts youth employment, based on data from labour market surveys in South Africa.

This solution will help organisations like Predictive Insights achieve a baseline prediction of young peoples’ employment outcomes, allowing them to design and test interventions to help youth make a transition into the labour market or to improve their earnings.

# The Data

The data for this challenge comes from four rounds of a survey of youth in the South African labour market, conducted at 6-month intervals. The survey contains numerical, categorical and free-form text responses. You will also receive additional demographic information such as age and information about school level and results.

Each person in the dataset was surveyed one year prior (the ‘baseline’ data) to the follow-up survey. We are interested in predicting whether a person is employed at the follow-up survey based on their labour market status and other characteristics during the baseline.

The training set consists of one row or observation per individual - information collected at baseline plus only the target outcome (whether they were employed or not) one year later. The test set consists of the data collected at baseline without the target outcome.

The objective of this challenge is to predict whether a young person will be employed, one year after the baseline survey, based on their demographic characteristics, previous and current labour market experience and education outcomes, and to deliver an easy-to-understand and insightful solution to the data team at Predictive Insights.


# Exploratory Data Analysis

## Load libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Load data

In [None]:
df_train = pd.read_csv("Train.csv")
df_test = pd.read_csv("Test.csv")

df_train.head()

In [None]:
df_train['Target'].unique()

In [None]:
df_train.columns

## Univariate Analysis

Let's have a look at some of the variables.

**sa_citizen**

In [None]:
df_train["Sa_citizen"].value_counts()

The values where `sa_citizen` are 0 are very underrepresented. It could be a good idea to remove the rows where `sa_citizen` = 0 but that could lead to a loss of data. Alternatively, one could consider removing the column altogether.

**geography**

In [None]:
df_train["Geography"].value_counts()

From this, we see that candidates come from three geographical categories: Rural, Suburb, and Urban. The majority come from urban areas.

**tenure**

In [None]:
# Generate a histogram of the tenure variable using matplotlib
plt.hist(df_train["Tenure"])
plt.xlabel("Tenure")
plt.ylabel("Frequency")
plt.title("Histogram of Tenure")
plt.show()

This histogram indicates that `tenure` has a skewed distribution, with a concentration of values towards the lower end and the presence of outliers.

Next, we will look at the distribution of the `birthyear` variable.

**birthyear**

In [None]:
# Generate a boxplot of the birthyear variable using matplotlib

plt.boxplot(df_train['Birthyear'])
plt.title("Boxplot of Birth Year")
plt.xlabel("Birth Year")

plt.show()

The presence of many points below the first quartile suggests a left-skewed skewed distribution, with many outliers on the lower end.
To get more details, we can use the `pandas.DataFrame.describe()` function.

In [None]:
#  get the key statistics of `birthyear` using pandas.DataFrame.describe()
df_train['Birthyear'].describe()

From this, we see that most candidates were born between 1995 and 2000.

## Bivariate Analysis

Now, let us look at the relationships between a few variables and the target variables.

In [None]:
sns.kdeplot(data=df_train, x="Birthyear", hue="Target", fill=True, alpha=0.5)
plt.xlabel("Birth Year")
plt.ylabel("Count")
plt.title("Histogram of Birth Year by Target")
plt.show()

The ages of candidates with a positive outcome and those with a negative outcome seem to follow a similar distribution.

We will now look at the percentage of candidates with a positive outcome in each province.

In [None]:
# Calculate the percentage of positive income for each province

df_province = df_train.groupby('Province').agg(percentage=('Target', 'mean')).reset_index()
df_province["percentage"] = df_province["percentage"] * 100
df_province = df_province.sort_values('percentage', ascending=False).reset_index()

In [None]:
# Generate a bar plot for the 'percentage' positive income for each province

plt.figure(figsize=(10, 6))
sns.barplot(data=df_province, x='Province', y='percentage')
plt.xlabel('Province')
plt.ylabel('Percentage of Positive Outcome')
plt.title('Percentage of Positive Outcome by Province')

for index, row in df_province.sort_values('percentage', ascending=False).iterrows():
  plt.text(row.name, row.percentage, f"{round(row.percentage, 1)}%", ha='center', va='bottom')
plt.xticks(rotation=90, ha='center')

plt.show()

In the training data, candidates from the Western Cape are the most likely to get a positive outcome, while those from the North West province are least likely.

What about the `geography` variable?

In [None]:
# Calculate the percentage of positive income for each `geography`

df_geography = df_train.groupby('Geography').agg(percentage=('Target', 'mean')).reset_index()
df_geography["percentage"] = df_geography["percentage"] * 100
df_geography = df_geography.sort_values('percentage', ascending=False).reset_index()

In [None]:
# Generate a bar plot for the 'percentage' positive income for each `geography`

plt.figure(figsize=(6, 4))
sns.barplot(data=df_geography, x='Geography', y='percentage')
plt.xlabel('Geography')
plt.ylabel('Percentage of Positive Outcome')
plt.title('Percentage of Positive Outcome by Geography')

# Add labels to the bars
for index, row in df_geography.iterrows():
    plt.text(row.name, row.percentage, f"{round(row.percentage, 1)}%", ha='center', va='bottom')

# Rotate x-axis labels
plt.xticks(rotation=90, ha='center')

plt.show()

We see that people from "Urban" areas are most likely to get a positive outcome.

In terms if gender, we see below that males in the data set are more likely to get a job after one year.

In [None]:
df_female = df_train.groupby('Female').agg(percentage=('Target', 'mean')).reset_index()
df_female["percentage"] = df_female["percentage"] * 100
df_female = df_female.sort_values('percentage', ascending=False).reset_index()

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=df_female, x='Female', y='percentage')
plt.xlabel('Status')
plt.ylabel('Percentage of Positive Outcome')
plt.title('Percentage of Positive Outcome by Gender')

# Add labels to the bars
for index, row in df_female.iterrows():
    plt.text(row.name, row.percentage, f"{round(row.percentage, 1)}%", ha='center', va='bottom')

# Rotate x-axis labels
plt.xticks(rotation=90, ha='center')

plt.show()

# Feature Engineering

Feature engineering is the process of transforming raw data into meaningful features that may improve the performance of machine learning models. It involves selecting, creating, and transforming variables to capture relevant information and enhance the predictive power of the model.

Let's extract the year of the survey then use it to calculate the age of each participant at the time of the survey.

In [None]:
df_train['Year_survey'] = pd.to_datetime(df_train['Survey_date']).dt.year
df_train['Age_survey'] = df_train['Year_survey'] - df_train['Birthyear']
df_train['Age_survey'].head()

Next, we create a variable that indicates the number of subjects where the participants have obtained 70% or more.

In [None]:
df_train['Subjects_over_70'] = df_train.apply(lambda row: row.str.contains("80 - 100 %|70 - 79 %").sum(), axis=1)
df_train['Subjects_over_70'].value_counts()

Feel free to explore these newly created variables and decide whether you'd like to discard them.

## Dummy variables

In this section, we convert our categorical variables into dummy variables.

In [None]:
# Create a list of categorical variables
selected_vars = ["Round", "Status", "Geography", "Province",
                                              "Schoolquintile", "Math", "Mathlit", "Additional_lang", "Home_lang", "Science"]
# Remove variables we will not use
df_train_dummy = df_train.drop(["Person_id", "Survey_date"], axis = 1)

# Convert character variables to dummy variables
df_train_dummy = pd.get_dummies(df_train_dummy, columns=selected_vars, drop_first=True, dummy_na=True)
df_train_dummy.columns

# Data cleaning

## Cleaning column names

The dummification process created some messy column names. Here, we're trying to clean those.

In [None]:
# Clean column names
df_train_dummy.columns = df_train_dummy.columns.str.replace(' ', '_')  # Replace spaces with underscores
df_train_dummy.columns = df_train_dummy.columns.str.replace('[^\w\s]', '', regex=True)  # Remove special characters
df_train_dummy.columns = df_train_dummy.columns.str.replace('_+', '_', regex=True)  # Replace consecutive underscores with a single underscore
df_train_dummy.columns = df_train_dummy.columns.str.rstrip('_')  # Remove trailing underscores at the end
df_train_dummy.columns

## Dealing with missing values

We will use a simplified method for replacing missing values: replacing them with zero.

In [None]:
df_train_dummy = df_train_dummy.fillna(0)

# Logistic Regression Modeling

Logistic Regression is a statistical modeling technique used to predict binary outcomes or probabilities. It is commonly used when the dependent variable (target variable) is categorical and has two possible outcomes, such as yes/no, success/failure, or 0/1.

To perform logistic regression with 10-fold cross-validation using scikit-learn, you can use the following code:

In [None]:
# Import
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

In [None]:
# save to csv file
df_train_dummy.to_csv("clean_train.csv", index=False)

In [None]:
import sys

# Separate the features and target variables
X = df_train_dummy.drop('Target', axis=1)
y = df_train_dummy['Target']

# X.shape, y.shape

# Set up logistic regression model
model = LogisticRegression()

# Set up cross-validation strategy
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Perform cross-validation and calculate ROC AUC
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv)


In [None]:

# Print the mean ROC AUC score across folds
print('Mean ROC AUC:', scores.mean())

# Predict on the test set

In [None]:
# Test set preview
df_test.head()

## Pre-processing

We need to make sure the test data undergoes the same pre-processing steps as the training data did.

In [None]:
# Create "year_survey" column then
# Create "age_survey" column
df_test['Year_survey'] = pd.to_datetime(df_test['Survey_date']).dt.year
df_test['Age_survey'] = df_test['Year_survey'] - df_test['Birthyear']
df_test['Age_survey'].head()

In [None]:
df_test['Subjects_over_70'] = df_test.apply(lambda row: row.str.contains("80 - 100 %|70 - 79 %").sum(), axis=1)
df_test['Subjects_over_70'].value_counts()

In [None]:
# Remove variables we will not use
df_test_dummy = df_test.drop(["Person_id", "Survey_date"], axis = 1)

# Convert character variables to dummy variables
df_test_dummy = pd.get_dummies(df_test_dummy, columns=selected_vars, drop_first=True, dummy_na=True)

# Clean column names
df_test_dummy.columns = df_test_dummy.columns.str.replace(' ', '_')  # Replace spaces with underscores
df_test_dummy.columns = df_test_dummy.columns.str.replace('[^\w\s]', '', regex=True)  # Remove special characters
df_test_dummy.columns = df_test_dummy.columns.str.replace('_+', '_', regex=True)  # Replace consecutive underscores with a single underscore
df_test_dummy.columns = df_test_dummy.columns.str.rstrip('_')  # Remove trailing underscores at the end
df_test_dummy.columns

In [None]:
# Dealing with missing values
df_test_dummy = df_test_dummy.fillna(0)
df_test_dummy.head()

Now, let's predict!

In [None]:
# Fit the model on training set
model.fit(X, y)

In [None]:
# Test on test set

predictions = model.predict(df_test_dummy)
print(predictions[:6])

Now let's put our predictions in the format needed for submission.For every row in the dataset, submission files should contain 2 columns: ID and Target.
Your submission file should look like this.

In [None]:
# Create a DataFrame df_submission with two columns "ID" and "Target"
df_submission = pd.DataFrame({"ID": df_test["Person_id"], "Target": predictions.astype(int)})
print(df_submission.head())

Save your submission as a CSV file.

In [None]:
df_submission.to_csv("submission.csv", index=False)

Et voilà! You are now ready to submit.

Predictive Insights is a leader in behavioural science and artificial intelligence to improve business efficiency and profitability. Through a combination of data science, machine learning and behavioural insights, we help customers to accurately predict sales, staffing and stock levels. Our solution improves sales forecasting on average by 50 percent. We operate in Africa as well as Europe, Middle East and India in the restaurant, food processing, retail and financial service sectors.
We are part of Alphawave, a specialised technology investment group supporting businesses seeking to do things that are complex to replicate.
