# Introduction

This is a step by step approach to the Predictive Insights competition.

Youth unemployment and under-employment is a major concern for any developing country, and serves as an important predictor of economic health and prosperity. Being able to predict, and understand,  which young people will find employment and which ones will require additional help,  helps promote evidence-based decision-making, supports economic empowerment, and allows young people to thrive in their chosen careers.

The objective of this challenge is to build a machine learning model that predicts youth employment, based on data from labour market surveys in South Africa.

This solution will help organisations like Predictive Insights achieve a baseline prediction of young peoples’ employment outcomes, allowing them to design and test interventions to help youth make a transition into the labour market or to improve their earnings.

# The Data

The data for this challenge comes from four rounds of a survey of youth in the South African labour market, conducted at 6-month intervals. The survey contains numerical, categorical and free-form text responses. You will also receive additional demographic information such as age and information about school level and results.

Each person in the dataset was surveyed one year prior (the ‘baseline’ data) to the follow-up survey. We are interested in predicting whether a person is employed at the follow-up survey based on their labour market status and other characteristics during the baseline.

The training set consists of one row or observation per individual - information collected at baseline plus only the target outcome (whether they were employed or not) one year later. The test set consists of the data collected at baseline without the target outcome.

The objective of this challenge is to predict whether a young person will be employed, one year after the baseline survey, based on their demographic characteristics, previous and current labour market experience and education outcomes, and to deliver an easy-to-understand and insightful solution to the data team at Predictive Insights.


# Exploratory Data Analysis

## Load libraries

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

## Load data

In [None]:
df = pd.read_csv('smartwatches.csv')
df.drop_duplicates(inplace=True)

train_df = df.sample(frac=0.8, random_state=42)
test_df = df.drop(train_df.index)

# Reset index after sampling to avoid potential issues with indexing
train_df = train_df.reset_index()
test_df = test_df.reset_index()

print(train_df.shape, test_df.shape)

Unnamed: 0,Person_id,Survey_date,Round,Status,Tenure,Geography,Province,Matric,Degree,Diploma,...,Math,Mathlit,Additional_lang,Home_lang,Science,Female,Sa_citizen,Birthyear,Birthmonth,Target
0,Id_eqz61wz7yn,2022-02-23,2,studying,,Rural,Mpumalanga,1.0,0.0,0.0,...,0 - 29 %,,50 - 59 %,,0 - 29 %,1,1,2000,5,0
1,Id_kj5k3g5wud,2023-02-06,4,unemployed,427.0,Suburb,North West,1.0,0.0,0.0,...,30 - 39 %,,40 - 49 %,,30 - 39 %,1,1,1989,4,1
2,Id_9h0isj38y4,2022-08-08,3,other,,Urban,Free State,1.0,0.0,0.0,...,30 - 39 %,,40 - 49 %,,30 - 39 %,0,1,1996,7,1
3,Id_5ch3zwpdef,2022-03-16,2,unemployed,810.0,Urban,Eastern Cape,,,,...,,,,,,0,1,2000,1,0
4,Id_g4elxibjej,2023-03-22,4,studying,,Urban,Limpopo,,,,...,,,,,,1,1,1998,12,0


In [None]:
train_df.head()

In [None]:
train_df.drop(['index', 'Unnamed: 0'], axis=1, inplace=True)
test_df.drop(['index', 'Unnamed: 0'], axis=1, inplace=True)

In [None]:
train_df.info()

In [None]:
train_df.isnull().sum()

In [None]:
numerical_data = [feature for feature in train_df.columns if train_df[feature].dtype != 'object']

In [None]:
continuous_data = [feature for feature in train_df.columns if train_df[feature].dtype == 'object']

In [None]:
numerical_data

In [None]:
continuous_data

## Univariate Analysis

Let's have a look at some of the variables.

**Brand**

In [None]:
train_df["Brand"].value_counts()

1    4018
0       2
Name: Sa_citizen, dtype: int64

The values where `sa_citizen` are 0 are very underrepresented. It could be a good idea to remove the rows where `sa_citizen` = 0 but that could lead to a loss of data. Alternatively, one could consider removing the column altogether.

**geography**

In [None]:
train_df["Strap Color"].value_counts()

Urban     2797
Rural      803
Suburb     420
Name: Geography, dtype: int64

From this, we see that candidates come from three geographical categories: Rural, Suburb, and Urban. The majority come from urban areas.

**tenure**

In [None]:
train_df["Strap Material"].value_counts()

In [None]:
train_df[numerical_data].describe()

In [None]:
train_df[numerical_data].corr()

In [None]:
train_df[numerical_data].skew()

From this, we see that most candidates were born between 1995 and 2000.

Histograms

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(14, 10))

for i, column in enumerate(numerical_data):
  row = i // 3
  col = i % 3
  ax = axes[row, col]
  sns.histplot(data=train_df[column], ax=ax, kde=True)

plt.show()

Density

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(14, 10))

for i, column in enumerate(numerical_data):
  row = i // 3
  col = i % 3
  ax = axes[row, col]
  sns.kdeplot(data=train_df[column], ax=ax, fill=True)

plt.show()

Box and Whisker

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(14, 10))

for i, column in enumerate(numerical_data):
  row = i // 3
  col = i % 3
  ax = axes[row, col]
  sns.boxplot(x=train_df[column], ax=ax)
  ax.set_xlabel(column)

fig.tight_layout()
plt.show()

## Multivariate Analysis

Now, let us look at the relationships between variables

In [None]:
# Scatter Plot Matrix

sns.pairplot(train_df)

In [None]:
# Correlation Matrix

sns.heatmap(train_df[numerical_data].corr(), annot=True)
plt.show()

# Feature Engineering

Feature engineering is the process of transforming raw data into meaningful features that may improve the performance of machine learning models. It involves selecting, creating, and transforming variables to capture relevant information and enhance the predictive power of the model.

Let's extract the year of the survey then use it to calculate the age of each participant at the time of the survey.

In [None]:
train_df['Display Size'].isna().sum()

In [None]:
train_df['Display Size'].value_counts().count()

In [None]:
train_df['Display Size'] = train_df['Display Size'].fillna('0.0 inches', inplace=True)

In [None]:
train_df['Display Size'] = train_df['Display Size'].apply(lambda x: float(x.split(' ')[0]))

In [None]:
train_df['Display Size'].head()

In [None]:
train_df['Display Size'] = train_df['Display Size'].replace(0.0, np.nan, inplace=True)

In [None]:
train_df['Display Size'].isna().sum()

Next, we create a variable that indicates the number of subjects where the participants have obtained 70% or more.

In [None]:
train_df['Weight'].head()

In [None]:
train_df['Weight'].value_counts()

In [None]:
re.findall('\d+', '20 - 35 g ')

In [None]:
cal = sum([int(i) for i in re.findall('\d+', '20 - 35 g')]) / 2
test_df['Weight'].replace('20 - 35 g ', cal, inplace=True)

cal = sum([int(i) for i in re.findall('\d+', '')]) / 2
test_df['Weight'].replace('35 - 50 g', cal, inplace=True)

cal = sum([int(i) for i in re.findall('\d+', '50 - 75 g')]) / 2
test_df['Weight'].replace('50 - 75 g', cal, inplace=True)

train_df['Weight'].replace('75g +', float(re.findall('\d+', '75g +')[0]), inplace=True)

train_df['Weight'].replace('<= 20 g', float(re.findall('\d+', '<= 20 g')[0]), inplace=True)

train_df['Weight'].value_counts()

Feel free to explore these newly created variables and decide whether you'd like to discard them.

In [None]:
train_df['Discount Price'] = (train_df['Original Price'] * (-train_df['Discount Percentage'])) / 100

In [None]:
train_df.drop(['Discount Percentage'], axis=1, inplace=True)

In [None]:
train_df[numerical_data].head()

# Data cleaning

Removing  outliers

In [None]:
def remove_outliers_IQR(data, col):
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[col] > lower_bound) & (data[col] < upper_bound)]

In [None]:
import_col = ['Current Price', 'Original Price', 'Rating', 'Number OF Ratings', 'Display Size']

In [None]:
for col in import_col:
    train_df = remove_outliers_IQR(train_df, col)

In [None]:
for col in numerical_data:
    print(col)
    train_df[col].fillna(train_df[col].median(), inplace=True)

In [None]:
train_df.isna().sum()

## Dealing with missing values

We will use a simplified method for replacing missing values: replacing them with zero.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
data = scaler.fit_transform(train_df[numerical_data[:-1]])

In [None]:
data

In [None]:
data = pd.DataFrame(data, columns=numerical_data[:-1])

In [None]:
data.head()

In [None]:
train_df.drop(numerical_data[:-1], axis=1, inplace=True)
train_df = pd.concat([train_df, data], axis=1)

# Logistic Regression Modeling

Logistic Regression is a statistical modeling technique used to predict binary outcomes or probabilities. It is commonly used when the dependent variable (target variable) is categorical and has two possible outcomes, such as yes/no, success/failure, or 0/1.

To perform logistic regression with 10-fold cross-validation using scikit-learn, you can use the following code:

In [None]:
# Import
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

In [None]:
# Separate the features and target variables
X = df_train_dummy.drop('Target', axis=1)
y = df_train_dummy['Target']

# Set up logistic regression model
model = LogisticRegression()

# Set up cross-validation strategy
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Perform cross-validation and calculate ROC AUC
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [None]:

# Print the mean ROC AUC score across folds
print('Mean ROC AUC:', scores.mean())

Mean ROC AUC: 0.7735122496672016


# Predict on the test set

In [None]:
# Test set preview
df_test.head()

Unnamed: 0,Person_id,Survey_date,Round,Status,Tenure,Geography,Province,Matric,Degree,Diploma,Schoolquintile,Math,Mathlit,Additional_lang,Home_lang,Science,Female,Sa_citizen,Birthyear,Birthmonth
0,Id_r90136smvl,2022-08-03,3,other,,Urban,KwaZulu-Natal,1.0,0.0,0.0,2.0,0 - 29 %,,50 - 59 %,,40 - 49 %,0,1,2002,12
1,Id_wawdqhmu6s,2023-03-16,4,unemployed,979.0,Urban,Western Cape,1.0,0.0,0.0,,,,40 - 49 %,,,1,1,1989,12
2,Id_ap2czff2bu,2023-03-14,4,unemployed,339.0,Urban,KwaZulu-Natal,0.0,0.0,0.0,1.0,,,,,,1,1,1989,12
3,Id_uhgink7iha,2023-02-16,4,studying,,Urban,Gauteng,1.0,0.0,0.0,1.0,,80 - 100 %,60 - 69 %,,,0,1,2002,11
4,Id_5j6bzk3k81,2023-03-23,4,unemployed,613.0,Urban,Gauteng,0.0,0.0,0.0,5.0,,,,,,1,1,1993,10


## Pre-processing

We need to make sure the test data undergoes the same pre-processing steps as the training data did.

In [None]:
# Create "year_survey" column then
# Create "age_survey" column
df_test['Year_survey'] = pd.to_datetime(df_test['Survey_date']).dt.year
df_test['Age_survey'] = df_test['Year_survey'] - df_test['Birthyear']
df_test['Age_survey'].head()

0    20
1    34
2    34
3    21
4    30
Name: Age_survey, dtype: int64

In [None]:
df_test['Subjects_over_70'] = df_test.apply(lambda row: row.str.contains("80 - 100 %|70 - 79 %").sum(), axis=1)
df_test['Subjects_over_70'].value_counts()

0    1817
1      90
2      19
3       8
Name: Subjects_over_70, dtype: int64

In [None]:
# Remove variables we will not use
df_test_dummy = df_test.drop(["Person_id", "Survey_date"], axis = 1)

# Convert character variables to dummy variables
df_test_dummy = pd.get_dummies(df_test_dummy, columns=selected_vars, drop_first=True, dummy_na=True)

# Clean column names
df_test_dummy.columns = df_test_dummy.columns.str.replace(' ', '_')  # Replace spaces with underscores
df_test_dummy.columns = df_test_dummy.columns.str.replace('[^\w\s]', '', regex=True)  # Remove special characters
df_test_dummy.columns = df_test_dummy.columns.str.replace('_+', '_', regex=True)  # Replace consecutive underscores with a single underscore
df_test_dummy.columns = df_test_dummy.columns.str.rstrip('_')  # Remove trailing underscores at the end
df_test_dummy.columns

Index(['Tenure', 'Matric', 'Degree', 'Diploma', 'Female', 'Sa_citizen',
       'Birthyear', 'Birthmonth', 'Year_survey', 'Age_survey',
       'Subjects_over_70', 'Round_20', 'Round_30', 'Round_40', 'Round_nan',
       'Status_other', 'Status_self_employed', 'Status_studying',
       'Status_unemployed', 'Status_wage_and_self_employed',
       'Status_wage_employed', 'Status_nan', 'Geography_Suburb',
       'Geography_Urban', 'Geography_nan', 'Province_Free_State',
       'Province_Gauteng', 'Province_KwaZuluNatal', 'Province_Limpopo',
       'Province_Mpumalanga', 'Province_North_West', 'Province_Northern_Cape',
       'Province_Western_Cape', 'Province_nan', 'Schoolquintile_10',
       'Schoolquintile_20', 'Schoolquintile_30', 'Schoolquintile_40',
       'Schoolquintile_50', 'Schoolquintile_nan', 'Math_30_39', 'Math_40_49',
       'Math_50_59', 'Math_60_69', 'Math_70_79', 'Math_80_100', 'Math_nan',
       'Mathlit_30_39', 'Mathlit_40_49', 'Mathlit_50_59', 'Mathlit_60_69',
       'Math

In [None]:
# Dealing with missing values
df_test_dummy = df_test_dummy.fillna(0)
df_test_dummy.head()

Unnamed: 0,Tenure,Matric,Degree,Diploma,Female,Sa_citizen,Birthyear,Birthmonth,Year_survey,Age_survey,...,Home_lang_70_79,Home_lang_80_100,Home_lang_nan,Science_30_39,Science_40_49,Science_50_59,Science_60_69,Science_70_79,Science_80_100,Science_nan
0,0.0,1.0,0.0,0.0,0,1,2002,12,2022,20,...,0,0,1,0,1,0,0,0,0,0
1,979.0,1.0,0.0,0.0,1,1,1989,12,2023,34,...,0,0,1,0,0,0,0,0,0,1
2,339.0,0.0,0.0,0.0,1,1,1989,12,2023,34,...,0,0,1,0,0,0,0,0,0,1
3,0.0,1.0,0.0,0.0,0,1,2002,11,2023,21,...,0,0,1,0,0,0,0,0,0,1
4,613.0,0.0,0.0,0.0,1,1,1993,10,2023,30,...,0,0,1,0,0,0,0,0,0,1


Now, let's predict!

In [None]:
# Fit the model on training set
model.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# Test on test set

predictions = model.predict(df_test_dummy)
print(predictions[:6])

[1 0 0 0 0 0]


Now let's put our predictions in the format needed for submission.For every row in the dataset, submission files should contain 2 columns: ID and Target.
Your submission file should look like this.

In [None]:
# Create a DataFrame df_submission with two columns "ID" and "Target"
df_submission = pd.DataFrame({"ID": df_test["Person_id"], "Target": predictions.astype(int)})
print(df_submission.head())

              ID  Target
0  Id_r90136smvl       1
1  Id_wawdqhmu6s       0
2  Id_ap2czff2bu       0
3  Id_uhgink7iha       0
4  Id_5j6bzk3k81       0


Save your submission as a CSV file.

In [None]:
df_submission.to_csv("submission.csv", index=False)

Predictive Insights is a leader in behavioural science and artificial intelligence to improve business efficiency and profitability. Through a combination of data science, machine learning and behavioural insights, we help customers to accurately predict sales, staffing and stock levels. Our solution improves sales forecasting on average by 50 percent. We operate in Africa as well as Europe, Middle East and India in the restaurant, food processing, retail and financial service sectors.
We are part of Alphawave, a specialised technology investment group supporting businesses seeking to do things that are complex to replicate.


Et voilà! You are now ready to submit.