### AI Talent Workshop - Day 1 or 2: Data Pre-processing and Feature Engineering

**Optional Challenge: Advanced Data Merging (Asynchronous)**

**Goal:** To practice merging datasets that do not share a direct common identifier.

**Materials:**
* Jupyter Notebook: `advanced_merging.ipynb`
* Datasets: `candidate_details.csv` (large), `assessment_results.csv` (large)

**Task:** Merge the `candidate_details.csv` and `assessment_results.csv` datasets. These datasets do not have a direct common identifier. You will likely need to use the 'CandidateName' from `candidate_details.csv` and the 'FullName' from `assessment_results.csv` to perform the merge. Consider the challenges this might present, such as variations in names (e.g., capitalization, extra spaces) and potential mismatches. Document your approach and any data cleaning steps you take to facilitate the merge.

This is an optional activity for students who want a more complex data wrangling challenge.


In [1]:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

# --- Task 1: Load the datasets ---
try:
    details_df = pd.read_csv("./data/candidate_details.csv")
    print("Loaded candidate_details.csv")
    results_df = pd.read_csv("./data/assessment_results.csv")
    print("Loaded assessment_results.csv")
except FileNotFoundError as e:
    print(f"Error loading file: {e}")
    details_df = pd.DataFrame()
    results_df = pd.DataFrame()

Loaded candidate_details.csv
Loaded assessment_results.csv


In [2]:
if not details_df.empty and not results_df.empty:
    # --- Task 2: Initial Exploration ---
    print("\n--- Candidate Details ---")
    print(details_df.head())
    print(details_df.info())

    print("\n--- Assessment Results ---")
    print(results_df.head())
    print(results_df.info())

    # --- Task 3: The Challenge - Merging without a common ID ---
    print("\n--- Task 3: Merging Challenge ---")
    print("\nYour task is to merge 'candidate_details.csv' and 'assessment_results.csv'.")
    print("Notice that there is no direct common identifier. You will likely need to use the 'CandidateName' from 'candidate_details' and the 'FullName' from 'assessment_results'.")
    print("\nConsider potential issues like different naming conventions (e.g., capitalization, extra spaces).")
    print("\nThink about what type of merge would be appropriate and how you might handle discrepancies in names.")

    # --- Space for student solutions ---
    # You might want to guide them towards using the 'Name' columns
    # and discuss potential pre-processing steps needed on these columns
    # before merging (e.g., lowercasing, stripping whitespace).

    # Example of a potential first step:
    # merged_advanced = pd.merge(details_df, results_df, left_on='CandidateName', right_on='FullName', how='inner')
    # print("\nInitial merge attempt (may not be perfect):")
    # print(merged_advanced.head())

else:
    print("\nCould not load data for the advanced merging challenge.")


--- Candidate Details ---
       CandidateName  Years of Experience  \
0      Joseph Willis                  4.0   
1     Michael Carney                  9.0   
2    Cheryl Fletcher                  9.0   
3     Danielle Russo                  2.0   
4  Lawrence Richards                  0.0   

                                              Skills     Degree  \
0                   Python, TensorFlow, NLP, PyTorch        PhD   
1                            Cloud Computing, ML Ops  Associate   
2               Scikit-learn, Reinforcement Learning        PhD   
3              Deep Learning, Reinforcement Learning     Master   
4  R, Statistics, NLP, PyTorch, Java, Machine Lea...   Bachelor   

                  Major                      University  
0            Statistics  University of British Columbia  
1          Data Science           University of Alberta  
2                    AI              Queen's University  
3  Software Engineering              Queen's University  
4        