### AI Talent Workshop - Day 2: Data Pre-processing and Feature Engineering

**Goal:** Practice data cleaning, transformation, and basic feature engineering on the merged AI talent data.

**Instructional Technique:** Jigsaw, Team-Based Problem Solving

**Materials:**
1.  **Interactive Notes:** Google Colaboratory Notebook 2 (will be shared)
2.  **Workshop:** Jupyter Notebook 2 (`AI_talent_workshop_part2.ipynb`) and the (potentially saved) merged dataset.

**Steps:**

1.  **(20 minutes):** Review the merged DataFrame from Wednesday and introduce various data pre-processing and feature engineering techniques using Google Colab Notebook 2.
2.  **(60 minutes):** You will work on Workshop 2 using the `AI_talent_workshop_part2.ipynb` notebook. You will be divided into three initial groups:
    * **Group A:** Focuses on identifying and handling missing values in the merged data.
    * **Group B:** Focuses on examining data types and performing any necessary cleaning or conversions (e.g., standardizing experience levels if needed).
    * **Group C:** Focuses on basic feature engineering â€“ creating new relevant features from the existing data. For example, you might think about creating a feature that combines skills or flags highly experienced candidates.
    Each group will work on their assigned task.
3.  **(20 minutes):** You will form new teams with at least one member from each of the initial groups. Each member will share their findings and the code they used. Your new team will then collaboratively discuss the impact of each pre-processing step and explore further feature engineering possibilities relevant to identifying top AI talent.
4.  **(10 minutes):** Final group discussion where each team shares their insights on the pre-processing and feature engineering steps, and we discuss how these steps prepare the data for potential machine learning models.


In [None]:
# -*- coding: utf-8 -*-
import pandas as pd

# Load the merged data (assuming it's been saved or re-loaded)
try:
    merged_df = pd.read_csv('merged_ai_data.csv') # Assuming you saved it after merging
except FileNotFoundError:
    try:
        profiles_df = pd.read_csv('candidate_profiles.csv')
        assessments_df = pd.read_csv('technical_assessments.csv')
        if not profiles_df.empty and not assessments_df.empty and 'CandidateID' in profiles_df.columns and 'CandidateID' in assessments_df.columns:
            merged_df = pd.merge(profiles_df, assessments_df, on='CandidateID', how='left')
            print("Data merged.")
        else:
            merged_df = pd.DataFrame()
    except FileNotFoundError:
        print("Error: Could not load candidate profiles or technical assessments.")
        merged_df = pd.DataFrame()

if not merged_df.empty:
    # --- Task 1 (Group A): Handling Missing Values ---
    print("\n--- Task 1 (Group A): Handling Missing Values ---")
    print("\nInitial missing values:\n", merged_df.isnull().sum())
    # Students in Group A will implement their strategy here

    # --- Task 2 (Group B): Data Type Conversion and Cleaning ---
    print("\n--- Task 2 (Group B): Data Type Conversion and Cleaning ---")
    print("\nData types:\n", merged_df.dtypes)
    # Students in Group B will work on data type issues and potential cleaning (e.g., experience levels)

    # --- Task 3 (Group C): Feature Engineering ---
    print("\n--- Task 3 (Group C): Feature Engineering ---")
    # Students in Group C will create new features from existing ones

    # --- Collaborative Task (All Teams) ---
    print("\n--- Collaborative Task (All Teams) ---")
    # Teams will discuss the impact of the pre-processing steps and potentially engineer more features or analyze the data.

    # Example collaborative tasks (to be expanded by students):
    if 'Years of Experience' in merged_df.columns and 'Assessment Score' in merged_df.columns:
        # Example: Create a boolean feature for "Experienced" candidates
        merged_df['Experienced'] = merged_df['Years of Experience'] > 3
        print("\nDataFrame with 'Experienced' feature:")
        print(merged_df[['CandidateID', 'Years of Experience', 'Experienced']].head())

        # Example: Analyze the relationship between experience and assessment score (briefly)
        print("\nCorrelation between Years of Experience and Assessment Score:")
        print(merged_df[['Years of Experience', 'Assessment Score']].corr())