### AI Talent Workshop - Day 1: Data Gathering and Wrangling (Merging Data)

**Goal:** Learn to load data from multiple sources using Pandas and merge them into a unified DataFrame.

**Instructional Technique:** Think-Pair-Share, Group Discussion

**Materials:**
1.  **Interactive Notes:** Google Colaboratory Notebook 1 (will be shared)
2.  **Workshop:** Jupyter Notebook 1 (`AI_talent_workshop_part1.ipynb`), `candidate_profiles.csv`, and `technical_assessments.csv` (will be provided).

**Steps:**

1.  **(15 minutes):** Introduction to loading multiple datasets and the concept of merging DataFrames in Pandas using Google Colab Notebook 1. We will focus on identifying common keys.
2.  **(60 minutes):** You will work on Workshop 1 using the `AI_talent_workshop_part1.ipynb` notebook and the provided CSV files. The tasks involve:
    * Loading both `candidate_profiles.csv` and `technical_assessments.csv` into separate Pandas DataFrames.
    * Inspecting the DataFrames to understand their structure and identify a common column ('CandidateID').
    * Merging the two DataFrames based on the 'CandidateID'. We will start with a left merge.
    You will engage in Think-Pair-Share activities for each task. Think individually, discuss with a partner, and then we'll have a class discussion.
3.  **(30 minutes):** Group discussion to review the merging process, discuss the different types of merges (left, right, inner, outer), and address any challenges. We will also preview Friday's session on pre-processing.


In [2]:
# -*- coding: utf-8 -*-
import pandas as pd

# --- Task 1A: Loading the datasets ---
# Load 'candidate_profiles.csv'
try:
    profiles_df = pd.read_csv('./data/candidate_profiles.csv')
    print("\nFirst 5 rows of candidate_profiles:")
    print(profiles_df.head())
    print("\nShape of candidate_profiles:", profiles_df.shape)
except FileNotFoundError:
    print("\nError: 'candidate_profiles.csv' not found.")
    profiles_df = pd.DataFrame()


First 5 rows of candidate_profiles:
   CandidateID     Name  Years of Experience  \
0            1    Alice                    5   
1            2      Bob                    2   
2            3  Charlie                    8   
3            4    David                    3   
4            5      Eve                    6   

                                    Skills  
0                  Python, TensorFlow, NLP  
1                   Java, Machine Learning  
2   Python, Deep Learning, Computer Vision  
3             R, Statistics, Data Analysis  
4  Python, PyTorch, Reinforcement Learning  

Shape of candidate_profiles: (12, 4)


In [3]:
# --- Task 1B: Loading the datasets ---
# Load 'technical_assessments.csv'
try:
    assessments_df = pd.read_csv('./data/technical_assessments.csv')
    print("\nFirst 5 rows of technical_assessments:")
    print(assessments_df.head())
    print("\nShape of technical_assessments:", assessments_df.shape)
except FileNotFoundError:
    print("\nError: 'technical_assessments.csv' not found.")
    assessments_df = pd.DataFrame()


First 5 rows of technical_assessments:
  AssessmentID  CandidateID  Assessment Score             Topic
0         A101            1                85            Python
1         A102            3                92     Deep Learning
2         A103            5                88           PyTorch
3         A104            2                78  Machine Learning
4         A105            7                90            ML Ops

Shape of technical_assessments: (14, 4)


In [4]:
# --- Task 2: Initial inspection ---
# Display info and descriptive statistics for both DataFrames
print("\n--- Task 2: Initial inspection ---")
if not profiles_df.empty:
    print("\nInfo for candidate_profiles:")
    profiles_df.info()
    print("\nDescription for candidate_profiles:")
    print(profiles_df.describe())

if not assessments_df.empty:
    print("\nInfo for technical_assessments:")
    assessments_df.info()
    print("\nDescription for technical_assessments:")
    print(assessments_df.describe())



--- Task 2: Initial inspection ---

Info for candidate_profiles:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   CandidateID          12 non-null     int64 
 1   Name                 12 non-null     object
 2   Years of Experience  12 non-null     int64 
 3   Skills               12 non-null     object
dtypes: int64(2), object(2)
memory usage: 516.0+ bytes

Description for candidate_profiles:
       CandidateID  Years of Experience
count    12.000000            12.000000
mean      6.500000             4.583333
std       3.605551             2.539088
min       1.000000             1.000000
25%       3.750000             2.750000
50%       6.500000             4.500000
75%       9.250000             6.250000
max      12.000000             9.000000

Info for technical_assessments:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 

In [5]:
# --- Task 3: Identifying a common key and merging ---
# Identify the common column ('CandidateID') and merge the two DataFrames
print("\n--- Task 3: Identifying a common key and merging ---")
if not profiles_df.empty and not assessments_df.empty and 'CandidateID' in profiles_df.columns and 'CandidateID' in assessments_df.columns:
    merged_df = pd.merge(profiles_df, assessments_df, on='CandidateID', how='left') # Start with a left merge
    print("\nFirst 5 rows of the merged DataFrame:")
    print(merged_df.head())
    print("\nShape of the merged DataFrame:", merged_df.shape)
else:
    print("\nCould not perform merge. Ensure both DataFrames are loaded and have a 'CandidateID' column.")


--- Task 3: Identifying a common key and merging ---

First 5 rows of the merged DataFrame:
   CandidateID     Name  Years of Experience  \
0            1    Alice                    5   
1            1    Alice                    5   
2            2      Bob                    2   
3            3  Charlie                    8   
4            3  Charlie                    8   

                                   Skills AssessmentID  Assessment Score  \
0                 Python, TensorFlow, NLP         A101              85.0   
1                 Python, TensorFlow, NLP         A108              79.0   
2                  Java, Machine Learning         A104              78.0   
3  Python, Deep Learning, Computer Vision         A102              92.0   
4  Python, Deep Learning, Computer Vision         A110              90.0   

              Topic  
0            Python  
1        TensorFlow  
2  Machine Learning  
3     Deep Learning  
4   Computer Vision  

Shape of the merged DataFram