# Inspect Tagged Skills File

This notebook allows you to inspect the "All Tagged Skills" parquet file from the s3_bucket/s3_output folder to analyze its structure, columns, and check for any duplicate column issues.

## 1. Import Required Libraries

Import pandas for data manipulation and pathlib for file handling.

In [1]:
import pandas as pd
from pathlib import Path
import os

# Set display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

## 2. Set File Path

Define the path to the "All Tagged Skills" parquet file in the s3_output folder.

In [2]:
# Define the path to the tagged skills file
file_path = "s3_bucket/s3_output/All Tagged Skills for HR sector_output.parquet"

# Check if file exists
if Path(file_path).exists():
    print(f"✅ File found: {file_path}")
    print(f"📁 File size: {Path(file_path).stat().st_size / (1024*1024):.2f} MB")
else:
    print(f"❌ File not found: {file_path}")
    print("📂 Available files in s3_output:")
    output_dir = Path("s3_bucket/s3_output/")
    if output_dir.exists():
        for file in output_dir.glob("*.parquet"):
            print(f"   - {file.name}")
    else:
        print("   Directory not found")

✅ File found: s3_bucket/s3_output/All Tagged Skills for HR sector_output.parquet
📁 File size: 0.04 MB


## 3. Load Tagged Skills File

Load the parquet file into a pandas DataFrame and display basic information.

In [3]:
try:
    # Load the parquet file
    df = pd.read_parquet(file_path, engine="auto")

    print("🎉 File loaded successfully!")
    print(f"📊 DataFrame shape: {df.shape}")
    print(f"📏 Rows: {df.shape[0]:,}, Columns: {df.shape[1]}")
    print(f"💾 Memory usage: {df.memory_usage(deep=True).sum() / (1024*1024):.2f} MB")

except Exception as e:
    print(f"❌ Error loading file: {e}")
    df = None

🎉 File loaded successfully!
📊 DataFrame shape: (33, 10)
📏 Rows: 33, Columns: 10
💾 Memory usage: 0.10 MB


## 4. Column Analysis

Check for duplicate columns and analyze the structure of the DataFrame.

In [4]:
if df is not None:
    print("📋 COLUMN ANALYSIS")
    print("=" * 50)

    # Get column names
    columns = list(df.columns)
    print(f"📝 All columns ({len(columns)}):")
    for i, col in enumerate(columns, 1):
        print(f"   {i:2d}. {col}")

    print("\n🔍 DUPLICATE COLUMN CHECK")
    print("=" * 50)

    # Check for duplicate columns
    duplicates = [col for col in columns if columns.count(col) > 1]

    if duplicates:
        print(f"❌ Duplicate columns found: {set(duplicates)}")
        for dup in set(duplicates):
            indices = [i for i, col in enumerate(columns) if col == dup]
            print(f"   '{dup}' appears at positions: {indices}")
    else:
        print("✅ No duplicate columns found")

    print(f"\n📊 BASIC INFO")
    print("=" * 50)
    print(f"Data types:")
    print(df.dtypes)
else:
    print("❌ Cannot analyze columns - DataFrame not loaded")

📋 COLUMN ANALYSIS
📝 All columns (10):
    1. Course Reference Number
    2. Course Title
    3. Skill Title
    4. About This Course
    5. What You'll Learn
    6. skill_lower
    7. Sector Relevance
    8. proficiency_level
    9. reason
   10. confidence

🔍 DUPLICATE COLUMN CHECK
✅ No duplicate columns found

📊 BASIC INFO
Data types:
Course Reference Number    object
Course Title               object
Skill Title                object
About This Course          object
What You'll Learn          object
skill_lower                object
Sector Relevance           object
proficiency_level           int64
reason                     object
confidence                 object
dtype: object


## 5. Display Head of DataFrame

Show the first few rows of the tagged skills data for inspection.

In [5]:
if df is not None:
    print("📋 FIRST 5 ROWS")
    print("=" * 80)
    display(df.head())

    print(f"\n📊 SUMMARY STATISTICS")
    print("=" * 80)

    # Show value counts for key columns if they exist
    key_columns = ['Skill Title', 'proficiency_level', 'Sector Relevance']

    for col in key_columns:
        if col in df.columns:
            print(f"\n🔢 Value counts for '{col}':")
            print(df[col].value_counts().head())

    print(f"\n📈 NULL VALUES")
    print("=" * 50)
    null_counts = df.isnull().sum()
    if null_counts.sum() > 0:
        print("Columns with null values:")
        for col, count in null_counts[null_counts > 0].items():
            print(f"   {col}: {count} ({count/len(df)*100:.1f}%)")
    else:
        print("✅ No null values found")

else:
    print("❌ Cannot display head - DataFrame not loaded")

📋 FIRST 5 ROWS


Unnamed: 0,Course Reference Number,Course Title,Skill Title,About This Course,What You'll Learn,skill_lower,Sector Relevance,proficiency_level,reason,confidence
0,TGS-2023038550,(eCornell) Applied Predictive Analytics in HR,Human Resource Analytics And Insights,Predictive analytics help organizations antici...,Diagnose your workforce using an HR analytics ...,human resource analytics and insights,In Sector,4,Learning activities demonstrate skill applicat...,medium
1,TGS-2023038405,(eCornell) Autism at Work,Human Resource Policies And Legislation Framew...,"In this course, you will explore the emerging ...",Develop a business case for making neurodivers...,human resource policies and legislation framew...,In Sector,4,Course prerequisites and outcomes match profic...,medium
2,TGS-2023038548,(eCornell) Essentials of HR Analytics,Human Resource Analytics And Insights,"Drawing on his experience and research, John H...",Frame questions and identify appropriate data ...,human resource analytics and insights,In Sector,2,Learning activities demonstrate skill applicat...,low
3,TGS-2023038531,(eCornell) Getting Results Through Talent Mana...,Learning And Development Strategy,Cornell University Professor Brad Bell offers ...,How to assess an organization's approach to ma...,learning and development strategy,In Sector,4,Learning activities demonstrate skill applicat...,high
4,TGS-2023038549,(eCornell) Strategic Talent Analytics,Human Resource Analytics And Insights,This course focuses on building analytical acu...,Align HR analytics to support larger organizat...,human resource analytics and insights,In Sector,4,Learning activities demonstrate skill applicat...,medium



📊 SUMMARY STATISTICS

🔢 Value counts for 'Skill Title':
Skill Title
Human Resource Analytics And Insights                           9
Stakeholder Engagement And Management                           5
Human Resource Policies And Legislation Framework Management    2
Learning And Development Strategy                               2
Learning And Development Programme Management                   2
Name: count, dtype: int64

🔢 Value counts for 'proficiency_level':
proficiency_level
4    13
3    12
2     4
5     4
Name: count, dtype: int64

🔢 Value counts for 'Sector Relevance':
Sector Relevance
In Sector    33
Name: count, dtype: int64

📈 NULL VALUES
Columns with null values:
   Skill Title: 9 (27.3%)


## 6. Fix Duplicate Columns (If Any)

If duplicate columns are found, this section will help fix them.

In [None]:
if df is not None:
    # Check if there are duplicate columns
    columns = list(df.columns)
    duplicates = [col for col in columns if columns.count(col) > 1]

    if duplicates:
        print("🔧 FIXING DUPLICATE COLUMNS")
        print("=" * 50)
        print(f"Before fix - Total columns: {len(df.columns)}")
        print(f"Duplicate columns: {set(duplicates)}")

        # Remove duplicate columns (keep first occurrence)
        df_fixed = df.loc[:, ~df.columns.duplicated()]

        print(f"After fix - Total columns: {len(df_fixed.columns)}")
        print("✅ Duplicate columns removed")

        # Save the fixed version
        output_path = "s3_bucket/s3_output/All Tagged Skills for HR sector_output_FIXED.parquet"
        df_fixed.to_parquet(output_path, index=False, engine="auto")
        print(f"💾 Fixed file saved as: {output_path}")

        # Display the fixed DataFrame head
        print("\n📋 FIXED DATAFRAME HEAD")
        print("=" * 80)
        display(df_fixed.head())

    else:
        print("✅ No duplicate columns to fix")
else:
    print("❌ Cannot fix columns - DataFrame not loaded")