# Sprint 1 - 荷兰行业薪酬分析 (Dutch Industry Salary Analysis)

**目标**: 计算2010-2024年间荷兰各行业薪酬的三个关键指标：
- 🏆 **增长冠军 (Growth Champion)**: 薪酬增长最快的行业
- 📉 **衰退之王 (Decline King)**: 薪酬下降最多的行业
- 📊 **差距倍数 (Gap Multiplier)**: 2024年最高与最低薪酬行业的倍数差距

**数据来源**: CBS (荷兰中央统计局) 开放数据
**时间范围**: 2010-2024年
**创建时间**: 2024.12.07, 23:45

In [61]:
# 导入必要的库 (Import necessary libraries) - 2024.12.07, 23:45
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# 设置中文字体支持 (Set Chinese font support)
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
plt.rcParams['axes.unicode_minus'] = False

print('📚 库导入完成 (Libraries imported successfully)')
print('🎯 开始数据分析任务 (Starting data analysis task)')

📚 库导入完成 (Libraries imported successfully)
🎯 开始数据分析任务 (Starting data analysis task)


## 第2阶段：数据加载与预处理 (Phase 2: Data Loading and Preprocessing)

In [62]:
# 任务 2.1: 正确加载JSON数据 (Correctly Load JSON Data) - 2024.12.07, 23:46

def load_typed_dataset(file_path):
    """
    加载TypedDataSet.json文件 (Load TypedDataSet.json file)
    注意：这是一个列表结构，不是字典 (Note: This is a list structure, not a dictionary)
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            typed_data = json.load(f)  # 直接加载列表 (Load list directly)
        df = pd.DataFrame(typed_data)
        print(f'✅ 成功加载 {file_path}，共 {len(df)} 条记录')
        return df
    except Exception as e:
        print(f'❌ 加载失败: {e}')
        return None

def load_auxiliary_data(file_path):
    """加载辅助数据文件 (Load auxiliary data files)"""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        df = pd.DataFrame(data)
        print(f'✅ 成功加载 {file_path}，共 {len(df)} 条记录')
        return df
    except Exception as e:
        print(f'❌ 加载失败: {e}')
        return None

# 设置数据文件路径 (Set data file paths)
data_dir = Path('../data_acquisition/raw_data')

# 加载主数据集 (Load main dataset)
print('🔄 开始加载数据文件...')
typed_df = load_typed_dataset(data_dir / 'TypedDataSet.json')

# 加载辅助数据 (Load auxiliary data)
sectors_df = load_auxiliary_data(data_dir / 'SectorBranchesSIC2008.json')
periods_df = load_auxiliary_data(data_dir / 'Periods.json')

print('📊 数据加载完成 (Data loading completed)')

🔄 开始加载数据文件...
✅ 成功加载 ../data_acquisition/raw_data/TypedDataSet.json，共 3000 条记录
✅ 成功加载 ../data_acquisition/raw_data/SectorBranchesSIC2008.json，共 100 条记录
✅ 成功加载 ../data_acquisition/raw_data/Periods.json，共 30 条记录
📊 数据加载完成 (Data loading completed)


In [63]:
# 任务 2.2: 数据结构分析与清理 (Data Structure Analysis and Cleaning) - 2024.12.07, 23:47

def analyze_data_structure(df):
    """分析数据结构 (Analyze data structure)"""
    print('📋 数据结构分析 (Data Structure Analysis)')
    print(f'数据形状: {df.shape}')
    print(f'列数量: {len(df.columns)}')
    print('前5列名称:')
    for i, col in enumerate(df.columns[:5]):
        print(f'  {i+1}. {col}')
    
    # 查找薪酬相关列 (Find salary-related columns)
    salary_cols = [col for col in df.columns if 'Compensation' in col or 'Wages' in col]
    print(f'💰 发现 {len(salary_cols)} 个薪酬相关列:')
    for col in salary_cols[:3]:  # 显示前3个
        print(f'  - {col}')
    
    # 检查空值分布 (Check null value distribution)
    null_counts = df.isnull().sum()
    non_zero_nulls = null_counts[null_counts > 0]
    print(f'🔍 有空值的列数量: {len(non_zero_nulls)}')
    
    return salary_cols

def clean_salary_data(df, salary_cols):
    """清理薪酬数据 (Clean salary data)"""
    print('🧹 开始数据清理 (Starting data cleaning)')
    
    # 转换薪酬列为数值类型 (Convert salary columns to numeric)
    for col in salary_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    
    print(f'✅ 已转换 {len(salary_cols)} 个薪酬列为数值类型')
    
    # 检查转换后的数据质量 (Check data quality after conversion)
    main_salary_col = 'CompensationOfEmployees_1'
    if main_salary_col in df.columns:
        valid_data = df[main_salary_col].notna()
        print(f'主要薪酬列 {main_salary_col} 有效数据: {valid_data.sum()}/{len(df)} ({valid_data.mean():.1%})')
    
    return df

# 执行数据结构分析 (Execute data structure analysis)
if typed_df is not None:
    salary_columns = analyze_data_structure(typed_df)
    typed_df_clean = clean_salary_data(typed_df.copy(), salary_columns)
    print('✨ 数据清理完成 (Data cleaning completed)')
else:
    print('❌ 无法进行数据分析，主数据集加载失败')

📋 数据结构分析 (Data Structure Analysis)
数据形状: (3000, 39)
列数量: 39
前5列名称:
  1. ID
  2. SectorBranchesSIC2008
  3. Periods
  4. CompensationOfEmployees_1
  5. WagesAndSalaries_2
💰 发现 12 个薪酬相关列:
  - CompensationOfEmployees_1
  - WagesAndSalaries_2
  - CompensationOfEmployees_5
🔍 有空值的列数量: 36
🧹 开始数据清理 (Starting data cleaning)
✅ 已转换 12 个薪酬列为数值类型
主要薪酬列 CompensationOfEmployees_1 有效数据: 2940/3000 (98.0%)
✨ 数据清理完成 (Data cleaning completed)


In [64]:
# 任务 2.3: 行业映射与数据整合 (Industry Mapping and Data Integration) - 2024.12.07, 23:48

def map_industries(main_df, sector_df):
    """映射行业代码到行业名称 (Map industry codes to industry names)"""
    print('🏭 开始行业映射 (Starting industry mapping)')
    
    # 检查映射字段 (Check mapping fields)
    print('主数据集列名:', main_df.columns.tolist()[:5])
    print('行业数据集列名:', sector_df.columns.tolist())
    
    # 创建行业映射字典 (Create industry mapping dictionary)
    sector_mapping = dict(zip(sector_df['Key'], sector_df['Title']))
    print(f'创建了 {len(sector_mapping)} 个行业映射')
    
    # 应用映射 (Apply mapping)
    main_df['IndustryName'] = main_df['SectorBranchesSIC2008'].map(sector_mapping)
    
    # 检查映射结果 (Check mapping results)
    mapped_count = main_df['IndustryName'].notna().sum()
    print(f'成功映射 {mapped_count}/{len(main_df)} 条记录 ({mapped_count/len(main_df):.1%})')
    
    return main_df

def map_periods(df, periods_df):
    """映射时间周期到年份 (Map periods to years)"""
    print('📅 开始时间映射 (Starting period mapping)')
    
    # 创建时间映射字典 (Create period mapping dictionary)
    period_mapping = dict(zip(periods_df['Key'], periods_df['Title']))
    print(f'创建了 {len(period_mapping)} 个时间映射')
    
    # 应用映射 (Apply mapping)
    df['Year'] = df['Periods'].map(period_mapping)
    
    # 转换年份为整数 (Convert year to integer)
    df['Year'] = pd.to_numeric(df['Year'], errors='coerce')
    
    # 检查映射结果 (Check mapping results)
    valid_years = df['Year'].notna().sum()
    print(f'成功映射 {valid_years}/{len(df)} 个年份')
    
    # 显示年份范围 (Show year range)
    if valid_years > 0:
        year_range = f"{df['Year'].min():.0f} - {df['Year'].max():.0f}"
        print(f'年份范围: {year_range}')
    
    return df

# 执行映射操作 (Execute mapping operations)
if all([typed_df_clean is not None, sectors_df is not None, periods_df is not None]):
    # 行业映射 (Industry mapping)
    typed_df_mapped = map_industries(typed_df_clean, sectors_df)
    
    # 时间映射 (Period mapping)
    typed_df_final = map_periods(typed_df_mapped, periods_df)
    
    print('🎉 数据整合完成 (Data integration completed)')
    print(f'最终数据集形状: {typed_df_final.shape}')
else:
    print('❌ 无法进行数据整合，缺少必要的数据集')

🏭 开始行业映射 (Starting industry mapping)
主数据集列名: ['ID', 'SectorBranchesSIC2008', 'Periods', 'CompensationOfEmployees_1', 'WagesAndSalaries_2']
行业数据集列名: ['Key', 'Title', 'Description', 'CategoryGroupID']
创建了 100 个行业映射
成功映射 3000/3000 条记录 (100.0%)
📅 开始时间映射 (Starting period mapping)
创建了 30 个时间映射
成功映射 3000/3000 个年份
年份范围: 1995 - 2024
🎉 数据整合完成 (Data integration completed)
最终数据集形状: (3000, 41)


## 第3阶段：核心指标计算 (Phase 3: Core Metric Calculation)

In [65]:
# 任务 3.1: 筛选2010-2024年数据 (Filter 2010-2024 Data) - 2024.12.07, 23:49

def filter_target_years(df, target_years=[2010, 2024]):
    """筛选目标年份数据 (Filter target year data)"""
    print(f'🎯 筛选目标年份: {target_years}')
    
    # 筛选目标年份 (Filter target years)
    filtered_df = df[df['Year'].isin(target_years)].copy()
    
    print(f'筛选结果: {len(filtered_df)} 条记录')
    
    # 按年份统计 (Statistics by year)
    year_counts = filtered_df['Year'].value_counts().sort_index()
    print('各年份记录数:')
    for year, count in year_counts.items():
        print(f'  {year}: {count} 条记录')
    
    return filtered_df

def validate_year_data(filtered_df, salary_col='CompensationOfEmployees_1'):
    """验证年份数据质量 (Validate year data quality)"""
    print('🔍 验证数据质量 (Validating data quality)')
    
    # 检查每年有效薪酬数据 (Check valid salary data per year)
    for year in sorted(filtered_df['Year'].unique()):
        year_data = filtered_df[filtered_df['Year'] == year]
        valid_salary = year_data[salary_col].notna().sum()
        total_records = len(year_data)
        print(f'  {year}年: {valid_salary}/{total_records} 条有效薪酬数据 ({valid_salary/total_records:.1%})')
    
    # 找出两年都有数据的行业 (Find industries with data in both years)
    industries_2010 = set(filtered_df[filtered_df['Year'] == 2010]['IndustryName'].dropna())
    industries_2024 = set(filtered_df[filtered_df['Year'] == 2024]['IndustryName'].dropna())
    common_industries = industries_2010.intersection(industries_2024)
    
    print(f'📊 数据覆盖情况:')
    print(f'  2010年行业数: {len(industries_2010)}')
    print(f'  2024年行业数: {len(industries_2024)}')
    print(f'  共同行业数: {len(common_industries)}')
    
    return list(common_industries)

# 执行年份筛选和验证 (Execute year filtering and validation)
if 'typed_df_final' in locals() and typed_df_final is not None:
    target_data = filter_target_years(typed_df_final)
    common_industries = validate_year_data(target_data)
    print(f'✅ 找到 {len(common_industries)} 个可比较的行')
else:
    print('❌ 无法进行年份筛选，数据整合未完成')

🎯 筛选目标年份: [2010, 2024]
筛选结果: 200 条记录
各年份记录数:
  2010: 100 条记录
  2024: 100 条记录
🔍 验证数据质量 (Validating data quality)
  2010年: 98/100 条有效薪酬数据 (98.0%)
  2024年: 98/100 条有效薪酬数据 (98.0%)
📊 数据覆盖情况:
  2010年行业数: 100
  2024年行业数: 100
  共同行业数: 100
✅ 找到 100 个可比较的行


In [66]:
# 任务 3.2: 计算薪酬增长率 (Calculate Compensation Growth Rate) - 2024.12.07, 23:50

def calculate_industry_growth(df, salary_col='CompensationOfEmployees_1'):
    """计算行业薪酬增长率 (Calculate industry compensation growth rate)"""
    print('📈 计算薪酬增长率 (Calculating compensation growth rate)')
    
    # 创建透视表 (Create pivot table)
    pivot_df = df.pivot_table(
        index='IndustryName',
        columns='Year',
        values=salary_col,
        aggfunc='first'
    ).reset_index()
    
    # 重命名列 (Rename columns)
    pivot_df.columns.name = None
    if 2010.0 in pivot_df.columns and 2024.0 in pivot_df.columns:
        pivot_df = pivot_df.rename(columns={2010.0: 'Salary_2010', 2024.0: 'Salary_2024'})
    
    # 过滤有效数据 (Filter valid data)
    valid_data = pivot_df.dropna(subset=['Salary_2010', 'Salary_2024'])
    
    # 计算增长指标 (Calculate growth metrics)
    valid_data['Absolute_Growth'] = valid_data['Salary_2024'] - valid_data['Salary_2010']
    valid_data['Growth_Rate'] = (valid_data['Salary_2024'] - valid_data['Salary_2010']) / valid_data['Salary_2010'] * 100
    
    print(f'成功计算 {len(valid_data)} 个行业的增长率')
    
    return valid_data

def rank_growth_performance(growth_df):
    """排序增长表现 (Rank growth performance)"""
    print('🏆 排序增长表现 (Ranking growth performance)')
    
    # 按增长率排序 (Sort by growth rate)
    ranked_df = growth_df.sort_values('Growth_Rate', ascending=False).reset_index(drop=True)
    
    # 显示统计信息 (Show statistics)
    print(f'增长率统计:')
    print(f'  最高: {ranked_df["Growth_Rate"].max():.1f}%')
    print(f'  最低: {ranked_df["Growth_Rate"].min():.1f}%')
    print(f'  平均: {ranked_df["Growth_Rate"].mean():.1f}%')
    print(f'  中位数: {ranked_df["Growth_Rate"].median():.1f}%')
    
    # 显示前5和后5 (Show top 5 and bottom 5)
    print('🔝 增长最快的5个行业:')
    for i, row in ranked_df.head().iterrows():
        print(f'  {i+1}. {row["IndustryName"]}: {row["Growth_Rate"]:.1f}%')
    
    print('📉 增长最慢的5个行业:')
    for i, row in ranked_df.tail().iterrows():
        print(f'  {len(ranked_df)-i}. {row["IndustryName"]}: {row["Growth_Rate"]:.1f}%')
    
    return ranked_df

# 执行增长率计算 (Execute growth rate calculation)
if 'target_data' in locals() and target_data is not None:
    growth_data = calculate_industry_growth(target_data)
    ranked_growth = rank_growth_performance(growth_data)
    print('✅ 增长率计算完成')
else:
    print('❌ 无法计算增长率，目标数据未准备好')

📈 计算薪酬增长率 (Calculating compensation growth rate)
成功计算 98 个行业的增长率
🏆 排序增长表现 (Ranking growth performance)
增长率统计:
  最高: 237.6%
  最低: -39.4%
  平均: 64.9%
  中位数: 64.6%
🔝 增长最快的5个行业:
  1. 79 Travel agencies, tour operators etc: 237.6%
  2. T Activities of households: 165.3%
  3. 62-63 IT- and information services: 158.4%
  4. 28 Manufacture of machinery n.e.c.: 133.1%
  5. 77 Renting and leasing of tangible goods: 131.2%
📉 增长最慢的5个行业:
  5. B Mining and quarrying: 10.9%
  4. 65 Insurance and pension funding: 5.9%
  3. 16-18 Man. wood en paperprod., printing: 4.8%
  2. 58 Publishing: -3.1%
  1. 18 Printing and reproduction: -39.4%
✅ 增长率计算完成


In [67]:
# 任务 3.3: 识别增长冠军和衰退之王 (Identify Growth Champions and Decline Kings) - 2024.12.07, 23:51

def identify_growth_champion(growth_df):
    """识别增长冠军 (Identify growth champion)"""
    champion = growth_df.loc[growth_df['Growth_Rate'].idxmax()]
    
    print('🏆 增长冠军 (Growth Champion):')
    print(f'  行业: {champion["IndustryName"]}')
    print(f'  2010年薪酬: €{champion["Salary_2010"]:,.0f}')
    print(f'  2024年薪酬: €{champion["Salary_2024"]:,.0f}')
    print(f'  绝对增长: €{champion["Absolute_Growth"]:,.0f}')
    print(f'  增长率: {champion["Growth_Rate"]:.1f}%')
    
    return champion

def identify_decline_king(growth_df):
    """识别衰退之王 (Identify decline king)"""
    decline_king = growth_df.loc[growth_df['Growth_Rate'].idxmin()]
    
    print('📉 衰退之王 (Decline King):')
    print(f'  行业: {decline_king["IndustryName"]}')
    print(f'  2010年薪酬: €{decline_king["Salary_2010"]:,.0f}')
    print(f'  2024年薪酬: €{decline_king["Salary_2024"]:,.0f}')
    print(f'  绝对变化: €{decline_king["Absolute_Growth"]:,.0f}')
    print(f'  变化率: {decline_king["Growth_Rate"]:.1f}%')
    
    return decline_king

# 执行冠军识别 (Execute champion identification)
if 'ranked_growth' in locals() and ranked_growth is not None:
    growth_champion = identify_growth_champion(ranked_growth)
    decline_king = identify_decline_king(ranked_growth)
    print('✅ 冠军识别完成')
else:
    print('❌ 无法识别冠军，增长数据未准备好')

🏆 增长冠军 (Growth Champion):
  行业: 79 Travel agencies, tour operators etc
  2010年薪酬: €675
  2024年薪酬: €2,279
  绝对增长: €1,604
  增长率: 237.6%
📉 衰退之王 (Decline King):
  行业: 18 Printing and reproduction
  2010年薪酬: €1,228
  2024年薪酬: €744
  绝对变化: €-484
  变化率: -39.4%
✅ 冠军识别完成


In [68]:
# 任务 3.4: 计算2024年薪酬差距倍数 (Calculate 2024 Salary Gap Multiplier) - 2024.12.07, 23:52

def find_salary_extremes_2024(df, salary_col='CompensationOfEmployees_1'):
    """找出2024年薪酬极值 (Find 2024 salary extremes)"""
    print('💰 分析2024年薪酬分布 (Analyzing 2024 salary distribution)')
    
    # 筛选2024年数据 (Filter 2024 data)
    data_2024 = df[df['Year'] == 2024].copy()
    
    # 移除空值 (Remove null values)
    valid_2024 = data_2024.dropna(subset=[salary_col, 'IndustryName'])
    
    if len(valid_2024) == 0:
        print('❌ 没有找到2024年的有效薪酬数据')
        return None, None
    
    # 找出最高和最低薪酬 (Find highest and lowest salaries)
    highest_idx = valid_2024[salary_col].idxmax()
    lowest_idx = valid_2024[salary_col].idxmin()
    
    highest_salary_row = valid_2024.loc[highest_idx]
    lowest_salary_row = valid_2024.loc[lowest_idx]
    
    print(f'💎 最高薪酬行业: {highest_salary_row["IndustryName"]}')
    print(f'    薪酬: €{highest_salary_row[salary_col]:,.0f}')
    
    print(f'💧 最低薪酬行业: {lowest_salary_row["IndustryName"]}')
    print(f'    薪酬: €{lowest_salary_row[salary_col]:,.0f}')
    
    return highest_salary_row, lowest_salary_row

def calculate_gap_multiplier(highest_row, lowest_row, salary_col='CompensationOfEmployees_1'):
    """计算差距倍数 (Calculate gap multiplier)"""
    if highest_row is None or lowest_row is None:
        return None
    
    highest_salary = highest_row[salary_col]
    lowest_salary = lowest_row[salary_col]
    
    if lowest_salary == 0:
        print('❌ 最低薪酬为0，无法计算倍数')
        return None
    
    gap_multiplier = highest_salary / lowest_salary
    
    print(f'📊 2024年薪酬差距倍数 (2024 Salary Gap Multiplier):')
    print(f'  最高薪酬: €{highest_salary:,.0f} ({highest_row["IndustryName"]})') 
    print(f'  最低薪酬: €{lowest_salary:,.0f} ({lowest_row["IndustryName"]})') 
    print(f'  差距倍数: {gap_multiplier:.1f}x')
    print(f'  解释: 最高薪酬行业的薪酬是最低薪酬行业的 {gap_multiplier:.1f} 倍')
    
    return gap_multiplier

# 执行差距计算 (Execute gap calculation)
if 'target_data' in locals() and target_data is not None:
    highest_2024, lowest_2024 = find_salary_extremes_2024(target_data)
    gap_multiplier = calculate_gap_multiplier(highest_2024, lowest_2024)
    print('✅ 差距倍数计算完成')
else:
    print('❌ 无法计算差距倍数，目标数据未准备好')

💰 分析2024年薪酬分布 (Analyzing 2024 salary distribution)
💎 最高薪酬行业: A-U All economic activities
    薪酬: €525,311
💧 最低薪酬行业: 03 Fishing and aquaculture
    薪酬: €116
📊 2024年薪酬差距倍数 (2024 Salary Gap Multiplier):
  最高薪酬: €525,311 (A-U All economic activities)
  最低薪酬: €116 (03 Fishing and aquaculture)
  差距倍数: 4528.5x
  解释: 最高薪酬行业的薪酬是最低薪酬行业的 4528.5 倍
✅ 差距倍数计算完成


## 📋 最终结果汇总 (Final Results Summary)

### 🎯 三个关键数字 (Three Key Numbers)

In [69]:
# 最终结果汇总 (Final Results Summary) - 2024.12.07, 23:53

def summarize_big_numbers():
    """汇总三个关键数字 (Summarize three key numbers)"""
    print('🎉 === 荷兰行业薪酬分析最终结果 (Final Results) === 🎉')
    print()
    
    if 'growth_champion' in locals() and growth_champion is not None:
        print('🏆 增长冠军 (Growth Champion):')
        print(f'    {growth_champion["IndustryName"]}')
        print(f'    增长率: {growth_champion["Growth_Rate"]:.1f}%')
        print()
    
    if 'decline_king' in locals() and decline_king is not None:
        print('📉 衰退之王 (Decline King):')
        print(f'    {decline_king["IndustryName"]}')
        print(f'    变化率: {decline_king["Growth_Rate"]:.1f}%')
        print()
    
    if 'gap_multiplier' in locals() and gap_multiplier is not None:
        print('📊 薪酬差距倍数 (Salary Gap Multiplier):')
        print(f'    {gap_multiplier:.1f}x')
        print(f'    (2024年最高薪酬行业是最低薪酬行业的 {gap_multiplier:.1f} 倍)')
        print()
    
    print('=' * 60)
    print('✅ 数据分析完成！这些数字可以用于前端展示。')
    print('📊 分析基于CBS荷兰中央统计局2010-2024年数据')

# 执行最终汇总 (Execute final summary)
summarize_big_numbers()

🎉 === 荷兰行业薪酬分析最终结果 (Final Results) === 🎉

✅ 数据分析完成！这些数字可以用于前端展示。
📊 分析基于CBS荷兰中央统计局2010-2024年数据
