결측치를 회사별 공고의 최빈값으로 채움
만약 채우지 못하면 전체 공고의 최빈값으로 채움

Industry 와 Sector 를 비닝하는데
사실상 이미 각 row에 Industry, Sector 값이 있으므로

Sector 의 유니크 값들을 -1 포함한 25개에서 6개로 줄이고

Sector 만 비닝했음.

In [16]:
import pandas as pd

df = pd.read_csv('glassdoor_jobs.csv')

# Industry, Sector 컬럼의 유니크 값 조회
industry_uniques = df['Industry'].dropna().unique()
sector_uniques   = df['Sector'].dropna().unique()
sector_uniques_sum   = df['Sector'].dropna().nunique()
print("Industry 유니크 값:\n", industry_uniques)
print("\nSector 유니크 값:\n", sector_uniques)
print("\nSector 유니크 값 개수:\n", sector_uniques_sum)

Industry 유니크 값:
 ['Aerospace & Defense' 'Health Care Services & Hospitals'
 'Security Services' 'Energy' 'Advertising & Marketing' 'Real Estate'
 'Banks & Credit Unions' 'Consulting' 'Internet' 'Other Retail Stores'
 'Research & Development' 'Department, Clothing, & Shoe Stores'
 'Biotech & Pharmaceuticals' 'Motion Picture Production & Distribution'
 'Enterprise Software & Network Solutions' 'Insurance Carriers'
 'Insurance Agencies & Brokerages' 'Logistics & Supply Chain'
 'Food & Beverage Manufacturing' 'Telecommunications Services'
 'IT Services' 'Computer Hardware & Software' '-1'
 'Consumer Products Manufacturing' 'Investment Banking & Asset Management'
 'Industrial Manufacturing' 'Staffing & Outsourcing' 'Metals Brokers'
 'Financial Transaction Processing' 'Sporting Goods Stores' 'Wholesale'
 'Mining' 'Financial Analytics & Research' 'Federal Agencies'
 'Education Training Services' 'Transportation Equipment Manufacturing'
 'Farm Support Services' 'Preschool & Child Care'
 'TV Br

In [7]:
import pandas as pd

df = pd.read_csv('glassdoor_jobs.csv')

# “-1” 로 표시된 결측치 개수 확인
missing_industry = (df['Industry'] == '-1').sum()
missing_sector   = (df['Sector']   == '-1').sum()

print(f"Industry 컬럼에 '-1' 결측치 개수: {missing_industry}")
print(f"Sector   컬럼에 '-1' 결측치 개수: {missing_sector}")


Industry 컬럼에 '-1' 결측치 개수: 39
Sector   컬럼에 '-1' 결측치 개수: 39


In [8]:
import pandas as pd

df = pd.read_csv('glassdoor_jobs.csv')

missing_industry = (df['Industry'] == '-1').sum()
missing_sector   = (df['Sector']   == '-1').sum()
total_rows       = len(df)

print(f"Industry 결측: {missing_industry}개 ({missing_industry/total_rows:.2%})")
print(f"Sector   결측: {missing_sector}개 ({missing_sector/total_rows:.2%})")


Industry 결측: 39개 (4.08%)
Sector   결측: 39개 (4.08%)


In [15]:
import pandas as pd
import numpy as np

# 1) 데이터 불러오기
df = pd.read_csv('glassdoor_jobs.csv')

# 2) "-1" 표기를 NaN으로 변환
df.replace({'Industry': {'-1': np.nan},
            'Sector'  : {'-1': np.nan}}, 
           inplace=True)

# 3) 회사별 Mode 계산, 회사별로 존재하는 공고의 최빈값.
ind_mode_by_co = (
    df.groupby('Company Name')['Industry']
      .agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
)
sec_mode_by_co = (
    df.groupby('Company Name')['Sector']
      .agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
)

# 4) 회사별 Mode로 결측치 채우기 (기존 컬럼 덮어쓰기)
df['Industry'] = df.apply(
    lambda r: ind_mode_by_co[r['Company Name']] 
              if pd.isna(r['Industry']) else r['Industry'],
    axis=1
)
df['Sector'] = df.apply(
    lambda r: sec_mode_by_co[r['Company Name']] 
              if pd.isna(r['Sector']) else r['Sector'],
    axis=1
)

# 5) 남은 NaN → 전역 Mode(최빈값)로 채우기, 만약 회사별 공고가 없을 경우 모든 공고의 최빈값으로 채움
global_ind_mode = df['Industry'].mode()[0]
global_sec_mode = df['Sector'].mode()[0]
df.fillna({'Industry': global_ind_mode,
           'Sector'  : global_sec_mode},
          inplace=True)

# 6) Sector를 6개 카테고리로 매핑 (기존 컬럼 덮어쓰기)
sector_mapping = {
    # 1. 금융·컨설팅
    'Finance': 'Finance & Consulting',
    'Insurance': 'Finance & Consulting',
    'Accounting & Legal': 'Finance & Consulting',
    'Business Services': 'Finance & Consulting',
    # 2. IT·테크
    'Information Technology': 'Technology',
    'Telecommunications': 'Technology',
    # 3. 헬스케어·교육
    'Health Care': 'Healthcare & Education',
    'Biotech & Pharmaceuticals': 'Healthcare & Education',
    'Education': 'Healthcare & Education',
    # 4. 소비재·리테일
    'Retail': 'Consumer & Retail',
    'Consumer Services': 'Consumer & Retail',
    # 5. 산업·에너지
    'Manufacturing': 'Industrial & Energy',
    'Oil, Gas, Energy & Utilities': 'Industrial & Energy',
    'Mining & Metals': 'Industrial & Energy',
    'Construction, Repair & Maintenance': 'Industrial & Energy',
    'Aerospace & Defense': 'Industrial & Energy',
    'Transportation & Logistics': 'Industrial & Energy',
    # 6. 기타 서비스
    'Media': 'Other Services',
    'Real Estate': 'Other Services',
    'Government': 'Other Services',
    'Non-Profit': 'Other Services',
    'Arts, Entertainment & Recreation': 'Other Services',
    'Agriculture & Forestry': 'Other Services',
    'Travel & Tourism': 'Other Services'
}

df['Sector'] = df['Sector'].map(sector_mapping).fillna('Other Services')

# 7) 최종 확인
print("남은 결측 (Industry):", df['Industry'].isna().sum())
print("남은 결측 (Sector):",   df['Sector'].isna().sum())
print("Sector 유니크 값:", df['Sector'].unique())
print("\n")
print(df['Industry'])
print("\n")
print(df['Sector'])

남은 결측 (Industry): 0
남은 결측 (Sector): 0
Sector 유니크 값: ['Industrial & Energy' 'Healthcare & Education' 'Finance & Consulting'
 'Other Services' 'Technology' 'Consumer & Retail']


0                   Aerospace & Defense
1      Health Care Services & Hospitals
2                     Security Services
3                                Energy
4               Advertising & Marketing
                     ...               
951                            Internet
952             Colleges & Universities
953              Staffing & Outsourcing
954                         IT Services
955                    Federal Agencies
Name: Industry, Length: 956, dtype: object


0         Industrial & Energy
1      Healthcare & Education
2        Finance & Consulting
3         Industrial & Energy
4        Finance & Consulting
                ...          
951                Technology
952    Healthcare & Education
953      Finance & Consulting
954                Technology
955            Other Services
Name: Se