MODULE 1: DATA STRUCTURE & NUMERIC HYGIENE
Dataset: sf_employee_compensation.csv Context: We need to audit the city's payroll. The dataset is large, so structure and memory optimization are priorities before we calculate salary statistics.

Topics: Intro methods, Structure methods, Numeric methods.

In [15]:
import pandas as pd
import numpy as np

# 1. INGESTION & ROBUST DATA PIPELINE
# Architect's Note: The raw file lacks 'total_benefits' and 'total_compensation' columns.
# We must engineer them explicitly in the pipeline to avoid KeyErrors.

sf_payroll = (
    pd.read_csv('../data/sf_employee_compensation.csv')
    # CRITICAL STEP: Standardize headers to match code references (e.g., "Health and Dental" -> "health_and_dental")
    .rename(columns=lambda c: c.strip().lower().replace(' ', '_'))
    
    # 2. FEATURE ENGINEERING (The "Atomic" Approach)
    # Improvement: Use .sum(axis=1) instead of (+) to treat NaNs as 0 (Safe Math)
    .assign(
        total_benefits=lambda df_: df_[[
            'retirement', 
            'health_and_dental', 
            'other_benefits'
        ]].sum(axis=1, min_count=0)
    )
    # Chaining: Calculate Total Comp using the newly created 'total_benefits'
    .assign(
        total_compensation=lambda df_: df_[[
            'salaries', 
            'overtime', 
            'other_salaries', 
            'total_benefits'
        ]].sum(axis=1, min_count=0),
        
        # Optimize Types
        organization_group=lambda df_: df_['organization_group'].astype('category'),
        job=lambda df_: df_['job'].astype('category')
    )
    .set_index('year')
)

# 2. STRUCTURAL AUDIT
print("--- Structure Audit ---")
sf_payroll.info(verbose=True, memory_usage='deep')
print(f"\nShape: {sf_payroll.shape}")

# 3. ANALYSIS PIPELINE
# Goal: Correlate salaries with benefits and identify top earners.

numeric_analysis = (
    sf_payroll
    .select_dtypes(include=['number'])
    # No 'year_type' found in source, so we skip dropping it to be safe, 
    # or strictly select known numeric columns to ensure purity.
)

# CORRELATION
# Calculate Pearson correlation between base salaries and total benefits
correlation_val = numeric_analysis['salaries'].corr(numeric_analysis['total_benefits'])

# EXTREMES
# Identify the job role with the single highest total compensation package
highest_paid_idx = numeric_analysis['total_compensation'].idxmax()
# Use .loc to retrieve the specific record based on the index (Year) and label logic
# Since index is 'year' (non-unique), idxmax returns the label (e.g., 2013). 
# To get the exact row robustly without ambiguity if years duplicate, 
# we usually reset index or use iloc. However, strictly following the prompt's flow:
# We will find the row with the max value directly to print details.
highest_earner_row = sf_payroll.loc[sf_payroll['total_compensation'] == sf_payroll['total_compensation'].max()]

print(f"\n--- Correlation: Salaries vs Total Benefits ---")
print(f"Pearson r: {correlation_val:.4f}")

print("\n--- Highest Compensated Role ---")
print(highest_earner_row[['job', 'total_compensation']].to_string())

--- Structure Audit ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   year                50000 non-null  int64   
 1   organization_group  50000 non-null  category
 2   job                 50000 non-null  category
 3   salaries            50000 non-null  float64 
 4   overtime            50000 non-null  float64 
 5   other_salaries      50000 non-null  float64 
 6   retirement          50000 non-null  float64 
 7   health_and_dental   50000 non-null  float64 
 8   other_benefits      50000 non-null  float64 
 9   total_benefits      50000 non-null  float64 
 10  total_compensation  50000 non-null  float64 
dtypes: category(2), float64(8), int64(1)
memory usage: 3.7 MB

Shape: (50000, 11)

--- Correlation: Salaries vs Total Benefits ---
Pearson r: 0.9170

--- Highest Compensated Role ---
                    job  total_co

MODULE 2: THE QUALITY FILTER (MISSING VALUES & RANKING)
Dataset: movie.csv Context: We are curating a list of "Best Value" movies. The raw data is messy. We need to handle missing financial data, remove duplicates, and rank movies by Return on Investment (ROI).

Topics: Missing Value methods, Sorting, Ranking, Uniqueness, "More" methods.

In [32]:
movies_raw = pd.read_csv('../data/movie.csv')

# PRODUCTION PIPELINE: Cleaning & Ranking
curated_catalog = (
    movies_raw
    # 1. HANDLING MISSING VALUES (The Strategy)
    # Rule A: Drop rows where we can't calculate ROI (Budget or Gross is NaN)
    .dropna(subset=['gross', 'budget'])
    
    # Rule B: Impute categorical gaps (Content Rating)
    # Using .fillna() to categorize unknown movies
    .assign(
        content_rating=lambda df_: df_['content_rating'].fillna('Unrated')
    )
    
    # 2. UNIQUENESS
    # Rule: Remove duplicate movie titles, keeping the one with the highest gross
    .sort_values('gross', ascending=False)
    .drop_duplicates(subset=['title'], keep='first')
    
    # 3. FEATURE ENGINEERING (for Ranking)
    .assign(roi=lambda df_: df_['gross'] / df_['budget'])
    
    # 4. SORTING & RANKING
    # Sort by ROI descending
    .sort_values('roi', ascending=False)
    
    # Rank: Assign a dense rank (1, 2, 3...) based on ROI
    .assign(roi_rank=lambda df_: df_['roi'].rank(method='dense', ascending=False))
)

# 5. "MORE" DATAFRAME METHODS (Sampling & N-Extremes)
# Business Requirement: Pick 5 random movies for the "Staff Picks" carousel
staff_picks = curated_catalog.sample(n=5, random_state=42)

# Business Requirement: Get the top 3 and bottom 3 movies by Budget (nsmallest/nlargest)
budget_extremes = pd.concat([
    curated_catalog.nlargest(3, 'budget'),
    curated_catalog.nsmallest(3, 'budget')
])

print("--- Data Completeness (Post-Cleaning) ---")
# Verifying we dropped the critical missing data
print(curated_catalog[['gross', 'budget']].isna().sum())

--- Data Completeness (Post-Cleaning) ---
gross     0
budget    0
dtype: int64


In [36]:
staff_picks

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score,roi,roi_rank
1757,30 Minutes or Less,2011.0,Color,R,83.0,Ruben Fleischer,181.0,Bianca Kajlich,731.0,Dilshad Vadsaria,...,Action|Comedy|Crime,220.0,77935,bank heist|bank robbery|heist gone wrong|pizza...,English,Germany,28000000.0,6.1,1.323354,1579.0
1382,Space Chimps,2008.0,Color,G,81.0,Kirk De Micco,16.0,Cheryl Hines,541.0,Kenan Thompson,...,Adventure|Animation|Comedy|Family|Sci-Fi,85.0,8860,astronaut|attacked by a plant|planet|senator|s...,English,USA,37000000.0,4.5,0.813675,2256.0
3983,Jason Lives: Friday the 13th Part VI,1986.0,Color,R,86.0,Tom McLoughlin,41.0,Tony Goldwyn,956.0,Ron Palillo,...,Horror|Thriller,158.0,25332,actual animal killed|death|jason voorhees|slas...,English,USA,3000000.0,5.9,6.490686,262.0
598,The Monuments Men,2014.0,Color,PG-13,118.0,George Clooney,0.0,Bill Murray,13000.0,Matt Damon,...,Drama|War,371.0,102248,art|art expert|nazi stolen art|soldier|world w...,English,USA,70000000.0,6.1,1.114737,1836.0
4217,The Missing Person,2009.0,Color,Unrated,95.0,Noah Buschel,8.0,Merritt Wever,529.0,Amy Ryan,...,Drama,66.0,1268,cellphone|detective|missing person|private det...,English,USA,1500000.0,6.2,0.01172,3649.0


MODULE 3: ASSIGNING SUBSETS (LOGICAL UPDATES)
Dataset: bikes.csv Context: We found a glitch. Rides under 60 seconds are likely errors and should be free ($0 cost). Also, we need to flag "Commuter" rides based on the day of the week.

Topics: Assigning Subsets (.mask, .where), assign.

In [51]:
bikes = pd.read_csv('../data/bikes.csv')

# LOGIC:
# 1. If tripduration < 60, Adjusted Cost = 0.
# 2. If tripduration >= 60, Adjusted Cost = 1.50 (Standard Fee).
# 3. Create a label 'Ride_Type': Commuter (Mon-Fri) or Leisure (Sat-Sun).

# PRODUCTION CHAIN: Conditional Logic
processed_rides = (
    bikes
    .assign(
        # 1. DATETIME CONVERSION (Prerequisite)
        starttime=lambda df_: pd.to_datetime(df_['starttime']),
        
        # 2. ASSIGNING SUBSETS (The .mask/.where Pattern)
        # np.where(condition, value_if_true, value_if_false)
        ride_cost=lambda df_: np.where(
            df_['tripduration'] < 60, 
            0.00,   # Adjusted Cost = 0
            1.50    # Standard Fee = 1.50
        ),
        ride_type=lambda df_: pd.Series(np.where(
            df_['starttime'].dt.dayofweek < 5, 
            'Commuter', 
            'Leisure'
        )).astype('category')
        
    )
    # Filter to show the result
    .loc[:, ['tripduration', 'ride_cost', 'ride_type', 'starttime']]
)

# VERIFICATION
print("--- subset Assignment Verification ---")
print(processed_rides.head(5))
print("\nCost Distribution:")
print(processed_rides['ride_cost'].value_counts())

--- subset Assignment Verification ---
   tripduration  ride_cost ride_type           starttime
0           993        1.5  Commuter 2013-06-28 19:01:00
1           623        1.5  Commuter 2013-06-28 22:53:00
2          1040        1.5   Leisure 2013-06-30 14:43:00
3           667        1.5  Commuter 2013-07-01 10:05:00
4           130        1.5  Commuter 2013-07-01 11:16:00

Cost Distribution:
ride_cost
1.5    50089
Name: count, dtype: int64


In [46]:
processed_rides['ride_type'].value_counts(normalize=True)

ride_type
Commuter    0.803071
Leisure     0.196929
Name: proportion, dtype: float64

THE CURATOR ALGORITHM (Structure, Sorting & Ranking)
Dataset: data/movie.csv Context: We are building a "Streaming Recommendation Engine." We need to restructure the messy source file, insert new metrics, and rank movies to surface hidden gems. Sorting must be multi-dimensional (e.g., by Year then by Score).

Topics: rename, drop, sort_values (multi-column), rank, nunique.

In [65]:
# PRODUCTION PIPELINE: Structure & Ranking
# -----------------------------------------------------------------------------
catalog_engine = (
    pd.read_csv('../data/movie.csv')
    
    # 1. STRUCTURE METHODS: RENAMING & DROPPING
    # Standard: Clean headers immediately to snake_case
    .rename(columns=lambda c: c.strip().lower().replace(' ', '_'))
    
    # Standard: Drop columns that add noise (Vertical Filtering) and drop na for budget and gross
    .drop(columns=['color', 'plot_keywords'], errors='ignore')
    .dropna(subset=['gross', 'budget'])
    
    # 2. INSERTING DATA (Feature Engineering via .assign)
    # Goal: Create a "Profitability" metric
    .assign(
        profit_millions = lambda df_: (df_['gross'] - df_['budget']) / 1_000_000,
        # Goal: How does this movie compare to the whole catalog?
        # 'dense': No gaps in rank (1, 2, 2, 3). Good for "Top 10" lists.
        rank_profit = lambda df_: df_['profit_millions'].rank(method='dense', ascending=False),
        
        # 'pct': Percentile (0-1). Good for "Top 1% of Movies".
        rank_percentile = lambda df_: df_['imdb_score'].rank(pct=True)
    )
    
    # 3. MULTI-COLUMN SORTING
    # Goal: Group by Director, then show their most profitable work first.
    # Logic: Sort by 'director_name' (Ascending A-Z), then 'profit_millions' (Descending)
    .sort_values(
        by=['director_name', 'profit_millions'], 
        ascending=[True, False]
    )
    
    
    # 4. UNIQUENESS (Sanity Check)
    # Ensure we don't recommend duplicates. 
    # keep='first' preserves the highest profit entry due to our previous sort.
    .drop_duplicates(subset=['title'], keep='first')
)

# OUTPUT: Top 5 Highest Ranked Movies by Profit
print("\n--- Catalog Ranking Sample ---")
cols = ['director_name', 'title','gross','budget', 'profit_millions', 'rank_profit', 'rank_percentile']
catalog_engine.loc[:, cols].head(5)


--- Catalog Ranking Sample ---


Unnamed: 0,director_name,title,gross,budget,profit_millions,rank_profit,rank_percentile
3430,Aaron Schneider,Get Low,9176553.0,7500000.0,1.676553,1820.0,0.7147
2156,Aaron Seltzer,Date Movie,48546578.0,20000000.0,28.546578,848.0,0.004355
2862,Abel Ferrara,The Funeral,1227324.0,12500000.0,-11.272676,2892.0,0.517023
4334,Adam Carolla,Road Hard,105943.0,1500000.0,-1.394057,2194.0,0.322776
4304,Adam Goldberg,I Love Your Work,2580.0,1650000.0,-1.64742,2226.0,0.143178


THE OPS DASHBOARD (Aggregation & Subsets)
Dataset: data/flights.csv Context: We manage airport operations. We need to create a "Performance Report." This requires complex named aggregation (renaming columns while grouping) and assigning status labels based on logic (e.g., "Severe Delay").

In [78]:
import pandas as pd
import numpy as np

# 1. PRODUCTION PIPELINE: ROBUST FLIGHTS ANALYSIS
# Architect's Note: The raw 'flights.csv' contains component delays (carrier, weather, etc.) 
# but lacks a pre-computed 'arr_delay' or 'flight_id'. We engineer them safely.
ops_dashboard = (
    pd.read_csv('../data/flights.csv')
    
    # A. Feature Engineering (Vectorized & Atomic)
    .assign(
        # Calculate total arrival delay from components (Safe Sum treats NaNs as 0)
        arr_delay=lambda df_: (
            df_.loc[:, ['carrier_delay', 'weather_delay', 'nas_delay', 
                        'security_delay', 'late_aircraft_delay']]
               .sum(axis=1)
        ),
        
        # Cap negative delays at 0 (Logic: Early arrival is not a negative penalty)
        arr_delay_clean=lambda df_: df_['arr_delay'].clip(lower=0),
        
        # Status logic using np.select (The production standard for multiple conditions)
        flight_status=lambda df_: np.select(
            condlist=[
                df_['arr_delay'] > 60,
                df_['arr_delay'] > 15
            ],
            choicelist=['Severe', 'Delayed'],
            default='Normal'
        )
    )
)
    
    # B. Named Aggregation (Explicit & Readable)
ops_dashboard_agg = (   ops_dashboard.groupby('airline')
    .agg(
        total_flights   = ('date', 'count'),  # Use 'date' as a proxy for record count
        avg_delay_min   = ('arr_delay_clean', 'mean'),
        worst_delay_min = ('arr_delay_clean', 'max'),
        
        # Custom Metric: Percent of flights arriving > 15 mins late
        pct_delayed     = ('arr_delay_clean', lambda x: (x > 15).mean())
    )
    
    # C. Final Polish
    .sort_values('pct_delayed', ascending=False)
    .round(2)
)

# 2. OUTPUT
print("\n--- Airline Performance Dashboard (Production Output) ---")
ops_dashboard.head()


--- Airline Performance Dashboard (Production Output) ---


Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,arr_delay,arr_delay_clean,flight_status
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0,0,0,Normal
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0,0,0,Normal
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0,91,91,Severe
3,2018-01-01,B6,DTW,BOS,600,754,0,79.0,632.0,0,0,19,0,0,19,19,Delayed
4,2018-01-01,UA,LAS,EWR,600,1348,0,261.0,2227.0,0,0,0,0,0,0,0,Normal


In [79]:
ops_dashboard_agg

Unnamed: 0_level_0,total_flights,avg_delay_min,worst_delay_min,pct_delayed
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F9,1141,20.69,494,0.3
EV,171,22.36,812,0.28
B6,3816,19.51,799,0.27
MQ,373,14.17,191,0.25
OH,257,16.46,337,0.23
OO,2085,20.24,1498,0.23
WN,4912,11.66,452,0.22
YV,729,14.3,429,0.21
VX,429,12.17,298,0.2
UA,11882,14.26,1190,0.2


In [72]:
pd.read_csv('../data/flights.csv')

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0
3,2018-01-01,B6,DTW,BOS,600,754,0,79.0,632.0,0,0,19,0,0
4,2018-01-01,UA,LAS,EWR,600,1348,0,261.0,2227.0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65918,2018-12-31,WN,ATL,BOS,2200,30,0,116.0,946.0,0,0,0,0,0
65919,2018-12-31,B6,SEA,JFK,2215,625,0,282.0,2422.0,0,0,0,0,0
65920,2018-12-31,B6,PHX,JFK,2234,509,0,233.0,2153.0,0,0,0,0,0
65921,2018-12-31,UA,SFO,EWR,2300,712,0,265.0,2565.0,20,0,0,0,0


In [83]:
import pandas as pd
import numpy as np

# 1. PRODUCTION-GRADE FINANCIAL PIPELINE
# Standard: We perform header standardization and datetime conversion immediately.
# Standard: We pick 'msft' as the primary price proxy (as 'price' is not in the source).
market_analysis = (
    pd.read_csv('../data/stocks/sample_missing.csv')
    # A. Ingestion & Standardization
    .rename(columns=lambda c: c.strip().lower())
    .assign(date=lambda df_: pd.to_datetime(df_['date']))
    .sort_values('date')
    .set_index('date')
    
    # B. Missing Data Restoration (Physics of Markets)
    # Using 'msft' as the target column found in the dataset.
    .assign(
        # Forward fill: "Price remains unchanged until the next transaction."
        price_ffill=lambda df_: df_['msft'].ffill(),
        
        # Linear Interpolation: "Estimates price movement across gaps."
        # limit_direction='both' ensures gaps at the file boundaries are handled.
        price_interp=lambda df_: df_['msft'].interpolate(method='linear', limit_direction='both')
    )
    
    # C. Financial Feature Engineering (Vectorized)
    .assign(
        # diff(): Absolute dollar change from previous period.
        dollar_change=lambda df_: df_['price_interp'].diff(),
        
        # pct_change(): Relative daily return (Standard for volatility/crash analysis).
        daily_return=lambda df_: df_['price_interp'].pct_change()
    )
)

# 2. EXTREME VALUE EXTRACTION
# Use .idxmin()/.idxmax() on the Series for precise label (Date) extraction.
# Note: For time series with unique daily dates, this is the precision standard.
worst_crash_date = market_analysis['daily_return'].idxmin()
best_rally_date  = market_analysis['daily_return'].idxmax()

# 3. PRODUCTION REPORTING
print(f"--- Market Extremes (Source: MSFT) ---")
# Accessing scalar values via .at for high-performance retrieval in reports.
#df.at[row_label, column_label] syntax for .at
print(f"Worst Crash: {worst_crash_date.date()} ({market_analysis.at[worst_crash_date, 'daily_return']:.2%})")
print(f"Best Rally : {best_rally_date.date()}  ({market_analysis.at[best_rally_date, 'daily_return']:.2%})")

print("\n--- Interpolation Comparison Audit ---")
# Precision Selection: .loc with explicit column list
market_analysis.loc[:, ['msft', 'price_ffill', 'price_interp']].head(6)

--- Market Extremes (Source: MSFT) ---
Worst Crash: 2019-10-18 (-1.00%)
Best Rally : 2019-10-09  (1.26%)

--- Interpolation Comparison Audit ---


Unnamed: 0_level_0,msft,price_ffill,price_interp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-10-08,135.67,135.67,135.67
2019-10-09,,135.67,137.385
2019-10-10,139.1,139.1,139.1
2019-10-11,,139.1,139.923333
2019-10-14,,139.1,140.746667
2019-10-15,141.57,141.57,141.57


In [84]:
market_analysis['daily_return'].idxmin()

Timestamp('2019-10-18 00:00:00')

THE HR AUDITOR

Focus: Structure, Sorting, and Advanced Ranking. 

Dataset: data/sf_employee_compensation.csv 

Business Goal: Audit the city's payroll. We need to restructure the messy raw file, insert categorization logic, and rank employees to find the top earners within each department.

In [92]:
import pandas as pd
import numpy as np

# 1. PRODUCTION INGESTION
# Architect's Note: We removed the 'drop' step for non-existent columns (year_type, union_code)
# to keep the pipeline clean and error-free.
emp_raw = (
    pd.read_csv('../data/sf_employee_compensation.csv')
    .sample(1000, random_state=42)
    # Standardize Headers: 'organization group' -> 'organization_group'
    .rename(columns=lambda c: c.strip().lower().replace(' ', '_'))
)

# 2. ANALYSIS PIPELINE
hr_audit = (
    emp_raw
    # FEATURE ENGINEERING: Constructing Missing Metrics (Atomic & Vectorized)
    # We must calculate totals before we can rank or sort by them.
    .assign(
        # Robust Summation: .sum(axis=1) treats NaNs as 0, avoiding data loss.
        total_benefits=lambda df_: df_[[
            'retirement', 'health_and_dental', 'other_benefits'
        ]].sum(axis=1),
        
        # Calculate Base Total Comp first
        total_compensation=lambda df_: df_[[
            'salaries', 'overtime', 'other_salaries'
        ]].sum(axis=1)
    )
    # Add Total Benefits to Total Comp (Chained for dependency)
    .assign(
        total_compensation=lambda df_: df_['total_compensation'] + df_['total_benefits'],
        total_benefits_k=lambda df_: df_['total_benefits'] / 1000
    )
    
    # TYPE OPTIMIZATION
    # Converting low-cardinality strings (Org, Job) to categories saves memory.
    # Note: Source column is 'job', not 'job_family'.
    .assign(
        organization_group=lambda df_: df_['organization_group'].astype('category'),
        job=lambda df_: df_['job'].astype('category')
    )
    
    # MULTI-COLUMN SORTING
    .sort_values(
        by=['organization_group', 'total_compensation'], 
        ascending=[True, False]
    )
    
    # ADVANCED RANKING (Vectorized)
    .assign(
        # Dense Rank: 1, 2, 2, 3 (No gaps, good for leaderboards)
        city_rank=lambda df_: df_['total_compensation'].rank(method='dense', ascending=False),
        
        # Percentile: 0.0 to 1.0 (Good for distribution analysis)
        comp_percentile=lambda df_: df_['total_compensation'].rank(pct=True),
        
        # Standard Rank: 1, 2, 2, 4 (Standard competition ranking)
        standard_rank=lambda df_: df_['total_compensation'].rank(method='min', ascending=False)
    )
)

# 3. REPORTING
print("\n--- Payroll Leaderboard (Top 5) ---")
# Precision Selection: Using the correct column 'job' instead of 'job_family'
cols_to_show = ['organization_group', 'job', 'total_compensation', 'city_rank', 'comp_percentile']
hr_audit.loc[:, cols_to_show].head(5)


#print("\n--- Structure Audit ---")
hr_audit.info( memory_usage='deep')


--- Payroll Leaderboard (Top 5) ---
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 9951 to 11670
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   year                1000 non-null   int64   
 1   organization_group  1000 non-null   category
 2   job                 1000 non-null   category
 3   salaries            1000 non-null   float64 
 4   overtime            1000 non-null   float64 
 5   other_salaries      1000 non-null   float64 
 6   retirement          1000 non-null   float64 
 7   health_and_dental   1000 non-null   float64 
 8   other_benefits      1000 non-null   float64 
 9   total_benefits      1000 non-null   float64 
 10  total_compensation  1000 non-null   float64 
 11  total_benefits_k    1000 non-null   float64 
 12  city_rank           1000 non-null   float64 
 13  comp_percentile     1000 non-null   float64 
 14  standard_rank       1000 non-null   float64 
dtypes:

In [93]:
hr_audit.loc[:, cols_to_show].head(5)

Unnamed: 0,organization_group,job,total_compensation,city_rank,comp_percentile
9951,Community Health,Nurse Manager,307384.56,2.0,0.999
23111,Community Health,Senior Physician Specialist,298402.09,5.0,0.996
48224,Community Health,Clinical Nurse Specialist,272962.98,12.0,0.989
3156,Community Health,Diagnostic Imaging Tech III,262742.46,13.0,0.988
1941,Community Health,Senior Physician Specialist,256384.52,18.0,0.983
