Exploring Data
-----------------

In this directory, there is a file called `employees.csv`.  Let's use pandas to load it into a dataframe:

In [2]:
import pandas as pd
from datetime import date

# Read in the data
df = pd.read_csv('employees.csv', index_col='employee_id')

# Show a few rows
df.head()

Unnamed: 0_level_0,gender,height,waist,salary,dob,death
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4782566,m,1.82,1.11,74917.0,1961-11-18,1986-10-21
1427930,m,1.73,1.11,63012.0,1946-03-18,1972-09-13
8880433,f,1.68,0.93,86437.0,1978-08-14,2012-08-09
3129668,m,1.91,1.6,65603.0,1949-05-07,1975-02-15
7607672,m,1.86,1.17,60018.0,1981-07-09,2007-01-25


Get statistics for a series
---

In [3]:
mean_waist = df['waist'].mean()
print(f"The mean of the waist series is {mean_waist:.2f} meters.")

The mean of the waist series is 1.21 meters.


The describe method gathers several statistics at once

In [4]:
df.salary.describe()

count      9930.000000
mean      63033.975227
std       20093.827794
min         297.000000
25%       49483.500000
50%       63078.500000
75%       76800.750000
max      140902.000000
Name: salary, dtype: float64

Edit series (no loops)
---

In [5]:
# Convert strings to dates for dob and death
df['dob'] = df['dob'].apply(lambda x: date.fromisoformat(x))
df['death'] = df['death'].apply(lambda x: date.fromisoformat(x))

# Make a new column
df['final_age'] = df['death'] - df['dob']

# Show a few rows
df.head()

Unnamed: 0_level_0,gender,height,waist,salary,dob,death,final_age
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4782566,m,1.82,1.11,74917.0,1961-11-18,1986-10-21,"9103 days, 0:00:00"
1427930,m,1.73,1.11,63012.0,1946-03-18,1972-09-13,"9676 days, 0:00:00"
8880433,f,1.68,0.93,86437.0,1978-08-14,2012-08-09,"12414 days, 0:00:00"
3129668,m,1.91,1.6,65603.0,1949-05-07,1975-02-15,"9415 days, 0:00:00"
7607672,m,1.86,1.17,60018.0,1981-07-09,2007-01-25,"9331 days, 0:00:00"


Get info on categorical series
----

In [6]:
print("\n*** Gender ***")
series = df["gender"]
missing = series.isnull()
print(f"{missing.sum()} rows have no value for gender.")
series_counts = series.value_counts()
for value in series_counts.index:
    print(f"{series_counts.loc[value]} employees are \"{value}\"")                  


*** Gender ***
82 rows have no value for gender.
4917 employees are "m"
4907 employees are "f"
36 employees are "F"
23 employees are "M"
19 employees are "male"
16 employees are "female"
