# Series Essential Methods

The most common and simple Series methods will be covered.

### Objectives
* Arithmetic operators
* Summary statistic methods
* **`str`** accessor for strings
* Boolean arithmetic
* Case Study: Do people with more experience make more money?


# Arithmetic Operators
Typically, the arithmetic operators `+, -, *, /, //, **`, are used on a single column of data since most datasets have a mix of numeric and string data.

In [1]:
import pandas as pd

### Read in the employee dataset and get some metadata on it

In [12]:
employee = pd.read_csv('data/employee.csv')
employee.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13
1,364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18
2,1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03
3,8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25
4,8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22


In [3]:
employee.shape

(2000, 10)

In [4]:
employee.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 10 columns):
UNIQUE_ID            2000 non-null int64
POSITION_TITLE       2000 non-null object
DEPARTMENT           2000 non-null object
BASE_SALARY          1886 non-null float64
RACE                 1965 non-null object
EMPLOYMENT_TYPE      2000 non-null object
GENDER               2000 non-null object
EMPLOYMENT_STATUS    2000 non-null object
HIRE_DATE            2000 non-null object
JOB_DATE             1997 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 156.3+ KB


# Select a Numeric Column for arithmetic operations

In [5]:
salary = employee['BASE_SALARY']
salary.head()

0    121862.0
1     26125.0
2     45279.0
3     63166.0
4     56347.0
Name: BASE_SALARY, dtype: float64

In [6]:
type(salary)

pandas.core.series.Series

In [7]:
salary + 100

0       121962.0
1        26225.0
2        45379.0
3        63266.0
4        56447.0
5        66714.0
6        71780.0
7        42490.0
8       108062.0
9        44716.0
10       52744.0
11      180516.0
12       30447.0
13       55369.0
14       77176.0
15           NaN
16           NaN
17       81339.0
18       40681.0
19       66714.0
20       61606.0
21       57915.0
22       43364.0
23       55537.0
24       52614.0
25       61743.0
26           NaN
27       66714.0
28       81339.0
29       29657.0
          ...   
1970     66714.0
1971     45379.0
1972     63266.0
1973     77425.0
1974     66714.0
1975     32257.0
1976     34566.0
1977     45379.0
1978     28014.0
1979     66714.0
1980         NaN
1981     70281.0
1982     55537.0
1983     55272.0
1984     62021.0
1985     30322.0
1986     77176.0
1987         NaN
1988    124215.0
1989     37311.0
1990     30447.0
1991     44529.0
1992     29386.0
1993     81339.0
1994    104555.0
1995     43543.0
1996     66623.0
1997     43543

In [8]:
(salary * 10).head()

0    1218620.0
1     261250.0
2     452790.0
3     631660.0
4     563470.0
Name: BASE_SALARY, dtype: float64

In [9]:
salary_10x = salary * 10

In [10]:
salary_10x.head()

0    1218620.0
1     261250.0
2     452790.0
3     631660.0
4     563470.0
Name: BASE_SALARY, dtype: float64

In [13]:
# typically we do not do arithmetic operations to entire dataframes
employee + 100

TypeError: Could not operate 100 with block values must be str, not int

# Statistical Summary methods

#### Basic
* `min, max, median, mean, mode, sum, var, std, corr, cov`

#### Accumulation
* `cumsum, cummin, cummax`

In [14]:
salary.max()

275000.0

In [15]:
#shift + tab + tab brings documentation
salary.mean()

55767.93160127253

In [18]:
salary.head(15)

0     121862.0
1      26125.0
2      45279.0
3      63166.0
4      56347.0
5      66614.0
6      71680.0
7      42390.0
8     107962.0
9      44616.0
10     52644.0
11    180416.0
12     30347.0
13     55269.0
14     77076.0
Name: BASE_SALARY, dtype: float64

In [16]:
salary.cummax().head(15)

0     121862.0
1     121862.0
2     121862.0
3     121862.0
4     121862.0
5     121862.0
6     121862.0
7     121862.0
8     121862.0
9     121862.0
10    121862.0
11    180416.0
12    180416.0
13    180416.0
14    180416.0
Name: BASE_SALARY, dtype: float64

In [19]:
(salary // 1000).mode()

0    66.0
dtype: float64

In [21]:
salary.value_counts()

66614.0     157
55461.0      68
81239.0      59
26125.0      39
62540.0      38
47650.0      37
70181.0      31
60347.0      30
66523.0      29
63166.0      29
61643.0      26
55437.0      24
40170.0      22
51194.0      22
61921.0      22
43528.0      19
30347.0      16
48190.0      16
42000.0      16
78355.0      15
61226.0      15
45279.0      14
77076.0      14
45791.0      14
52644.0      14
43443.0      13
50621.0      13
28024.0      13
89590.0      12
91181.0      11
           ... 
36150.0       1
83174.0       1
102297.0      1
34507.0       1
52000.0       1
80791.0       1
55939.0       1
59530.0       1
58040.0       1
44970.0       1
61589.0       1
40000.0       1
27102.0       1
57034.0       1
46675.0       1
32115.0       1
70565.0       1
49275.0       1
45906.0       1
49691.0       1
27414.0       1
126115.0      1
30493.0       1
53832.0       1
59613.0       1
46426.0       1
77000.0       1
60237.0       1
72932.0       1
31595.0       1
Name: BASE_SALARY, Lengt

In [26]:
((salary // 1000)*1000).value_counts()

66000.0     194
55000.0     106
61000.0      69
47000.0      65
81000.0      63
43000.0      57
62000.0      50
45000.0      48
30000.0      47
40000.0      46
26000.0      44
51000.0      44
60000.0      43
52000.0      41
70000.0      39
63000.0      38
35000.0      36
28000.0      35
32000.0      35
31000.0      33
36000.0      32
33000.0      31
48000.0      31
34000.0      30
42000.0      30
37000.0      30
57000.0      28
39000.0      27
38000.0      26
50000.0      26
           ... 
102000.0      2
97000.0       2
94000.0       2
150000.0      2
130000.0      2
82000.0       2
117000.0      2
107000.0      2
124000.0      2
146000.0      1
178000.0      1
115000.0      1
142000.0      1
85000.0       1
180000.0      1
24000.0       1
113000.0      1
210000.0      1
165000.0      1
122000.0      1
275000.0      1
121000.0      1
163000.0      1
141000.0      1
125000.0      1
186000.0      1
114000.0      1
140000.0      1
128000.0      1
199000.0      1
Name: BASE_SALARY, Lengt

# Select a string (object) column
Strings don't play well with arithmetic operations or statistical summary methods. The most valuable method for a Series of strings is **`value_counts`**

In [28]:
dept = employee['DEPARTMENT']

In [29]:
dept.value_counts()

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Solid Waste Management             43
Library                            36
Fleet Management Department        36
Admn. & Regulatory Affairs         29
Municipal Courts Department        28
Human Resources Dept.              24
Houston Emergency Center (HEC)     23
Housing and Community Devp.        22
General Services Department        22
Legal Department                   17
Dept of Neighborhoods (DON)        17
City Council                       11
Finance                            10
Houston Information Tech Svcs       9
Planning & Development              7
Mayor's Office                      5
City Controller's Office            5
Convention and Entertainment        1
Name: DEPARTMENT, dtype: int64

In [30]:
dept.value_counts(normalize=True)

Houston Police Department-HPD     0.3190
Houston Fire Department (HFD)     0.1920
Public Works & Engineering-PWE    0.1715
Health & Human Services           0.0550
Houston Airport System (HAS)      0.0530
Parks & Recreation                0.0370
Solid Waste Management            0.0215
Library                           0.0180
Fleet Management Department       0.0180
Admn. & Regulatory Affairs        0.0145
Municipal Courts Department       0.0140
Human Resources Dept.             0.0120
Houston Emergency Center (HEC)    0.0115
Housing and Community Devp.       0.0110
General Services Department       0.0110
Legal Department                  0.0085
Dept of Neighborhoods (DON)       0.0085
City Council                      0.0055
Finance                           0.0050
Houston Information Tech Svcs     0.0045
Planning & Development            0.0035
Mayor's Office                    0.0025
City Controller's Office          0.0025
Convention and Entertainment      0.0005
Name: DEPARTMENT

# The `str` accessor
Pandas provides the `str` accessor with a few dozen dedicated string-only methods. Many of these are the same that you will find from regular Python strings.

In [31]:
# a list of all the string only methods
print([method for method in dir(dept.str) if method[0] != '_'])

['capitalize', 'cat', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', 'extractall', 'find', 'findall', 'get', 'get_dummies', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', 'isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pad', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill']


In [32]:
dept.head()

0      Municipal Courts Department
1                          Library
2    Houston Police Department-HPD
3    Houston Fire Department (HFD)
4      General Services Department
Name: DEPARTMENT, dtype: object

In [33]:
# count the appearances of the letter 'a' in each word
dept.str.count('a').head()

0    2
1    1
2    1
3    1
4    2
Name: DEPARTMENT, dtype: int64

In [34]:
# find all the departments with the word 'Services' in them
dept.str.contains('Services').head()

0    False
1    False
2    False
3    False
4     True
Name: DEPARTMENT, dtype: bool

In [36]:
# can do boolean selection with Series
has_services = dept.str.contains('Services')
dept[has_services].value_counts()

Health & Human Services        110
General Services Department     22
Name: DEPARTMENT, dtype: int64

# Boolean Arithmetic
Boolean values True/False are treated the same as 1/0 and some interesting calculations can be had.

In [37]:
has_services.sum()

132

In [38]:
has_services.mean()

0.066

# Case Study: Do people with more experience make more money?
To answer this question, the number of years of experience needs to be calculated from the column **HIRE_DATE**. **datetime** columns can be subtracted from one another. We will use the date that the data was generated which was around December, 2016.

In [39]:
pull_date = pd.Timestamp('2016-12-1')
pull_date

Timestamp('2016-12-01 00:00:00')

## Reading in Dates
It is possible to read in dates correctly using **`read_csv`**. Use a list of the column names you would like to be dates as the argument for the **`parse_dates`** parameter.

In [40]:
employee = pd.read_csv('data/employee.csv', parse_dates=['HIRE_DATE', 'JOB_DATE'])
employee.dtypes

UNIQUE_ID                     int64
POSITION_TITLE               object
DEPARTMENT                   object
BASE_SALARY                 float64
RACE                         object
EMPLOYMENT_TYPE              object
GENDER                       object
EMPLOYMENT_STATUS            object
HIRE_DATE            datetime64[ns]
JOB_DATE             datetime64[ns]
dtype: object

In [41]:
# subtract the hire date from today to get the number of days of experience
experience = pull_date - employee['HIRE_DATE']

# print out head of series
experience.head()

0    3825 days
1    5979 days
2     667 days
3   12715 days
4   10027 days
Name: HIRE_DATE, dtype: timedelta64[ns]

### Converting to years
Notice that the data type is now **timedelta64** which represents an amount of time in days. To convert this to years an esoteric command must be run. [See here for more detail](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html#frequency-conversion)

In [42]:
# convert to years
years_experience = experience / pd.Timedelta(1, 'Y')

# inspect and check that it makes sense
years_experience.head()

0    10.472494
1    16.369946
2     1.826184
3    34.812488
4    27.452994
Name: HIRE_DATE, dtype: float64

In [46]:
# Make a new column
employee['YEARS_EXPERIENCE'] = years_experience

In [47]:
employee.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,YEARS_EXPERIENCE
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,10.472494
1,364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,16.369946
2,1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,1.826184
3,8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,34.812488
4,8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,27.452994


### Creating categories for years of experience
It's possible to divide numerical columns into different categories based on their value. The pandas **cut** function accepts a Series or an array and a list of the edges of the **bins**. Each category can be given a **label** as well. A series is returned that is of **categorical** type - unique to Pandas. [More on categorical data](http://pandas.pydata.org/pandas-docs/stable/categorical.html)

In [48]:
# create Series of categorical data
exp_categories = pd.cut(years_experience, bins=[0, 5, 15, 100], labels=['Novice', 'Experienced', 'Senior'])

In [49]:
# inspect Seriers
exp_categories.head(10)

0    Experienced
1         Senior
2         Novice
3         Senior
4         Senior
5         Senior
6         Novice
7         Novice
8         Senior
9         Novice
Name: HIRE_DATE, dtype: category
Categories (3, object): [Novice < Experienced < Senior]

In [50]:
# get some summary statistics
exp_categories.value_counts()

Senior         806
Experienced    663
Novice         531
Name: HIRE_DATE, dtype: int64

In [53]:
# Create new column
employee['EXPERIENCE_LEVEL'] = exp_categories

In [54]:
employee.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,YEARS_EXPERIENCE,EXPERIENCE_LEVEL
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,10.472494,Experienced
1,364,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,16.369946,Senior
2,1286,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,1.826184,Novice
3,8789,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,34.812488,Senior
4,8542,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,27.452994,Senior


### Now find the average salary for each experience level

In [None]:
# your code here

In [72]:
exp_lvl_sr=employee['EXPERIENCE_LEVEL'] == 'Senior'
exp_lvl_nv=employee['EXPERIENCE_LEVEL'] == 'Novice'
exp_lvl_exp=employee['EXPERIENCE_LEVEL'] == 'Experienced'

senior=employee[exp_lvl_sr]

novice=employee[exp_lvl_nv]

experienced=employee[exp_lvl_exp]

senior['BASE_SALARY'].mean()

novice['BASE_SALARY'].mean()

experienced['BASE_SALARY'].mean()


55264.92867981791

### Problem 1
<span  style="color:green; font-size:16px">Find the percentage of each race</span>

In [75]:
# your code here

employee['RACE'].value_counts(normalize='True')

Black or African American            0.356234
White                                0.338422
Hispanic/Latino                      0.244275
Asian/Pacific Islander               0.054453
American Indian or Alaskan Native    0.005598
Others                               0.001018
Name: RACE, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Sum the percentages found in problem 1. Do they add to 1?</span>

In [76]:
# your code here
employee['RACE'].value_counts(normalize='True').sum()

1.0

### Problem 3
<span  style="color:green; font-size:16px">How many people make over $100,000?</span>

In [79]:
# your code here
ppl_list=employee['BASE_SALARY'] > 100000
ppl_list.sum()

57

### Problem 4
<span  style="color:green; font-size:16px">What percentage of those who make over $100,000 are women? Compare this ratio to that of the whole population</span>

In [100]:
# your code here
ppl_gndr=(employee['BASE_SALARY'] > 100000) &  (employee['GENDER']=='Female')
ppl_gndr.sum()/ppl_list.sum()

0.3684210526315789

### Problem 5
<span  style="color:green; font-size:16px">Find the count of each position title that has the word **POLICE** in it.</span>

In [104]:
# your code here
police=employee['POSITION_TITLE'].str.contains('POLICE')
pt=employee['POSITION_TITLE']
pt[police].value_counts()

SENIOR POLICE OFFICER             220
POLICE OFFICER                    184
POLICE SERGEANT                    98
POLICE LIEUTENANT                  18
POLICE OFFICER,PROBATIONARY        15
POLICE CAPTAIN                      6
SENIOR POLICE SERVICE OFFICER       5
POLICE TRAINEE                      5
POLICE TELECOMMUNICATOR             4
POLICE SERVICE OFFICER              4
SENIOR POLICE TELECOMMUNICATOR      3
Name: POSITION_TITLE, dtype: int64

# Solutions

### Problem 1
<span  style="color:green; font-size:16px">Find the percentage of each race</span>

In [89]:
employee['RACE'].value_counts(normalize=True)

Black or African American            0.356234
White                                0.338422
Hispanic/Latino                      0.244275
Asian/Pacific Islander               0.054453
American Indian or Alaskan Native    0.005598
Others                               0.001018
Name: RACE, dtype: float64

### Problem 2
<span  style="color:green; font-size:16px">Sum the percentages found in problem 1. Do they add to 1?</span>

In [90]:
employee['RACE'].value_counts(normalize=True).sum()

1.0

### Problem 3
<span  style="color:green; font-size:16px">How many people make over $100,000?</span>

In [94]:
gt_100 = employee['BASE_SALARY'] > 100000
gt_100.sum()

57

### Problem 4
<span  style="color:green; font-size:16px">What percentage of those who make over $100,000 are women?</span>

In [92]:
gt_100 = employee['BASE_SALARY'] > 100000
df_gt_100 = employee[gt_100]
df_gt_100.head()

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,YEARS_EXPERIENCE,EXPERIENCE_LEVEL
0,5906,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,10.472494,Experienced
8,8172,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,Public Works & Engineering-PWE,107962.0,White,Full Time,Male,Active,1993-11-15,2013-01-05,23.044963,Senior
11,3347,"CHIEF PHYSICIAN,MD",Health & Human Services,180416.0,Black or African American,Full Time,Male,Active,1987-05-22,1999-08-28,29.531065,Senior
43,3982,ASSOCIATE EMS PHYSICIAN DIRECTOR,Houston Fire Department (HFD),165216.0,Hispanic/Latino,Full Time,Male,Active,2013-08-31,2013-08-31,3.252634,Novice
66,7369,"PUBLIC HEALTH DENTIST,DDS",Health & Human Services,100791.0,White,Full Time,Female,Active,2015-12-28,2015-12-28,0.92815,Novice


In [93]:
df_gt_100['GENDER'].value_counts(normalize=True)

Male      0.631579
Female    0.368421
Name: GENDER, dtype: float64

In [None]:
employee['GENDER'].value_counts(normalize=True)

### Problem 5
<span  style="color:green; font-size:16px">Find the count of each position title that has the word **POLICE** in it.</span>

In [101]:
pt = employee['POSITION_TITLE']
has_police = pt.str.contains('POLICE')
pt[has_police].value_counts()

SENIOR POLICE OFFICER             220
POLICE OFFICER                    184
POLICE SERGEANT                    98
POLICE LIEUTENANT                  18
POLICE OFFICER,PROBATIONARY        15
POLICE CAPTAIN                      6
SENIOR POLICE SERVICE OFFICER       5
POLICE TRAINEE                      5
POLICE TELECOMMUNICATOR             4
POLICE SERVICE OFFICER              4
SENIOR POLICE TELECOMMUNICATOR      3
Name: POSITION_TITLE, dtype: int64