In [3]:
import pandas as pd

## Reading data from files

- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

What we need to do when creating a dataframe:
- do we have a coulmn that can be an index for the data frame
- do we have any columns that contain dates so we can parse them properly into proper data type

In [4]:
url = 'https://raw.githubusercontent.com/piotrgradzinski/dap_20230114/main/day_6_pgg/emps.csv'
emps = pd.read_csv(url, sep=';', encoding='utf-8', index_col='employee_id', parse_dates=['hire_date'])
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


## Data exploration

What we should explore:
- what data type we have in each column, are they proper?
- do we have `null`s or `NaN` values in any column - it's possible that we'll have to deal with them somehow.

In [5]:
type(emps)

pandas.core.frame.DataFrame

In [6]:
emps.dtypes

first_name                 object
last_name                  object
job_title                  object
salary                      int64
hire_date          datetime64[ns]
department_name            object
address                    object
postal_code                object
city                       object
country                    object
dtype: object

In [7]:
emps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 100 to 206
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   first_name       107 non-null    object        
 1   last_name        107 non-null    object        
 2   job_title        107 non-null    object        
 3   salary           107 non-null    int64         
 4   hire_date        107 non-null    datetime64[ns]
 5   department_name  106 non-null    object        
 6   address          106 non-null    object        
 7   postal_code      105 non-null    object        
 8   city             106 non-null    object        
 9   country          106 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 9.2+ KB


In [8]:
# 25 percentile means that 25% of salaries are equal or lower than 3 100.
emps.describe()

Unnamed: 0,salary
count,107.0
mean,6461.682243
std,3909.365746
min,2100.0
25%,3100.0
50%,6200.0
75%,8900.0
max,24000.0


In [9]:
emps.describe(include='all')

  emps.describe(include='all')


Unnamed: 0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
count,107,107,107,107.0,107,106,106,105.0,106,106
unique,91,102,19,,99,11,7,6.0,7,4
top,Peter,King,Sales Representative,,2004-06-07 00:00:00,Shipping,2011 Interiors Blvd,99236.0,South San Francisco,United States of America
freq,3,2,30,,4,45,45,45.0,45,68
first,,,,,1987-09-17 00:00:00,,,,,
last,,,,,2011-02-06 00:00:00,,,,,
mean,,,,6461.682243,,,,,,
std,,,,3909.365746,,,,,,
min,,,,2100.0,,,,,,
25%,,,,3100.0,,,,,,


In [10]:
emps.columns

Index(['first_name', 'last_name', 'job_title', 'salary', 'hire_date',
       'department_name', 'address', 'postal_code', 'city', 'country'],
      dtype='object')

In [11]:
emps.shape  # how many rows (.shape[0]) and columns (.shape[1])

(107, 10)

In [12]:
len(emps), emps.size

(107, 1070)

## How we can access data in a DataFrame

In [13]:
# dictionary notation to access column in a data frame
emps['last_name']

employee_id
100       King
101    Kochhar
102    De Haan
103     Hunold
104      Ernst
        ...   
202        Fay
203     Mavris
204       Baer
205    Higgins
206      Gietz
Name: last_name, Length: 107, dtype: object

In [None]:
# object notation to access column in a data frame
# we can use this notation if the name of the column does not contain space nor special characters
emps.salary

In [16]:
type(emps.salary)

pandas.core.series.Series

When we are using dictionary notation we have few additional features we can use. For example we can access several columns at once.

In [17]:
emps[['first_name', 'last_name', 'salary']]  # providing a list of columns

Unnamed: 0_level_0,first_name,last_name,salary
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,Steven,King,24000
101,Neena,Kochhar,17000
102,Lex,De Haan,17000
103,Alexander,Hunold,9000
104,Bruce,Ernst,6000
...,...,...,...
202,Pat,Fay,6000
203,Susan,Mavris,6500
204,Hermann,Baer,10000
205,Shelley,Higgins,12000


In [18]:
emps.salary.mean()

6461.682242990654

## Accessing data using `loc` and `iloc`

We can use `loc` and `iloc` to access some portion of the data, either particular cell, several cell or a row:
- [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) - label index or "business index" where we can use as well column names
- [`iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) - integer based index, starts with 0

## `iloc`

In [24]:
emps.iloc[0]

first_name                           Steven
last_name                              King
job_title                         President
salary                                24000
hire_date               1997-06-17 00:00:00
department_name                   Executive
address                     2004 Charade Rd
postal_code                           98199
city                                Seattle
country            United States of America
Name: 100, dtype: object

In [26]:
emps.iloc[0, 1]

'King'

We can use similar operations with `iloc` as with accessing elements list elements:
- negative indexes
- range of elements - `start:stop:step`

We can do that for both dimensions (rows and columns).

In [28]:
emps.iloc[0:5]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America


In [30]:
emps.iloc[0:10:2]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
106,Valli,Pataballa,Programmer,4800,2008-02-05,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America


In [32]:
emps.iloc[-1]

first_name                          William
last_name                             Gietz
job_title                 Public Accountant
salary                                 8300
hire_date               2004-06-07 00:00:00
department_name                  Accounting
address                     2004 Charade Rd
postal_code                           98199
city                                Seattle
country            United States of America
Name: 206, dtype: object

In [33]:
emps.iloc[0:5, 2:4]

Unnamed: 0_level_0,job_title,salary
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100,President,24000
101,Administration Vice President,17000
102,Administration Vice President,17000
103,Programmer,9000
104,Programmer,6000


In [34]:
emps.iloc[:10]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
105,David,Austin,Programmer,4800,2007-06-25,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
106,Valli,Pataballa,Programmer,4800,2008-02-05,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
107,Diana,Lorentz,Programmer,4200,2009-02-07,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America
109,Daniel,Faviet,Accountant,9000,2004-08-16,Finance,2004 Charade Rd,98199,Seattle,United States of America


In [36]:
emps.iloc[100:]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
200,Jennifer,Whalen,Administration Assistant,4400,1987-09-17,Administration,2004 Charade Rd,98199,Seattle,United States of America
201,Michael,Hartstein,Marketing Manager,13000,2006-02-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America
206,William,Gietz,Public Accountant,8300,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


Instead of a particular index or a range I can provide a list of indexes which I want to get, either from a rows or columns.

In [38]:
emps.iloc[0:5, [0, 3, -2]]

Unnamed: 0_level_0,first_name,salary,city
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,Steven,24000,Seattle
101,Neena,17000,Seattle
102,Lex,17000,Seattle
103,Alexander,9000,Southlake
104,Bruce,6000,Southlake


## `loc`

In [41]:
emps.loc[100]

first_name                           Steven
last_name                              King
job_title                         President
salary                                24000
hire_date               1997-06-17 00:00:00
department_name                   Executive
address                     2004 Charade Rd
postal_code                           98199
city                                Seattle
country            United States of America
Name: 100, dtype: object

In [43]:
emps.loc[100, 'first_name']

('Steven',
 first_name    Steven
 last_name       King
 Name: 100, dtype: object)

In [45]:
emps.loc[100:105, ['first_name', 'last_name']]

Unnamed: 0_level_0,first_name,last_name
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100,Steven,King
101,Neena,Kochhar
102,Lex,De Haan
103,Alexander,Hunold
104,Bruce,Ernst
105,David,Austin


We can use ranges with `loc` as well but those are both sides closed.

In [46]:
emps.loc[100:110:2, ['first_name', 'last_name']]

Unnamed: 0_level_0,first_name,last_name
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100,Steven,King
102,Lex,De Haan
104,Bruce,Ernst
106,Valli,Pataballa
108,Nancy,Greenberg
110,John,Chen


Ranges can work on a column level as well.

In [49]:
emps.loc[100:110:2, 'first_name':'address']

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd
106,Valli,Pataballa,Programmer,4800,2008-02-05,IT,2014 Jabberwocky Rd
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd
110,John,Chen,Accountant,8200,2007-09-28,Finance,2004 Charade Rd


In [50]:
emps.loc[100:110:2, 'first_name':'address':2]

Unnamed: 0_level_0,first_name,job_title,hire_date,address
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,Steven,President,1997-06-17,2004 Charade Rd
102,Lex,Administration Vice President,2003-01-13,2004 Charade Rd
104,Bruce,Programmer,2001-05-21,2014 Jabberwocky Rd
106,Valli,Programmer,2008-02-05,2014 Jabberwocky Rd
108,Nancy,Finance Manager,2004-08-17,2004 Charade Rd
110,John,Accountant,2007-09-28,2004 Charade Rd


## How we can iterate through a DataFrame using a for loop

By default we are iterating through column names in a DataFrame. If we want to iterate by rows we can use [`.iterrows()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html) method on a DataFrame.

Using this approach is a good idea if we want to present the data in the way we want, which is different than just displaying the DataFrame (or its selection). We shouldn't use this approach to do calculations because they will be much slower than using Pandas (or NumPy) built-in methods.

In [51]:
for column_name in emps:
    print(column_name)

first_name
last_name
job_title
salary
hire_date
department_name
address
postal_code
city
country


In [54]:
for emp_id, emp in emps.iterrows():
    print(emp_id, emp['first_name'], emp['last_name'])

100 Steven King
101 Neena Kochhar
102 Lex De Haan
103 Alexander Hunold
104 Bruce Ernst
105 David Austin
106 Valli Pataballa
107 Diana Lorentz
108 Nancy Greenberg
109 Daniel Faviet
110 John Chen
111 Ismael Sciarra
112 Jose Manuel Urman
113 Luis Popp
114 Den Raphaely
115 Alexander Khoo
116 Shelli Baida
117 Sigal Tobias
118 Guy Himuro
119 Karen Colmenares
120 Matthew Weiss
121 Adam Fripp
122 Payam Kaufling
123 Shanta Vollman
124 Kevin Mourgos
125 Julia Nayer
126 Irene Mikkilineni
127 James Landry
128 Steven Markle
129 Laura Bissot
130 Mozhe Atkinson
131 James Marlow
132 TJ Olson
133 Jason Mallin
134 Michael Rogers
135 Ki Gee
136 Hazel Philtanker
137 Renske Ladwig
138 Stephen Stiles
139 John Seo
140 Joshua Patel
141 Trenna Rajs
142 Curtis Davies
143 Randall Matos
144 Peter Vargas
145 John Russell
146 Karen Partners
147 Alberto Errazuriz
148 Gerald Cambrault
149 Eleni Zlotkey
150 Peter Tucker
151 David Bernstein
152 Peter Hall
153 Christopher Olsen
154 Nanette Cambrault
155 Oliver Tuvault
1

In [55]:
for emp_id, emp in emps.iterrows():
    print(emp_id, emp.first_name, emp.last_name)

100 Steven King
101 Neena Kochhar
102 Lex De Haan
103 Alexander Hunold
104 Bruce Ernst
105 David Austin
106 Valli Pataballa
107 Diana Lorentz
108 Nancy Greenberg
109 Daniel Faviet
110 John Chen
111 Ismael Sciarra
112 Jose Manuel Urman
113 Luis Popp
114 Den Raphaely
115 Alexander Khoo
116 Shelli Baida
117 Sigal Tobias
118 Guy Himuro
119 Karen Colmenares
120 Matthew Weiss
121 Adam Fripp
122 Payam Kaufling
123 Shanta Vollman
124 Kevin Mourgos
125 Julia Nayer
126 Irene Mikkilineni
127 James Landry
128 Steven Markle
129 Laura Bissot
130 Mozhe Atkinson
131 James Marlow
132 TJ Olson
133 Jason Mallin
134 Michael Rogers
135 Ki Gee
136 Hazel Philtanker
137 Renske Ladwig
138 Stephen Stiles
139 John Seo
140 Joshua Patel
141 Trenna Rajs
142 Curtis Davies
143 Randall Matos
144 Peter Vargas
145 John Russell
146 Karen Partners
147 Alberto Errazuriz
148 Gerald Cambrault
149 Eleni Zlotkey
150 Peter Tucker
151 David Bernstein
152 Peter Hall
153 Christopher Olsen
154 Nanette Cambrault
155 Oliver Tuvault
1

## Filtering and logical conditions

On a Series we can use comparison operatators that will return a mask for which we will have `True`/`False` values saying whether particular value fulfills the condition or not.

Once I ahve a mask I can use it to filter elements from the DataFrame using indexing operator.

In [57]:
emps.salary > 10_000

employee_id
100     True
101     True
102     True
103    False
104    False
       ...  
202    False
203    False
204    False
205     True
206    False
Name: salary, Length: 107, dtype: bool

In [58]:
emps[emps.salary > 10_000]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America
114,Den,Raphaely,Purchasing Manager,11000,2004-12-07,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
146,Karen,Partners,Sales Manager,13500,2007-01-05,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
147,Alberto,Errazuriz,Sales Manager,12000,2007-03-10,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
148,Gerald,Cambrault,Sales Manager,11000,2009-10-15,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
149,Eleni,Zlotkey,Sales Manager,10500,2000-01-29,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom


In [59]:
emps[emps['city'] == 'Oxford']

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
146,Karen,Partners,Sales Manager,13500,2007-01-05,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
147,Alberto,Errazuriz,Sales Manager,12000,2007-03-10,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
148,Gerald,Cambrault,Sales Manager,11000,2009-10-15,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
149,Eleni,Zlotkey,Sales Manager,10500,2000-01-29,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
150,Peter,Tucker,Sales Representative,10000,2007-01-30,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
151,David,Bernstein,Sales Representative,9500,2007-03-24,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
152,Peter,Hall,Sales Representative,9000,2007-08-20,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
153,Christopher,Olsen,Sales Representative,8000,2008-03-30,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
154,Nanette,Cambrault,Sales Representative,7500,2008-12-09,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom


In [60]:
emps[emps['city'] == 'Oxford'].salary.mean()

8955.882352941177

In [61]:
emps[emps['city'] == 'London'].salary.mean()

6500.0

First we take a `salary` column and the filtering the data by the city. This is possible, because all series are sharing the index (`employee_id`).

In [62]:
emps.salary[emps.city == 'Oxford'].mean()

8955.882352941177

In [64]:
emps.salary[emps.city == 'London'].mean()

6500.0

If we want to combine several conditions together, we can't use pythons `and`, `or` operators, they will not work with Pandas. We have to use, so called, bit-wise operators `&` (for `and`) and `|` (for `or`) to connect several conditions together. **Due to the fact the `&` and `|` are stronger than standard comparison operators (they take precedence) to make the statement work we need to use `()`.**

In [65]:
emps[(emps.city == 'Oxford') & (emps.salary >= 10_000)]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
146,Karen,Partners,Sales Manager,13500,2007-01-05,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
147,Alberto,Errazuriz,Sales Manager,12000,2007-03-10,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
148,Gerald,Cambrault,Sales Manager,11000,2009-10-15,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
149,Eleni,Zlotkey,Sales Manager,10500,2000-01-29,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
150,Peter,Tucker,Sales Representative,10000,2007-01-30,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
156,Janette,King,Sales Representative,10000,2006-01-30,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
162,Clara,Vishney,Sales Representative,10500,2007-11-11,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
168,Lisa,Ozer,Sales Representative,11500,2007-03-11,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
169,Harrison,Bloom,Sales Representative,10000,2008-03-23,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom


# Exercises

List employees from Seattle with column: first_name, last_name, salary and city.

In [70]:
# providing columns by their names, providing emps.first_name means that we provide not a column name but the whole column
# whole Series object.
emps[emps.city == 'Seattle'][['first_name', 'last_name', 'salary', 'city']]

Unnamed: 0_level_0,first_name,last_name,salary,city
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,Steven,King,24000,Seattle
101,Neena,Kochhar,17000,Seattle
102,Lex,De Haan,17000,Seattle
108,Nancy,Greenberg,12000,Seattle
109,Daniel,Faviet,9000,Seattle
110,John,Chen,8200,Seattle
111,Ismael,Sciarra,7700,Seattle
112,Jose Manuel,Urman,7800,Seattle
113,Luis,Popp,6900,Seattle
114,Den,Raphaely,11000,Seattle


In [71]:
emps[emps.city == 'Seattle'].loc[:, ['first_name', 'last_name', 'salary', 'city']]

Unnamed: 0_level_0,first_name,last_name,salary,city
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,Steven,King,24000,Seattle
101,Neena,Kochhar,17000,Seattle
102,Lex,De Haan,17000,Seattle
108,Nancy,Greenberg,12000,Seattle
109,Daniel,Faviet,9000,Seattle
110,John,Chen,8200,Seattle
111,Ismael,Sciarra,7700,Seattle
112,Jose Manuel,Urman,7800,Seattle
113,Luis,Popp,6900,Seattle
114,Den,Raphaely,11000,Seattle


One interesting feature of loc is that we can provide a mask/condition to one of the dimensions.

In [73]:
emps.loc[emps.city == 'Seattle', ['first_name', 'last_name', 'salary', 'city']]

Unnamed: 0_level_0,first_name,last_name,salary,city
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,Steven,King,24000,Seattle
101,Neena,Kochhar,17000,Seattle
102,Lex,De Haan,17000,Seattle
108,Nancy,Greenberg,12000,Seattle
109,Daniel,Faviet,9000,Seattle
110,John,Chen,8200,Seattle
111,Ismael,Sciarra,7700,Seattle
112,Jose Manuel,Urman,7800,Seattle
113,Luis,Popp,6900,Seattle
114,Den,Raphaely,11000,Seattle


Using `iloc` get 10 first employees from Oxford.