In [1]:
import pandas as pd

## Reading data from files

- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 

What we need to do when creating a dataframe:
- do we have a coulmn that can be an index for the data frame
- do we have any columns that contain dates so we can parse them properly into proper data type

In [9]:
url = 'https://raw.githubusercontent.com/piotrgradzinski/dap_20230114/main/day_6_pgg/emps.csv'
emps = pd.read_csv(url, sep=';', encoding='utf-8', index_col='employee_id', parse_dates=['hire_date'])
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


## Data exploration

What we should explore:
- what data type we have in each column, are they proper?
- do we have `null`s or `NaN` values in any column - it's possible that we'll have to deal with them somehow.

In [7]:
type(emps)

pandas.core.frame.DataFrame

In [11]:
emps.dtypes

first_name                 object
last_name                  object
job_title                  object
salary                      int64
hire_date          datetime64[ns]
department_name            object
address                    object
postal_code                object
city                       object
country                    object
dtype: object

In [12]:
emps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 100 to 206
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   first_name       107 non-null    object        
 1   last_name        107 non-null    object        
 2   job_title        107 non-null    object        
 3   salary           107 non-null    int64         
 4   hire_date        107 non-null    datetime64[ns]
 5   department_name  106 non-null    object        
 6   address          106 non-null    object        
 7   postal_code      105 non-null    object        
 8   city             106 non-null    object        
 9   country          106 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 9.2+ KB


In [15]:
# 25 percentile means that 25% of salaries are equal or lower than 3 100.
emps.describe()

Unnamed: 0,salary
count,107.0
mean,6461.682243
std,3909.365746
min,2100.0
25%,3100.0
50%,6200.0
75%,8900.0
max,24000.0


In [16]:
emps.describe(include='all')

  emps.describe(include='all')


Unnamed: 0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
count,107,107,107,107.0,107,106,106,105.0,106,106
unique,91,102,19,,99,11,7,6.0,7,4
top,Peter,King,Sales Representative,,2004-06-07 00:00:00,Shipping,2011 Interiors Blvd,99236.0,South San Francisco,United States of America
freq,3,2,30,,4,45,45,45.0,45,68
first,,,,,1987-09-17 00:00:00,,,,,
last,,,,,2011-02-06 00:00:00,,,,,
mean,,,,6461.682243,,,,,,
std,,,,3909.365746,,,,,,
min,,,,2100.0,,,,,,
25%,,,,3100.0,,,,,,


In [17]:
emps.columns

Index(['first_name', 'last_name', 'job_title', 'salary', 'hire_date',
       'department_name', 'address', 'postal_code', 'city', 'country'],
      dtype='object')

In [18]:
emps.shape  # how many rows (.shape[0]) and columns (.shape[1])

(107, 10)

In [20]:
len(emps), emps.size

(107, 1070)

## Accessing data

### Columns

The two options we have to access columns in a DataFrame
- "dictionary notation" or using indexing operator
- "object notation"

In [22]:
emps['last_name']  # "dictionary notation" or using indexing operator

employee_id
100       King
101    Kochhar
102    De Haan
103     Hunold
104      Ernst
        ...   
202        Fay
203     Mavris
204       Baer
205    Higgins
206      Gietz
Name: last_name, Length: 107, dtype: object

In [24]:
emps.last_name  # "object notation", this will work if the column name does not contain spaces or any other special characters

employee_id
100       King
101    Kochhar
102    De Haan
103     Hunold
104      Ernst
        ...   
202        Fay
203     Mavris
204       Baer
205    Higgins
206      Gietz
Name: last_name, Length: 107, dtype: object

When access column I'm getting `pandas.core.series.Series` object.

In [25]:
type(emps.last_name)

pandas.core.series.Series

In [27]:
emps.salary.mean(), emps.salary.min(), emps.salary.max() 

(6461.682242990654, 2100, 24000)

Quite often we want to take several columns from the DataFrame. To do that we need to provide a list of columns we want to get. 

In [28]:
emps[['first_name', 'last_name', 'salary']]

Unnamed: 0_level_0,first_name,last_name,salary
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,Steven,King,24000
101,Neena,Kochhar,17000
102,Lex,De Haan,17000
103,Alexander,Hunold,9000
104,Bruce,Ernst,6000
...,...,...,...
202,Pat,Fay,6000
203,Susan,Mavris,6500
204,Hermann,Baer,10000
205,Shelley,Higgins,12000


In [31]:
emps.job_title

employee_id
100                          President
101      Administration Vice President
102      Administration Vice President
103                         Programmer
104                         Programmer
                    ...               
202           Marketing Representative
203     Human Resources Representative
204    Public Relations Representative
205                 Accounting Manager
206                  Public Accountant
Name: job_title, Length: 107, dtype: object

In [33]:
emps.job_title.unique()

array(['President', 'Administration Vice President', 'Programmer',
       'Finance Manager', 'Accountant', 'Purchasing Manager',
       'Purchasing Clerk', 'Stock Manager', 'Stock Clerk',
       'Sales Manager', 'Sales Representative', 'Shipping Clerk',
       'Administration Assistant', 'Marketing Manager',
       'Marketing Representative', 'Human Resources Representative',
       'Public Relations Representative', 'Accounting Manager',
       'Public Accountant'], dtype=object)

In [35]:
emps.job_title.nunique()

19

### Rows

Data Frame is composed out of Series objects that share the same index.

We have two types of indexes:
- `loc` - label index, "business index" that we can set during DataFrame creation, we can have numbers, strings as a business index. [Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
- `iloc` - integer-based indexing, position based on an integer number, starting with 0. [Docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)

#### `iloc`

In [38]:
emps.iloc[0], type(emps.iloc[0])

(first_name                           Steven
 last_name                              King
 job_title                         President
 salary                                24000
 hire_date               1997-06-17 00:00:00
 department_name                   Executive
 address                     2004 Charade Rd
 postal_code                           98199
 city                                Seattle
 country            United States of America
 Name: 100, dtype: object,
 pandas.core.series.Series)

In [40]:
emps.iloc[0, 1]

'King'

We can leverage ranges we know from python and NumPy.

In [43]:
emps.iloc[10:20:2, 0:3]

Unnamed: 0_level_0,first_name,last_name,job_title
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
110,John,Chen,Accountant
112,Jose Manuel,Urman,Accountant
114,Den,Raphaely,Purchasing Manager
116,Shelli,Baida,Purchasing Clerk
118,Guy,Himuro,Purchasing Clerk


In [46]:
emps.iloc[-1]  # negative indexes works exatcly the same

first_name                          William
last_name                             Gietz
job_title                 Public Accountant
salary                                 8300
hire_date               2004-06-07 00:00:00
department_name                  Accounting
address                     2004 Charade Rd
postal_code                           98199
city                                Seattle
country            United States of America
Name: 206, dtype: object

In [49]:
emps.iloc[-5:]  # last 5 rows, so wildcard ":" are working exactly the same as when we were using idexing operator for lists

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America
206,William,Gietz,Public Accountant,8300,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


In addition to standar way of working for indexing operator we can provide numbers of rows and/or columns we want to get.

In [51]:
emps.iloc[-5:, [0, 1, 3]]

Unnamed: 0_level_0,first_name,last_name,salary
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
202,Pat,Fay,6000
203,Susan,Mavris,6500
204,Hermann,Baer,10000
205,Shelley,Higgins,12000
206,William,Gietz,8300


#### `loc`

labeled index - for rows we use values from the column we've marked as an index column, for columns we can use column names.

In [52]:
emps.loc[100, 'last_name']

'King'

In [53]:
emps.loc[100]

first_name                           Steven
last_name                              King
job_title                         President
salary                                24000
hire_date               1997-06-17 00:00:00
department_name                   Executive
address                     2004 Charade Rd
postal_code                           98199
city                                Seattle
country            United States of America
Name: 100, dtype: object

In [56]:
emps.loc[:, 'last_name']

employee_id
100       King
101    Kochhar
102    De Haan
103     Hunold
104      Ernst
        ...   
202        Fay
203     Mavris
204       Baer
205    Higgins
206      Gietz
Name: last_name, Length: 107, dtype: object

In [57]:
emps.loc[:, ['first_name', 'last_name']]

Unnamed: 0_level_0,first_name,last_name
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100,Steven,King
101,Neena,Kochhar
102,Lex,De Haan
103,Alexander,Hunold
104,Bruce,Ernst
...,...,...
202,Pat,Fay
203,Susan,Mavris
204,Hermann,Baer
205,Shelley,Higgins


In [60]:
emps.loc[[100, 101], 'last_name']

employee_id
100       King
101    Kochhar
Name: last_name, dtype: object

In [61]:
emps.loc[[100, 101], ['first_name', 'last_name']]

Unnamed: 0_level_0,first_name,last_name
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
100,Steven,King
101,Neena,Kochhar


## Iterating through DataFrame

The following approach is possible but we shouldn't use, it's not really a best practise and it's not really convenient. 

In [65]:
for column in emps:  # iterating over column names
    for val in emps[column]:
        print(column, val)

first_name Steven
first_name Neena
first_name Lex
first_name Alexander
first_name Bruce
first_name David
first_name Valli
first_name Diana
first_name Nancy
first_name Daniel
first_name John
first_name Ismael
first_name Jose Manuel
first_name Luis
first_name Den
first_name Alexander
first_name Shelli
first_name Sigal
first_name Guy
first_name Karen
first_name Matthew
first_name Adam
first_name Payam
first_name Shanta
first_name Kevin
first_name Julia
first_name Irene
first_name James
first_name Steven
first_name Laura
first_name Mozhe
first_name James
first_name TJ
first_name Jason
first_name Michael
first_name Ki
first_name Hazel
first_name Renske
first_name Stephen
first_name John
first_name Joshua
first_name Trenna
first_name Curtis
first_name Randall
first_name Peter
first_name John
first_name Karen
first_name Alberto
first_name Gerald
first_name Eleni
first_name Peter
first_name David
first_name Peter
first_name Christopher
first_name Nanette
first_name Oliver
first_name Janette
fi

Much better (and convenient) approach is to use [`.iterrows()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html) method.

In [67]:
for row_index, row_data in emps.iterrows():
    print(row_index, row_data['first_name'], row_data['last_name'], row_data['salary'])

100 Steven King 24000
101 Neena Kochhar 17000
102 Lex De Haan 17000
103 Alexander Hunold 9000
104 Bruce Ernst 6000
105 David Austin 4800
106 Valli Pataballa 4800
107 Diana Lorentz 4200
108 Nancy Greenberg 12000
109 Daniel Faviet 9000
110 John Chen 8200
111 Ismael Sciarra 7700
112 Jose Manuel Urman 7800
113 Luis Popp 6900
114 Den Raphaely 11000
115 Alexander Khoo 3100
116 Shelli Baida 2900
117 Sigal Tobias 2800
118 Guy Himuro 2600
119 Karen Colmenares 2500
120 Matthew Weiss 8000
121 Adam Fripp 8200
122 Payam Kaufling 7900
123 Shanta Vollman 6500
124 Kevin Mourgos 5800
125 Julia Nayer 3200
126 Irene Mikkilineni 2700
127 James Landry 2400
128 Steven Markle 2200
129 Laura Bissot 3300
130 Mozhe Atkinson 2800
131 James Marlow 2500
132 TJ Olson 2100
133 Jason Mallin 3300
134 Michael Rogers 2900
135 Ki Gee 2400
136 Hazel Philtanker 2200
137 Renske Ladwig 3600
138 Stephen Stiles 3200
139 John Seo 2700
140 Joshua Patel 2500
141 Trenna Rajs 3500
142 Curtis Davies 3100
143 Randall Matos 2600
144 P

## Filtering

In [68]:
emps.salary > 10_000  # this expression, as with NumPy, returns a mask which we can apply to a DataFrame

employee_id
100     True
101     True
102     True
103    False
104    False
       ...  
202    False
203    False
204    False
205     True
206    False
Name: salary, Length: 107, dtype: bool

In [69]:
emps[emps.salary > 10_000]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America
114,Den,Raphaely,Purchasing Manager,11000,2004-12-07,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
146,Karen,Partners,Sales Manager,13500,2007-01-05,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
147,Alberto,Errazuriz,Sales Manager,12000,2007-03-10,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
148,Gerald,Cambrault,Sales Manager,11000,2009-10-15,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
149,Eleni,Zlotkey,Sales Manager,10500,2000-01-29,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom


In [70]:
emps[emps.city == 'Seattle']

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America
109,Daniel,Faviet,Accountant,9000,2004-08-16,Finance,2004 Charade Rd,98199,Seattle,United States of America
110,John,Chen,Accountant,8200,2007-09-28,Finance,2004 Charade Rd,98199,Seattle,United States of America
111,Ismael,Sciarra,Accountant,7700,2007-09-30,Finance,2004 Charade Rd,98199,Seattle,United States of America
112,Jose Manuel,Urman,Accountant,7800,1998-03-07,Finance,2004 Charade Rd,98199,Seattle,United States of America
113,Luis,Popp,Accountant,6900,2009-12-07,Finance,2004 Charade Rd,98199,Seattle,United States of America
114,Den,Raphaely,Purchasing Manager,11000,2004-12-07,Purchasing,2004 Charade Rd,98199,Seattle,United States of America


We can use several conditions for filtering (as in NumPy) and to connect them together we need `and`, `or` operators, but as with NumPy we have to use:
- `and` -> `&`
- `or` -> `|`

Also, each condition has to be surrounded with `()`.

In [71]:
emps[(emps.city == 'Seattle') & (emps.job_title == 'Purchasing Clerk')]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
115,Alexander,Khoo,Purchasing Clerk,3100,2005-05-18,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
116,Shelli,Baida,Purchasing Clerk,2900,2007-12-24,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
117,Sigal,Tobias,Purchasing Clerk,2800,2007-07-24,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
118,Guy,Himuro,Purchasing Clerk,2600,2008-11-15,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
119,Karen,Colmenares,Purchasing Clerk,2500,2009-08-10,Purchasing,2004 Charade Rd,98199,Seattle,United States of America


## Data modification

for example:
- changing existing data, like particular cells or whole columns
- adding new columns based on the data we have

In [73]:
emps['salary'] * 12

employee_id
100    288000
101    204000
102    204000
103    108000
104     72000
        ...  
202     72000
203     78000
204    120000
205    144000
206     99600
Name: salary, Length: 107, dtype: int64

In [77]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
emps['salary_per_year'] = emps['salary'] * 12
emps.sort_values('salary_per_year', ascending=False)

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country,salary_per_year
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America,288000
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom,168000
146,Karen,Partners,Sales Manager,13500,2007-01-05,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom,162000
...,...,...,...,...,...,...,...,...,...,...,...
127,James,Landry,Stock Clerk,2400,2009-01-14,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,28800
135,Ki,Gee,Stock Clerk,2400,2009-12-12,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,28800
128,Steven,Markle,Stock Clerk,2200,2010-03-08,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,26400
136,Hazel,Philtanker,Stock Clerk,2200,2011-02-06,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,26400


In [78]:
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country,salary_per_year
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America,288000
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,108000
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,72000
...,...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada,72000
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom,78000
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany,120000
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America,144000


We can modify particular rows or cells using `iloc` and `loc` methods.

In [82]:
emps.iloc[0, 0] = 'John'
emps.loc[100, 'last_name'] = 'Doe'

In [83]:
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country,salary_per_year
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
100,John,Doe,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America,288000
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,108000
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,72000
...,...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada,72000
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom,78000
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany,120000
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America,144000


To remove/delete column from our DataFrame we can use [`.drop()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method. By default when dropping a column I'm getting a copy of a DataFrame without this column or columns. To modify the original DataFrame I have to `.drop(..., inplace=True)`.

In [85]:
emps.drop(columns=['salary_per_year'])

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,John,Doe,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


In [86]:
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country,salary_per_year
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
100,John,Doe,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America,288000
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,108000
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,72000
...,...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada,72000
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom,78000
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany,120000
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America,144000


### Accessors

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.upper.html

In [88]:
emps.city.unique()

array(['Seattle', 'Southlake', 'South San Francisco', 'Oxford', nan,
       'Toronto', 'London', 'Munich'], dtype=object)

In [90]:
emps.city.str.upper()

employee_id
100      SEATTLE
101      SEATTLE
102      SEATTLE
103    SOUTHLAKE
104    SOUTHLAKE
         ...    
202      TORONTO
203       LONDON
204       MUNICH
205      SEATTLE
206      SEATTLE
Name: city, Length: 107, dtype: object

In [91]:
emps.city.str.lower()

employee_id
100      seattle
101      seattle
102      seattle
103    southlake
104    southlake
         ...    
202      toronto
203       london
204       munich
205      seattle
206      seattle
Name: city, Length: 107, dtype: object

In [95]:
emps.hire_date.dt.year

employee_id
100    1997
101    1999
102    2003
103    2000
104    2001
       ... 
202    2007
203    2004
204    2004
205    2004
206    2004
Name: hire_date, Length: 107, dtype: int64

In [97]:
emps[emps.hire_date.dt.year.isin([2004])]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country,salary_per_year
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199.0,Seattle,United States of America,144000
109,Daniel,Faviet,Accountant,9000,2004-08-16,Finance,2004 Charade Rd,98199.0,Seattle,United States of America,108000
114,Den,Raphaely,Purchasing Manager,11000,2004-12-07,Purchasing,2004 Charade Rd,98199.0,Seattle,United States of America,132000
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom,78000
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925.0,Munich,Germany,120000
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199.0,Seattle,United States of America,144000
206,William,Gietz,Public Accountant,8300,2004-06-07,Accounting,2004 Charade Rd,98199.0,Seattle,United States of America,99600


In [98]:
emps[
    (emps.hire_date.dt.year >= 2000) & (emps.hire_date.dt.year <= 2005)
]

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country,salary_per_year
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America,204000
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,108000
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America,72000
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America,144000
109,Daniel,Faviet,Accountant,9000,2004-08-16,Finance,2004 Charade Rd,98199,Seattle,United States of America,108000
114,Den,Raphaely,Purchasing Manager,11000,2004-12-07,Purchasing,2004 Charade Rd,98199,Seattle,United States of America,132000
115,Alexander,Khoo,Purchasing Clerk,3100,2005-05-18,Purchasing,2004 Charade Rd,98199,Seattle,United States of America,37200
122,Payam,Kaufling,Stock Manager,7900,2005-05-01,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,94800
137,Renske,Ladwig,Stock Clerk,3600,2005-07-14,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,43200
141,Trenna,Rajs,Stock Clerk,3500,2005-10-17,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America,42000
