# Additional practice

Use this notebook to read through and execute cells, testing what you've learnt in class with the other notebook and experimenting yourself with the data in the imported dataframes.

## Initial set up

Import relevant libraries:

In [65]:
import numpy as np
import pandas as pd
import os
import random

NOTE: Only necessary in COLAB

Mount Drive content: 

In [66]:
#drive_loc = '/content/gdrive'
#files_loc = os.path.join(drive_loc, 'MyDrive', 'pdsfiles')

#from google.colab import drive
#drive.mount(drive_loc)

In [67]:
#!mkdir -p {files_loc}

# Using `iloc` and `loc` to select rows and columns in Pandas DataFrames

Remember, `ix` is deprecated as of Pandas 0.20, so we'll be using `loc` and `iloc`.

In [68]:
#!wget https://bit.ly/ks-pds-csv4 -O {files_loc}/uk_data.csv
#contents = !ls {files_loc}/*uk_data*
uk_data_file = 'uk_data.csv'

In [69]:
# read the data from a CSV file.
data = pd.read_csv(uk_data_file)
# set a numeric id for use as an index for examples.
np.random.seed(0)
data['id'] = [random.randint(0,1000) for x in range(data.shape[0])]
 
data.head(5)

Unnamed: 0,first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
0,Aleshia,Tomkiewicz,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,436
1,Evan,Zigomalas,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,973
2,France,Andrade,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,549
3,Ulysses,Mcwalters,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,755
4,Tyisha,Veness,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,352


In [70]:
data.shape

(500, 12)

In [71]:
data.size

6000

In [72]:
len(data)

500

In [73]:
len(data.columns)

12

## Using `iloc`

Let's do single selections using iloc for dataframes, starting with the rows.

This selects the first row of the data frame (note the Series data type output):

In [74]:
data.iloc[0]

first_name                                   Aleshia
last_name                                 Tomkiewicz
company_name                 Alan D Rosenburg Cpa Pc
address                                 14 Taylor St
city                               St. Stephens Ward
county                                          Kent
postal                                       CT2 7PP
phone1                                  01835-703597
phone2                                  01944-369967
email                        atomkiewicz@hotmail.com
web             http://www.alandrosenburgcpapc.co.uk
id                                               436
Name: 0, dtype: object

Let's now select the second row of the data frame:

In [75]:
data.iloc[1]

first_name                                   Evan
last_name                               Zigomalas
company_name                   Cap Gemini America
address                               5 Binney St
city                                   Abbey Ward
county                            Buckinghamshire
postal                                   HP11 2AX
phone1                               01937-864715
phone2                               01714-737668
email                    evan.zigomalas@gmail.com
web             http://www.capgeminiamerica.co.uk
id                                            973
Name: 1, dtype: object

We can do the last row as well, using the familiar Python syntax for it:

In [76]:
data.iloc[-1]

first_name                                               Mi
last_name                                            Richan
company_name                 Nelson Wright Haworth Golf Crs
address                                     6 Norwood Grove
city                                      Tanworth-in-Arden
county                                         Warwickshire
postal                                              B94 5RZ
phone1                                         01451-785624
phone2                                         01202-738406
email                                        mi@hotmail.com
web             http://www.nelsonwrighthaworthgolfcrs.co.uk
id                                                      755
Name: 499, dtype: object

Let's do the same by columns:

In [77]:
data.iloc[:,0] # first column of data frame (first_name)

0          Aleshia
1             Evan
2           France
3          Ulysses
4           Tyisha
5             Eric
6             Marg
7          Laquita
8             Lura
9           Yuette
10        Fernanda
11     Charlesetta
12        Corrinne
13          Niesha
14          Rueben
15         Michell
16           Edgar
17          Dewitt
18        Charisse
19             Mee
20           Peter
21         Octavio
22          Martha
23         Tamesha
24            Tess
25         Leonard
26        Svetlana
27             Pok
28       Augustine
29           Karma
          ...     
470           Tony
471            Val
472            Mel
473       Isabella
474         Erasmo
475          Ivory
476         Nikita
477          Aleta
478           Owen
479        Pauline
480        Tijuana
481          Ahmad
482         Jamika
483        Derrick
484     Jacquelyne
485        Zachary
486         Sophia
487       Isabelle
488         Ronnie
489       Krystina
490         Rosita
491         

In [78]:
data.iloc[:,1] # second column of data frame (last_name)

0      Tomkiewicz
1       Zigomalas
2         Andrade
3       Mcwalters
4          Veness
5           Rampy
6        Grasmick
7           Hisaw
8        Manzella
9          Klapec
10         Writer
11            Erm
12          Jaret
13          Bruch
14      Gastellum
15      Throssell
16          Kanne
17          Julio
18       Spinello
19       Lapinski
20      Gutierres
21      Salvadore
22        Teplica
23         Veigel
24          Sitra
25         Kufner
26         Tauras
27       Molaison
28       Growcock
29         Quarto
          ...    
470    Diazdeleon
471        Villot
472      Picciuto
473    Piatkowski
474          Rhea
475       Lohrenz
476         Walka
477        Ligons
478       Jentzen
479         Fling
480      Machalek
481       Alsaqri
482        Conoly
483       Dolloff
484       Reibman
485    Freeburger
486       Gaucher
487          Kono
488       Brigman
489    Schlabaugh
490     Ausdemore
491       Stancil
492       Fiorino
493       Manciel
494       

In [79]:
data.iloc[:,-1] # last column of data frame (id)

0      436
1      973
2      549
3      755
4      352
5      352
6      959
7      710
8      994
9       47
10     196
11     813
12     380
13     633
14     569
15     726
16     403
17     409
18     376
19     716
20     824
21      99
22     953
23     959
24     394
25     149
26     520
27      29
28     698
29     547
      ... 
470    750
471     74
472    210
473    640
474    360
475    327
476    139
477    126
478    481
479    959
480    974
481     14
482    486
483    497
484     99
485    628
486    939
487    505
488      7
489    301
490    113
491    635
492    440
493    215
494    746
495    322
496    239
497    984
498    421
499    755
Name: id, Length: 500, dtype: int64

Multiple columns and rows can be selected together using the .iloc indexer and Python slices syntax:

In [80]:
data.iloc[0:5] # first five rows of dataframe

Unnamed: 0,first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
0,Aleshia,Tomkiewicz,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,436
1,Evan,Zigomalas,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,973
2,France,Andrade,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,549
3,Ulysses,Mcwalters,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,755
4,Tyisha,Veness,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,352


In [81]:
data.iloc[:, 0:2] # first two columns of data frame with all rows

Unnamed: 0,first_name,last_name
0,Aleshia,Tomkiewicz
1,Evan,Zigomalas
2,France,Andrade
3,Ulysses,Mcwalters
4,Tyisha,Veness
5,Eric,Rampy
6,Marg,Grasmick
7,Laquita,Hisaw
8,Lura,Manzella
9,Yuette,Klapec


In [82]:
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.

Unnamed: 0,first_name,county,postal
0,Aleshia,Kent,CT2 7PP
3,Ulysses,Lincolnshire,DN36 5RP
6,Marg,Southampton,SO14 3TY
24,Tess,West Sussex,PO19 1RH


In [83]:
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

Unnamed: 0,county,postal,phone1
0,Kent,CT2 7PP,01835-703597
1,Buckinghamshire,HP11 2AX,01937-864715
2,Bournemouth,BH6 3BE,01347-368222
3,Lincolnshire,DN36 5RP,01912-771311
4,West Midlands,B70 9DT,01547-429341


There’s two thing to consider when using iloc like this:

- `.iloc` returns:
  - a Pandas Series when one row is selected
  - a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.
  
  If you require DataFrame output, pass a single-valued list.

  When using `.loc`, or `.iloc`, you can then control the output format by passing *lists* or *single values* to the selectors.

- When selecting multiple columns or multiple rows in this manner, remember that your selection will go from the first number to one minus the second number, as it is usual with Python.

In practice, it is better to use `.loc`.


In [84]:
data.iloc[:,[1]] # Single column as a DF instead of a Series.

Unnamed: 0,last_name
0,Tomkiewicz
1,Zigomalas
2,Andrade
3,Mcwalters
4,Veness
5,Rampy
6,Grasmick
7,Hisaw
8,Manzella
9,Klapec


## Using `loc`

The Pandas `loc` indexer can be used with DataFrames for two different use cases:

1. Selecting rows by label/index
2. Selecting rows with a boolean/conditional lookup

The `loc` indexer is used with the same syntax as iloc: `data.loc[<row selection>, <column selection>]`.



### Label-based / Index-based indexing using `loc`

Selections using the loc method are based on the index of the data frame (if defined). Where the index is set on a DataFrame, using `df.set_index()`, the `.loc` method directly selects based on index values of any rows.

For example, setting the index of our test data frame to the persons "last_name":

In [85]:
data.set_index('last_name', inplace=True)
data.head()

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,436
Zigomalas,Evan,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,973
Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,549
Mcwalters,Ulysses,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,755
Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,352


Now with this new index, we can directly select rows for different "last_name" values using `.loc[<label>]`. For example:

In [86]:
data.loc['Andrade']

first_name                                France
company_name                 Elliott, John W Esq
address                             8 Moor Place
city              East Southbourne and Tuckton W
county                               Bournemouth
postal                                   BH6 3BE
phone1                              01347-368222
phone2                              01935-821636
email                 france.andrade@hotmail.com
web             http://www.elliottjohnwesq.co.uk
id                                           549
Name: Andrade, dtype: object

In [87]:
data.loc[['Andrade','Veness']]

Unnamed: 0_level_0,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,549
Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,352


Note that the first example returns a series, and the second returns a DataFrame. You can achieve a single-column DataFrame by passing a single-element list to the `.loc` operation.

Select columns with `.loc` using the names of the columns. In most of the data work, typically thera are named columns, so use these named selections.

In [88]:
data.loc[['Andrade','Veness'], ['first_name','address', 'city']]

Unnamed: 0_level_0,first_name,address,city
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Andrade,France,8 Moor Place,East Southbourne and Tuckton W
Veness,Tyisha,5396 Forth Street,Greets Green and Lyng Ward


You can select ranges of index labels – the selection `data.loc['Bruch':'Julio']`will return all rows in the data frame between the index entries for “Bruch” and “Julio”.

The following examples should now make sense:

In [89]:
# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']

Unnamed: 0_level_0,city,county,postal,phone1,phone2,email
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Andrade,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com
Veness,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com


In [90]:
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]

Unnamed: 0_level_0,first_name,address,city
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Andrade,France,8 Moor Place,East Southbourne and Tuckton W
Mcwalters,Ulysses,505 Exeter Rd,Hawerby cum Beesby
Veness,Tyisha,5396 Forth Street,Greets Green and Lyng Ward


Now, we reset the index before selecting the old index. If not, we may end up with a multiple index, a situation we don't want to deal with as of now:

In [91]:
data.reset_index(inplace=True)
data.head()

Unnamed: 0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web,id
0,Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk,436
1,Zigomalas,Evan,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk,973
2,Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk,549
3,Mcwalters,Ulysses,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk,755
4,Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk,352


In [92]:
data.set_index('id', inplace=True)
data.head()

Unnamed: 0_level_0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
436,Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk
973,Zigomalas,Evan,Cap Gemini America,5 Binney St,Abbey Ward,Buckinghamshire,HP11 2AX,01937-864715,01714-737668,evan.zigomalas@gmail.com,http://www.capgeminiamerica.co.uk
549,Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk
755,Mcwalters,Ulysses,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk
352,Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk


In [93]:
# select the row with 'id' = 436
data.loc[436]

last_name                                 Tomkiewicz
first_name                                   Aleshia
company_name                 Alan D Rosenburg Cpa Pc
address                                 14 Taylor St
city                               St. Stephens Ward
county                                          Kent
postal                                       CT2 7PP
phone1                                  01835-703597
phone2                                  01944-369967
email                        atomkiewicz@hotmail.com
web             http://www.alandrosenburgcpapc.co.uk
Name: 436, dtype: object

Note that in the last example, `data.loc[436]` (the row with index value 436) **is not equal** to `data.iloc[436]` (the 486th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.

Remember that you can also recover the row in a DataFrame format if you specify a list instead of the raw element:

In [94]:
data.loc[[436]]

Unnamed: 0_level_0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
436,Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk


### Boolean indexing

Conditional selections with boolean arrays using `data.loc[<selection>]` is a common method to use with Pandas DataFrames. With boolean indexing or logical selection, you pass an array or Series of True/False values to the `.loc` indexer to select the rows where your Series has True values.

In most use cases, you will make selections based on the values of different columns in your data set.

For example, the statement `data['first_name'] == 'Antonio'` produces a Pandas Series with a True/False value for every row in the `data` DataFrame, where there are `True` values for the rows where the first_name is 'Antonio'. These type of boolean arrays can be passed directly to the `.loc` indexer as so:

In [95]:
data.loc[data['first_name'] == 'Antonio']

Unnamed: 0_level_0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
818,Villamarin,Antonio,Combs Sheetmetal,353 Standish St #8264,Little Parndon and Hare Street,Hertfordshire,CM20 2HT,01559-403415,01388-777812,antonio.villamarin@gmail.com,http://www.combssheetmetal.co.uk
565,Glasford,Antonio,Saint Thomas Creations,425 Howley St,Gaer Community,Newport,NP20 3DE,01463-409090,01242-318420,antonio_glasford@glasford.co.uk,http://www.saintthomascreations.co.uk
350,Heilig,Antonio,Radisson Suite Hotel,35 Elton St #3,Ipplepen,Devon,TQ12 5LL,01324-171614,01442-946357,antonio.heilig@gmail.com,http://www.radissonsuitehotel.co.uk


As before, a second argument can be passed to .loc to select particular columns out of the data frame.

Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation:

In [96]:
data.loc[data['first_name'] == 'Erasmo', ['company_name', 'email', 'phone1']]

Unnamed: 0_level_0,company_name,email,phone1
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
762,Active Air Systems,erasmo.talentino@hotmail.com,01492-454455
437,Pan Optx,egath@hotmail.com,01445-796544
360,Martin Morrissey,erasmo_rhea@hotmail.com,01507-386397


You can see that when selecting columns, if one column only is selected, the `.loc` operator returns a Series. For a single column DataFrame, use a one-element list to keep the DataFrame format, for example:


In [97]:
data.loc[data['first_name'] == 'Antonio', 'email']

id
818       antonio.villamarin@gmail.com
565    antonio_glasford@glasford.co.uk
350           antonio.heilig@gmail.com
Name: email, dtype: object

In [98]:
data.loc[data['first_name'] == 'Antonio', ['email']]

Unnamed: 0_level_0,email
id,Unnamed: 1_level_1
818,antonio.villamarin@gmail.com
565,antonio_glasford@glasford.co.uk
350,antonio.heilig@gmail.com


Selecting rows with first name Antonio and all columns between 'city' and 'email:

In [99]:
data.loc[data['first_name'] == 'Antonio', 'city':'email']

Unnamed: 0_level_0,city,county,postal,phone1,phone2,email
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
818,Little Parndon and Hare Street,Hertfordshire,CM20 2HT,01559-403415,01388-777812,antonio.villamarin@gmail.com
565,Gaer Community,Newport,NP20 3DE,01463-409090,01242-318420,antonio_glasford@glasford.co.uk
350,Ipplepen,Devon,TQ12 5LL,01324-171614,01442-946357,antonio.heilig@gmail.com


Select rows where the email column ends with 'hotmail.com', include all columns:

In [100]:
data.loc[data['email'].str.endswith("hotmail.com")]

Unnamed: 0_level_0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
436,Tomkiewicz,Aleshia,Alan D Rosenburg Cpa Pc,14 Taylor St,St. Stephens Ward,Kent,CT2 7PP,01835-703597,01944-369967,atomkiewicz@hotmail.com,http://www.alandrosenburgcpapc.co.uk
549,Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk
755,Mcwalters,Ulysses,"Mcmahan, Ben L",505 Exeter Rd,Hawerby cum Beesby,Lincolnshire,DN36 5RP,01912-771311,01302-601380,ulysses@hotmail.com,http://www.mcmahanbenl.co.uk
352,Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk
959,Grasmick,Marg,Wrangle Hill Auto Auct & Slvg,7457 Cowl St #70,Bargate Ward,Southampton,SO14 3TY,01865-582516,01362-620532,marg@hotmail.com,http://www.wranglehillautoauctslvg.co.uk
994,Manzella,Lura,Bizerba Usa Inc,929 Augustine St,Staple Hill Ward,South Gloucestershire,BS16 4LL,01907-538509,01340-713951,lura@hotmail.com,http://www.bizerbausainc.co.uk
409,Julio,Dewitt,Rittenhouse Motor Co,7 Richmond St,Parkham,Devon,EX39 5DJ,01253-528327,01241-964675,dewitt.julio@hotmail.com,http://www.rittenhousemotorco.co.uk
394,Sitra,Tess,Smart Signs,61 Rossett St,Chichester,West Sussex,PO19 1RH,01473-229124,01848-116775,tess_sitra@hotmail.com,http://www.smartsigns.co.uk
718,Zelaya,German,Jackson & Heit Machine Co Inc,7 Shenstone St,Longhill Ward,"Yorkshire, East (North Humbers",HU8 9PZ,01400-269033,01366-210656,german@hotmail.com,http://www.jacksonheitmachinecoinc.co.uk
582,Ear,Luis,Wa Inst For Plcy Studies,2 Birchfield Rd,Whittington,Shropshire,SY11 4PH,01462-648669,01405-648623,luis@hotmail.com,http://www.wainstforplcystudies.co.uk


Select rows with last_name equal to some values, all columns:

In [101]:
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]

Unnamed: 0_level_0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
549,Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk
352,Veness,Tyisha,Champagne Room,5396 Forth Street,Greets Green and Lyng Ward,West Midlands,B70 9DT,01547-429341,01290-367248,tyisha.veness@hotmail.com,http://www.champagneroom.co.uk
352,Rampy,Eric,"Thompson, Michael C Esq",9472 Lind St,Desborough,Northamptonshire,NN14 2GH,01969-886290,01545-817375,erampy@rampy.co.uk,http://www.thompsonmichaelcesq.co.uk


Select rows with first name 'Antonio' **AND** hotmail email addresses:

In [102]:
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')]

Unnamed: 0_level_0,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
818,Villamarin,Antonio,Combs Sheetmetal,353 Standish St #8264,Little Parndon and Hare Street,Hertfordshire,CM20 2HT,01559-403415,01388-777812,antonio.villamarin@gmail.com,http://www.combssheetmetal.co.uk
350,Heilig,Antonio,Radisson Suite Hotel,35 Elton St #3,Ipplepen,Devon,TQ12 5LL,01324-171614,01442-946357,antonio.heilig@gmail.com,http://www.radissonsuitehotel.co.uk


Select rows with id column between 100 and 200, and just return 'postal' and 'web' columns. But let's first reset the index so `id` is a column:

In [103]:
data.reset_index(inplace=True)
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']] 

Unnamed: 0,postal,web
10,DA2 7PP,http://www.krassociatesinc.co.uk
25,HS1 2PZ,http://www.arcticstardistributinginc.co.uk
37,SL3 0PY,http://www.novakalanpaulesq.co.uk
51,NG8 2NB,http://www.marketinghorizonsinc.co.uk
56,PL14 5PA,http://www.kennedyscalesinc.co.uk
57,B34 7BP,http://www.barajasbustamantearchl.co.uk
59,NG6 8RG,http://www.bohswelldrillinginc.co.uk
61,GU21 5QL,http://www.reidcarletonbesq.co.uk
93,E14 5DR,http://www.rosatimarcdesq.co.uk
98,M33 4BP,http://www.taosvalleyresortassn.co.uk


A lambda function that yields True/False values can also be used. Let's use it to select rows where the company name has 4 words in it:

In [105]:
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)] 

Unnamed: 0,id,last_name,first_name,company_name,address,city,county,postal,phone1,phone2,email,web
2,549,Andrade,France,"Elliott, John W Esq",8 Moor Place,East Southbourne and Tuckton W,Bournemouth,BH6 3BE,01347-368222,01935-821636,france.andrade@hotmail.com,http://www.elliottjohnwesq.co.uk
5,352,Rampy,Eric,"Thompson, Michael C Esq",9472 Lind St,Desborough,Northamptonshire,NN14 2GH,01969-886290,01545-817375,erampy@rampy.co.uk,http://www.thompsonmichaelcesq.co.uk
11,813,Erm,Charlesetta,"Cain, John M Esq",5 Hygeia St,Loundsley Green Ward,Derbyshire,S40 4LY,01276-816806,01517-624517,charlesetta_erm@gmail.com,http://www.cainjohnmesq.co.uk
15,726,Throssell,Michell,Weiss Spirt & Guyer,89 Noon St,Carbrooke,Norfolk,IP25 6JQ,01967-580851,01672-496478,mthrossell@throssell.co.uk,http://www.weissspirtguyer.co.uk
16,403,Kanne,Edgar,"Crowan, Kenneth W Esq",99 Guthrie St,New Milton,Hampshire,BH25 5DF,01326-532337,01666-638176,edgar.kanne@yahoo.com,http://www.crowankennethwesq.co.uk
19,716,Lapinski,Mee,Galloway Electric Co Inc,9 Pengwern St,Marldon,Devon,TQ3 1SA,01578-287816,01939-815208,mee.lapinski@yahoo.com,http://www.gallowayelectriccoinc.co.uk
20,824,Gutierres,Peter,Niagara Custombuilt Mfg Co,4410 Tarlton St,Prestatyn Community,Denbighshire,LL19 9EG,01842-767201,01859-648598,peter_gutierres@yahoo.com,http://www.niagaracustombuiltmfgco.co.uk
22,953,Teplica,Martha,"Curtin, Patricia M Esq",148 Rembrandt St,Warlingham,Surrey,CR6 9SW,01677-684257,01583-287367,mteplica@teplica.co.uk,http://www.curtinpatriciamesq.co.uk
23,959,Veigel,Tamesha,"Wilhelm, James E Jr",2200 Nelson St #58,Newport,Isle of Wight,PO30 5AL,01217-342071,01280-786847,tveigel@veigel.co.uk,http://www.wilhelmjamesejr.co.uk
25,149,Kufner,Leonard,Arctic Star Distributing Inc,41 Canning St,Steornabhagh a Deas Ward,Western Isles,HS1 2PZ,01230-623547,01604-718601,lkufner@kufner.co.uk,http://www.arcticstardistributinginc.co.uk


Selections can be achieved outside of the main `.loc` for clarity. First, form a separate variable with your selections:

In [110]:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
idx

0      False
1      False
2       True
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11      True
12     False
13     False
14     False
15      True
16      True
17     False
18     False
19      True
20      True
21     False
22      True
23      True
24     False
25      True
26      True
27     False
28     False
29      True
       ...  
470    False
471    False
472    False
473     True
474    False
475    False
476    False
477    False
478    False
479    False
480     True
481     True
482    False
483    False
484     True
485    False
486    False
487     True
488    False
489    False
490    False
491    False
492     True
493    False
494    False
495    False
496    False
497    False
498     True
499    False
Name: company_name, Length: 500, dtype: bool

Then, select only the True values in `idx` and only the 3 columns specified:

In [111]:
data.loc[idx, ['email', 'first_name', 'company_name']]

Unnamed: 0,email,first_name,company_name
2,france.andrade@hotmail.com,France,"Elliott, John W Esq"
5,erampy@rampy.co.uk,Eric,"Thompson, Michael C Esq"
11,charlesetta_erm@gmail.com,Charlesetta,"Cain, John M Esq"
15,mthrossell@throssell.co.uk,Michell,Weiss Spirt & Guyer
16,edgar.kanne@yahoo.com,Edgar,"Crowan, Kenneth W Esq"
19,mee.lapinski@yahoo.com,Mee,Galloway Electric Co Inc
20,peter_gutierres@yahoo.com,Peter,Niagara Custombuilt Mfg Co
22,mteplica@teplica.co.uk,Martha,"Curtin, Patricia M Esq"
23,tveigel@veigel.co.uk,Tamesha,"Wilhelm, James E Jr"
25,lkufner@kufner.co.uk,Leonard,Arctic Star Distributing Inc


# Pandas `apply`, `applymap` and `map`

[Source](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)

Let's create a new Dataframe:

In [112]:
df = pd.DataFrame({
    'A': [1,2,3,4], 
    'B': [10,20,30,40],
    'C': [20,40,60,80]
    }, 
    index=['Row 1', 'Row 2', 'Row 3', 'Row 4'])
df

Unnamed: 0,A,B,C
Row 1,1,10,20
Row 2,2,20,40
Row 3,3,30,60
Row 4,4,40,80


## Using Apply

The Pandas apply() is used to apply a function along an axis of the DataFrame or on values of Series.

Let’s begin with a simple example, to sum each row and save the result to a new column "D":

In [113]:
def custom_sum(row):
    return row.sum()
    
df['D'] = df.apply(custom_sum, axis=1)
df

Unnamed: 0,A,B,C,D
Row 1,1,10,20,31
Row 2,2,20,40,62
Row 3,3,30,60,93
Row 4,4,40,80,124


Do you really understand what just happened?

Let’s take a look `df.apply(custom_sum, axis=1)`

- The first parameter custom_sum is a function.
- The second parameter axis is to specify which axis the function is applied to. `0` for applying the function to each column and `1` for applying the function to each row.

Let me explain this process in a more intuitive way. The second parameter `axis = 1` tells Pandas to use the row. So, the custom_sum is applied to each row and returns a new Series with the output of each row as value.

With the understanding of the sum of each row, the sum of each column is just to use axis = 0 instead, first clearing out what we just did:

In [114]:
df.drop('D', axis=1, inplace=True)
df

Unnamed: 0,A,B,C
Row 1,1,10,20
Row 2,2,20,40
Row 3,3,30,60
Row 4,4,40,80


In [45]:
df.loc['Row 5'] = df.apply(custom_sum, axis=0)
df

Unnamed: 0,A,B,C
Row 1,1,10,20
Row 2,2,20,40
Row 3,3,30,60
Row 4,4,40,80
Row 5,10,100,200


So far, we have been talking about `apply()` on a DataFrame. Similarly, `apply()` can be used on the values of Series. For example, multiply the column **C** by 2 and save the result to a new column **D**:

In [46]:
df.drop('Row 5', inplace=True)
df

Unnamed: 0,A,B,C
Row 1,1,10,20
Row 2,2,20,40
Row 3,3,30,60
Row 4,4,40,80


In [47]:
def multiply_by_2(val):
    return val * 2

df['D'] = df['C'].apply(multiply_by_2)
df

Unnamed: 0,A,B,C,D
Row 1,1,10,20,40
Row 2,2,20,40,80
Row 3,3,30,60,120
Row 4,4,40,80,160


Notice that `df['C']` is used to select the column **C** and then call `apply()` with the only parameter `multiply_by_2`. We don’t need to specify axis anymore because Series is a one-dimensional array. The return value is a Series and get assigned to the new column **D** by `df[‘D’]`.

Now, we could do exactly the same for the rows:

In [48]:
df.drop('D', axis=1, inplace=True)
df

Unnamed: 0,A,B,C
Row 1,1,10,20
Row 2,2,20,40
Row 3,3,30,60
Row 4,4,40,80


In [49]:
df.loc['Row 5'] = df.loc['Row 4'].apply(multiply_by_2)
df

Unnamed: 0,A,B,C
Row 1,1,10,20
Row 2,2,20,40
Row 3,3,30,60
Row 4,4,40,80
Row 5,8,80,160


### Using labmdas

As we saw in class, you can use Pandas `apply()` function with Labmdas.

The lambda equivalent for the sum of each row of a DataFrame that we used above is:


In [62]:
df

Unnamed: 0,A,B,C
Row 1,1.0,10.0,20.0
Row 2,2.0,20.0,40.0
Row 3,3.0,30.0,60.0
Row 4,4.0,40.0,80.0


In [63]:
df['D'] = df.apply(lambda x:x.sum(), axis=1)
df

Unnamed: 0,A,B,C,D
Row 1,1.0,10.0,20.0,31.0
Row 2,2.0,20.0,40.0,62.0
Row 3,3.0,30.0,60.0,93.0
Row 4,4.0,40.0,80.0,124.0


Or, the lambda equivalent for the sum of each column of a DataFrame:


In [64]:
df.loc['Row 5'] = df.apply(lambda x:x.sum(), axis=0)
df

Unnamed: 0,A,B,C,D
Row 1,1.0,10.0,20.0,31.0
Row 2,2.0,20.0,40.0,62.0
Row 3,3.0,30.0,60.0,93.0
Row 4,4.0,40.0,80.0,124.0
Row 5,10.0,100.0,200.0,310.0


And finally, the lambda equivalent for multiply by 2 on a Series:

In [65]:
df['D'] = df['C'].apply(lambda x:x*2)
df

Unnamed: 0,A,B,C,D
Row 1,1.0,10.0,20.0,40.0
Row 2,2.0,20.0,40.0,80.0
Row 3,3.0,30.0,60.0,120.0
Row 4,4.0,40.0,80.0,160.0
Row 5,10.0,100.0,200.0,400.0


### Using the `result_type` parameter
`result_type` is a parameter in apply() set to `expand`, `reduce`, or `broadcast` to get the desired type of result.

In what we've done previously, if result_type is set to `broadcast` then the output will be a DataFrame substituted by the custom_sum value:


In [66]:
df.apply(custom_sum, axis=1, result_type='broadcast')

Unnamed: 0,A,B,C,D
Row 1,71.0,71.0,71.0,71.0
Row 2,142.0,142.0,142.0,142.0
Row 3,213.0,213.0,213.0,213.0
Row 4,284.0,284.0,284.0,284.0
Row 5,710.0,710.0,710.0,710.0


You can see that the result is broadcasted to the original shape of the frame, while the original index and columns are retained.



To understand `result_type`'s `expand` and `reduce`, you will first create a function that returns a list:

In [67]:
def cal_multi_col(row):
    return [row['A'] * 2, row['B'] * 3]

Let's apply this function on the dataframe's columns axis with result_type set as `expand`:

In [68]:
df.apply(cal_multi_col, axis=1, result_type='expand')

Unnamed: 0,0,1
Row 1,2.0,30.0
Row 2,4.0,60.0
Row 3,6.0,90.0
Row 4,8.0,120.0
Row 5,20.0,300.0


The output is a new DataFrame with column names 0 and 1.

To append this to the existing DataFrame, the result needs to be stored in a variable so the column names can be accessed by `resul.columns`:


In [71]:
resul = df.apply(cal_multi_col, axis=1, result_type='expand')
df[resul.columns] = resul

In [72]:
df

Unnamed: 0,A,B,C,D,0,1
Row 1,1.0,10.0,20.0,40.0,2.0,30.0
Row 2,2.0,20.0,40.0,80.0,4.0,60.0
Row 3,3.0,30.0,60.0,120.0,6.0,90.0
Row 4,4.0,40.0,80.0,160.0,8.0,120.0
Row 5,10.0,100.0,200.0,400.0,20.0,300.0


Finally, apply the function across axis 1 with `result_type=reduce` . This is just the opposite of `expand` and returns a Series if possible rather than expanding list-like results:


In [73]:
df['New'] = df.apply(cal_multi_col, axis=1, result_type='reduce')

In [74]:
df

Unnamed: 0,A,B,C,D,0,1,New
Row 1,1.0,10.0,20.0,40.0,2.0,30.0,"[2.0, 30.0]"
Row 2,2.0,20.0,40.0,80.0,4.0,60.0,"[4.0, 60.0]"
Row 3,3.0,30.0,60.0,120.0,6.0,90.0,"[6.0, 90.0]"
Row 4,4.0,40.0,80.0,160.0,8.0,120.0,"[8.0, 120.0]"
Row 5,10.0,100.0,200.0,400.0,20.0,300.0,"[20.0, 300.0]"


## Using `applymap()`

`applymap()` is used for element-wise operation across the whole DataFrame. It's an optimized method and in some particular cases it works much faster than `apply()` (but it’s always good to compare it with `apply()` for big operations).

Our example too output a DataFrame with number squared from before, with applymap would be:

In [75]:
df.applymap(np.square)

Unnamed: 0,A,B,C,D,0,1,New
Row 1,1.0,100.0,400.0,1600.0,4.0,900.0,"[4.0, 900.0]"
Row 2,4.0,400.0,1600.0,6400.0,16.0,3600.0,"[16.0, 3600.0]"
Row 3,9.0,900.0,3600.0,14400.0,36.0,8100.0,"[36.0, 8100.0]"
Row 4,16.0,1600.0,6400.0,25600.0,64.0,14400.0,"[64.0, 14400.0]"
Row 5,100.0,10000.0,40000.0,160000.0,400.0,90000.0,"[400.0, 90000.0]"


## Using `map()`

`map()` is only available in Series and used for substituting each value in a Series with a new one.

To get how the map() works, let's create a Series:

In [76]:
s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s

0       cat
1       dog
2       NaN
3    rabbit
dtype: object

`map()` accepts a dict or a Series as input. Values that are not found in the dict are converted to `NaN`, unless the dict has a default value (as is the case of `defaultdict`):

In [77]:
s.map({'cat': 'kitten', 'dog': 'puppy'})

0    kitten
1     puppy
2       NaN
3       NaN
dtype: object

`map()` also accepts a function as input:

In [78]:
s.map('I am a {}'.format)

0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

If you want to avoid applying the function to missing values (and therefore dragging `NaN` down your processing), you can use `na_action='ignore'`:

In [79]:
s.map('I am a {}'.format, na_action='ignore')

0       I am a cat
1       I am a dog
2              NaN
3    I am a rabbit
dtype: object

# Filling up missing Data

[Source](https://www.geeksforgeeks.org/python-pandas-dataframe-fillna-to-replace-null-values-in-dataframe/)

Sometimes our data has null values, which are later displayed as `NaN` in Data Frame. Just like pandas `dropna()` method manages and removes `Null` values from a data frame, `fillna()` manages and let the user replace `NaN` values with some value of their own.

[See the syntax in the Pandas Official Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

## Replacing NaN values with a static value

In [116]:
#!wget https://bit.ly/ks-pds-csv5 -O {files_loc}/nba.csv
#contents = !ls {files_loc}/*nba*
#nba_file = contents[0]

In [118]:
nba = pd.read_csv('nba.csv')
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


Here, all the null values in College column are going to be replaced with “No college” string. Firstly, the data frame is imported from CSV and then College column is selected and fillna() method is used on it:

In [119]:
nba.fillna({'College':'No College'})

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,No College,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,No College,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


Working in place in the dataframe can be expressed somewhat differently as well:

In [120]:
nba['College'].fillna('No College', inplace=True)
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,No College,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,No College,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


But look at what happens if we use the last syntax and we do not modify in place:

In [122]:
nba = pd.read_csv("nba.csv") # Rereading, we modified nba df before
nba_1 = nba['College'].fillna('No College')
nba_1

0                  Texas
1              Marquette
2      Boston University
3          Georgia State
4             No College
5             No College
6                    LSU
7                Gonzaga
8             Louisville
9         Oklahoma State
10            Ohio State
11            Washington
12            Ohio State
13              Kentucky
14        North Carolina
15            No College
16        Oklahoma State
17        North Carolina
18               Arizona
19          Georgia Tech
20            No College
21            Cincinnati
22            Miami (FL)
23              Stanford
24              Syracuse
25           Saint Louis
26                Kansas
27            Georgetown
28             Texas A&M
29          Georgia Tech
             ...        
428          Wake Forest
429           Notre Dame
430           California
431       North Carolina
432           St. John's
433                 Duke
434     Central Michigan
435             Illinois
436          Weber State


In [123]:
type(nba['College'].fillna('No College'))

pandas.core.series.Series

In [124]:
type(nba['College'].fillna('No College', inplace=True))

NoneType

## Using the `method` parameter
Now, let's set the `method` to `ffill` (forward fill) and hence the value in the same column replaces the null value. In this case 'Georgia State' replaced 'null' value in college column of row 4 and 5.

Similarly, bfill, backfill and pad methods can also be used:

In [125]:
nba = pd.read_csv("nba.csv") # Reloading, we modified nba df before
nba['College'].fillna(method='ffill', inplace=True)
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Georgia State,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,Georgia State,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


And now, let's do the same but not modifying the nba original dataframe. This is going to involve creating a new series the way we want it and using the [`assign`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html) method on the original dataframe that returns a copy with the desired modifications:

In [126]:
nba = pd.read_csv("nba.csv") # Reloading, we modified nba df before
new_college = nba['College'].fillna(method='ffill')
nba_2 = nba.assign(College=new_college)
nba_2

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Georgia State,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,Georgia State,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In [127]:
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


## Using `limit`
Let's set a limit of 1 is set in the fillna() method to check if the function stops replacing after one successful replacement of NaN value or not:

In [128]:
nba['College'].fillna(method='ffill', limit=1, inplace=True)
nba

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Georgia State,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0
