This section is focussed on data extraction. E.g. How to select one or more rows (or columns) from our data, how to target a specific cell, how to overwrite a given cell's value, how to create a new index etc.  

Topics covered in this section:
- Use the .set_index() and .reset_index() methods to define a new DataFrame index.
- Retrieve rows by index label with .loc[] accessor. 
- Retrieve rows by index position with .iloc[] accessor. 
- Passing second arguments to the .loc[] and .iloc[] accessors. 
- Set new value for a specific cell or cells in a row. 
- Set multiple values in a DataFrame.
- Rename index labels or columns in a DataFrame.
- Delete rows or columns from a DataFrame.
- Create random sample with the .sample() method. 
- Use the .nsmallest() / .nlargest() methods to get rows with the smallest / largest values. 
- Filter a DataFrame with the .where() method. 
- Filter a DataFrame with the .query() method. 
- A review of the apply method on a pandas series object. 
- Apply a function to every DataFrame row with the .apply() method. 
- Create a copy of a DataFRame with the .copy() method. 

In [3]:
import pandas as pd

In [2]:
pd.__version__

'1.1.3'

In [40]:
bond = pd.read_csv("jamesbond.csv")
bond.head(3)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


The set_index and reset_index Methods

In [8]:
# bond = pd.read_csv("jamesbond.csv", index_col = "Film") 

# This is one method to set the Film column as the index. 

In [41]:
# .set_index() parameters:
# keys : represents the column name(s) that we want to serve as the index. 
 
bond.set_index(keys = "Film", inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [48]:
# .reset_index() : resets the index back to the original ascending numeric index. 
# Parameters:
# drop = False : by default False. Will not drop the column which was previously set as the index. Set to True if you want to drop the 
#                        column which was previous index, in our case, the Film column. 

bond.reset_index(drop = False, inplace = True)
bond.head(3)

# Returned back to original shape. 

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [49]:
# With .set_index(), by default the original column used as the index will be dropped and replaced. 
# For example, we want to assign the Film column as the index instead. 

bond.set_index(keys = "Film", inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [50]:
bond.set_index(keys = "Year")

# Using the .set_index() to change the index column to Year, the Film column will be dropped by default. 
# This is different than the .reset_index() where the previous index column will be moved back in to the DataFrame. 

Unnamed: 0_level_0,Actor,Director,Box Office,Budget,Bond Actor Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1962,Sean Connery,Terence Young,448.8,7.0,0.6
1963,Sean Connery,Terence Young,543.8,12.6,1.6
1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
1965,Sean Connery,Terence Young,848.1,41.9,4.7
1967,David Niven,Ken Hughes,315.0,85.0,
1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4
1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
1973,Roger Moore,Guy Hamilton,460.3,30.8,
1974,Roger Moore,Guy Hamilton,334.0,27.7,


In [53]:
# To keep the previous index column, we can combine the two methods. 

bond.reset_index(inplace = True)
bond.set_index(keys = "Year", inplace = True)
bond.head(3)

# First, we reset the index to the original DataFrame. This moves the Film column back into the DataFrame.  
# Then, we set the index to Year. Now, the Film column will still be present. 

Unnamed: 0_level_0,Film,Actor,Director,Box Office,Budget,Bond Actor Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1962,Dr. No,Sean Connery,Terence Young,448.8,7.0,0.6
1963,From Russia with Love,Sean Connery,Terence Young,543.8,12.6,1.6
1964,Goldfinger,Sean Connery,Guy Hamilton,820.4,18.6,3.2


Retrieve Rows by Index Label with .loc[] Accessor

In this module, we will take a look at how to extract one or more rows by there index labels using the .loc[] accessor. This is neither an attribute or a method but a property. 

In [55]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

# We have imported the jamesbond.csv into the DataFrame and set the index column to the Film column. 
# We also sorted the DataFrame. This will optimise the steps to extract any rows for pandas so it is good practice. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [58]:
# .loc[] : access data for a row. Provide the index label of the row you want to extract. 

bond.loc["Goldfinger"]

#Returns a series of the values for each column (which is the index of the series). 

Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: Goldfinger, dtype: object

In [59]:
bond.loc["GoldenEye"]

Year                            1995
Actor                 Pierce Brosnan
Director             Martin Campbell
Box Office                     518.5
Budget                          76.9
Bond Actor Salary                5.1
Name: GoldenEye, dtype: object

In [61]:
# DataFrames can hold rows with duplicate index.  

bond.loc["Casino Royale"]

# Since there are two rows with the index label (Film) as Casino Royale, a DataFrame is returned of the two rows.  

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [70]:
# The .loc[] syntax supports many of the list slicing operations for a regular python list. 

bond.loc["Diamonds Are Forever": "From Russia with Love"]

# Extracts the five rows starting from Diamonds Are Forever to From Russia With Love, both inclusive. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6


In [72]:
bond.loc["Diamonds Are Forever": "From Russia with Love": 2]

# Extracts a row, then jumps two rows forward to extract the next row, essentially extracts every other movie. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6


In [75]:
bond.loc["GoldenEye": ]

# Extracts rows starting from GoldenEye to the end of the DataFrame. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5


In [79]:
bond.loc[["Die Another Day", "Octopussy"]]

# "Gold Bond" in bond.index # To check whether that index label is in the DataFrame. 
# Extracts more than one row label. Every value provided in the list must exist or it will return an error. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8


Retrieve Rows by Index Position with the .iloc[] Accessor

.iloc[] short by index location. 

In [81]:
bond = pd.read_csv("jamesbond.csv")
bond.head(3)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [82]:
bond.iloc[0]

# Note that the index position starts at 0, so if you want the 17th film, then input index position of 16. 

Film                        Dr. No
Year                          1962
Actor                 Sean Connery
Director             Terence Young
Box Office                   448.8
Budget                           7
Bond Actor Salary              0.6
Name: 0, dtype: object

In [83]:
bond.iloc[0:2]

# When using index position, only the start point is inclusive. This is the key difference between the .loc[] and .loc[] accessor. 

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6


In [84]:
# Important to note that index positions will still exist even if your index does not consist of the standard pandas numeric index. 
bond.set_index("Film", inplace = True)
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [86]:
# Despite setting Film as the index column, each row still has an index position. 
bond.iloc[[0, 5, 10]]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9


Second Arguments to .loc[] and .loc[] Accessors

Passing second arguements into the accessors allows us to customize which columns to pull out of the DataFrame when we are extracting rows. 

In [90]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [92]:
# Extract the actor who played in Moonraker.
bond.loc["Moonraker", "Actor"]

'Roger Moore'

In [94]:
# Extract the Year and the Actor of the films GoldenEye and Moonraker. 
bond.loc[["GoldenEye", "Moonraker"], ["Year", "Actor"]]

Unnamed: 0_level_0,Year,Actor
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
GoldenEye,1995,Pierce Brosnan
Moonraker,1979,Roger Moore


In [98]:
# Extract the columns from Director to Budget for the movies between Casino Royale to GoldenEye.
bond.loc["A View to a Kill": "GoldenEye", "Director": "Budget"]

Unnamed: 0_level_0,Director,Box Office,Budget
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A View to a Kill,John Glen,275.2,54.5
Casino Royale,Martin Campbell,581.5,145.3
Casino Royale,Ken Hughes,315.0,85.0
Diamonds Are Forever,Guy Hamilton,442.5,34.7
Die Another Day,Lee Tamahori,465.4,154.2
Dr. No,Terence Young,448.8,7.0
For Your Eyes Only,John Glen,449.4,60.2
From Russia with Love,Terence Young,543.8,12.6
GoldenEye,Martin Campbell,518.5,76.9


In [102]:
bond.iloc[0:4, 0:3]

# Note that when using .loc[], the end point is not inclusive. 

Unnamed: 0_level_0,Year,Actor,Director
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A View to a Kill,1985,Roger Moore,John Glen
Casino Royale,2006,Daniel Craig,Martin Campbell
Casino Royale,1967,David Niven,Ken Hughes
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton


Set New Value for a Specific Cell

Setting a new value for one or more cells in a given row in our DataFrame. 

In [103]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [106]:
# Change the actor of the film Dr. No to Sir Sean Connery.
bond.loc["Dr. No", "Actor"] = "Sir Sean Connery"
bond.loc["Dr. No"]

Year                             1962
Actor                Sir Sean Connery
Director                Terence Young
Box Office                      448.8
Budget                              7
Bond Actor Salary                 0.6
Name: Dr. No, dtype: object

In [108]:
# Overwriting multiple cell values in a given row.
bond.loc["Dr. No", ["Actor", "Director"]] = ["Sean Connery", "T. Young"]
bond.loc["Dr. No"]

Year                         1962
Actor                Sean Connery
Director                 T. Young
Box Office                  448.8
Budget                          7
Bond Actor Salary             0.6
Name: Dr. No, dtype: object

Set Multiple Values in a DataFrame

In previous lesson, we changed the actor for a single film from Sean Connery to Sir Sean Connery. 
Now, we want to change Sean Connery to Sir Sean Connery for every occurrence of Sean Connery in the actor column. 

In [119]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


In [131]:
# WRONG method

bond[bond["Actor"] == "Sean Connery"] = "Sir Sean Connery"
bond

# According to the course instructor, this is the wrong method, but for some reason it still works. 
# Next, we will look at the correct method using .loc[]. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315,85,
Diamonds Are Forever,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery


In [133]:
# CORRECT method

bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [139]:
# Steps:
# 1) Return a boolean series, where the value is True for when the value in the Actor column is Sean Connery. Assign to variable. 
# 2)  Input variable from 1) to bond.loc[] and enter second argument to specify the Actor column to return the Actor series where all
#      the values for Actor is Sean Connery. 
# 3) Set series from 2) to "Sir Sean Connery".

actor_is_sean_connery = bond["Actor"] == "Sean Connery" # This variable returns a series of boolean values.  
bond.loc[actor_is_sean_connery] # Returns rows where the actor is Sean Connery. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


In [140]:
bond.loc[actor_is_sean_connery, "Actor"] = "Sir Sean Connery"
bond
# LHS returns an actor series where the actor is Sean Connery. 
# RHS assigns all the Sean Connery values to Sir Sean Connery. 
# Since we are using .loc[], we affect the original bond DataFrame. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315,85,
Diamonds Are Forever,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery,Sir Sean Connery


Rename Index Labels or Columns in a DataFrame

In [168]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [None]:
# .rename() : different options for parameters. 
# Option 1: Use the mapper and axis parameters.
# Option 2: Use the index parameter to change the index values (rows) and column parameter to change column names.  

In [150]:
# To change the name of the index value. 
# Option 1 : mapper and axis = 0 (or "rows") to change index names. 
# mapper parameter: takes in a dictionary as an arguments. dictionary can take multiple key-value pairs. 
# mapper = {"old index label": "new index label you want to change to"}

bond.rename(mapper = {"GoldenEye": "Golden Eye", "Casino Royale": "CASINO ROYALE"}, axis = 0)

# We changed "GoldenEye" to "Golden Eye" and "Casino Royale" to "CASINO ROYALE". 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
CASINO ROYALE,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
CASINO ROYALE,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Golden Eye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [153]:
# Option 2 to changing index labels. 
# Directly use the index parameter only.

bond.rename(index = {"GoldenEye": "GOLDENEYE", "Casino Royale": "CASINO ROYALE"})

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
CASINO ROYALE,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
CASINO ROYALE,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GOLDENEYE,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [155]:
# Change column names:
# Option 1: Use mapper and axis parameter combination. 
# Set the axis parameter to axis = 1 or axis = "column". 

bond.rename(mapper = {"Year": "YEAR", "Actor": "ACTOR"}, axis = 1)

# Changed the column names Year to YEAR and Actor to Actor. 

Unnamed: 0_level_0,YEAR,ACTOR,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [158]:
# Option 2 to change column names. 
# Use the columns parameter only.

bond.rename(columns = {"Actor": "Actress", "Year": "YEAR"})

# Changed the column names Actor to Actress and Year to YEAR. 

Unnamed: 0_level_0,YEAR,Actress,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [170]:
# To change the index label of duplicated index labels. 
# .index.where(): replaces values where the condition is False. 
# cond: condition to select the values on.
# other: replacement if the condition is False. 

bond.index = bond.index.where(cond = ~bond.index.duplicated(), other = bond.index + "_dp") 
bond

# need to use ~ since the first duplicated value will return False, but we want to replace the second duplicated value,
# so we use ~ to reverse True and False to make the second duplicated value return False. 

# The second duplicated value will have a"_dp" extension on their name. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale_dp,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


Delete Rows or Columns from a DataFrame

Three methods:
- .drop() : removes rows or columns.
- .pop() : removes single series and returns the series removed which we can assign to a variable. 
- del : removes a series from a DataFrame. 

In [207]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [177]:
# .drop() method. Parameters:
# labels: accepts single values or a list of values and it will remove them from the DataFrame. 
# axis = 0: by defauilt which is rows. Will look for the values we specify among the row index labels. Set axis = 1 ("columns") for columns. 
            
# Remove "A View to a Kill" row 
bond.drop(labels = "A View to a Kill")

# If we have to numeric index rather than a string index, we can just pass in the index position. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9


In [174]:
# Extract multiple row values by providing a list. 

bond.drop(labels = ["Die Another Day", "Dr. No"])

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,


In [179]:
# Remove columns: remove the Actor and Director columns.

bond.drop(labels = ["Actor", "Director"], axis = "columns")

Unnamed: 0_level_0,Year,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A View to a Kill,1985,275.2,54.5,9.1
Casino Royale,2006,581.5,145.3,3.3
Casino Royale,1967,315.0,85.0,
Diamonds Are Forever,1971,442.5,34.7,5.8
Die Another Day,2002,465.4,154.2,17.9
Dr. No,1962,448.8,7.0,0.6
For Your Eyes Only,1981,449.4,60.2,
From Russia with Love,1963,543.8,12.6,1.6
GoldenEye,1995,518.5,76.9,5.1
Goldfinger,1964,820.4,18.6,3.2


In [197]:
# To drop the row for the second duplicated label. 
bond = bond[~bond.index.duplicated(keep = "first")]
bond

# Dropped the row of the second Casino Royale. 
# ~ : non-duplicated values will be False, therefore, in order to return non-duplicates, we need to reverse False to True and vice versa. 

# Note: bond[bond["Year"] == 2012]: returns rows where bond["Year"] == 2012 (boolean series) is True. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9


In [202]:
# .pop() : another method to remove values. 

actor_removed = bond.pop("Actor")

# Removes the Actor column from our DataFrame and also return it. We stored the column removed to actor_removed. 

In [203]:
bond

Unnamed: 0_level_0,Year,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A View to a Kill,1985,John Glen,275.2,54.5,9.1
Casino Royale,2006,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,John Glen,449.4,60.2,
From Russia with Love,1963,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Guy Hamilton,820.4,18.6,3.2


In [205]:
actor_removed 

Film
A View to a Kill                      Roger Moore
Casino Royale                        Daniel Craig
Casino Royale                         David Niven
Diamonds Are Forever                 Sean Connery
Die Another Day                    Pierce Brosnan
Dr. No                               Sean Connery
For Your Eyes Only                    Roger Moore
From Russia with Love                Sean Connery
GoldenEye                          Pierce Brosnan
Goldfinger                           Sean Connery
Licence to Kill                    Timothy Dalton
Live and Let Die                      Roger Moore
Moonraker                             Roger Moore
Never Say Never Again                Sean Connery
Octopussy                             Roger Moore
On Her Majesty's Secret Service    George Lazenby
Quantum of Solace                    Daniel Craig
Skyfall                              Daniel Craig
Spectre                              Daniel Craig
The Living Daylights               Timothy Da

In [211]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [217]:
# del method
del bond["Year"]
bond.head(3)

Unnamed: 0_level_0,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A View to a Kill,John Glen,275.2,54.5,9.1
Casino Royale,Martin Campbell,581.5,145.3,3.3
Casino Royale,Ken Hughes,315.0,85.0,


Create Random Sample

Extract a random sample of either rows or columns from our DataFrame. 

In [218]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [222]:
# Paramters
# n : specifies the number of random rows we want to extract. 
# frac : rather than specifying the number of rows, we can specify the percentage of rows of our DataFrame we want to extract. 
# axis : specifies the axis we want to extract (random rows or columns?). 

In [224]:
# Extract one row at random by default. 

bond.sample()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1


In [223]:
bond.sample(n=5) # Returns 5 random rows.
bond.sample(frac = 0.25) # Returns random rows sample  of 25% of our DataFrame.
bond.sample(n =2, axis = "columns") # Returns 2 random columns of our DataFrame. 

Unnamed: 0_level_0,Year,Director
Film,Unnamed: 1_level_1,Unnamed: 2_level_1
A View to a Kill,1985,John Glen
Casino Royale,2006,Martin Campbell
Casino Royale,1967,Ken Hughes
Diamonds Are Forever,1971,Guy Hamilton
Die Another Day,2002,Lee Tamahori
Dr. No,1962,Terence Young
For Your Eyes Only,1981,John Glen
From Russia with Love,1963,Terence Young
GoldenEye,1995,Martin Campbell
Goldfinger,1964,Guy Hamilton


The .nsmallest() and .nlargest() Methods

 Two convenience methods to extract the rows from the DataFrame that contain the smallest or largest values in a specific column. 

In [225]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [233]:
# Parameters
# n : number of rows you want to return.
# columns : column name you want to sort by. 

bond.sort_values("Box Office")
bond.nsmallest(n = 3, columns = "Box Office")

# Returns three rows with the highest grossing box office column.   

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6


In [232]:
bond.nlargest(n = 5, columns = "Bond Actor Salary")

# Returns the rows for films with the top 5 highest bond actor salary. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1


In [234]:
bond["Box Office"].nlargest(2)

# We can call these methods directly on a series. 
# Returns a series of the top two highest grossing films. 

Film
Skyfall        943.5
Thunderball    848.1
Name: Box Office, dtype: float64

Filtering with the where Method

One more method we can call on our DataFrame where we can filter but returns a slightly different result. 

Rather than filtering and removing the rows which do no meet the criteria, by using this method we retain the whole DataFrame. However, the rows which do no meet the criteria are filled with NULL values instead. 

In [235]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [238]:
sean_connery = bond["Actor"] == "Sean Connery"
bond[sean_connery] # Returns DataFrame only consisting of rows where Actor is Sean Connery. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [241]:
bond.where(sean_connery)

# Returns whole DataFrame with values only for rows where actor is Sean Connery, otherwise, filled with NULL values. 

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,,,,,,
Casino Royale,,,,,,
Casino Royale,,,,,,
Diamonds Are Forever,1971.0,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,,,,,,
Dr. No,1962.0,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,,,,,,
From Russia with Love,1963.0,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,,,,,,
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [244]:
# Only show values for rows where the actor is Sean Connery and the box office is greater than 800.
con_2 = bond["Box Office"] > 800
bond.where(sean_connery & con_2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,,,,,,
Casino Royale,,,,,,
Casino Royale,,,,,,
Diamonds Are Forever,,,,,,
Die Another Day,,,,,,
Dr. No,,,,,,
For Your Eyes Only,,,,,,
From Russia with Love,,,,,,
GoldenEye,,,,,,
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2


Filter a DataFrame with the .query() Method

Two caveats to this method:
- The argument will be a string. 
- Only works when our columns in our DataFrame don't have any spaces. So before working working with this method, we need to first fill in the spaces in the column names to underscores. 

Advantages of this method is that it is slightly faster especially for larger data sets. Second advantage is that it is "written in English" and more human. 

In [5]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [None]:
# First, we add underscores to fill in the spaces in the column names. 

In [14]:
bond.columns = [column_name.replace(" ", "_") for column_name in bond.columns]
bond.head(3)

# column_name can be anything, like x. 

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [20]:
bond.query('Actor == "Sean Connery" ')

# Quotes have to be an alternating combination. E.g. single and doube quotes. Order does not matter. 

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [21]:
bond.query('Actor == "Roger Moore" and Director == "John Glen" ')

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8


In [24]:
bond.query('Actor in ["Timothy Dalton", "George Lazenby"]')
# Returns rows for movies where the director is either Timothy Dalton or Goerge Lazenby.

bond.query('Actor not in ["Sean Connery", "Roger Moore"]')
# Returns all of the rows for movies where the actor is not Sean Connery or Roger Moore. 

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Quantum of Solace,2008,Daniel Craig,Marc Forster,514.2,181.4,8.1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Spectre,2015,Daniel Craig,Sam Mendes,726.7,206.3,
The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


A Review of the .apply() Method on a Single Column. 

When applied to a series, the .apply() method is applied to every value within that series (similar to a broadcasting operation). .apply() method is often used for more custom operations. 

In [25]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [41]:
#  Convert the values in the Box Office, Budget and Bond Actor Salary columns to strings and add Millions to the end. 

# First, define a function to perform this operation. 
def convert_str_add_m(number):
    return str(number) + "MILLION"

# Then we apply this function to every value to the Box Office column using .apply(). And we assign this to the Box Office series. 
bond["Box Office"] =bond["Box Office"].apply(convert_str_add_m)

# Repeat the previous step and use .apply() on Budget and Bond Actor Salary. 

bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2MILLIONMILLIONMILLION,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5MILLIONMILLIONMILLION,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0MILLIONMILLIONMILLION,85.0,


In [42]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [51]:
# More elegant way to use .apply on multiple columns by assigning the list of columns to a variable. Steps:
# 1) define the function to convert values to strings and add MILLION at the end. 
# 2) Assign variable to the list of columns we want to apply function to. 
# 3) Write for loop for the function to be applied to every column in the list, so we don't have to apply one by one. 
def convert_str_add_m(number):
    return str(number) + "MILLION"

columns = ["Box Office", "Budget", "Bond Actor Salary"]

for col in columns:
    bond[col] = bond[col].apply(convert_str_add_m)

bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2MILLION,54.5MILLION,9.1MILLION
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5MILLION,145.3MILLION,3.3MILLION
Casino Royale,1967,David Niven,Ken Hughes,315.0MILLION,85.0MILLION,nanMILLION


The .apply() Method with Row Values

In the previous lesson, we looked at how .apply() method is used on a single column. 

In this lesson, we look at how to combine values in a row in a custom function. 

How we can apply an operation where we want to use the values in each row and iterate over every single row in that DataFrame and apply a method that involves values from multiple adjacent cells to the left and right of each other.

.apply() by default is going to apply an operation on a single column, so it will have the parameter axis = "rows". 

If we want to navigate a DataFrame and look at data from each row, we have to set axis = "columns". 

In [52]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [53]:
# Let's say we want to create a custom column that based itself off the values in every single row. 
# We want to create a ranking for each movie by assigning three different types of classification - Best, Enjoyable and None. 
# We want to assign this ranking to each movie, and this ranking is based on a combination of the row values. 

In [55]:
# 1) Begin, by creating a custom Python function. 
# We will call it in a such a way that pandas is going to pass the row values to this good_movie function for every row .
# Whenever it does this, it's going to pass them as an array or a list, such that the first item in line (value in the Year column - 1985) is
# going to be the first item in our list and will have an index position of 0, next is value in Actor column (Roger Moore) with index
# position of 1 and so on. This process will be repeated for every single row. 

def good_movie(row):
    
    actor = row[1] # Stores each rows second value (value in the actor column). As it passes every row, it takes the value in the actor 
                            # column as it iterates and assigns that to the actor variable. 
    budget = row[4]
    
    if actor == "Pierce Brosnan":   # Custom criteria for assigning rank for each row. 
        return "Best"
    elif actor == "Roger Moore" and budget > 40:
        return "Enjoyable"
    else:
        return "No Idea"


In [58]:
# Parameters for .apply()
# 1)func : the function you want to be applied.
# 2) axis : the axis you want to operate across. In our case, we want to apply the function for a row (across horizontally) rather than 
#              column. So, (slightly confusingly), in our case, we want to set axis = "columns". Think of it like this; the function will move across  
#              the columns as it feeds the datas in. Even though we are moving row by row, we are moving across the columns to obtain to value
#              to feed the function. 

bond.apply(good_movie, axis = "columns")

Film
A View to a Kill                   Enjoyable
Casino Royale                        No Idea
Casino Royale                        No Idea
Diamonds Are Forever                 No Idea
Die Another Day                         Best
Dr. No                               No Idea
For Your Eyes Only                 Enjoyable
From Russia with Love                No Idea
GoldenEye                               Best
Goldfinger                           No Idea
Licence to Kill                      No Idea
Live and Let Die                     No Idea
Moonraker                          Enjoyable
Never Say Never Again                No Idea
Octopussy                          Enjoyable
On Her Majesty's Secret Service      No Idea
Quantum of Solace                    No Idea
Skyfall                              No Idea
Spectre                              No Idea
The Living Daylights                 No Idea
The Man with the Golden Gun          No Idea
The Spy Who Loved Me               Enjoyable
The W

The .copy() Method

Creates an exact copy of the existing pandas object such as a DataFrame or series and stores it completely separately in memory. 

Using the .copy() method will allow us to make changes to the smaller object (which was part of the larger object) without affecting the larger object and preserves the original DataFrame.

In [59]:
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [68]:
#  Change the director of A View to a Kill from John Glen to Mr. John Glen. 
# First method is to assign the Director series to directors and make the change to this series.
# However, using this method, the values will also be changed in the original bond DataFrame. 

directors = bond["Director"]
directors.head(3)

Film
A View to a Kill      Mr. John Glen
Casino Royale       Martin Campbell
Casino Royale            Ken Hughes
Name: Director, dtype: object

In [67]:
directors["A View to a Kill"] = "Mr. John Glen"
bond.head(3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  directors["A View to a Kill"] = "Mr. John Glen"


Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,Mr. John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [69]:
# Method 2: if we want to change the values in the series but not overwrite the original DataFrame. 
# Create a separate copy, instead of using a director series that's both it's own series and also a chunk of the original DataFrame. 
# So we take that series from our original DataFrame, make an exact copy and work with that copy. This way, whatever manipulations
# we make with our copied series will not affect the original DataFrame's values. 

bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [70]:
# Step 1) Pull the Directors series, create a copy of it and assign it to the directors series. This series is no longer a series that references 
#            the Directors series in the bond DataFrame, it's a brand new, separated unique series object. 
# Step 2) Make the changes to the directors series. Whatever changes we make to this copied series will not affect the bond DataFrame. 

directors = bond["Director"].copy()

directors["A View to a Kill"] = "Mr. John Glen"

directors.head(3)

Film
A View to a Kill      Mr. John Glen
Casino Royale       Martin Campbell
Casino Royale            Ken Hughes
Name: Director, dtype: object

In [71]:
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
