A `DateFrame` is a two-dimensional object. It is analogous to a table.

In [1]:
import pandas as pd

In [2]:
nba = pd.read_csv('pandas/nba.csv')

-----
## Shared attributes and methods between `Series` and `Dataframe`
### `.head(..)` and `.tail(..)` methods

In [3]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [4]:
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


### `.index` attribute

In [5]:
nba.index

RangeIndex(start=0, stop=458, step=1)

### `.value` attribute
Returns multi-dimensional array
- Every row is returned as an array, part of an outer array representing the table

In [6]:
nba.values

array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ...,
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

### `.shape` attribute

In [7]:
nba.shape

(458, 9)

### `.dtypes` attribute
Returns the data type of the columms of the `DataFrame` <br/>
**NOTE:** A `Series` is returned, in which all the columns of the `DataFrame` represent indices of the Series, and their data type is represented by the `Series` values.

In [8]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

---
## Attributes and methods available for `DataFrame`

### .columns, .axes attributes

In [9]:
nba.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [10]:
nba.axes

[RangeIndex(start=0, stop=458, step=1),
 Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
        'College', 'Salary'],
       dtype='object')]

### `.info(..)` method
Provides a summary of the `DataFrame`

In [11]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     373 non-null object
Salary      446 non-null float64
dtypes: float64(4), object(5)
memory usage: 23.3+ KB


### `.get_dtype_counts(..)` method
- Returns the count of different datatypes that the columns belong to.
- For example, 5 columns are `int64` type, 3 columns are `float32` type
- Note that this information is already a part of the `.info(..)` method

In [12]:
nba.get_dtype_counts()

float64    4
object     5
dtype: int64

---
## Different behavior of the methods shared between `Series` and `DataFrame`

In [13]:
rev = pd.read_csv('pandas/revenue.csv', index_col='Date')
rev.head(3)

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/16,985,122,499
1/2/16,738,788,534
1/3/16,14,20,933


### Calling `.sum(..)` on a `Series` versus on a `DataFrame`

In [14]:
s = pd.Series([1,2,3,4])
s.sum()

10

In [15]:
rev.sum() # Equivalent to rev.sum(axis=0) or rev.sum(axis='rows')

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

**NOTE:** A `Series` is returned, in which columns from the `DataFrame` represent the indices of the `Series`, and sum is computed for all the columns that have numerical values

But what if we want the sum across rows instead of a columns? In such a case, use the **`axis`** attribute.

In [16]:
rev.sum(axis=1) # Equivalent to rev.sum(axis = 'columns')

Date
1/1/16     1606
1/2/16     2060
1/3/16      967
1/4/16     2519
1/5/16      438
1/6/16     1935
1/7/16     1234
1/8/16     2313
1/9/16     2623
1/10/16     555
dtype: int64

### Selecting a column from the `DataFrame`
The following method of simply writing the Column name after the dot
- returns a `Series`
- is NOT the recommended way because it does not work if the column names have whitespaces

In [17]:
nba.Name

0                Avery Bradley
1                  Jae Crowder
2                 John Holland
3                  R.J. Hunter
4                Jonas Jerebko
5                 Amir Johnson
6                Jordan Mickey
7                 Kelly Olynyk
8                 Terry Rozier
9                 Marcus Smart
10             Jared Sullinger
11               Isaiah Thomas
12                 Evan Turner
13                 James Young
14                Tyler Zeller
15            Bojan Bogdanovic
16                Markel Brown
17             Wayne Ellington
18     Rondae Hollis-Jefferson
19                Jarrett Jack
20              Sergey Karasev
21             Sean Kilpatrick
22                Shane Larkin
23                 Brook Lopez
24            Chris McCullough
25                 Willie Reed
26             Thomas Robinson
27                  Henry Sims
28                Donald Sloan
29              Thaddeus Young
                ...           
428            Al-Farouq Aminu
429     

#### Using the bracket syntax
Will work even if there are white spaces in the column names

In [18]:
nba['Name'].head(3)

0    Avery Bradley
1      Jae Crowder
2     John Holland
Name: Name, dtype: object

### Selecting two or more columns from a `DataFrame`
Another `DataFrame` is returned.

In [19]:
nba[['Salary', 'Name']].head(3) # Note, output appears in the same the order in which column names are provided.

Unnamed: 0,Salary,Name
0,7730337.0,Avery Bradley
1,6796117.0,Jae Crowder
2,,John Holland


## Adding new columns to a `DataFrame`
### Method 1:

In [20]:
nba['Sport'] = 'Basketball'
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Sport
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,Basketball


### Method 2: Using the `.insert(..)` method
Automatically assigns back the value, hence no need to reassign or use inplace parameter.

In [21]:
nba.insert(loc=3,column='sport2', value='Basketball2')
nba.head(3)

Unnamed: 0,Name,Team,Number,sport2,Position,Age,Height,Weight,College,Salary,Sport
0,Avery Bradley,Boston Celtics,0.0,Basketball2,PG,25.0,6-2,180.0,Texas,7730337.0,Basketball
1,Jae Crowder,Boston Celtics,99.0,Basketball2,SF,25.0,6-6,235.0,Marquette,6796117.0,Basketball
2,John Holland,Boston Celtics,30.0,Basketball2,SG,27.0,6-5,205.0,Boston University,,Basketball


---
## Broadcasting operations
Applying methods to each value one by one

In [22]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


**Adding 5 to the age of all the players** <br/>
If the value is Null, then the addition will still result in a Null, and not an error.

In [23]:
nba['Age'].add(5).head(5)

0    30.0
1    30.0
2    32.0
3    27.0
4    34.0
Name: Age, dtype: float64

**Similarly, `.sub(..)`, `.mul(..)`, `.div(..)` etc. are also available. **

---
## Dropping rows with null values

In [24]:
nba = pd.read_csv('pandas/nba.csv')
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


### `.dropna(..)` method
Simply calling `.dropna(..)` method on a `DataFrame` will remove all those rows in entirety that have one or more column values as null.

**NOTE:** The `inplace` parameter can be used.

In [25]:
nba.dropna().tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


#### `.dropna(how='all')`
Drops only those rows that have **ALL** the columns as null.

#### `.dropna(how='any', axis=1)`
To drop the entire COLUMN (instead of row) that has one or more null values.

#### `.dropna(.., subset = )` parameter
**Drop a row if the value in a particular column of that row is null.** <br/>
The below code statement drops any row that has a null in the `Salary` column. It does not matter if any other column has a null value or not.

In [26]:
nba.dropna(subset=['Salary']).head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0


---
## Filling in the null values with `.fillna(..)` method
It is **NOT** recommended to execute the `.fillna(..)` method on the entire `DataFrame`.

For example, if we execute `nba.fillna(0)`, all the `NaN` values from the `Salary` column will be replaced by 0.0 (which is intended). However, it will also replace the `NaN` values in the `College` column by **0**, which is undesirable as it does not make any sense to have a **0** in the `College` column. 

In [27]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [28]:
nba['Salary'].fillna(value=0, inplace=True)
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0


**Similarly looking at the `College` column.**

In [29]:
nba['College'].head(6)

0                Texas
1            Marquette
2    Boston University
3        Georgia State
4                  NaN
5                  NaN
Name: College, dtype: object

In [30]:
nba['College'].fillna(value='Not available', inplace=True)
nba['College'].head(6)

0                Texas
1            Marquette
2    Boston University
3        Georgia State
4        Not available
5        Not available
Name: College, dtype: object

## `.astype(..)` method to convert the data type of a `Series`
**NOTE:** The Series should **NOT** have any NULL values.

In [31]:
nba = pd.read_csv('pandas/nba.csv').dropna(how='all')
nba['Salary'].fillna(0, inplace=True)
nba['College'].fillna('None', inplace=True)

nba.head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0


Checking the current data types of all columns

In [32]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

We can also use the `.info(..)` method, which in additon to giving information about the datatypes of the columns, also informs if there is any NULL in any column. <br/> Also note the current memory usage. Since we'll be converting some columns from float to int, we expect reduction in the overall memory usage for the DataFrame.

In [33]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null float64
dtypes: float64(4), object(5)
memory usage: 26.8+ KB


In [34]:
nba['Salary'] = nba['Salary'].astype('int')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [35]:
nba[['Number', 'Age']] = nba[['Number', 'Age']].astype('int')

**TIP:** Whenever you have small number of values in a DataFrame, convert them to Categorical as it helps in reducing the memory usage significantly.

For example, if we have columns such as Gender, or Month in our DataFrame, always consider converting these columns to categorical.

In [36]:
nba['Position'] = nba['Position'].astype('category')
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null int32
Position    457 non-null category
Age         457 non-null int32
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null int32
dtypes: category(1), float64(1), int32(3), object(4)
memory usage: 20.2+ KB


In [37]:
nba['Team'] = nba['Team'].astype('category')
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null category
Number      457 non-null int32
Position    457 non-null category
Age         457 non-null int32
Height      457 non-null object
Weight      457 non-null float64
College     457 non-null object
Salary      457 non-null int32
dtypes: category(2), float64(1), int32(3), object(3)
memory usage: 19.7+ KB


---
## Sorting a DataFrame
### `.sort_values(..)` method to sort the `DataFrame` by a column

In [38]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### Sort the values in the `DataFrame` by the `Name` column

In [39]:
nba.sort_values(by='Name').head(4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6-0,161.0,Oregon,2250000.0
356,Aaron Gordon,Orlando Magic,0.0,PF,20.0,6-9,220.0,Arizona,4171680.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6-6,210.0,Kentucky,525093.0
404,Adreian Payne,Minnesota Timberwolves,33.0,PF,25.0,6-10,237.0,Michigan State,1938840.0


#### Sort the values in the `DataFrame` by the `Age` column in descending order

In [40]:
nba.sort_values(by='Age', ascending=False).head(4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
304,Andre Miller,San Antonio Spurs,24.0,PG,40.0,6-3,200.0,Utah,250750.0
400,Kevin Garnett,Minnesota Timberwolves,21.0,PF,40.0,6-11,240.0,,8500000.0
298,Tim Duncan,San Antonio Spurs,21.0,C,40.0,6-11,250.0,Wake Forest,5250000.0
261,Vince Carter,Memphis Grizzlies,15.0,SG,39.0,6-6,220.0,North Carolina,4088019.0


**NOTE: If there are any NaN values in the column that we are sorting the DataFrame by, by-default they are placed at the end.**

In [41]:
nba.sort_values('Salary').tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
397,Axel Toupane,Denver Nuggets,6.0,SG,23.0,6-7,210.0,,
409,Greg Smith,Minnesota Timberwolves,4.0,PF,25.0,6-10,250.0,Fresno State,
457,,,,,,,,,


**However, if we want the null values to be at the beginning of the column being sorted, use the `na_position = 'first'` parameter**

In [42]:
nba.sort_values('Salary', na_position='first').tail(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500.0
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000.0


In [43]:
nba.sort_values('Salary', na_position='first').head(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
46,Elton Brand,Philadelphia 76ers,42.0,PF,37.0,6-9,254.0,Duke,


---
### `.sort_values(..)` method to sort the `DataFrame` by multiple columns

In [44]:
nba = pd.read_csv('pandas/nba.csv')
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


**First sort by `Team` and then by `Name`**

In [45]:
nba.sort_values(['Team', 'Name']).head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6-10,245.0,Florida,12000000.0
318,Dennis Schroder,Atlanta Hawks,17.0,PG,22.0,6-1,172.0,,1763400.0
323,Jeff Teague,Atlanta Hawks,0.0,PG,27.0,6-2,186.0,Wake Forest,8000000.0
309,Kent Bazemore,Atlanta Hawks,24.0,SF,26.0,6-5,201.0,Old Dominion,2000000.0
311,Kirk Hinrich,Atlanta Hawks,12.0,SG,35.0,6-4,190.0,Kansas,2854940.0
313,Kris Humphries,Atlanta Hawks,43.0,PF,31.0,6-9,235.0,Minnesota,1000000.0


**NOTE:** In both the cases the columns were sorted in ascending order by default.

However, If we want the `Team` to be sorted in ascending order and `Name` in descending order...

In [46]:
nba.sort_values(['Team', 'Name'], ascending=[True, False]).head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1000000.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
321,Tiago Splitter,Atlanta Hawks,11.0,C,31.0,6-11,245.0,,9756250.0
320,Thabo Sefolosha,Atlanta Hawks,25.0,SF,32.0,6-7,220.0,,4000000.0
315,Paul Millsap,Atlanta Hawks,4.0,PF,31.0,6-8,246.0,Louisiana Tech,18671659.0
319,Mike Scott,Atlanta Hawks,32.0,PF,27.0,6-8,237.0,Virginia,3333333.0


---
## Ranking values using the `.rank(..)` method
Ranking the rows based on the values in a particular column.

For example, ranking the players by their salaries, that is, the player with the highest salary gets a rank of 1 and so on.

- Reading the CSV
- Dropping all Null values
- Converting columns to appropriate datatypes

In [47]:
nba = pd.read_csv('pandas/nba.csv').dropna(how='all')
nba['Salary'] = nba['Salary'].fillna(0).astype('int')

nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [48]:
nba['Salary'].rank(ascending=False).astype(int).head(5)

0     97
1    110
2    452
3    322
4    147
Name: Salary, dtype: int32

**Interpretting the result:** <br/>
Refer to the original DataFrame, the player at index position of 0 has a rank of 97 based on our salary criteria. Similarly, the player at index position of 1 has a rank of 110 ad so on.

To verify, we first save the ranks obtained as another column of the `DataFrame` and then sort the `DataFrame` by `Salary`. We expect the first row of the resulting `DataFrame` to have a salary rank of 1 and so on..

In [49]:
nba['Salary Rank'] = nba['Salary'].rank(ascending=False).astype(int)

In [50]:
nba.sort_values(by='Salary', ascending=False).head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary Rank
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000,1
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500,2
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000,3


**NOTE:** Same rank is assigned to the values having the same value. For example, in the above example, all the players having the same particular value of salary, will be assigned the same rank by the `.rank(..)` method

---
---
# Extracting data from a DataFrame

In [51]:
import pandas as pd

**Reading the `DataFrame`**

In [52]:
df = pd.read_csv('pandas/employees.csv')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,8/6/1993,12:42 PM,97308,6.945,True,Marketing
1,Thomas,Male,3/31/1996,6:53 AM,61933,4.17,True,
2,Maria,Female,4/23/1993,11:17 AM,130590,11.858,False,Finance


**Calling the `.info(..)` method to see the datatypes of columns and change the datatypes if required**

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               855 non-null object
Start Date           1000 non-null object
Last Login Time      1000 non-null object
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    933 non-null object
Team                 957 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 39.1+ KB


In [54]:
df['Gender'] = df['Gender'].astype('category')
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['Last Login Time'] = pd.to_datetime(df['Last Login Time'])
df['Senior Management'] = df['Senior Management'].astype('bool')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
First Name           933 non-null object
Gender               855 non-null category
Start Date           1000 non-null datetime64[ns]
Last Login Time      1000 non-null datetime64[ns]
Salary               1000 non-null int64
Bonus %              1000 non-null float64
Senior Management    1000 non-null bool
Team                 957 non-null object
dtypes: bool(1), category(1), datetime64[ns](2), float64(1), int64(1), object(2)
memory usage: 41.1+ KB


**NOTE:** Since the column `Last Login Time` only has time and no date, converting to datetime will add the computer's current date.

In [55]:
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


**However, there is a better way to convert to datetime than the one showed above. 
While reading the CSV, we can pass the `parse_dates` argument to convert the columns from string/object to datetime type while the `DataFrame` is read.**

In [56]:
employees = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date','Last Login Time'])

---
## Filtering (or subsetting) data from a Dataframe based on some condition

In [57]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date', 'Last Login Time'])
df[['Gender','Senior Management','Team']] = df[['Gender','Senior Management','Team']].astype('category')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


**Let's say we want all the male employees of the company, that is only those rows where the `Gender` column is `Male`**

In [58]:
df[df['Gender'] == 'Male'].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2018-06-08 13:00:00,138705,9.34,True,Finance


**Let's say we want all the male employees of the company, that is only those rows where the `Senior Management` column is `True`**. <br/>
Since, the `Senior Management` column is already boolean type, the desired result can be obtained in a straighforward way without the need to use the equality operator**.

However, take note of the datatype of the column. It should be bool and not categorical.

In [59]:
df['Senior Management'].dtype

CategoricalDtype(categories=[False, True], ordered=False)

**Converting to bool**

In [60]:
df['Senior Management'] = df['Senior Management'].astype('bool')

**Now extracting only those from the original `DataFrame` where `Senior Management` is `True`**

In [61]:
df[df['Senior Management']].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2018-06-08 13:00:00,138705,9.34,True,Finance


### Extracting the records using multiple filter conditions
**We can use the & and | to combine different filter conditions** <br/>
For example, extracting those records where `Gender` is `Female` **and** `Senior Management` is `True`

In [62]:
mask1 = df['Gender'] == 'Male'
mask2 = df['Senior Management'] == True
df[mask1 & mask2].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
3,Jerry,Male,2005-03-04,2018-06-08 13:00:00,138705,9.34,True,Finance


Now extracting only those records where the `Senior Management` is `True` **or** `Start Date` is before `1-Jan-2000`

In [63]:
mask1 = df['Senior Management'] == True
mask2 = df['Start Date'] < '2000-01-01'
df[mask1 | mask2].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


**These operators can also be combined** <br/>
It is better to wrap the conditions in parenthesis in the order we want them to be executed so as to avoid confusion.

In [64]:
mask1 = df['First Name'] == 'Robert'
mask2 = df['Team'] == 'Client Services'
mask3 = df['Start Date'] > '2016-06-01'

df[(mask1 & mask2) | mask3]

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
15,Lillian,Female,2016-06-05,2018-06-08 06:09:00,59414,1.256,False,Product
98,Tina,Female,2016-06-16,2018-06-08 19:47:00,100705,16.961,True,Marketing
387,Robert,Male,1994-10-29,2018-06-08 04:26:00,123294,19.894,False,Client Services
451,Terry,,2016-07-15,2018-06-08 00:29:00,140002,19.49,True,Marketing


---
## The `.isin(..)` method

Let's say we want to extract records of only the employees from `Finance`, `Product` and `Sales` Teams

In [65]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date', 'Last Login Time'])
df['Senior Management'] = df['Senior Management'].astype('bool')
df['Gender'] = df['Gender'].astype('category')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


### Method 1: Create masks based on the desired filter required and then combine them using boolean OR operator

In [66]:
mask1 = df['Team'] == 'Finance'
mask2 = df['Team'] == 'Product'
mask3 = df['Team'] == 'Sales'

df[mask1 | mask2 | mask3].head(5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-06-08 13:00:00,138705,9.34,True,Finance
6,Ruby,Female,1987-08-17,2018-06-08 16:20:00,65476,10.012,True,Product
7,,Female,2015-07-20,2018-06-08 10:43:00,45906,11.598,True,Finance
13,Gary,Male,2008-01-27,2018-06-08 23:40:00,109831,5.831,False,Sales


### Method 2: Use `.isin(..)` method

In [67]:
mask = df['Team'].isin(['Finance', 'Sales', 'Product'])
df[mask].head(5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-06-08 13:00:00,138705,9.34,True,Finance
6,Ruby,Female,1987-08-17,2018-06-08 16:20:00,65476,10.012,True,Product
7,,Female,2015-07-20,2018-06-08 10:43:00,45906,11.598,True,Finance
13,Gary,Male,2008-01-27,2018-06-08 23:40:00,109831,5.831,False,Sales


---
## The `.isnull(..)`,  `.notnull(..)` methods
Very useful to filter out the null values

In [68]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Last Login Time', 'Start Date'])
df['Gender'] = df['Gender'].astype('category')
df['Senior Management'] = df['Senior Management'].astype('bool')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


**Pulling out all the records that have null, that is, `NaN` in the `Team` column**

In [69]:
df[ df['Team'].isnull() ].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
10,Louise,Female,1980-08-12,2018-06-08 09:01:00,63241,15.132,True,
23,,Male,2012-06-14,2018-06-08 16:19:00,125792,5.042,True,


**Only extracting the rows that DO NOT have a null in the `Gender` column**

In [70]:
df[ df['Gender'].notnull() ].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


---
### The `.between(..)` method
**NOTE:** The lower bound and upper bound provided to the `.between(..)` method are included into the results.

In [71]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date', 'Last Login Time'])
df['Gender'] = df['Gender'].astype('category')
df['Senior Management'] = df['Senior Management'].astype('bool')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


**Let's say we want all the employees whose salary lie BETWEEN 60,000 and 70,000 (both inclusive)**

In [72]:
mask = df['Salary'].between(60000,70000)
df[mask].head(5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
6,Ruby,Female,1987-08-17,2018-06-08 16:20:00,65476,10.012,True,Product
10,Louise,Female,1980-08-12,2018-06-08 09:01:00,63241,15.132,True,
20,Lois,,1995-04-22,2018-06-08 19:18:00,64714,4.934,True,Legal
41,Christine,,2015-06-28,2018-06-08 01:08:00,66582,11.308,True,Business Development


**As for another example, let's say we want those employees whose `Bonus%` lie between 2% to 5%**

In [73]:
mask = df['Bonus %'].between(2,5)
df[mask].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
20,Lois,,1995-04-22,2018-06-08 19:18:00,64714,4.934,True,Legal
40,Michael,Male,2008-10-10,2018-06-08 11:25:00,99283,2.665,True,Distribution


**We can also use the `.between(..)` method when working with dates**

In [74]:
mask = df['Start Date'].between('1991-01-01', '1992-01-01')
df[mask].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
27,Scott,,1991-07-11,2018-06-08 18:58:00,122367,5.218,False,Legal
75,Bonnie,Female,1991-07-02,2018-06-08 01:27:00,104897,5.118,True,Human Resources
88,Donna,Female,1991-11-27,2018-06-08 13:59:00,64088,6.155,True,Legal


**Employees who last logged in between 8:30AM and 5:00PM**

In [75]:
mask = df['Last Login Time'].between('8:30:00', '17:00:00')
df[mask].head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance
3,Jerry,Male,2005-03-04,2018-06-08 13:00:00,138705,9.34,True,Finance


----
## The `.duplicated(..)` method
**This method allows us to find/remove duplicates in our dataframe**

In [76]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date','Last Login Time'])
df['Gender'] = df['Gender'].astype('category')
df['Senior Management'] = df['Senior Management'].astype('bool')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


**We are going to find duplicates in the `First Name` column**

In [77]:
# First sorting the enployees dataframe by `First Name`
df.sort_values('First Name', inplace=True)

# Using finding duplicated values
df['First Name'].duplicated().head(5)

101    False
327     True
440     True
937     True
137    False
Name: First Name, dtype: bool

See how this method returns `True` if a value is duplicated in the `Series`, else returns `False`.
The first occurence is returned as False, subsequent occurrences are returned as `False`

By-default, the `keep` parameter in the `.duplicated(..)` method is set to `first`. This means that the first occurrence of a value in the Series will be returned as `False`, and subsequent occurrences of the same is returned as `True`. <br/>
Changing this parameter to `last`, will lead only the last occurrence of a value to be not marked as duplicate. If a value occurs only once, it is returned as non-duplicate.

In [78]:
df['First Name'].duplicated(keep='last').head(5)

101     True
327     True
440     True
937    False
137     True
Name: First Name, dtype: bool

In [79]:
df[ df['First Name'].duplicated(keep='last') ].head(5)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-06-08 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2018-06-08 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2018-06-08 14:53:00,52119,11.343,True,Client Services
137,Adam,Male,2011-05-21,2018-06-08 01:45:00,95327,15.12,False,Distribution
141,Adam,Male,1990-12-24,2018-06-08 20:57:00,110194,14.727,True,Product


**If we want to mark all the values as duplicate if they occur more than once no matter if they occur first or last, `keep` paramter has to be set to `False`**

In [80]:
df['First Name'].duplicated(keep=False).head(5)

101    True
327    True
440    True
937    True
137    True
Name: First Name, dtype: bool

---
## `.drop_duplicates(..)` method
Can be called on a `DataFrame` instead of just a `Series`.

In [81]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date','Last Login Time'])
df['Gender'] = df['Gender'].astype('category')
df['Senior Management'].astype('bool')
df.sort_values('First Name', inplace=True)
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-06-08 10:20:00,61602,11.849,True,Marketing
327,Aaron,Male,1994-01-29,2018-06-08 18:48:00,58755,5.097,True,Marketing
440,Aaron,Male,1990-07-22,2018-06-08 14:53:00,52119,11.343,True,Client Services


In [82]:
len(df)

1000

See that currently have 1000 rows in our `DataFrame`.

In [83]:
len(df.drop_duplicates())

1000

Note that even after calling the `drop_duplicates` method, the total number of rows in our DataFrame still remained the same even though we have duplicates in the First Name column, Gender column, Senior Management Column, Team column etc. 

This is because by-default only those rows are dropped that have the values duplicated across ALL of the columns, and there are no two rows in this dataframe that are identical across all of the columns.

In [84]:
df.drop_duplicates(subset=['First Name'], keep='first').head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
101,Aaron,Male,2012-02-17,2018-06-08 10:20:00,61602,11.849,True,Marketing
137,Adam,Male,2011-05-21,2018-06-08 01:45:00,95327,15.12,False,Distribution
300,Alan,Male,1988-06-26,2018-06-08 03:54:00,111786,3.592,True,Engineering


**Just like while calling the `duplicated` method, the value of the `keep` parameter can be set to `last` or a boolean `False`**

----


## The `.unique(..) and .nunique(..) methods
Both the above methods deal with unique values in a `Series`.

In [85]:
df = pd.read_csv('pandas/employees.csv', parse_dates=['Start Date','Last Login Time'])
df['Senior Management'] = df['Senior Management'].astype('bool')
df['Gender'] = df['Gender'].astype('category')
df.head(3)

Unnamed: 0,First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
0,Douglas,Male,1993-08-06,2018-06-08 12:42:00,97308,6.945,True,Marketing
1,Thomas,Male,1996-03-31,2018-06-08 06:53:00,61933,4.17,True,
2,Maria,Female,1993-04-23,2018-06-08 11:17:00,130590,11.858,False,Finance


In [86]:
df['Gender'].head(5)

0      Male
1      Male
2    Female
3      Male
4      Male
Name: Gender, dtype: category
Categories (2, object): [Female, Male]

In [87]:
df['Gender'].unique()

[Male, Female, NaN]
Categories (2, object): [Male, Female]

In [88]:
df['Team'].unique()

array(['Marketing', nan, 'Finance', 'Client Services', 'Legal', 'Product',
       'Engineering', 'Business Development', 'Human Resources', 'Sales',
       'Distribution'], dtype=object)

In [89]:
len(df['Team'].unique())

11

This means that there are a total of 11 unique Teams present. We can also use the `.nunique()` method instead of alling `len()` on `.unique(..)` method. 

In [90]:
df['Team'].nunique()

10

**NOTE:** By default, nunique does not count NaN.

In [91]:
df['Team'].nunique(dropna=False)

11

---
# More on `DataFrames`

In [92]:
import pandas as pd

In [93]:
# Loading and previewing the dataset
bond = pd.read_csv('pandas/jamesbond.csv')
bond.head(3)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


### The `setindex(..)` and `resetindex(..)` methods

In [94]:
bond.set_index('Film').head(4)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


In [95]:
bond.reset_index().head(3) # We can use the inplace parameter

Unnamed: 0,index,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


**Note how we got another column of numeric indices. These can be avoided by using the `drop=True` parameter**

In [96]:
bond.reset_index(drop=True).head(3)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


**If we have one of the columns as index and if we wish to now set another column as index, it is important to first use the reset_index() method. Else, the previous index column is simply overwritten and data is lost.**

---
## Retrieving rows by index label using `.loc[]`

In [97]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


When `DataFrames` are very large, sorting the `DataFrame` first accelerate several other operations performed subsequently.

In [98]:
bond.loc['Goldfinger']

Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: Goldfinger, dtype: object

As in the above case, since only one `Goldfinger` exists in our `DataFrame`, a `Series` is returned.

All the column names from that particular row of the Dataframe become the indices of the Series and the values of columns from that row become the values in the Series being returned.

If the index name passed to the `loc[..]` does not exist in the DataFrame, it results in an **error**.

If the index name passed to the `loc[..]` occurs more than once, a DataFrame is returned.

In [99]:
bond.loc['Casino Royale']

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


**We an also extract multiple indexes using the `:`**

In [100]:
bond.loc['Diamonds Are Forever':'Moonraker']

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,1981,Roger Moore,John Glen,449.4,60.2,
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,1995,Pierce Brosnan,Martin Campbell,518.5,76.9,5.1
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
Live and Let Die,1973,Roger Moore,Guy Hamilton,460.3,30.8,
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,


**Extracting rows with only specific index names**

In [101]:
bond.loc[['Moonraker','Octopussy']]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Moonraker,1979,Roger Moore,Lewis Gilbert,535.0,91.5,
Octopussy,1983,Roger Moore,John Glen,373.8,53.9,7.8


**If we provide multiple index names, of which one or more does not exist in the DataFrame, NaN is returned for all columns** <br/>
In the code below, 'Goldfinger' and 'License to Kill' exist. However, 'Goldbond' does not exist.

In [102]:
'Goldbond' in bond.index

False

In [103]:
bond.loc[['Goldfinger','Licence to Kill','Goldbond']]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Licence to Kill,1989.0,Timothy Dalton,John Glen,250.9,56.7,7.9
Goldbond,,,,,,


The order in which the values are returned in the Dataframe is exactly the same as they were passed to the `.loc[..]`.
-----

----
## Retrieve row(s) by index position with `iloc[..]`

In [104]:
bond =  pd.read_csv('pandas/jamesbond.csv')
bond.head(3)

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


In [105]:
bond.iloc[2]

Film                   Goldfinger
Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: 2, dtype: object

Since right now the indices are numerical starting from zero, the behavior of `.loc[..]` will be the same as that for `iloc[..]`

In [106]:
bond.loc[2]

Film                   Goldfinger
Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: 2, dtype: object

**Multiple values using `iloc[..]`**

In [107]:
bond.iloc[[2,4,5]] # NOTE, another DataFrame is returned

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
4,Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
5,You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [108]:
bond.iloc[:3]

Unnamed: 0,Film,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
0,Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
1,From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
2,Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


**Reading the CSV again this time with `Film` column set as index**

In [109]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


**Using the `loc[..]` to obtain the specific row as demonstrated above**

In [110]:
bond.loc['Goldfinger']

Year                         1964
Actor                Sean Connery
Director             Guy Hamilton
Box Office                  820.4
Budget                       18.6
Bond Actor Salary             3.2
Name: Goldfinger, dtype: object

**Interesting** <br/>
Even though we have string indices now in our `bond` DataFrame, we can still use the `.iloc[..]` and obtain the records by passing numerical index values. 

In [111]:
bond.iloc[0]

Year                        1985
Actor                Roger Moore
Director               John Glen
Box Office                 275.2
Budget                      54.5
Bond Actor Salary            9.1
Name: A View to a Kill, dtype: object

In [112]:
bond.iloc[:5]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


---
## The catch-all `.ix[..]` method (Depreciated)
Combines the working of `.loc[..]` and `.iloc[..]` methods

In [113]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [114]:
bond.ix['GoldenEye']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Year                            1995
Actor                 Pierce Brosnan
Director             Martin Campbell
Box Office                     518.5
Budget                          76.9
Bond Actor Salary                5.1
Name: GoldenEye, dtype: object

In [115]:
bond.ix[2]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Year                        1967
Actor                David Niven
Director              Ken Hughes
Box Office                   315
Budget                        85
Bond Actor Salary            NaN
Name: Casino Royale, dtype: object

### Second parameter to the `.loc[..]`, `.iloc[..]` and `.ix[..]` methods

In [116]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [117]:
bond.loc['A View to a Kill', 'Actor']

'Roger Moore'

In [118]:
bond.iloc[1, 2]

'Martin Campbell'

In [119]:
bond.ix[1,'Actor']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


'Daniel Craig'

---
## Set new values for specific rows or cells

In [120]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [121]:
bond.loc['Dr. No']

Year                          1962
Actor                 Sean Connery
Director             Terence Young
Box Office                   448.8
Budget                           7
Bond Actor Salary              0.6
Name: Dr. No, dtype: object

In [122]:
bond.loc['Dr. No','Actor']

'Sean Connery'

In [123]:
bond.loc['Dr. No','Actor'] = 'Sir Sean Connery'

In [124]:
bond.loc['Dr. No']

Year                             1962
Actor                Sir Sean Connery
Director                Terence Young
Box Office                      448.8
Budget                              7
Bond Actor Salary                 0.6
Name: Dr. No, dtype: object

### Changing multiple values

In [125]:
bond.loc['Dr. No']

Year                             1962
Actor                Sir Sean Connery
Director                Terence Young
Box Office                      448.8
Budget                              7
Bond Actor Salary                 0.6
Name: Dr. No, dtype: object

In [126]:
bond.loc['Dr. No',['Year', 'Budget']] = [1966, 9]

In [127]:
bond.loc['Dr. No']

Year                             1966
Actor                Sir Sean Connery
Director                Terence Young
Box Office                      448.8
Budget                              9
Bond Actor Salary                 0.6
Name: Dr. No, dtype: object

### Set values of a column across multiple rows in a DataFrame

In [128]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [129]:
bond['Actor'].head(3)

Film
A View to a Kill     Roger Moore
Casino Royale       Daniel Craig
Casino Royale        David Niven
Name: Actor, dtype: object

In [130]:
mask = bond['Actor'] == 'Sean Connery'
mask.head(5)

Film
A View to a Kill        False
Casino Royale           False
Casino Royale           False
Diamonds Are Forever     True
Die Another Day         False
Name: Actor, dtype: bool

In [131]:
bond.loc[mask].head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6


In [132]:
bond.loc[mask]['Actor'].head(3)

Film
Diamonds Are Forever     Sean Connery
Dr. No                   Sean Connery
From Russia with Love    Sean Connery
Name: Actor, dtype: object

In [133]:
bond.loc[mask, 'Actor'] = 'Sir Sean Connery'

In [134]:
bond.head(5)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sir Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


---
## Renaming index values or column values in a DataFrame

In [135]:
bond = pd.read_csv('pandas/jamesbond.csv',index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


### `.rename(..)` method to rename column headers
The rename method requires a Python Dictionary to be passed to the `column` paramter where the key(s) is the old name, and the corresponding value(s) is the new name to be assigned to that particular column.

In [136]:
bond.rename(columns={'Year':'Release Date', 'Box Office':'Revenue'}, inplace=True)
bond.head(2)

Unnamed: 0_level_0,Release Date,Actor,Director,Revenue,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


### `.rename(..)` method to rename index labels

In [137]:
bond.rename(index = {'Dr. No': 'Doctor No', 
                     'GolddenEye':'Golden Eye',
                    'The World Is Not Enough':'Best Bond Movie Ever'}, inplace=True)

In [138]:
bond.loc['Best Bond Movie Ever']

Release Date                   1999
Actor                Pierce Brosnan
Director              Michael Apted
Revenue                       439.5
Budget                        158.3
Bond Actor Salary              13.5
Name: Best Bond Movie Ever, dtype: object

**We can also simply assign values to the `.columns` attribute of the DataFrame. However in this case, ALL of the column names will have to be provided as a Python List in their order of occurrence in the DataFrame. **

In [139]:
bond.columns

Index(['Release Date', 'Actor', 'Director', 'Revenue', 'Budget',
       'Bond Actor Salary'],
      dtype='object')

---
# Deleting rows or columns from a DataFrame

In [140]:
bond = pd.read_csv('pandas/jamesbond.csv',index_col='Film')
bond.sort_index(inplace=True)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


## Removing rows using the `.drop(..)` method
- A new DataFrame is returned
- `inplace` parameter is also available

In [141]:
result = bond.drop(labels='A View to a Kill')
result.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8


**To delete multiple values, a Python list of labels acan be passed to the `.drop(..)` method**

In [142]:
bond.tail(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Tomorrow Never Dies,1997,Pierce Brosnan,Roger Spottiswoode,463.2,133.9,10.0
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


In [143]:
result = bond.drop(labels= ['You Only Live Twice', 'Tomorrow Never Dies'])
result.tail(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
The Spy Who Loved Me,1977,Roger Moore,Lewis Gilbert,533.0,45.1,
The World Is Not Enough,1999,Pierce Brosnan,Michael Apted,439.5,158.3,13.5
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


**If there are multiple occurrences of an index label, and `.drop(..)` method is called on that label, ALL the occurrences of that will be removed from the DataFrame.**

In [144]:
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


In [145]:
result = bond.drop(labels='Casino Royale')
result.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9


## Removing columns using the `.drop(..)` method
### Method 1

In [146]:
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3
Casino Royale,1967,David Niven,Ken Hughes,315.0,85.0,


**Removing the `Box Office` column**

In [147]:
result = bond.drop(columns='Box Office')
result.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A View to a Kill,1985,Roger Moore,John Glen,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,145.3,3.3


### Method 2: passing numerical value to the `axis` parameter
**`axis = 1` signifies columns**

In [148]:
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [149]:
result = bond.drop(labels='Box Office', axis=1)
result.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A View to a Kill,1985,Roger Moore,John Glen,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,145.3,3.3


### Passing `'columns'` to the `axis` parameter

In [150]:
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [151]:
result = bond.drop(labels='Box Office', axis='columns')
result.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A View to a Kill,1985,Roger Moore,John Glen,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,145.3,3.3


**To remove multiple columns, a Python List can be passed to the `drop(..)` method, with `axis = 1` parameter**

## Removing rows using the `.pop(..)` method
- This operation is permanent, hence, no need to use the `inplace = True` parameter.
- The popped out row is returned as a Series.

In [152]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [153]:
actor = bond.pop('Actor')

In [154]:
bond.head(2)

Unnamed: 0_level_0,Year,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A View to a Kill,1985,John Glen,275.2,54.5,9.1
Casino Royale,2006,Martin Campbell,581.5,145.3,3.3


In [155]:
actor

Film
A View to a Kill                      Roger Moore
Casino Royale                        Daniel Craig
Casino Royale                         David Niven
Diamonds Are Forever                 Sean Connery
Die Another Day                    Pierce Brosnan
Dr. No                               Sean Connery
For Your Eyes Only                    Roger Moore
From Russia with Love                Sean Connery
GoldenEye                          Pierce Brosnan
Goldfinger                           Sean Connery
Licence to Kill                    Timothy Dalton
Live and Let Die                      Roger Moore
Moonraker                             Roger Moore
Never Say Never Again                Sean Connery
Octopussy                             Roger Moore
On Her Majesty's Secret Service    George Lazenby
Quantum of Solace                    Daniel Craig
Skyfall                              Daniel Craig
Spectre                              Daniel Craig
The Living Daylights               Timothy Da

## Python `del` operator to delete columns

In [156]:
bond.head(2)

Unnamed: 0_level_0,Year,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A View to a Kill,1985,John Glen,275.2,54.5,9.1
Casino Royale,2006,Martin Campbell,581.5,145.3,3.3


In [157]:
del bond['Director']
bond.head(2)

Unnamed: 0_level_0,Year,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A View to a Kill,1985,275.2,54.5,9.1
Casino Royale,2006,581.5,145.3,3.3


---
# Creating a random sample from a DataFrame

In [158]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


### `.sample(..)` method by default returns one row from the DataFrame at random

In [159]:
bond.sample()

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6


#### Multiple random rows using the `n` parameter

In [160]:
bond.sample(n=3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Die Another Day,2002,Pierce Brosnan,Lee Tamahori,465.4,154.2,17.9
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


#### Multiple random rows using the `frac` parameter
10% of the total number of rows in the DataFrame by passing `frac=0.1`

In [161]:
bond.sample(frac=0.1)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


### Random columns acorss all labels/indices

In [162]:
result = bond.sample(n=3, axis=1)
result.head(3)

Unnamed: 0_level_0,Year,Actor,Box Office
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A View to a Kill,1985,Roger Moore,275.2
Casino Royale,2006,Daniel Craig,581.5
Casino Royale,1967,David Niven,315.0


In [163]:
result = bond.sample(n=3, axis='columns')
result.head(3)

Unnamed: 0_level_0,Box Office,Actor,Year
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A View to a Kill,275.2,Roger Moore,1985
Casino Royale,581.5,Daniel Craig,2006
Casino Royale,315.0,David Niven,1967


---
# `.nsmallest(..)` and `.nlargest(..)` methods
To extract the rows from a DataFrame that contain the smalles/largest values in a particular column

In [164]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


**Extracting the top 3 rows with largest Box Office**

**METHOD-1**

In [165]:
bond.sort_values('Box Office', ascending=False).head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


**METHOD-2**

In [166]:
bond.nlargest(n=3,columns='Box Office')

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Skyfall,2012,Daniel Craig,Sam Mendes,943.5,170.2,14.5
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2


### Can use the `.nsmallest(..)` method in a similar way

In [167]:
bond.nsmallest(n=2, columns='Bond Actor Salary')

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6


---
# Filtering with the `.where(..)` method
The rows that match our criteria are returned as another DataFrame. Those values that do not match our criteria will also be present in the DataFrame as null values!

In [168]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


**Filtering where the `Actor` column is equal to `Sean Connery`**

**METHOD-1**

In [169]:
mask = bond['Actor'] == 'Sean Connery'
bond[mask]

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


**METHOD-2**

In [170]:
bond.where(mask)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,,,,,,
Casino Royale,,,,,,
Casino Royale,,,,,,
Diamonds Are Forever,1971.0,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Die Another Day,,,,,,
Dr. No,1962.0,Sean Connery,Terence Young,448.8,7.0,0.6
For Your Eyes Only,,,,,,
From Russia with Love,1963.0,Sean Connery,Terence Young,543.8,12.6,1.6
GoldenEye,,,,,,
Goldfinger,1964.0,Sean Connery,Guy Hamilton,820.4,18.6,3.2


---
# Filtering data using `.query(..)` method
- The argument to the query method can ONLY be a string
- This method only works when the column names in our DataFrame do not have any spaces

In [171]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


**Changing the column names that have spaces in them**

In [172]:
bond.rename(columns={'Box Office':'Box_Office', 'Bond Actor Salary':'Bond_Actor_Salary'}, inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


**Using the `.query(..)` method**

In [173]:
bond.query('Actor == "Sean Connery"')

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Diamonds Are Forever,1971,Sean Connery,Guy Hamilton,442.5,34.7,5.8
Dr. No,1962,Sean Connery,Terence Young,448.8,7.0,0.6
From Russia with Love,1963,Sean Connery,Terence Young,543.8,12.6,1.6
Goldfinger,1964,Sean Connery,Guy Hamilton,820.4,18.6,3.2
Never Say Never Again,1983,Sean Connery,Irvin Kershner,380.0,86.0,
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7
You Only Live Twice,1967,Sean Connery,Lewis Gilbert,514.2,59.9,4.4


**While querying on multiple conditions, use `and` and `or` instead of `&` and `|` respectively**

In [174]:
bond.query('Director == "Terence Young" and Box_Office > 600')

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thunderball,1965,Sean Connery,Terence Young,848.1,41.9,4.7


**We can also use the `in` and `not in` keywords**

In [175]:
bond.query('Actor in ["Timothy Dalton", "George Lazenby"]')

Unnamed: 0_level_0,Year,Actor,Director,Box_Office,Budget,Bond_Actor_Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Licence to Kill,1989,Timothy Dalton,John Glen,250.9,56.7,7.9
On Her Majesty's Secret Service,1969,George Lazenby,Peter R. Hunt,291.5,37.3,0.6
The Living Daylights,1987,Timothy Dalton,John Glen,313.5,68.8,5.2


---
# `apply(..)` method on a single column
This is same as calling the `.apply(..)` method on a `Series`

In [176]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [177]:
def convert_to_string_and_add_millions(num):
    return str(num) + ' MILLIONS!'

In [178]:
bond['Box Office']

Film
A View to a Kill                   275.2
Casino Royale                      581.5
Casino Royale                      315.0
Diamonds Are Forever               442.5
Die Another Day                    465.4
Dr. No                             448.8
For Your Eyes Only                 449.4
From Russia with Love              543.8
GoldenEye                          518.5
Goldfinger                         820.4
Licence to Kill                    250.9
Live and Let Die                   460.3
Moonraker                          535.0
Never Say Never Again              380.0
Octopussy                          373.8
On Her Majesty's Secret Service    291.5
Quantum of Solace                  514.2
Skyfall                            943.5
Spectre                            726.7
The Living Daylights               313.5
The Man with the Golden Gun        334.0
The Spy Who Loved Me               533.0
The World Is Not Enough            439.5
Thunderball                        848.1
Tomorrow Ne

In [179]:
bond['Box Office'].apply(convert_to_string_and_add_millions)

Film
A View to a Kill                   275.2 MILLIONS!
Casino Royale                      581.5 MILLIONS!
Casino Royale                      315.0 MILLIONS!
Diamonds Are Forever               442.5 MILLIONS!
Die Another Day                    465.4 MILLIONS!
Dr. No                             448.8 MILLIONS!
For Your Eyes Only                 449.4 MILLIONS!
From Russia with Love              543.8 MILLIONS!
GoldenEye                          518.5 MILLIONS!
Goldfinger                         820.4 MILLIONS!
Licence to Kill                    250.9 MILLIONS!
Live and Let Die                   460.3 MILLIONS!
Moonraker                          535.0 MILLIONS!
Never Say Never Again              380.0 MILLIONS!
Octopussy                          373.8 MILLIONS!
On Her Majesty's Secret Service    291.5 MILLIONS!
Quantum of Solace                  514.2 MILLIONS!
Skyfall                            943.5 MILLIONS!
Spectre                            726.7 MILLIONS!
The Living Daylights      

### `.apply(..)` method on multiple columns

In [180]:
columns = ['Box Office', 'Budget', 'Bond Actor Salary']
for col in columns:
    bond[col] = bond[col].apply(convert_to_string_and_add_millions)
bond.head(3)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2 MILLIONS!,54.5 MILLIONS!,9.1 MILLIONS!
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5 MILLIONS!,145.3 MILLIONS!,3.3 MILLIONS!
Casino Royale,1967,David Niven,Ken Hughes,315.0 MILLIONS!,85.0 MILLIONS!,nan MILLIONS!


## `.apply(..)` method on row values
Assigning a category to each movie (row of the `bond` DataFrame)

In [181]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [182]:
def good_movie(row):
    actor = row[1]
    budget = row[4]
    
    if actor == 'Pierce Brosnan':
        return 'The Best'
    elif actor == 'Roger Moore' and budget > 40:
        return 'Enjoyable'
    else: 
        return 'I have no clue'

In [183]:
bond.apply(good_movie, axis='columns')

Film
A View to a Kill                        Enjoyable
Casino Royale                      I have no clue
Casino Royale                      I have no clue
Diamonds Are Forever               I have no clue
Die Another Day                          The Best
Dr. No                             I have no clue
For Your Eyes Only                      Enjoyable
From Russia with Love              I have no clue
GoldenEye                                The Best
Goldfinger                         I have no clue
Licence to Kill                    I have no clue
Live and Let Die                   I have no clue
Moonraker                               Enjoyable
Never Say Never Again              I have no clue
Octopussy                               Enjoyable
On Her Majesty's Secret Service    I have no clue
Quantum of Solace                  I have no clue
Skyfall                            I have no clue
Spectre                            I have no clue
The Living Daylights               I have no 

---
# The `.copy(..)` method
This method creates a copy of an existing Pandas object such as a Series or a DataFrame.

In [184]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [185]:
directors = bond['Director']
directors.head(3)

Film
A View to a Kill          John Glen
Casino Royale       Martin Campbell
Casino Royale            Ken Hughes
Name: Director, dtype: object

In [186]:
directors['A View to a Kill']

'John Glen'

In [187]:
directors['A View to a Kill'] = 'Mister John Glen'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [188]:
directors.head(3)

Film
A View to a Kill    Mister John Glen
Casino Royale        Martin Campbell
Casino Royale             Ken Hughes
Name: Director, dtype: object

In [189]:
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,Mister John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


**The original DataFrame has been changed.**

In [190]:
bond = pd.read_csv('pandas/jamesbond.csv', index_col='Film')
bond.sort_index(inplace=True)
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


In [191]:
directors = bond['Director'].copy()
directors.head(3)

Film
A View to a Kill          John Glen
Casino Royale       Martin Campbell
Casino Royale            Ken Hughes
Name: Director, dtype: object

In [192]:
directors['A View to a Kill']

'John Glen'

In [193]:
directors['A View to a Kill'] = 'Mister John Glen'
directors.head(3)

Film
A View to a Kill    Mister John Glen
Casino Royale        Martin Campbell
Casino Royale             Ken Hughes
Name: Director, dtype: object

In [194]:
bond.head(2)

Unnamed: 0_level_0,Year,Actor,Director,Box Office,Budget,Bond Actor Salary
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A View to a Kill,1985,Roger Moore,John Glen,275.2,54.5,9.1
Casino Royale,2006,Daniel Craig,Martin Campbell,581.5,145.3,3.3


**NOTE: The original DataFrame has not been modified this time**