# C. Pandas DataFrame

Pandas DataFrame is a 2-dimensional data structure composed of two axes (rows & columns), and you would often face the situation to process data along a specific axis. Here, what makes pandas so useful is that it provides functionality to deal with such situations. Thus, it is crucial to know pandas well enough to the point of becoming confident with axis-wise data processing.<br>
In order to drive our confidence in pandas, we'll be looking at methods for indexing, slicing and subsetting DataFrames.

### _Objective_

1. **Data selection based on columns**: Understanding how to select specific data from a DataFrame based on columns

2. **Data selection based on rows**: Understanding how to select specific data from a DataFrame based on rows

In [3]:
import pandas as pd
import numpy as np

#### Example Data) Students' Report Cards

In [4]:
columns = ["class","l_name", "f_name", "history", "english", "math", "social_studies", "science"]
scores = [["1", "Smith", "John", 80, 92, 70, 65, 92],
          ["1", "Schafer", "Elise", 91, 75, 90, 68, 85],
          ["2", "Zimmermann", "Kate", 86, 76, 42, 72, 88],
          ["2", "Mendoza", "James", 77, 92, 52, 60, 80],
          ["3", "Park", "Jay", 75, 85, 85, 92, 95],
          ["3", "Randow", "Emma", 96, 90, 95, 81, 72],
          ["4", "Thompson", "Sarah", 91, 81, 92, 81, 73]]

df = pd.DataFrame(scores,columns=columns)
df

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
0,1,Smith,John,80,92,70,65,92
1,1,Schafer,Elise,91,75,90,68,85
2,2,Zimmermann,Kate,86,76,42,72,88
3,2,Mendoza,James,77,92,52,60,80
4,3,Park,Jay,75,85,85,92,95
5,3,Randow,Emma,96,90,95,81,72
6,4,Thompson,Sarah,91,81,92,81,73


# \[1. DataFrame Indexing, Slicing and Subsetting based on Columns\]

Let's look at how to select a subset of a DataFrame based on columns.

### (1) Selecting a single column

To select a single column, insert the column name into square brackets <code>[&nbsp;]</code> or use a dot operator as **`df_name.column_name`**.

*&nbsp;Note that `df` stands for DataFrame.

In [5]:
# Selecting the entire column for history scores
df['history'] 

0    80
1    91
2    86
3    77
4    75
5    96
6    91
Name: history, dtype: int64

In [6]:
# Selecting the entire column for history scores using a dot operator
df.history

0    80
1    91
2    86
3    77
4    75
5    96
6    91
Name: history, dtype: int64

To select multiple columns, pass a list of column names to the square brackets <code>[&nbsp;&nbsp;]</code>.

In [5]:
df[["history","english","math"]] 

Unnamed: 0,history,english,math
0,80,92,70
1,91,75,90
2,86,76,42
3,77,92,52
4,75,85,85
5,96,90,95
6,91,81,92


### (2) Searching for unique values

- `.unique()` returns a list of unique values in the selected column.

- `.value_counts()` returns the frequency of each unique value in the selected column.


#### `.unique()`

In [6]:
df.loc[:, 'class'].unique() # checking the unique values from the `class` column

array(['1', '2', '3', '4'], dtype=object)

#### `.value_counts()`

In [7]:
df.loc[:, 'class'].value_counts() # checking how many times each unique value appeared in the `class` column

1    2
3    2
2    2
4    1
Name: class, dtype: int64

### (3) Sorting a DataFrame by column labels ** - `.sort_index()`** 

`.sort_index()` basically sorts DataFrame objects by labels along a specific axis. By passing `axis = 1` to the method, you can sort a DataFrame object based on column labels.

#### `.sort_index(axis=1)`

In [1]:
# Rearranging the student report cards by column labels in ascending order 
df.sort_index(axis=1)

NameError: name 'df' is not defined

### (4) Sorting by the values with **`.sort_values()`** 

Sort by the values along either axis.

A DataFrame object can be sorted not only by labels, but also by values along a specific axis with `.sort_values()`. By default, this method rearranges the DataFrame in ascending order.

#### `.sort_values()`

In [9]:
df.sort_values('history') # DataFrame rows sorted in ascending order by the history score 

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
4,3,Park,Jay,75,85,85,92,95
3,2,Mendoza,James,77,92,52,60,80
0,1,Smith,John,80,92,70,65,92
2,2,Zimmermann,Kate,86,76,42,72,88
1,1,Schafer,Elise,91,75,90,68,85
6,4,Thompson,Sarah,91,81,92,81,73
5,3,Randow,Emma,96,90,95,81,72


### (5) Iterating over DataFrame columns - **`.iteritems()`**

`.iteritems()` Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.

#### `.iteritems()`

In [9]:
col_num = 0
for idx, col in df.iteritems():
    if col_num == 2: # terminates the iteration when `col_num` reaches 2
        break 
    print("Column Name:", idx, "\n", "Column Number:", col_num)
    print(col)
    print("\n")
    col_num += 1
    

Column Name: class 
 Column Number: 0
0    1
1    1
2    2
3    2
4    3
5    3
6    4
Name: class, dtype: object


Column Name: l_name 
 Column Number: 1
0         Smith
1       Schafer
2    Zimmermann
3       Mendoza
4          Park
5        Randow
6      Thompson
Name: l_name, dtype: object




# \[2. DataFrame Indexing, Slicing and Subsetting based on Rows\]

Let's learn how you can select a subset of a DataFrame based on rows.

Before going on row selection, you can check or rename rows of a DataFrame object by passing a list of new row labels to `.index`. Here, the new label list must be of the same length as the given DataFrame. 


In [11]:
# Let's change row labels as follows.
df.index = ["a", "b", "c", "d", "e", "f", "g"]
df

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
a,1,Smith,John,80,92,70,65,92
b,1,Schafer,Elise,91,75,90,68,85
c,2,Zimmermann,Kate,86,76,42,72,88
d,2,Mendoza,James,77,92,52,60,80
e,3,Park,Jay,75,85,85,92,95
f,3,Randow,Emma,96,90,95,81,72
g,4,Thompson,Sarah,91,81,92,81,73


### (1) Selecting rows -  **`.loc[]`**  

**`loc`** is a label-based indexer. If you want select rows by labels, use `.loc[]`.<br>

For a single-row selection, use **`df.loc[row_name]`** to get the entire row as a Series.

In [12]:
# Select row 'a'
df.loc["a"]

class                 1
l_name            Smith
f_name             John
history              80
english              92
math                 70
social_studies       65
science              92
Name: a, dtype: object

To select multiple rows, pass the list of row labels to **`df[]`**. It will then return a subset of the DataFrame as a DataFrame.

In [13]:
# Get row 'a' and 'b'

df.loc[["a", "b"]]

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
a,1,Smith,John,80,92,70,65,92
b,1,Schafer,Elise,91,75,90,68,85


You can range rows for the selection of multiple consecutive rows.

In [14]:
# Get the entire elements rows from 'a'(inclusive) to c(inclusive)
df.loc["a":"c"]

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
a,1,Smith,John,80,92,70,65,92
b,1,Schafer,Elise,91,75,90,68,85
c,2,Zimmermann,Kate,86,76,42,72,88


What if you want to select sets of data by both row and column labels?

You can simply pass the row and column labels of interest to the loc indexer.

In [15]:
df.loc[["a","b"], # selecting values in row a and b
       ["l_name","history","english","math"]] # of column `l_name`, `history`, `english` and `math`.

Unnamed: 0,l_name,history,english,math
a,Smith,80,92,70
b,Schafer,91,75,90


if you are selecting all instead of specifying certain labels along an axis, use a colon(**`:`**) to indicate `all`.

In [10]:
df.loc[:,"history"] #  selecting values in the history column of all rows.

0    80
1    91
2    86
3    77
4    75
5    96
6    91
Name: history, dtype: int64

### (2) Accessing/Selecting rows by positional values - **`.iloc[]`**
`iloc` is a position-based indexer.

In [17]:
# Get 0th row.
df.iloc[0]

class                 1
l_name            Smith
f_name             John
history              80
english              92
math                 70
social_studies       65
science              92
Name: a, dtype: object

Get the last row of the DataFrame by negative indexing. Since negative indexing proceeds backwards from the end to the beginning of an object, starting from -1, the last row can be taken with index  `-1`. 

In [11]:
df.iloc[-1]

class                    4
l_name            Thompson
f_name               Sarah
history                 91
english                 81
math                    92
social_studies          81
science                 73
Name: 6, dtype: object

You can range the positional numbers of multiple consecutive rows to be accessed.

In [12]:
# Access all values from row 1 up to row 3. Note that the endpoint, row 3 in this case, is exclusive when indexing with `iloc`
df.iloc[1:3] 

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
1,1,Schafer,Elise,91,75,90,68,85
2,2,Zimmermann,Kate,86,76,42,72,88


If you want to select values in specific rows of specific columns with `iloc`, you can do the following.<br>
+ `df.iloc[row_numbers, column_numbers]`

In [20]:
df.iloc[1:3,0:2] # values in column 0 and 1 from row 1 to 3 

Unnamed: 0,class,l_name
b,1,Schafer
c,2,Zimmermann


### (3) Selecting by conditions (Boolean / Logical Indexing) - **`loc`**
In addition to labels or position-based data indexing, there is an additional indexing method that uses actual values of datasets, and the method is called **`Boolean indexing`**. In Boolean indexing, we use a Boolean vector to filter data.

In [21]:
# Selecting students with a history score of 80 or higher
df.loc[df.history > 80]

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
b,1,Schafer,Elise,91,75,90,68,85
c,2,Zimmermann,Kate,86,76,42,72,88
f,3,Randow,Emma,96,90,95,81,72
g,4,Thompson,Sarah,91,81,92,81,73


In [22]:
#  Last names and the math scores of students with a history score of 80 or higher
df.loc[(df.history > 80),["l_name","math"]]

Unnamed: 0,l_name,math
b,Schafer,90
c,Zimmermann,42
f,Randow,95
g,Thompson,92


### (4) Sorting a DataFrame by on row labels - **`.sort_index()`** 

`.sort_index()` sorts a DataFrame object by labels along a specific axis, and along rows by default(axis = 0).

#### `.sort_index(axis=0)`

In [23]:
df.index = ['c', 'b', 'a', 'd', 'e', 'g', 'f'] # row labels are shuffled.
df.sort_index() # and sorted.

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
a,2,Zimmermann,Kate,86,76,42,72,88
b,1,Schafer,Elise,91,75,90,68,85
c,1,Smith,John,80,92,70,65,92
d,2,Mendoza,James,77,92,52,60,80
e,3,Park,Jay,75,85,85,92,95
f,4,Thompson,Sarah,91,81,92,81,73
g,3,Randow,Emma,96,90,95,81,72


In [24]:
# Sorting a DataFrame by row labels in descending order 
df.sort_index(ascending = False)

Unnamed: 0,class,l_name,f_name,history,english,math,social_studies,science
g,3,Randow,Emma,96,90,95,81,72
f,4,Thompson,Sarah,91,81,92,81,73
e,3,Park,Jay,75,85,85,92,95
d,2,Mendoza,James,77,92,52,60,80
c,1,Smith,John,80,92,70,65,92
b,1,Schafer,Elise,91,75,90,68,85
a,2,Zimmermann,Kate,86,76,42,72,88


### (5) Iterating over DataFrame rows - **`.iterrows()`**

`.iterrows()` Iterate over DataFrame rows as (index, Series) pairs, returning a tuple of each row label and the content as a Series.


#### `.iterrows()`

In [25]:
print(df,'\n')
for idx, row in df.iterrows():
    print("row name : ", idx)
    print(row)

  class      l_name f_name  history  english  math  social_studies  science
c     1       Smith   John       80       92    70              65       92
b     1     Schafer  Elise       91       75    90              68       85
a     2  Zimmermann   Kate       86       76    42              72       88
d     2     Mendoza  James       77       92    52              60       80
e     3        Park    Jay       75       85    85              92       95
g     3      Randow   Emma       96       90    95              81       72
f     4    Thompson  Sarah       91       81    92              81       73 

row name :  c
class                 1
l_name            Smith
f_name             John
history              80
english              92
math                 70
social_studies       65
science              92
Name: c, dtype: object
row name :  b
class                   1
l_name            Schafer
f_name              Elise
history                91
english                75
math             