# Unit 4 Lecture 2 -  ```Series``` and ```DataFrame``` Functions

ESI4628: Decision Support Systems for Industrial Engineers<br>
University of Central Florida
Dr. Ivan Garibay, Ramya Akula.
https://github.com/igaribay/DSSwithPython/blob/master/DSS-Week04/Notebook/DSS-Unit04-Lecture02.ipynb

## Notebook Learning Objectives
After studying this notebook students should be able to:
- Use common Pandas' functions to manipulate  <code>Series</code> and  <code>DataFrame</code> data structures


# Overview

On this notebook we introduce the fundamental mechanishms to interact with data contained in <code>Series</code> and  <code>DataFrame</code>, including:
- <code>.reindex</code>
- <code>.iloc</code> and <code>.loc</code>
- Arithmetic operators and data alignment: <code>add()</code>, <code>sub()</code>, <code>mul()</code> and <code>div()</code>.

# Reindex

In [1]:
import numpy as np
import pandas as pd

The ```reindex``` method is used for changing the order of the rows and columns in a ```Series``` or a ```DataFrame```. 
- Primary argument is "new index" to provide a new index for a Series, or a new "row index" for a dataframe. 
- For dataframe, it also takes the argument <code>column='new column index'</code> to reorganize columns.
- Secondary arguments include <code>fill_value=</code>, <code>method='ffill'</code>, <code>method='bfill'</code>, used to fill missing data with a set value, the value from previous row, or value from next row with data respectively
- For a full description of the method use <code>pd.reindex?</code>

__Note:__ <code>reindex</code> does not change the original <code>Series</code> or <code>DataFrame</code> at all. It just produces a _new object_ (a copy) with the desired reorganized rows/columns

### Reindexing Series

In [2]:
# creating a series with arbitraty index
SeriesPixels = pd.Series(['blue','red','orange','yellow','black'], index=[15,3,0,6,10])
SeriesPixels

15      blue
3        red
0     orange
6     yellow
10     black
dtype: object

In [3]:
# using reindex method to reorganize the data so the index is in ascending order
SeriesPixels2=SeriesPixels.reindex([0,3,6,10,15])

In [4]:
print(range(15)) # using range to generate an array of integers from 0 to 17

# using reindex to create a dataset with all rows present from 0 to 17, note that NaN are added by reindex
SeriesPixels2.reindex(range(18))

range(0, 15)


0     orange
1        NaN
2        NaN
3        red
4        NaN
5        NaN
6     yellow
7        NaN
8        NaN
9        NaN
10     black
11       NaN
12       NaN
13       NaN
14       NaN
15      blue
16       NaN
17       NaN
dtype: object

In [5]:
# reindex can "fill" missing data with any value using "fill_value="
SeriesPixels2.reindex(range(15), fill_value="white")

0     orange
1      white
2      white
3        red
4      white
5      white
6     yellow
7      white
8      white
9      white
10     black
11     white
12     white
13     white
14     white
dtype: object

In [6]:
# reindex can also "pad" (carry values forward) or "backfill" (carry values backward)
SeriesPixels2.reindex(range(15), method='ffill')

0     orange
1     orange
2     orange
3        red
4        red
5        red
6     yellow
7     yellow
8     yellow
9     yellow
10     black
11     black
12     black
13     black
14     black
dtype: object

### Reindexing a DataFrame

In [7]:
# Create DataFrame with arbitrary index

raw_data = {'first_name': ['Jason','Molly','Tina','Jake','Amy'], 'last_name': ['Miller','Jacobson','Alison','Milner','Cooze'], 
         'age': [42,52,36,24,73], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame(raw_data, index=['e','a','b','d','f'])
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
e,Jason,Miller,42,4,25
a,Molly,Jacobson,52,24,94
b,Tina,Alison,36,31,57
d,Jake,Milner,24,2,62
f,Amy,Cooze,73,3,70


In [8]:
# reindex or change the order of rows, save resulting dataframe as df2

df2 = df.reindex (['a','b','d','e','f'])
df2

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52,24,94
b,Tina,Alison,36,31,57
d,Jake,Milner,24,2,62
e,Jason,Miller,42,4,25
f,Amy,Cooze,73,3,70


Note: If we invoke a ```Series``` or ```DataFrame``` using an input list containing a label that is not in the original DataFrame index, the new row is filled with null value or NaN.

In [9]:
# reindex or change the order of rows with new inputs

df2.reindex (['a','b','c','d','e','f','g','h'])

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52.0,24.0,94.0
b,Tina,Alison,36.0,31.0,57.0
c,,,,,
d,Jake,Milner,24.0,2.0,62.0
e,Jason,Miller,42.0,4.0,25.0
f,Amy,Cooze,73.0,3.0,70.0
g,,,,,
h,,,,,


In [10]:
# reindex can also "fill" NaN with any desired value

df.reindex (['a','b','c','d','e','f','g','h'], fill_value="Joe")

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52,24,94
b,Tina,Alison,36,31,57
c,Joe,Joe,Joe,Joe,Joe
d,Jake,Milner,24,2,62
e,Jason,Miller,42,4,25
f,Amy,Cooze,73,3,70
g,Joe,Joe,Joe,Joe,Joe
h,Joe,Joe,Joe,Joe,Joe


In [11]:
# reindex can also "pad" NaN with previous value

# code below produces an error... why? because index of "df" is NOT "monotonically increasing/decreasing", uncomment next line to test:
# df.reindex (['a','b','c','d','e','f','g','h'], method='ffill')

# but index for df2 is monotonically increasing, so works fine here:
df3 = df2.reindex (['a','b','c','d','e','f','g','h'], method='ffill')
df3

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52,24,94
b,Tina,Alison,36,31,57
c,Tina,Alison,36,31,57
d,Jake,Milner,24,2,62
e,Jason,Miller,42,4,25
f,Amy,Cooze,73,3,70
g,Amy,Cooze,73,3,70
h,Amy,Cooze,73,3,70


In [12]:
# reindex to change the order of columns, also selecting/adding columns

columnsTitles = ['first_name','last_name','age','phone']

df4=df3.reindex (columns = columnsTitles)
df4

Unnamed: 0,first_name,last_name,age,phone
a,Molly,Jacobson,52,
b,Tina,Alison,36,
c,Tina,Alison,36,
d,Jake,Milner,24,
e,Jason,Miller,42,
f,Amy,Cooze,73,
g,Amy,Cooze,73,
h,Amy,Cooze,73,


# How select multiple rows and columns from a ```DataFrame```

- By using integer labels```.iloc``` and axis labels```.loc``` functions, we can select multiple rows and columns from a ```DataFrame```
- Selected data can be updated
- ```.iloc``` and ```.loc``` are designed to avoid confusion on which type of indexes are we using to access data:
- (A) Internal, unchangeable, created by Pandas, always [0,1,2...N] indexes (use ```.iloc``` for internal location)
- (B) Our row/column labels, created by us, can take any valus indexes (use  ```.loc``` for location)

### ```.iloc``` function
```.iloc``` function operates on "implicit" indices, similar to arrays = [0,1,2,...,N]

In [13]:
# Lets use the following dataframe
df=df2
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52,24,94
b,Tina,Alison,36,31,57
d,Jake,Milner,24,2,62
e,Jason,Miller,42,4,25
f,Amy,Cooze,73,3,70


In [14]:
# If we run this code, we will get a single row 
df.iloc[3]

first_name        Jason
last_name        Miller
age                  42
preTestScore          4
postTestScore        25
Name: e, dtype: object

For getting the result in DataFrame format, we can pass this number in a list like:

In [15]:
df.iloc[[3]]

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
e,Jason,Miller,42,4,25


In [16]:
df.iloc[[-1]]

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
f,Amy,Cooze,73,3,70


To select a single data element:

In [17]:
df.iloc[3,2]

42

or using the DataFrame format:

In [18]:
df.iloc[[3],[2]]

Unnamed: 0,age
e,42


We can use <code>.iloc</code> to update values:

In [19]:
df.iloc[[3],[2]]="Garibay"
df.iloc[[3],[2]]

Unnamed: 0,age
e,Garibay


To select more than one row using .iloc

In [20]:
#Selecting more than one row using .iloc 
df.iloc[[0,2]]

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52,24,94
d,Jake,Milner,24,2,62


Everything left to the comma belongs to rows and everything right to the comma belongs to the column.

In [21]:
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
a,Molly,Jacobson,52,24,94
b,Tina,Alison,36,31,57
d,Jake,Milner,24,2,62
e,Jason,Miller,Garibay,4,25
f,Amy,Cooze,73,3,70


In [22]:
df.iloc[[0,2],[2,1]]

Unnamed: 0,age,last_name
a,52,Jacobson
d,24,Milner


Finally, we can also use "slice" format for .iloc

In [23]:
df.iloc[0:3,1:3]

Unnamed: 0,last_name,age
a,Jacobson,52
b,Alison,36
d,Milner,24


### ```.loc``` function

```loc``` function operates on the index labels we define for rows or columns

In [24]:
#example (introducing a data frame)

Score = {'student1' : pd.Series([100, 93,87,100], index=['score1', 'score2', 'score3', 'score4']),
      'student2' : pd.Series([93,96,79,98], index=['score1', 'score2', 'score3', 'score4']),
         'student3' : pd.Series([100,99,96,89], index=['score1', 'score2', 'score3', 'score4'])}

df = pd.DataFrame(Score)
df

Unnamed: 0,student1,student2,student3
score1,100,93,100
score2,93,96,99
score3,87,79,96
score4,100,98,89


In [25]:
df.loc[['score3']]

Unnamed: 0,student1,student2,student3
score3,87,79,96


In [26]:
#everything left to the comma belongs to rows and everything right to the comma belongs to the column.

df.loc[['score2','score3'],['student2']]

Unnamed: 0,student2
score2,96
score3,79


In [27]:
df.loc['score1':'score2','student2':'student3']

Unnamed: 0,student2,student3
score1,93,100
score2,96,99


## Arithmetic and Data Alignment

When doing arithmetic or any other operations on Series and DataFrames, Pandas will automatically attemp to:
   - Fill in missing data 
   - Align data so operations make sence
   - The main arithmetic operators are: ```add()```, ```sub()```, ````mul()````,````div()````

In [28]:
# example DataFrame 1
df1 = pd.DataFrame(np.arange(9).reshape((3,3)), columns=['Orlando', 'Tampa', 'Miami'], index=['b','d','e'])
df1

Unnamed: 0,Orlando,Tampa,Miami
b,0,1,2
d,3,4,5
e,6,7,8


In [29]:
# example DataFrame 2
df2 = pd.DataFrame(np.arange(10,42,2).reshape((4,4)), columns=['Orlando', 'New York', 'Miami', 'San Francisco'], index=['a','b','c','d'])
df2

Unnamed: 0,Orlando,New York,Miami,San Francisco
a,10,12,14,16
b,18,20,22,24
c,26,28,30,32
d,34,36,38,40


To add two DataFrames simply:

In [30]:
df1 + df2  # df1.add(df2)

Unnamed: 0,Miami,New York,Orlando,San Francisco,Tampa
a,,,,,
b,24.0,,18.0,,
c,,,,,
d,43.0,,37.0,,
e,,,,,


Alternatively use:

In [31]:
df1.add(df2)

Unnamed: 0,Miami,New York,Orlando,San Francisco,Tampa
a,,,,,
b,24.0,,18.0,,
c,,,,,
d,43.0,,37.0,,
e,,,,,


Note that the addition is sucessful only when there are two numbers to be added. Everywhere else is filled by Pandas as "NaN" = Not A Number.
In some cases we may want to insert a special value when there is no data, for instance "0":

In [32]:
df1.add(df2, fill_value=0)

Unnamed: 0,Miami,New York,Orlando,San Francisco,Tampa
a,14.0,12.0,10.0,16.0,
b,24.0,20.0,18.0,24.0,1.0
c,30.0,28.0,26.0,32.0,
d,43.0,36.0,37.0,40.0,4.0
e,8.0,,6.0,,7.0


In [33]:
df1.mul(df2, fill_value=0)

Unnamed: 0,Miami,New York,Orlando,San Francisco,Tampa
a,0.0,0.0,0.0,0.0,
b,44.0,0.0,0.0,0.0,0.0
c,0.0,0.0,0.0,0.0,
d,190.0,0.0,102.0,0.0,0.0
e,0.0,,0.0,,0.0


In [34]:
df1.mul(df2, fill_value=1)

Unnamed: 0,Miami,New York,Orlando,San Francisco,Tampa
a,14.0,12.0,10.0,16.0,
b,44.0,20.0,0.0,24.0,1.0
c,30.0,28.0,26.0,32.0,
d,190.0,36.0,102.0,40.0,4.0
e,8.0,,6.0,,7.0


In [35]:
df1.div(df2, fill_value=0)

Unnamed: 0,Miami,New York,Orlando,San Francisco,Tampa
a,0.0,0.0,0.0,0.0,
b,0.090909,0.0,0.0,0.0,inf
c,0.0,0.0,0.0,0.0,
d,0.131579,0.0,0.088235,0.0,inf
e,inf,,inf,,inf


# References

- Series, http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
- Data Frame, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
- Reindex, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
- Indexing and Selecting (.loc, .iloc), https://pandas.pydata.org/pandas-docs/stable/indexing.html

_Last updated on 10.13.2022 3:46pm<br>
(C) 2022 Complex Adaptive Systems Laboratory all rights reserved._