# Unit 4 Lecture 1 -  Pandas

ESI4628: Decision Support Systems for Industrial Engineers<br>
University of Central Florida
Dr. Ivan Garibay, Ramya Akula, Mostafa Saeidi, Madeline Schiappa, and Brett Belcher. 
https://github.com/igaribay/DSSwithPython/blob/master/DSS-Week04/Notebook/DSS-Unit04-Lecture01.2018.ipynb

## Notebook Learning Objectives
After studying this notebook students should be able to:
- Use the Pandas Python package to execute basic data manipulation operations
- Use Pandas to create and manipulate Series and Data Frames

# Overview

Pandas is the fundamental package for data manipulation and analysis in Python:
- Extremely useful for doing anything with data, from simple Excel style operations to complex SQL-style data manipulations
- Build on top of NumPy
- It is conventional to import Pandas as "pd"

For more information about Pandas:
https://pandas.pydata.org


# Data Structures

The two most versatile data structures in Pandas are ```Series``` and ```DataFrame``` which are built on top of NumPy. So, before starting, we need to import the ```NumPy``` and ```Pandas``` libraries

In [67]:
# import the numpy and pandas libraries and aliasing as np and pd respectively

import numpy as np
import pandas as pd

In [69]:
pd.Series?

## Series

A series is a one-dimensional object and can be created using various inputs like ```Array```, ```Dict```, and ```Scalar value or constant```. By default, each value in a series will receive an index from 0 to N-1, which N is the length of the data.

In [54]:
# example of creating a simple series
MySeries = pd.Series ([3.14, "python", -10, 'BC34'])
print (MySeries)

0      3.14
1    python
2       -10
3      BC34
dtype: object


You can specify an index to each data in the series like below:

In [74]:
MySeries2 = pd.Series ( [3.14, "python", -10, 'BC34'], 
                 index=['A', 'B', 'C', 'D'])
print (MySeries2)

A      3.14
B    python
C       -10
D      BC34
dtype: object


In [36]:
MySeries2.values

array([3.14, 'python', -10, 'BC34'], dtype=object)

Using index for calling values in a series.

In [52]:
MySeries2[['C','B',]]   # Using index for calling values in a series.

C       -10
B    python
dtype: object

### Creating a series by passing the dictionary

In [45]:
Data = {'Name': ['Bob', 'John', 'Mary'], 'Age': [15, 23, 17], 'Color': ['white', 'black', 'black']}

Sdata = pd.Series(Data)
print (Sdata)

Age               [15, 23, 17]
Color    [white, black, black]
Name         [Bob, John, Mary]
dtype: object


In this example, the dict' s keys are indexes in Data. So you can recall values by using these keys:

In [38]:
Features = ['Name', 'Age']

Sdata2 = pd.Series (Data, index = Features)
print (Sdata2)

Name    [Bob, John, Mary]
Age          [15, 23, 17]
dtype: object


In [50]:
Features = ['Name', 'Age', 'Color', 'Weigth']

Sdata2 = pd.Series (Data, index = Features)
print (Sdata2)

Name          [Bob, John, Mary]
Age                [15, 23, 17]
Color     [white, black, black]
Weigth                      NaN
dtype: object


Note: Since we do not have any value for ```Weigth``` in 'Data' dictionary, it appears as NaN. This kind of data is considered as 'missing data' or 'NA values'.

In big data, detecting missing data is essential. For this purpose, The ```isnull``` and ```notnull``` functions should be used.

In [49]:
pd.isnull(Sdata2)

Name      False
Age       False
Color     False
Weigth     True
dtype: bool

In [9]:
pd.notnull(Sdata2)

Name       True
Age        True
Color      True
Weigth    False
dtype: bool

In [56]:
# Retrieve some elements from a series

Ser = pd.Series ([10,20,30,40,50,60,70], index = ['a','b','c','d','e','f','g'])

print (Ser[1:5])
print (Ser[3:4])

b    20
c    30
d    40
e    50
dtype: int64
d    40
dtype: int64


In [57]:
print (Ser[-3:])

e    50
f    60
g    70
dtype: int64


In [58]:
#Retrieve data using index

print (Ser [['a','d','f','g']])

a    10
d    40
f    60
g    70
dtype: int64


### Creating a series by passing the scalar

If data ia a scalar value, the value will be repeated to the number of indexes. The important point is, an index must be provided in the series.

In [80]:
Ser = pd.Series (23 , index = [0,1,2,3,4])
print (Ser)

0    23
1    23
2    23
3    23
4    23
dtype: int64


We can use a function to define the index, <code>range(8)</code>, and and <code>dtype</code> to determine the data type.

In [81]:
Ser = pd.Series (23, index = range(8), dtype=float)
print (Ser)

0    250.0
1    250.0
2    250.0
3    250.0
4    250.0
5    250.0
6    250.0
7    250.0
dtype: float64


## Basic Functionality in Series

```axes``` Returns a list of the row axis labels.

```dtype``` Returns the dtype of the object.

```empty``` Returns True if series is empty.

```ndim```  Returns the number of dimensions of the underlying data.

```size``` Returns the number of elements in the underlying data.

```values``` Returns the Series as ndarray.

```head()``` Returns the first n rows.

```tail()``` Returns the last n rows.



(Reference :www.tutotialspoint.com/python_pandas)

#### The structure of using these functions is like below:

#### NameSeries.```function```

In [105]:
# Some example of using functions in Series:


Ser = pd.Series ([10,20,30,40,50,60,70, 80], index = ['a','b','c','d','e','f','g','h'])

print ("The axes are: ")
print (Ser.axes)

print ("The dimentions of the object is: ")
print (Ser.ndim)

print ("The size of the object is: ")
print (Ser.size)

print ("The data in the Series is: ")
print (Ser.values)

print ("The first 4 rows of the data series: ")
print (Ser.head(4))

print ("The last 2 rows of the data series: ")
print (Ser.tail(2))


The axes are: 
[Index([u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h'], dtype='object')]
The dimentions of the object is: 
1
The size of the object is: 
8
The data in the Series is: 
[10 20 30 40 50 60 70 80]
The first 4 rows of the data series: 
a    10
b    20
c    30
d    40
dtype: int64
The last 2 rows of the data series: 
g    70
h    80
dtype: int64


## DataFrame

A ```DataFrame``` is a two-dimensional data structure resembling a table consisting of an ordered collection of columns, each of which could be of a different value type. A ```DataFrame``` has both row and column indexes.

A ```DataFrame``` can be created using various inputs like: ```List```, ```Dictionary```, ```Series```, and ```Numpy ndarrays```. 

### Creating a DataFrame by passing a Lists

In [103]:
Data = [100, 120, 130, 140, 150]

df = pd.DataFrame(Data)
df

Unnamed: 0,0
0,100
1,120
2,130
3,140
4,150


In [102]:
raw_data = [['Bruce','Banner',38,4,25],['Tony','Stark',42,24,94],['Hal','Jordan',25,31,57],['Bruce','Wayne',32,2,62],
            ['Clark','Kent',28,3,70]]
df = pd.DataFrame (raw_data, columns = ['first_name', 'last_name','age','preTestScore','postTestScore'])
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Bruce,Banner,38,4,25
1,Tony,Stark,42,24,94
2,Hal,Jordan,25,31,57
3,Bruce,Wayne,32,2,62
4,Clark,Kent,28,3,70


### Creating a DataFrame by passing a dictionary

In [101]:
raw_data = {'first_name': ['Bruce','Tony','Hal','Bruce','Clark'], 'last_name': ['Banner','Stark','Jordan','Wayne','Kent'], 
           'age': [38,42,25,32,28], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}

df = pd.DataFrame (raw_data , index = ['rank1','rank2','rank3','rank4','rank5'])
df

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
rank1,38,Bruce,Banner,25,4
rank2,42,Tony,Stark,94,24
rank3,25,Hal,Jordan,57,31
rank4,32,Bruce,Wayne,62,2
rank5,28,Clark,Kent,70,3


In [108]:
# Create a DataFrame from list of dicts

Data = [{'first_attempt':12, 'second_attempt':10.78,}, {'first_attempt':14.1, 'second_attempt':13.2, 'third_attempt':12}]

df = pd.DataFrame (Data)
df

Unnamed: 0,first_attempt,second_attempt,third_attempt
0,12.0,10.78,
1,14.1,13.2,12.0


In [109]:
# define index 

df = pd.DataFrame (Data, index = ['score1','score2'])
df

Unnamed: 0,first_attempt,second_attempt,third_attempt
score1,12.0,10.78,
score2,14.1,13.2,12.0


### Creating a DataFrame from Dict of Series

In [110]:

Data = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(Data)
df

Unnamed: 0,first,second
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


### Column Addition

In [113]:
raw_data = {'first_name': ['Bruce','Tony','Hal','Bruce','Clark'], 'last_name': ['Banner','Stark','Jordan','Wayne','Kent'], 
           'age': [38,42,25,32,28], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}
df = pd.DataFrame (raw_data , index = ['rank1','rank2','rank3','rank4','rank5'])

print ("Original data: ")

print df

# adding a new column to an existing columns in DataFrame object

date = [2017, 2018,2017,np.nan,2015]

df["date"] = date

    
print ("New DataFrame after inserting the 'date' column")

print df

Original data: 
       age first_name last_name  postTestScore  preTestScore
rank1   38      Bruce    Banner             25             4
rank2   42       Tony     Stark             94            24
rank3   25        Hal    Jordan             57            31
rank4   32      Bruce     Wayne             62             2
rank5   28      Clark      Kent             70             3
New DataFrame after inserting the 'date' column
       age first_name last_name  postTestScore  preTestScore    date
rank1   38      Bruce    Banner             25             4  2017.0
rank2   42       Tony     Stark             94            24  2018.0
rank3   25        Hal    Jordan             57            31  2017.0
rank4   32      Bruce     Wayne             62             2     NaN
rank5   28      Clark      Kent             70             3  2015.0


In [123]:
date = [2017, 2018,2017,np.nan,2015]
df["date"] = date
df

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore,date
rank1,38,Bruce,Banner,25,4,2017.0
rank2,42,Tony,Stark,94,24,2018.0
rank3,25,Hal,Jordan,57,31,2017.0
rank4,32,Bruce,Wayne,62,2,
rank5,28,Clark,Kent,70,3,2015.0


In [129]:
Data = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df2 = pd.DataFrame(Data)
print (df2)

# adding a new column to an existing columns in DataFrame object

df2 ['third'] = pd.Series([100,200,300,400], index = ['a','b','c','d'])

print ("New DataFrame after inserting the 'third' column")

print (df2)

   first  second
a    1.0       1
b    2.0       2
c    3.0       3
d    NaN       4
New DataFrame after inserting the 'third' column
   first  second  third
a    1.0       1    100
b    2.0       2    200
c    3.0       3    300
d    NaN       4    400


### Retrieving Columns and Rows as Series

In [132]:
raw_data = {'first_name': ['Bruce','Tony','Hal','Bruce','Clark'], 'last_name': ['Banner','Stark','Jordan','Wayne','Kent'], 
           'age': [38,42,25,32,28], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}
df = pd.DataFrame (raw_data , index = ['rank1','rank2','rank3','rank4','rank5'])
df

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
rank1,38,Bruce,Banner,25,4
rank2,42,Tony,Stark,94,24
rank3,25,Hal,Jordan,57,31
rank4,32,Bruce,Wayne,62,2
rank5,28,Clark,Kent,70,3


In [133]:
df["last_name"]

rank1    Banner
rank2     Stark
rank3    Jordan
rank4     Wayne
rank5      Kent
Name: last_name, dtype: object

In [136]:
df.loc["rank5"]

age                 28
first_name       Clark
last_name         Kent
postTestScore       70
preTestScore         3
Name: rank5, dtype: object

### Column and Row Deletion

In [154]:
raw_data = {'first_name': ['Bruce','Tony','Hal','Bruce','Clark'], 'last_name': ['Banner','Stark','Jordan','Wayne','Kent'], 
           'age': [38,42,25,32,28], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}
df = pd.DataFrame(raw_data)
df

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
0,38,Bruce,Banner,25,4
1,42,Tony,Stark,94,24
2,25,Hal,Jordan,57,31
3,32,Bruce,Wayne,62,2
4,28,Clark,Kent,70,3


In [161]:
df.drop('preTestScore', axis = 1)             # drop column "preTestScore", the argument axis=1 denotes column

Unnamed: 0,age,first_name,last_name,postTestScore
0,38,Bruce,Banner,25
1,42,Tony,Stark,94
2,25,Hal,Jordan,57
3,32,Bruce,Wayne,62
4,28,Clark,Kent,70


In [166]:
df.drop(4)                                   # drop row 4, axis=0 denotes row (default)

Unnamed: 0,age,first_name,last_name,postTestScore,preTestScore
0,38,Bruce,Banner,25,4
1,42,Tony,Stark,94,24
2,25,Hal,Jordan,57,31
3,32,Bruce,Wayne,62,2


In [159]:
# This example shows we can use del function for dropping a column in DataFrame

Data = {'first' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'second' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(Data)

# using del function

del df['first']
df

Unnamed: 0,second
a,1
b,2
c,3
d,4


## Basic Functionality in DataFrame

```T``` Transposes rows and columns.

```axes``` Returns a list of the row and column axis labels.

```dtype``` Returns the dtype of the object.

```empty``` Returns True if NDFrame is empty.

```ndim```  Returns the number of axes / array dimensions.

```size``` Returns the number of elements in the underlying data.

```values``` Returns the NDFrame.

```head()``` Returns the first n rows.

```tail()``` Returns the last n rows.



(Reference :www.tutotialspoint.com/python_pandas)

In [167]:
raw_data = {'first_name': ['Bruce','Tony','Hal','Bruce','Clark'], 'last_name': ['Banner','Stark','Jordan','Wayne','Kent'], 
           'age': [38,42,25,32,28], 'preTestScore': [4,24,31,2,3], 'postTestScore': [25,94,57,62,70]}
df = pd.DataFrame (raw_data)

print (df)

# Transpose

print ("The transpose of the data series is: ")
print (df.T)

# Axes

print ("The row and column axis labels are: ")
print (df.axes)

# dtypes

print ("The data types of each column are: ")
print (df.dtypes)


# ndim

print ("The dimension is: ")
print (df.ndim)

# shape

print ("The shape is: ")
print (df.shape)

# size

print ("The total number of elements is: ")
print (df.size)

   age first_name last_name  postTestScore  preTestScore
0   38      Bruce    Banner             25             4
1   42       Tony     Stark             94            24
2   25        Hal    Jordan             57            31
3   32      Bruce     Wayne             62             2
4   28      Clark      Kent             70             3
The transpose of the data series is: 
                    0      1       2      3      4
age                38     42      25     32     28
first_name      Bruce   Tony     Hal  Bruce  Clark
last_name      Banner  Stark  Jordan  Wayne   Kent
postTestScore      25     94      57     62     70
preTestScore        4     24      31      2      3
The row and column axis labels are: 
[RangeIndex(start=0, stop=5, step=1), Index([u'age', u'first_name', u'last_name', u'postTestScore', u'preTestScore'], dtype='object')]
The data types of each column are: 
age               int64
first_name       object
last_name        object
postTestScore     int64
preTestSco

# Exercises

__4.1__ Consider Table 1:

|      | City      |Capital of      | Population (in Millions)|
|------|-----------|----------------|-------------------------|
|   0  | Kabul     |Afganistan      | 3.2                     |
|   1  | Beijing   |China           | 20.7                    |
|   2  | New Delhi |India           | 16.8                    |
|   3  | Tokyo     |Japan           | 13.2                    |
|   4  | Moscow    |Russia          | 11.5                    |
|   5  | Cairo     |Egypt           | 10.2                    |


And Table 2:

|      | City      |Capital of      | Population (in Millions)|
|------|-----------|----------------|-------------------------|
|   0  | Manila    |Philippines     | 12.0                    |
|   1  | Jakarta   |Indonesia       | 10.2                    |
|   2  | Kinshasa  |Congo           | 10.2                    |
|   3  | Seoul     |South Korea     | 9.9                     |
|   4  | Dhaka     |Bangladesh      | 8.9                     |
|   5  | Sao Paulo |NaN             | 12.1                    |
|   6  | New York  |NaN             | 8.5                     |
|   7  | Orlando   |NaN             | 0.3                     |


Please create (A) A Series containing all cities listed in Table 1; (B) A Series containing row 3 (Seoul) from Table 2; (C) create a Series containing all "Capitals" from Table 1 and Table 2.

__4.2__ Using the tables defined in 4.1, (A) Create a Data Frame for each table; (B) Create a Data Frame containing the information from both Table 1 and Table 2; (C) Delete from previous table all rows containing "NaN".

# Homework (not graded)
Please complete all the exercises on this Notebook. Some exercises will be solved in class, but you should complete solving all the remaining exersices at the end of each Notebook on every class. If you can not solve an exercise, please contact the class teaching assistant for help inmmediately.

# References
- Pandas, https://pandas.pydata.org
- Series, DataFrame https://www.tutorialspoint.com/python_pandas/
- Wes McKinne (creator of Pandas), _Phython for Data Analysis_, Chapter 5

_Last updated on 9.10.18 6:27pm<br>
(C) 2018 Complex Adaptive Systems Laboratory all rights reserved._