<a href="https://colab.research.google.com/github/jbpost2/ST-554-Big-Data-With-Python-Course-Notes/blob/main/01_Programming_in_python/15_Pandas_Data_Frames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Data Frames

Justin Post

- Pandas data frames are a 2D data structure
    + Each column is a `series` object
    + Each column can be differing types (just like most common data sets!)


## Creating a `DataFrame`

- Most of the time we'll read data from a raw file directly into a `DataFrame`
- However, you can create one with the `pd.DataFrame()` function

In [107]:
import pandas as pd
import numpy as np

### Creating a Data Frame from Lists

- zip lists of the same length together
- specify columns via `columns =` list of appropriate length
- sepcify row names via `index =` list of appropriate length (if you want!)

In [109]:
#populate some lists, each of equal length
name = ['Alice', 'Bob','Charlie','Dave','Eve','Francesca','Greg']
age = [20, 21, 22, 23, 22, 21, 22]
major = ['Statistics', 'History', 'Chemistry', 'English', 'Math', 'Civil Engineering','Statistics']

#create the data frame
my_df = pd.DataFrame(zip(name, age, major), columns = ["name", "age", "major"])
my_df

Unnamed: 0,name,age,major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
5,Francesca,21,Civil Engineering
6,Greg,22,Statistics


---

### Creating a Data Frame from a Dictionary

- The `pd.DataFrame()` function can create data frames from many objects
- For a dictionary, the keys become the column names (values **must** be of the same length)

In [110]:
people = {'Name': ['Alice', 'Bob','Charlie','Dave','Eve','Francesca','Greg'],
          'Age': [20, 21, 22, 23, 22, 21, 22],
          'Major': ['Statistics', 'History', 'Chemistry', 'English', 'Math', 'Civil Engineering','Statistics'],
         }
people

{'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve', 'Francesca', 'Greg'],
 'Age': [20, 21, 22, 23, 22, 21, 22],
 'Major': ['Statistics',
  'History',
  'Chemistry',
  'English',
  'Math',
  'Civil Engineering',
  'Statistics']}

In [111]:
my_df = pd.DataFrame(people)
my_df

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
5,Francesca,21,Civil Engineering
6,Greg,22,Statistics


---

### Creating a Data Frame from a `NumPy` Array

- If you have a 2D `numpy` array, the conversion to a data frame is natural
- You can specify the column names with `columns = ` and the indices with `index =`

In [112]:
my_array = np.random.random((5,3))
print(my_array.shape)
my_array

(5, 3)


array([[0.29048793, 0.20192264, 0.13982855],
       [0.03380887, 0.82833162, 0.48653212],
       [0.21379655, 0.33554069, 0.80907267],
       [0.31462832, 0.87751129, 0.45655808],
       [0.15288996, 0.63591624, 0.88460099]])

In [113]:
my_df2 = pd.DataFrame(my_array, columns=["1st", "2nd", "3rd"], index=["a", "b", "c", "d", "e"])
my_df2

Unnamed: 0,1st,2nd,3rd
a,0.290488,0.201923,0.139829
b,0.033809,0.828332,0.486532
c,0.213797,0.335541,0.809073
d,0.314628,0.877511,0.456558
e,0.15289,0.635916,0.884601


---

## Indexing a Data Frame

### Indexing Columns with `[]`

- DataFrames have a `.columns` attribute

In [114]:
my_df2.columns

Index(['1st', '2nd', '3rd'], dtype='object')

- Access the columns using the column names and 'selection brackets'

In [115]:
my_df2["1st"]

a    0.290488
b    0.033809
c    0.213797
d    0.314628
e    0.152890
Name: 1st, dtype: float64

- Note that what gets returned is just a series!

In [116]:
type(my_df2["1st"])

pandas.core.series.Series

- Can also return a column using its name via the attribute syntax

In [117]:
my_df.Major

0           Statistics
1              History
2            Chemistry
3              English
4                 Math
5    Civil Engineering
6           Statistics
Name: Major, dtype: object

In [118]:
type(my_df.Major)

pandas.core.series.Series

- Returning more than one column is easy
- You can give a list of the column names you want

In [119]:
my_df[['Name', 'Age']]

Unnamed: 0,Name,Age
0,Alice,20
1,Bob,21
2,Charlie,22
3,Dave,23
4,Eve,22
5,Francesca,21
6,Greg,22


- Note you can't use slicing for columns using just `[]` (we'll need to us `.iloc[]` or `.loc[]`)
- If you try to index with slicing you get back appropriate rows (see below)

### Indexing Rows by Slicing with `[]`

- Similarly, you can index the rows using `[]` if you use a slice or a boolean array


In [120]:
my_df

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
5,Francesca,21,Civil Engineering
6,Greg,22,Statistics


In [121]:
my_df[3:5] #get the 3rd and 4th rows

Unnamed: 0,Name,Age,Major
3,Dave,23,English
4,Eve,22,Math


In [122]:
my_df2

Unnamed: 0,1st,2nd,3rd
a,0.290488,0.201923,0.139829
b,0.033809,0.828332,0.486532
c,0.213797,0.335541,0.809073
d,0.314628,0.877511,0.456558
e,0.15289,0.635916,0.884601


In [123]:
my_df2[1:5]

Unnamed: 0,1st,2nd,3rd
b,0.033809,0.828332,0.486532
c,0.213797,0.335541,0.809073
d,0.314628,0.877511,0.456558
e,0.15289,0.635916,0.884601


- Oddly, you can't return a single row with just a number
- You can return it using slicing (recall `:` *usually* doesn't return the last value)

In [124]:
my_df2[1] #throws an error

KeyError: 1

In [126]:
my_df2[1:2] #return just one row

Unnamed: 0,1st,2nd,3rd
b,0.033809,0.828332,0.486532


### Indexing Rows Using a Boolean Array with `[]`

- Often use a Boolean object to subset (rows with a `True` get returned, `False` do not)

In [127]:
my_df['Name'] == 'Alice' #create a boolean array

0     True
1    False
2    False
3    False
4    False
5    False
6    False
Name: Name, dtype: bool

In [128]:
my_df[my_df['Name'] == 'Alice'] #return just the True rows

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics


In [137]:
my_df[my_df['Age'] > 21] #return only rows that match

Unnamed: 0,Name,Age,Major
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
6,Greg,22,Statistics


---

##### Compound Logicals

- All the standard compound logical operators exist
- `&` (and), `|` (or), `~` (not), `^` (xor - exclusive or)


In [132]:
(my_df['Name'] == 'Alice')

0     True
1    False
2    False
3    False
4    False
5    False
6    False
Name: Name, dtype: bool

In [133]:
(my_df['Name'] == 'Greg')

0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: Name, dtype: bool

- Get either/or for these two booleans

In [134]:
(my_df['Name'] == 'Alice') | (my_df['Name'] == 'Greg')

0     True
1    False
2    False
3    False
4    False
5    False
6     True
Name: Name, dtype: bool

In [144]:
my_df[(my_df['Name'] == 'Alice') | (my_df['Name'] == 'Greg')]

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
6,Greg,22,Statistics


- When doing lots of logicals, you want to be careful and use `()` to keep things straight!

In [143]:
my_df[((my_df['Name'] == 'Alice') | (my_df['Name'] == 'Greg')) & (my_df['Age'] > 21)]

Unnamed: 0,Name,Age,Major
6,Greg,22,Statistics


---

# Indexing a Data Frame's Rows & Columns

- To index both rows and columns at once, we use the `.iloc[]` and `.loc[]` methods

---

### Indexing Rows with `.iloc[]`

- Can access rows by their **integer location** using `.iloc[]`

In [145]:
my_df.iloc[0]

Name          Alice
Age              20
Major    Statistics
Name: 0, dtype: object

In [146]:
type(my_df.iloc[1])

pandas.core.series.Series

- The row is return as a series with the data type being as broad as it needs to be. Here it is returned with a data type of `object` (used for storing mixed [data types](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html))

In [147]:
my_df.iloc[1]

Name         Bob
Age           21
Major    History
Name: 1, dtype: object

- With our other data object, all elements in a row are floats so that is the data type of the series that is returned

In [148]:
my_df2.iloc[1].dtype

dtype('float64')

- You can return more than one row by passing a list (or similar type object, such as a `range()` call) of the numeric indices you want

In [149]:
my_df.iloc[[0,1]]

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History


In [150]:
my_df.iloc[2:5] #note this doesn't include the last value!

Unnamed: 0,Name,Age,Major
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math


In [151]:
my_df.iloc[range(0,3)]

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry


---

#### `.iloc[]` for Returning Rows and Columns

- `.iloc[]` allows for subsetting of columns by location too!
- Simply add a `,` to get the 2nd dimension (similar to subsetting a `numpy` array)

In [152]:
my_df.iloc[[0,1], [0, 2]] #rows [0,1], columns [0,2]

Unnamed: 0,Name,Major
0,Alice,Statistics
1,Bob,History


In [154]:
my_df.iloc[3:6, 0:2] #slicing doesn't include either last value

Unnamed: 0,Name,Age
3,Dave,23
4,Eve,22
5,Francesca,21


---

### Indexing Rows with `.loc[]`

- `.loc[]` is similar to `.iloc[]` but it allows for subsetting based on **labels** or **boolean arrays**
- Slicing has a slightly different behavior! The last value **is** included for `.loc[]`

In [155]:
my_df

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
5,Francesca,21,Civil Engineering
6,Greg,22,Statistics


In [156]:
my_df.loc[0] #0 is interpreted as a label, which exists for my_df

Name          Alice
Age              20
Major    Statistics
Name: 0, dtype: object

In [157]:
my_df2

Unnamed: 0,1st,2nd,3rd
a,0.290488,0.201923,0.139829
b,0.033809,0.828332,0.486532
c,0.213797,0.335541,0.809073
d,0.314628,0.877511,0.456558
e,0.15289,0.635916,0.884601


In [158]:
my_df2.loc["b"]

1st    0.033809
2nd    0.828332
3rd    0.486532
Name: b, dtype: float64

- You can use slicing

In [159]:
my_df.loc[2:5] #note this includes the last value! (again interpreted as labels)

Unnamed: 0,Name,Age,Major
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
5,Francesca,21,Civil Engineering


In [160]:
my_df2.loc["b":"e"] #includes the last value!

Unnamed: 0,1st,2nd,3rd
b,0.033809,0.828332,0.486532
c,0.213797,0.335541,0.809073
d,0.314628,0.877511,0.456558
e,0.15289,0.635916,0.884601


#### `.loc[]` for Returning Rows and Columns

- Just like with `.iloc[]` you can return both columns and rows if you put in a `,` for the dimensions (rows, columns)

In [161]:
my_df.loc[:3, ['Name', "Major"]]

Unnamed: 0,Name,Major
0,Alice,Statistics
1,Bob,History
2,Charlie,Chemistry
3,Dave,English


- You can use slicing on the column names too!

In [162]:
my_df.loc[:3, 'Name':"Major"]

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry
3,Dave,23,English


---

#### `.loc[]` Using a Boolean

- As with `[]` we can use a boolean to return only certain rows (and/or columns)
- Must supply a boolean of the correct length!

In [164]:
my_df['Age'] > 21 #create a boolean array

0    False
1    False
2     True
3     True
4     True
5    False
6     True
Name: Age, dtype: bool

In [166]:
my_df.loc[my_df['Age'] > 21]

Unnamed: 0,Name,Age,Major
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
6,Greg,22,Statistics


- Here we gain the advantage of being able to select columns of interest at the same time as subsetting the rows!

In [165]:
my_df.loc[my_df['Age'] > 21, ["Name", "Age"]] #still can return only selected columns

Unnamed: 0,Name,Age
2,Charlie,22
3,Dave,23
4,Eve,22
6,Greg,22


- You can use booleans for both rows and columns
- Also, `.isin()` is a very convenient operator!


In [170]:
my_df.columns.isin(["Name", "Age"])

array([ True,  True, False])

In [171]:
my_df.loc[my_df["Age"] > 21, my_df.columns.isin(["Name", "Age"])]

Unnamed: 0,Name,Age
2,Charlie,22
3,Dave,23
4,Eve,22
6,Greg,22


---

## Operations on Data Frames

- `.head` and `.tail` methods give the first few and last rows, respectively

In [175]:
my_df.head()

Unnamed: 0,Name,Age,Major
0,Alice,20,Statistics
1,Bob,21,History
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math


In [176]:
my_df.tail()

Unnamed: 0,Name,Age,Major
2,Charlie,22,Chemistry
3,Dave,23,English
4,Eve,22,Math
5,Francesca,21,Civil Engineering
6,Greg,22,Statistics


- `shape` attribute contains the dimensions of the data frame

In [177]:
my_df.shape

(7, 3)

- `.info()` method gives information about the data frame

In [179]:
my_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    7 non-null      object
 1   Age     7 non-null      int64 
 2   Major   7 non-null      object
dtypes: int64(1), object(2)
memory usage: 296.0+ bytes


In [180]:
my_df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   1st     5 non-null      float64
 1   2nd     5 non-null      float64
 2   3rd     5 non-null      float64
dtypes: float64(3)
memory usage: 332.0+ bytes


- Obtain a quick contingency table with `.value_counts()` on a column

In [178]:
my_df["Major"].value_counts()

Statistics           2
History              1
Chemistry            1
English              1
Math                 1
Civil Engineering    1
Name: Major, dtype: int64

---

# Quick Video

This video shows the creation of a `DataFrame` and how to add a new column and reorder the rows (or columns) with `.sort_value()`.

We also check out a few other methods such as

+ `.dropna()`: removes rows with empty cells (returns a new dataset; add inplace = True to replace)
+ `.fillna()`: replaces missing values with something
+ `my_df.describe()` for basic stats


---

# Recap

- Data Frames are great for storing a data set (2D)

    + Rows = observations, Columns = variables
    
    + Many ways to create them (from a dictionary, list, array, etc.)
    
    + Many ways to subset them!

    + `.info()`, `.head()` and other useful methods!
