# Pandas DataFrames
---

## 1. Pandas DataFrames - Definition:

In the previous unit we learnt about the concept of __Series__. Pandas __DataFrames__ are a natural continuation of it.

A __DataFrame__ is a 2-dimensional array of data, indexed by rows and columns. Each column in a __DataFrame__ corresponds to a __Series__ object. Put simply, a __DataFrame__ is the Pandas' term for a table. And just like with any table, each cell in it is uniquely identified by its row and column index - in that sense:  

- Each __DataFrame__ is a collection of __Series__, each having a single data type
- We can think of each pair (row_index, column_index) as a key of an element in the __DataFrame__


In [None]:
# Imports
import pandas as pd
import numpy as np
import datetime as dt

---
## 2. Constructing a Pandas DataFrame:

A __DataFrame__ object can be created in many ways. To see why, recall that a simple __Series__ object can be constructed from a list as well as a dictionary. Now we are handling not a 1- but a 2-dimensional object. In that sense, there are multiple ways to navigate the construction - horizontally via creating a collection of rows, or vertically - via a collection of columns!

### 2.1 Constructing a DataFrame with 1 Column:

A __DataFrame__ can have a single column. Yes - this really looks like a Series, however, once initialised as a DataFrame object, Python will treat it as such - a table with only one column. A single-column DataFrame can be built in 2 ways:

- From a List:
    - `pd.DataFrame([item1, item2, ...], columns = ['column1'])` 
    
- From a Series:
    - `pd.DataFrame(series_name)`

In [None]:
# Constructing a dataframe from a simple list
df = pd.DataFrame([1,2,3,4], columns=['column1'])
display(df)

In [None]:
# Constructing a dataframe from a series object
s  = pd.Series({'a':123, 'b':456})
df = pd.DataFrame(s)
display(df)

# unless the column name is specified, it will be assigned a default value 0 - we can rename the column in this way
df = df.rename(columns = {0:'column1'})
display(df)

### 2.1 Constructing a DataFrame with multiple columns:

Now let's explore the ways to build a multi-column __DataFrame__. Below is a brief syntax guide on the different ways to do it:

- From a __List of Lists__ - each inner list corresponds to a row:
    - `pd.DataFrame([list1, list2, list3, ...], columns = ['column1', ...])` where `list1 = [item1, item2, ...]`
    
    
- From a __List of Dictionaries__ - each dictionary corresponds to a row; each key in a dict corresponds to a column name
    - `pd.DataFrame([dict1, dict2, dict3, ...])` where `dict1 = {'column1': value1, 'columns2': value2, ...}`
   
   
- From a __Dictionary of List Values__ - each key correspond to a column name; each list corresponds to column values
    - `pd.DataFrame({'column1': list1, 'column2': list2, ...})` where `list1 = [item1, item2, ...]`
    
    
- From a __Dictionary of Series__ - each key corresponds to a column name; each Series corresponds to column values
    - `pd.DataFrame({'column1': series1, 'column2': series2, ...})` where `series1 = pd.Series(...)`

In [None]:
# Constructing a dataframe using a list of lists.
data = [[1,2], # row 1
        [3,4], # row 2
        [5,6]] # row 3

df = pd.DataFrame(data, columns=['col1', 'col2'])
display(df)

In [None]:
# From a list of dictionaries
data = [{'a':111, 'b':222}, # row 1
        {'a':333, 'b':444}, # row 2
        {'b':666, 'a':555}] # row 3 # Note, the order of the items in the dictionary doesn't matter, the keys do.

df = pd.DataFrame(data)
display(df)

In [None]:
# From a dictionary of list values
data = {'a':[1,2,3,4], 'b':[5,6,7,8]}

df = pd.DataFrame(data)
display(df)

In [None]:
# From a dictionary of series
s1 = pd.Series([1,2], index=['a','b'])
s2 = pd.Series([3,4], index=['a','b'])

data = {'series1':s1, 'series2':s2}

df = pd.DataFrame(data)
display(df)

In [None]:
# From a dictionary of series - showing index alignment
s1 = pd.Series([1,2], index=['a','b'])
s2 = pd.Series([3,4], index=['b','c'])
data = {'series1':s1, 'series2':s2}

df = pd.DataFrame(data)
display(df)

---
## 3. Getting the Index, Columns and Values from a DataFrame:

Obtaining information on values and the index of a DataFrame is identical to how we do it with Series:

In [None]:
# Getting the index
df.index

In [None]:
# Getting the columns
df.columns

In [None]:
# Getting the values
df.values # Returns an array of the values

---
## 4. DataFrame - Column Data Types:

Since a __DataFrame__ is really just a collection of __Series__, we can easily obtain the data types of all columns in the same way we did with Series. To cast a new data type onto a column however, we now have to specify the column name of interest:

In [None]:
# Getting datatypes for each column
df.dtypes

In [None]:
# Typecasting a dataframe column
df[['series1']] = df[['series1']].astype(str) # Note the interesting way we indexed the column, we'll talk more about this later.
df.dtypes

---
## 5. DataFrame Shape:

As we know, the __Shape__ of an object returns information on its dimensions. With __Series__ we saw that the outcome of the `.shape` method was a __(x,)__ pair, indicating the number of elements in the Series. 


__DataFrames__ are however 2-dimensional, so we would expect to obtain information on 2 things - the number of rows and number of columns in it. 

Remember: The output of `df.shape` is a pair __(x,y)__ where:
- __x__ corresponds to the number of rows
- __y__ corresponds to the number of columns

In [None]:
# 2 dimensions - 3 rows, 2 columns.
df.shape

---
## 6. The Pandas Index Object:

We encountered the concept of __Index__ with both __Series__ and __DataFrames__. 

__Pandas Index__ is an immutable sequence used for indexing and alignment - the basic object, storing axis labels for all Pandas objects. Think of an index as an immutable list or tuple.

In the context of working with Series and DataFrames, unless explicitly specified, indexes will be automatically created. Below we are showing a couple of ways to explicitly construct an __Index__ object via the `.Index()` method:

In [None]:
# Creating an integer index
idx = pd.Index([1,2,3])
idx

In [None]:
# Creating a datetime index
pd.Index([dt.datetime(2020,1,1), dt.datetime(2020,1,2)])

In [None]:
# Pandas indexes are immutable objects -- we can't change its value.
idx = pd.Index([1,2,3])
idx[1] = 5 # We get an error

---
## 7. Summary:

- Pandas __DataFrame__ is a 2-dimensional array of data, indexed by rows and columns
- DataFrames can be constructed in multiple ways - via list of lists, list of dictionaries, dictionary of lists, etc.
- Obtaining the values, index, columns, column data types and the shape of a dataframe is identical to Series
- Pandas __Index__ Object is an immutable sequence for indexing and alignment of Pandas objects

---
## 8. Concept Check:

1. What is a pandas DataFrame? How is it different to a Series?
2. What are some useful attributes of a DataFrame instance?
3. What is the data type for a column in a dataframe?
4. Construct the following dataframe:

|   | col1 | col2 |
|---|------|------|
| 0 | 1    | 2    |
| 1 | 3    | 4    |
| 2 | 5    | 6.0  |

Using:

   - a. A list of lists
   - b. A list of tuples
   - c. A dictionary of lists
   - d. A dictionary of pd.Series objects

1. Construct a pandas.Index of length 20, object consisting of the first 20 days of this month

In [None]:
# 1. dataframes are 2-dimensional, series are 1-dimensional
# 2. df.index, df.columns, df.values, df.shape
# 3. pd.Series object
# 4a.
my_data1 = [[1,2], [3,4], [5,6.0]]
df1 = pd.DataFrame(my_data1, columns=['col1', 'col2'])
display(df1)
# 4b.
my_data2 = [(1,2), (3,4), (5,6.0)]
df2 = pd.DataFrame(my_data2, columns=['col1', 'col2'])
display(df2)
# 4c.
my_data3 = {'col1': [1,3,5], 'col2': [4,5,6.0]}
df3 = pd.DataFrame(my_data3)
display(df3)
# 4d.
s1 = pd.Series([1,3,5])
s2 = pd.Series([4,5,6.0])
df4 = pd.DataFrame({'col1': s1, 'col2': s2})
display(df4)
# 5.
my_data4 = [dt.datetime(2022, 5, x) for x in range(1,21)]
idx = pd.Index(my_data4)
display(idx)