# Pandas DataFrames
---

## 1. Pandas DataFrames - Definition:

In the previous unit we learnt about the concept of __Series__. Pandas __DataFrames__ are a natural continuation of it.

A __DataFrame__ is a 2-dimensional array of data, indexed by rows and columns. Each column in a __DataFrame__ corresponds to a __Series__ object. Put simply, a __DataFrame__ is the Pandas' term for a table. And just like with any table, each cell in it is uniquely identified by its row and column index - in that sense:  

- Each __DataFrame__ is a collection of __Series__, each having a single data type
- We can think of each pair (row_index, column_index) as a key of an element in the __DataFrame__


In [22]:
import pandas as pd
import datetime as dt

---
## 2. Constructing a Pandas DataFrame:

A __DataFrame__ object can be created in many ways. To see why, recall that a simple __Series__ object can be constructed from a list as well as a dictionary. Now we are handling not a 1- but a 2-dimensional object. In that sense, there are multiple ways to navigate the construction - horizontally via creating a collection of rows, or vertically - via a collection of columns!

### 2.1 Constructing a DataFrame with 1 Column:

A __DataFrame__ can have a single column. Yes - this really looks like a Series, however, once initialised as a DataFrame object, Python will treat it as such - a table with only one column. A single-column DataFrame can be built in 2 ways:

- From a List:
    - `pd.DataFrame([item1, item2, ...], columns = ['column1'])` 
    
- From a Series:
    - `pd.DataFrame(series_name)`

In [4]:
df = pd.DataFrame([1, 2, 3, 4], columns=["column1"])
display(df)

Unnamed: 0,column1
0,1
1,2
2,3
3,4


In [6]:
s = pd.Series({"a": 123, "b": 456})
df = pd.DataFrame(s)
display(df)

# unless the column name is specified, it will be assigned a default value
df = df.rename(columns={0:"column_1"})
df

Unnamed: 0,0
a,123
b,456


Unnamed: 0,column_1
a,123
b,456


### 2.1 Constructing a DataFrame with multiple columns:

Now let's explore the ways to build a multi-column __DataFrame__. Below is a brief syntax guide on the different ways to do it:

- From a __List of Lists__ - each inner list corresponds to a row:
    - `pd.DataFrame([list1, list2, list3, ...], columns = ['column1', ...])` where `list1 = [item1, item2, ...]`
    
    
- From a __List of Dictionaries__ - each dictionary corresponds to a row; each key in a dict corresponds to a column name
    - `pd.DataFrame([dict1, dict2, dict3, ...])` where `dict1 = {'column1': value1, 'columns2': value2, ...}`
   
   
- From a __Dictionary of List Values__ - each key correspond to a column name; each list corresponds to column values
    - `pd.DataFrame({'column1': list1, 'column2': list2, ...})` where `list1 = [item1, item2, ...]`
    
    
- From a __Dictionary of Series__ - each key corresponds to a column name; each Series corresponds to column values
    - `pd.DataFrame({'column1': series1, 'column2': series2, ...})` where `series1 = pd.Series(...)`

In [7]:
# list of lists
data = [
    [1,2], # row 1
    [3,4], # row 2
    [5,6], # row 3
]
df = pd.DataFrame(data, columns=["col1", "col2"])
display(df)

Unnamed: 0,col1,col2
0,1,2
1,3,4
2,5,6


In [8]:
# From a list of dictionaries
data = [
    {"a": 111, "b": 222}, # row 1
    {"a": 333, "b": 444}, # row 2
    {"b": 666, "a": 555}, # row 3, note the order of items in the dictionary doesn't matter, the keys do
]
df = pd.DataFrame(data)
df

Unnamed: 0,a,b
0,111,222
1,333,444
2,555,666


In [9]:
# From a dictionary of list values
data = {
    "a": [1,2,3,4,5],
    "b": [5,6,7,8,9],
}
df = pd.DataFrame(data)
df

Unnamed: 0,a,b
0,1,5
1,2,6
2,3,7
3,4,8
4,5,9


In [10]:
# From a dictionary of Series
s1 = pd.Series([1, 2], index=["a", "b"])
s2 = pd.Series([3, 4], index=["a", "b"])
data = {"series1": s1, "series2": s2}
df = pd.DataFrame(data)
df

Unnamed: 0,series1,series2
a,1,3
b,2,4


In [11]:
# From a dictionary of Series - showing index alignment
s1 = pd.Series([1, 2], index=["a", "b"])
s2 = pd.Series([3, 4], index=["b", "c"])
data = {"series1": s1, "series2": s2}
df = pd.DataFrame(data)
df # NaN = Not a Number = Null in SQL

Unnamed: 0,series1,series2
a,1.0,
b,2.0,3.0
c,,4.0


---
## 3. Getting the Index, Columns and Values from a DataFrame:

Obtaining information on values and the index of a DataFrame is identical to how we do it with Series:

In [12]:
# index
df.index

Index(['a', 'b', 'c'], dtype='object')

In [13]:
# columns
df.columns

Index(['series1', 'series2'], dtype='object')

In [14]:
# values
df.values # Return an array of the values, lost index and column information

array([[ 1., nan],
       [ 2.,  3.],
       [nan,  4.]])

---
## 4. DataFrame - Column Data Types:

Since a __DataFrame__ is really just a collection of __Series__, we can easily obtain the data types of all columns in the same way we did with Series. To cast a new data type onto a column however, we now have to specify the column name of interest:

In [16]:
# datatypes of each column
df.dtypes

series1    float64
series2    float64
dtype: object

In [17]:
# Typecasting a dataframe column
df[["series1"]] = df[["series1"]].astype(str) # Note the interesting way we indexed the column, we'll talk more about this later
df.dtypes

series1     object
series2    float64
dtype: object

---
## 5. DataFrame Shape:

As we know, the __Shape__ of an object returns information on its dimensions. With __Series__ we saw that the outcome of the `.shape` method was a __(x,)__ pair, indicating the number of elements in the Series. 


__DataFrames__ are however 2-dimensional, so we would expect to obtain information on 2 things - the number of rows and number of columns in it. 

Remember: The output of `df.shape` is a pair __(x,y)__ where:
- __x__ corresponds to the number of rows
- __y__ corresponds to the number of columns

In [20]:
display(df)
df.shape # 2 dimensional - 3 rows, 2 columns

Unnamed: 0,series1,series2
a,1.0,
b,2.0,3.0
c,,4.0


(3, 2)

---
## 6. The Pandas Index Object:

We encountered the concept of __Index__ with both __Series__ and __DataFrames__. 

__Pandas Index__ is an immutable sequence used for indexing and alignment - the basic object, storing axis labels for all Pandas objects. Think of an index as an immutable list or tuple.

In the context of working with Series and DataFrames, unless explicitly specified, indices will be automatically created. Below we are showing a couple of ways to explicitly construct an __Index__ object via the `.Index()` method:

In [21]:
# Create an integer index
idx = pd.Index([1,2,3])
idx

Int64Index([1, 2, 3], dtype='int64')

In [23]:
# Create a datetime index
# import datetime as dt
pd.Index([dt.datetime(2022,6,27), dt.datetime(2022,6,28)])

DatetimeIndex(['2022-06-27', '2022-06-28'], dtype='datetime64[ns]', freq=None)

In [25]:
# Pandas indices are immutable objects -- we can't change its value
idx = pd.Index([1,2,3])
idx[1] = 5 # We get an error

TypeError: Index does not support mutable operations

---
## 7. Summary:

- Pandas __DataFrame__ is a 2-dimensional array of data, indexed by rows and columns
- DataFrames can be constructed in multiple ways - via list of lists, list of dictionaries, dictionary of lists, etc.
- Obtaining the values, index, columns, column data types and the shape of a dataframe is identical to Series
- Pandas __Index__ Object is an immutable sequence for indexing and alignment of Pandas objects

---
## 8. Concept Check:

1. What is a pandas DataFrame? How is it different to a Series?
2. What are some useful attributes of a DataFrame instance?
3. What is the data type for a column in a dataframe?
4. Construct the following dataframe:

|   | col1 | col2 |
|---|------|------|
| 0 | 1    | 2    |
| 1 | 3    | 4    |
| 2 | 5    | 6.0  |

Using:

   - a. A list of lists
   - b. A list of tuples
   - c. A dictionary of lists
   - d. A dictionary of pd.Series objects

5. Construct a pandas.Index of length 20, object consisting of the first 20 days of this month

In [26]:
# 1. dataframes are 2-dimensional, series are 1-dimensional
# 2. df.index, df.columns, df.values, df.shape
# 3. pd.Series object
# 4a.
my_data1 = [[1,2], [3,4], [5,6.0]]
df1 = pd.DataFrame(my_data1, columns=['col1', 'col2'])
display(df1)
# 4b.
my_data2 = [(1,2), (3,4), (5,6.0)]
df2 = pd.DataFrame(my_data2, columns=['col1', 'col2'])
display(df2)
# 4c.
my_data3 = {'col1': [1,3,5], 'col2': [4,5,6.0]}
df3 = pd.DataFrame(my_data3)
display(df3)
# 4d.
s1 = pd.Series([1,3,5])
s2 = pd.Series([4,5,6.0])
df4 = pd.DataFrame({'col1': s1, 'col2': s2})
display(df4)
# 5.
my_data4 = [dt.datetime(2022, 6, x) for x in range(1,21)]
idx = pd.Index(my_data4)
display(idx)

Unnamed: 0,col1,col2
0,1,2.0
1,3,4.0
2,5,6.0


Unnamed: 0,col1,col2
0,1,2.0
1,3,4.0
2,5,6.0


Unnamed: 0,col1,col2
0,1,4.0
1,3,5.0
2,5,6.0


Unnamed: 0,col1,col2
0,1,4.0
1,3,5.0
2,5,6.0


DatetimeIndex(['2022-06-01', '2022-06-02', '2022-06-03', '2022-06-04',
               '2022-06-05', '2022-06-06', '2022-06-07', '2022-06-08',
               '2022-06-09', '2022-06-10', '2022-06-11', '2022-06-12',
               '2022-06-13', '2022-06-14', '2022-06-15', '2022-06-16',
               '2022-06-17', '2022-06-18', '2022-06-19', '2022-06-20'],
              dtype='datetime64[ns]', freq=None)