# DataFrame
A dataframe consists of a named, ordered collection of columns. As such it is used with rectangular data. Each dataframe has a row and a column index. The following diagram describes the Series I created with the below Python code. 

  <img alt="" src="./images/dataframe.png">

In [2]:
import pandas as pd

columns = {
    "Age" : [45,55,30,20,10],
    "Sex" : ["M", "M", "F", "M", "F"]
}

index = ["Bob", "Dave", "Anna", "John", "Sally"]

df = pd.DataFrame(columns, index)
df.index.name = "Name"
df.columns.name = "Attributes"

## Columns

### Column Retrieval
#### Single Column
We can retrieve a single column from a DataFrame using dictionary like, square bracket notation. The column is retrieved as a Pandas Series object whose index is the same as the DataFrame index. The name of the series is the name of the column in the DataFrame. Setting values on the Series sets the values on the original DataFrame. 

In [70]:
s = df["Age"]
s


Name
Bob      99
Dave     21
Anna     12
John     23
Sally    21
Name: Age, dtype: int64

#### Multiple Column Retrieval
We can retrieve multiple columns from a DataFrame. The result is a new DataFrame

In [73]:
import numpy as np

df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['a','b','c','d'])
df1[[1,2]]

Unnamed: 0,1,2
a,1,2
b,5,6
c,9,10
d,13,14


### Column Setting
#### Scalar
Every value in the column takes the scalar value

In [12]:
df["Age"] = 35
df

Attributes,Age,Sex
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,35,M
Dave,35,M
Anna,35,F
John,35,M
Sally,35,F


#### List
The length of the list must match the length of the DataFrame

In [17]:
df["Age"] = [5,3,25,5,35]
df

Attributes,Age,Sex
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,5,M
Dave,3,M
Anna,25,F
John,5,M
Sally,35,F


#### Series
When setting using a series the labels on the Series are matched to the labels on the DataFrame. Any labels from the DataFrame that are missing from the Series are set to NaN

In [20]:
s = pd.Series([7,10,15], index=["Bob", "Dave", "Derek"])
df["Age"] =s
df

Attributes,Age,Sex
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bob,7.0,M
Dave,10.0,M
Anna,,F
John,,M
Sally,,F


#### Setting non existing column
Setting a column that does not exist, creates a new column

In [21]:
df["Age2"] = 21
df

Attributes,Age,Sex,Age2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,7.0,M,21
Dave,10.0,M,21
Anna,,F,21
John,,M,21
Sally,,F,21


#### Using boolean notation

In [30]:
df["Age"] = [12,21,12,21,21]
df["Keys"] = df["Age"] == 21
df

Attributes,Age,Sex,Age2,Keys
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,12,M,21,False
Dave,21,M,21,True
Anna,12,F,21,False
John,21,M,21,True
Sally,21,F,21,True


### Column Deletion

In [31]:
del df["Keys"]
df

Attributes,Age,Sex,Age2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,12,M,21
Dave,21,M,21
Anna,12,F,21
John,21,M,21
Sally,21,F,21


## Rows
### Retrieving Rows
We can retried rows by integer index using the **iloc** method or by index name using the **loc** method. Note the name of the returned series is the index tag. 

In [9]:
df.iloc[1]

Attributes
Age    55
Sex     M
Name: Dave, dtype: object

### Transpose - Swapping Rows and Columns
Note that in the example not all columns in the original data frame have the same type so column type will be lost in the transpose as the new columns have just Python object as the type. In this kind of situation, Transpose and Transpose back will lose the type information. 

In [40]:
df

Attributes,Age,Sex,Age2,John
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,12,M,21,22
Dave,21,M,21,22
Anna,12,F,21,22
John,23,M,21,22
Sally,21,F,21,22


In [39]:
df.T

Name,Bob,Dave,Anna,John,Sally
Attributes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Age,12,21,12,23,21
Sex,M,M,F,M,F
Age2,21,21,21,21,21
John,22,22,22,22,22


## Creation Options
#### 2 Dimensional ndarray
Notice we optionally pass an index and a column index

In [97]:
pd.DataFrame(np.arange(16).reshape(4,4), index=["One", "Two", "Three", "Four"], columns=['a','b','c','d'])

Unnamed: 0,a,b,c,d
One,0,1,2,3
Two,4,5,6,7
Three,8,9,10,11
Four,12,13,14,15


#### Dictionary of Sequences (Lists, arrays or tuples) 
Note the optional index argument

In [82]:
data = {
    "List" : [1,2,3],
    "Tuple" : (4,5,6),
    "Numpy" : np.array([7,8,9])
}

pd.DataFrame(data, index = ["One", "Two", "Three"])

Unnamed: 0,List,Tuple,Numpy
One,1,4,7
Two,2,5,8
Three,3,6,9


#### Dictionary of Series
Note in the absense of an index, the union of the indices from each series is used as an index

In [96]:
columns = {
    "One" : pd.Series([1,2], ['a', 'b']),
    "Two" : pd.Series([1,2], ['a', 'c']),
    "Three" : pd.Series([1,2], ['b', 'c']),
}

pd.DataFrame(columns)

Unnamed: 0,One,Two,Three
a,1.0,1.0,
b,2.0,,1.0
c,,2.0,2.0


#### Dictionary of Dictionaries

In [95]:
columns = {
    "One" : {'a': 1, 'b' : 2},
    "Two" : {'b' : 9, 'c':12}
}

pd.DataFrame(columns)

Unnamed: 0,One,Two
a,1.0,
b,2.0,9.0
c,,12.0


#### List of Dictionaries

In [94]:
rows = [
    {'One' : 1.0, "Two" : 5.0},
    {'One' : 2.0, "Two" : 10.0},
]

pd.DataFrame(rows, index = ['a','b'])

Unnamed: 0,One,Two
a,1.0,5.0
b,2.0,10.0


#### List of Lists or Tuples


In [101]:
data = [
    [1,2,3],
    (4,5,6)
]

pd.DataFrame(data, index=['One','Two'], columns=['a','b', 'c'])

Unnamed: 0,a,b,c
One,1,2,3
Two,4,5,6
