# DataFrame
A dataframe consists of a named, ordered collection of columns. As such it is used with rectangular data. Each dataframe has a row and a column index. The following diagram describes the Series I created with the below Python code. 

  <img alt="" src="./images/dataframe.png">

In [2]:
import pandas as pd

columns = {
    "Age" : [45,55,30,20,10],
    "Sex" : ["M", "M", "F", "M", "F"]
}

index = ["Bob", "Dave", "Anna", "John", "Sally"]

df = pd.DataFrame(columns, index)
df.index.name = "Name"
df.columns.name = "Attributes"

## Indexing
Indexing enables us to select subset of rows and or columns from a DataFrame. We consider the options in this sub section.
### Select Single Column ( Array notation )
We can retrieve a single column from a DataFrame using dictionary like, square bracket notation. The column is retrieved as a Pandas Series object whose index is the same as the DataFrame index. The name of the series is the name of the column in the DataFrame. Setting values on the Series sets the values on the original DataFrame. 

In [185]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1['ColA']

RowOne       0
RowTwo       4
RowThree     8
RowFour     12
Name: ColA, dtype: int32

### Select Multiple Columns (Array Notation)
We can pass a list of column labels inside the square brackets. The result is a DataFrame with the specified subset of columns

In [186]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1[['ColA','ColC']]

Unnamed: 0,ColA,ColC
RowOne,0,2
RowTwo,4,6
RowThree,8,10
RowFour,12,14


### Select Single Row By Label (loc)

In [187]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.loc['RowTwo']

ColA    4
ColB    5
ColC    6
ColD    7
Name: RowTwo, dtype: int32

### Select Multiple Rows By Label (loc)

In [189]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.loc[['RowTwo','RowFour']]

Unnamed: 0,ColA,ColB,ColC,ColD
RowTwo,4,5,6,7
RowFour,12,13,14,15


### Select Single Column By Label (loc)

In [199]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.loc[:,'ColA']

RowOne       0
RowTwo       4
RowThree     8
RowFour     12
Name: ColA, dtype: int32

### Select Subset Of Columns By Label (loc)

In [204]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.loc[:,['ColA', 'ColC']]

Unnamed: 0,ColA,ColC
RowOne,0,2
RowTwo,4,6
RowThree,8,10
RowFour,12,14


### Select Subset Of Rows and Columns by Label (loc)

In [193]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.loc[['RowTwo','RowFour'], ['ColB','ColD']]

Unnamed: 0,ColB,ColD
RowTwo,5,7
RowFour,13,15


### Select Subset Of Rows and Columns by Label Slice (loc)

In [195]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.loc[:'RowThree', :'ColB']

Unnamed: 0,ColA,ColB
RowOne,0,1
RowTwo,4,5
RowThree,8,9


### Select subset of rows by integer position (iloc)

In [206]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.iloc[[2,3]]

Unnamed: 0,ColA,ColB,ColC,ColD
RowThree,8,9,10,11
RowFour,12,13,14,15


### Select subset of rows and columns by integer position (iloc)

In [208]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.iloc[[2,3], [1,2]]

Unnamed: 0,ColB,ColC
RowThree,9,10
RowFour,13,14


### Select subset of rows and columns by integer position (iloc)

In [210]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.iloc[:, [1,2]]

Unnamed: 0,ColB,ColC
RowOne,1,2
RowTwo,5,6
RowThree,9,10
RowFour,13,14


### Select subset of rows and columns by integer position (iloc) slice
**Note:** the difference between iloc and loc when using slice notation. With iloc the stop index is exclusing and with loc it is inclusive. 

In [213]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.iloc[1:3,1:3]

Unnamed: 0,ColB,ColC
RowTwo,5,6
RowThree,9,10


### Select single scalar by column and row label (at)

In [215]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.at['RowTwo','ColB']

5

### Select single scalar by column and row integer index (iat)

In [217]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index = ['RowOne','RowTwo','RowThree','RowFour'],columns=['ColA','ColB','ColC','ColD'] )
df1.iat[1,1]

5

## Deletion
### Column

In [31]:
del df["Keys"]
df

Attributes,Age,Sex,Age2
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob,12,M,21
Dave,21,M,21
Anna,12,F,21
John,21,M,21
Sally,21,F,21


## Rows
### Retrieving Rows
We can retried rows by integer index using the **iloc** method or by index name using the **loc** method. Note the name of the returned series is the index tag. 

In [9]:
df.iloc[1]

Attributes
Age    55
Sex     M
Name: Dave, dtype: object

### Transpose - Swapping Rows and Columns
Note that in the example not all columns in the original data frame have the same type so column type will be lost in the transpose as the new columns have just Python object as the type. In this kind of situation, Transpose and Transpose back will lose the type information. 

In [40]:
df

Attributes,Age,Sex,Age2,John
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bob,12,M,21,22
Dave,21,M,21,22
Anna,12,F,21,22
John,23,M,21,22
Sally,21,F,21,22


In [39]:
df.T

Name,Bob,Dave,Anna,John,Sally
Attributes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Age,12,21,12,23,21
Sex,M,M,F,M,F
Age2,21,21,21,21,21
John,22,22,22,22,22


## Creation Options
#### 2 Dimensional ndarray
Notice we optionally pass an index and a column index

In [97]:
pd.DataFrame(np.arange(16).reshape(4,4), index=["One", "Two", "Three", "Four"], columns=['a','b','c','d'])

Unnamed: 0,a,b,c,d
One,0,1,2,3
Two,4,5,6,7
Three,8,9,10,11
Four,12,13,14,15


#### Dictionary of Sequences (Lists, arrays or tuples) 
Note the optional index argument

In [82]:
data = {
    "List" : [1,2,3],
    "Tuple" : (4,5,6),
    "Numpy" : np.array([7,8,9])
}

pd.DataFrame(data, index = ["One", "Two", "Three"])

Unnamed: 0,List,Tuple,Numpy
One,1,4,7
Two,2,5,8
Three,3,6,9


#### Dictionary of Series
Note in the absense of an index, the union of the indices from each series is used as an index

In [96]:
columns = {
    "One" : pd.Series([1,2], ['a', 'b']),
    "Two" : pd.Series([1,2], ['a', 'c']),
    "Three" : pd.Series([1,2], ['b', 'c']),
}

pd.DataFrame(columns)

Unnamed: 0,One,Two,Three
a,1.0,1.0,
b,2.0,,1.0
c,,2.0,2.0


#### Dictionary of Dictionaries

In [95]:
columns = {
    "One" : {'a': 1, 'b' : 2},
    "Two" : {'b' : 9, 'c':12}
}

pd.DataFrame(columns)

Unnamed: 0,One,Two
a,1.0,
b,2.0,9.0
c,,12.0


#### List of Dictionaries

In [94]:
rows = [
    {'One' : 1.0, "Two" : 5.0},
    {'One' : 2.0, "Two" : 10.0},
]

pd.DataFrame(rows, index = ['a','b'])

Unnamed: 0,One,Two
a,1.0,5.0
b,2.0,10.0


#### List of Lists or Tuples


In [101]:
data = [
    [1,2,3],
    (4,5,6)
]

pd.DataFrame(data, index=['One','Two'], columns=['a','b', 'c'])

Unnamed: 0,a,b,c
One,1,2,3
Two,4,5,6


## Reindexing
### Rows
Creates a new dataframe with the values re-arranged as per the new index. If the new index has values missing from the old index  index empty rows will be added. The following are all equivalent

 * ```df1.reindex(['Five','Four', 'Three', 'Two', "One"])```
 * ```df1.reindex(index=['Five','Four', 'Three', 'Two', "One"])```
 * ```df.reindex(['Five','Four', 'Three', 'Two', "One"], axis="index")```

In [105]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index=["One", "Two", "Three", "Four"], columns=['a','b','c','d'])
df1

Unnamed: 0,a,b,c,d
One,0,1,2,3
Two,4,5,6,7
Three,8,9,10,11
Four,12,13,14,15


In [109]:
df1.reindex(['Five','Four', 'Three', 'Two', "One"])

Unnamed: 0,a,b,c,d
Five,,,,
Four,12.0,13.0,14.0,15.0
Three,8.0,9.0,10.0,11.0
Two,4.0,5.0,6.0,7.0
One,0.0,1.0,2.0,3.0


In [110]:
df1.reindex(index=['Five','Four', 'Three', 'Two', "One"])

Unnamed: 0,a,b,c,d
Five,,,,
Four,12.0,13.0,14.0,15.0
Three,8.0,9.0,10.0,11.0
Two,4.0,5.0,6.0,7.0
One,0.0,1.0,2.0,3.0


In [113]:
df1.reindex(['Five','Four', 'Three', 'Two', "One"], axis="index")

Unnamed: 0,a,b,c,d
Five,,,,
Four,12.0,13.0,14.0,15.0
Three,8.0,9.0,10.0,11.0
Two,4.0,5.0,6.0,7.0
One,0.0,1.0,2.0,3.0


### Columns
If the new index has a label missing from the original dataframe column list a new empty colum is added. If the new index is missing any columns from the original column index those columns are omitted from the result

In [114]:
df1.reindex(columns=['d','c','b'])

Unnamed: 0,d,c,b
One,3,2,1
Two,7,6,5
Three,11,10,9
Four,15,14,13


In [116]:
df1.reindex(['d','c','b','f'], axis="columns")

Unnamed: 0,d,c,b,f
One,3,2,1,
Two,7,6,5,
Three,11,10,9,
Four,15,14,13,


## Dropping
### Rows
Creates a new DataFrame with the specified values dropped. All three methods below are equivalent
 
  * ```df1.drop(["Two", "Three"])```
  * ```df1.drop(["Two", "Three"], axis="index")```
  * ```df1.drop(index=["Two", "Three"])```
  * ```df1.drop(["Two", "Three"], axis=0)```
 

In [121]:
df1 = pd.DataFrame(np.arange(16).reshape(4,4), index=["One", "Two", "Three", "Four"], columns=['a','b','c','d'])
df1.drop(["Two", "Three"])

Unnamed: 0,a,b,c,d
One,0,1,2,3
Four,12,13,14,15


Unnamed: 0,a,b,c,d
One,0,1,2,3
Four,12,13,14,15


In [122]:
df1.drop(["Two", "Three"], axis="index")

Unnamed: 0,a,b,c,d
One,0,1,2,3
Four,12,13,14,15


In [123]:
df1.drop(index=["Two", "Three"])

Unnamed: 0,a,b,c,d
One,0,1,2,3
Four,12,13,14,15


In [129]:
df1.drop(["Two", "Three"], axis=0)

Unnamed: 0,a,b,c,d
One,0,1,2,3
Four,12,13,14,15


### Columns
All three below methods are equivalent

 * ```df1.drop(columns=["Two", "Three"])```
 * ```df1.drop(["Two", "Three"], axis="columns")```
 * ```df1.drop(["Two", "Three"], axis=1)```

In [126]:
df1.drop(columns=["a", "d"])

Unnamed: 0,b,c
One,1,2
Two,5,6
Three,9,10
Four,13,14


In [127]:
df1.drop(["a", "d"], axis="columns")

Unnamed: 0,b,c
One,1,2
Two,5,6
Three,9,10
Four,13,14


In [128]:
df1.drop(["a", "d"], axis=1)

Unnamed: 0,b,c
One,1,2
Two,5,6
Three,9,10
Four,13,14


## Indexing
### Columns