# Work with DataFrames and pandas

Before we start, we need to import the required modules. 

In [1]:
import pandas as pd
from pandas import DataFrame

First, I like to show you how to create a new DataFrame.

In [2]:
df = DataFrame([[1,2,3,4]], columns=["A", "B", "C", "D"])
df

Unnamed: 0,A,B,C,D
0,1,2,3,4


The amount of list entries in the ``columns`` list must equal every list in the first parameter, otherwise an Exception is thrown.

The following example shows, how to create a DataFrame using a list.

In [3]:
rows = []
for e in range(0,5):
    rows.append([e+1,e+2,e+3,e+4])
    
df = DataFrame(rows, columns=["A", "B", "C", "D"])
df

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8


A DataFrame is always organized in labled columns and rows. Each row has an index value, which is created using a  `for` loop in the previous example. There are also ways to work with custom index values and/or multi-level index values, but this is beyond the scope of this notebook. 

## get information about the structure of the DataFrame

To get a list with all index values from a DataFrame, use

In [4]:
df.index

RangeIndex(start=0, stop=5, step=1)

This attribute returns a `RangeIndex` that can be used like a list (technically it is not a list but anyway).

In [5]:
for e in df.index:
    print(e)

0
1
2
3
4


 To get a normal python list, you can use the `tolist()` function.

In [6]:
df.index.tolist()

[0, 1, 2, 3, 4]

This function is availalbe on many attributes within the DataFrame. We can the following expression to get a list with all column names from the DataFrame.

In [7]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [8]:
df.columns.tolist()

['A', 'B', 'C', 'D']

## common functions in DataFrames

The following cells contain some useful functions when working with DataFrames (e.g. view x rows from the top or bottom).

Some computation functions, like `sum()` can be used on multiple levels within the DataFrame.

In [9]:
df.head(2)

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,3,4,5


In [10]:
df.tail(2)

Unnamed: 0,A,B,C,D
3,4,5,6,7
4,5,6,7,8


In [11]:
df.sum()

A    15
B    20
C    25
D    30
dtype: int64

In [12]:
df.A.sum()

15

In [13]:
df.A.mean()

3.0

In [14]:
df.median()

A    3.0
B    4.0
C    5.0
D    6.0
dtype: float64

In [15]:
df.describe()

Unnamed: 0,A,B,C,D
count,5.0,5.0,5.0,5.0
mean,3.0,4.0,5.0,6.0
std,1.581139,1.581139,1.581139,1.581139
min,1.0,2.0,3.0,4.0
25%,2.0,3.0,4.0,5.0
50%,3.0,4.0,5.0,6.0
75%,4.0,5.0,6.0,7.0
max,5.0,6.0,7.0,8.0


## delete rows

A row within a DataFrame can be deleted using the `drop()` function. The function expects the index label of the row that should be dropped.

The index values within the DataFrame are persistent, e.g. when deleting the row with the index 2, the length of the index values will decrease, but you still have the index values 0, 1, 3 and 4.

In [16]:
df

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8


In [17]:
df.index.tolist()

[0, 1, 2, 3, 4]

In [18]:
df.drop(2)

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,3,4,5
3,4,5,6,7
4,5,6,7,8


In [19]:
df.drop([0,3])

Unnamed: 0,A,B,C,D
1,2,3,4,5
2,3,4,5,6
4,5,6,7,8


## select and edit values

There are multiple ways to retrive data from a DataFrame. First, you can select rows by the index value using `ix`. 

In [20]:
ser = df.ix[0]
ser

A    1
B    2
C    3
D    4
Name: 0, dtype: int64

The resulting data type is a `Series`, which is a one-dimensional labeled array. You can access the data in a similar way as with the DataFrame.

In [21]:
type(ser)

pandas.core.series.Series

In [22]:
ser["A"]

1

In [23]:
ser.keys()

Index(['A', 'B', 'C', 'D'], dtype='object')

Within DataFrames and Series, you can work with multiple DataTypes. They could be native to python (like `int` or `strings`) or complex objects, like an `ipaddress.IPv4Address`. The DataType of an element an/or a row is stored in the `dtype` attribute:

In [24]:
df.ix[0]["A"].dtype

dtype('int64')

The data types behind the scenes of pandas is build on numpy. I won't go into more details at this point. If you like to learn more about numpy, please take a look at the [numpy homepage](http://www.numpy.org/).

An alternative way to select data from the DataFrame is the `loc` function. You can access data either using labels or using boolean arrays.

First, lets have a look on the `loc`
 function using labels.

In [25]:
df

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,3,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8


In [26]:
df.loc[1, "B"]

3

You can also modify values in the DataFrame with the `loc` function.

In [27]:
df.loc[1, "B"] = 100
df

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,100,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8


The following function can be used to iterate over the rows in a DataFrame:

In [28]:
for index, row in df.iterrows():
    print("ROW AT INDEX %d:\n%s\n\n" % (index, row))

ROW AT INDEX 0:
A    1
B    2
C    3
D    4
Name: 0, dtype: int64


ROW AT INDEX 1:
A      2
B    100
C      4
D      5
Name: 1, dtype: int64


ROW AT INDEX 2:
A    3
B    4
C    5
D    6
Name: 2, dtype: int64


ROW AT INDEX 3:
A    4
B    5
C    6
D    7
Name: 3, dtype: int64


ROW AT INDEX 4:
A    5
B    6
C    7
D    8
Name: 4, dtype: int64




## Filter DataFrame

At some point, you need just some data from a DataFrame. Filter data is quite easy, but you need to keep in mind, that a seperate DataFrame with the expected data is created during that process.

In [29]:
df2 = df[df.A == 4]
df2  # copy with the selected data

Unnamed: 0,A,B,C,D
3,4,5,6,7


In [30]:
df  # original DataFrame

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,100,4,5
2,3,4,5,6
3,4,5,6,7
4,5,6,7,8


In [31]:
df.loc[3, "C"] = 50
df

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,100,4,5
2,3,4,5,6
3,4,5,50,7
4,5,6,7,8


In [32]:
df2

Unnamed: 0,A,B,C,D
3,4,5,6,7


In many cases, i need to work with text data. First, lets create a DataFrame with some text values.

In [33]:
df = DataFrame([
        ["Test", "Another Test"],
        ["String", "Another String"],
    ], columns=["A", "B"])
df

Unnamed: 0,A,B
0,Test,Another Test
1,String,Another String


You can filter based on string values using the index value as `str`. We use `contains` in this example to receive all rows that contains the string `Test` in column A. The `contains` function is based on the `re.search` function from the standard regular expression library.

In [34]:
df[df.A.str.contains("Test")]

Unnamed: 0,A,B
0,Test,Another Test


There are also additional operations like `match` (relis on `re.match`), `startswith` and `endswith` available.

At last, I'll like to show you how to convert a DataFrame to a dictionary.

In [35]:
# convert the rows to dictionaries that include the index of the row
df2.to_dict()

{'A': {3: 4}, 'B': {3: 5}, 'C': {3: 6}, 'D': {3: 7}}

In [36]:
# convert the rows to flat lists
df2.to_dict(orient="list")

{'A': [4], 'B': [5], 'C': [6], 'D': [7]}

## Create DataFrames from different sources

You can create DataFrames from different sources, e.g. CSV files, Excel Files and even the local clipboard. A detailed explanation of all possiblities are available in the [pandas IO documentation](http://pandas.pydata.org/pandas-docs/stable/io.html).

In [37]:
# copy the last table output within this workbook
clipboard_df = pd.read_clipboard()
clipboard_df

Unnamed: 0,Unnamed: 1,A,B
0,Test,Another,Test


It's also possible to extract tables from HTML pages, like the Cisco EoL notes. The following statement requires the following additional libraries:

* lxml
* BeautifulSoup4
* html5lib

The result is a list with all tables found in the HTML sourcecode.

In [38]:
html_dfs = pd.read_html("http://www.cisco.com/c/en/us/products/collateral/switches/catalyst-2960-series-switches/eos-eol-notice-c51-730121.html")
type(html_dfs)

list

In [39]:
html_dfs[0]

Unnamed: 0,0,1,2
0,Milestone,Definition,Date
1,End-of-Life Announcement Date,The date the document that announces the end-o...,"October 31, 2013"
2,End-of-Sale Date,The last date to order the product through Cis...,"October 31, 2014"
3,Last Ship Date: HW,The last-possible ship date that can be reques...,"January 29, 2015"
4,End of SW Maintenance Releases Date: HW,The last date that Cisco Engineering may relea...,"October 31, 2015"
5,End of Vulnerability/Security Support: HW,The last date that Cisco Engineering may relea...,"October 30, 2017"
6,End of Routine Failure Analysis Date: HW,The last-possible date a routine failure analy...,"October 31, 2015"
7,End of New Service Attachment Date: HW,For equipment and software that is not covered...,"October 31, 2015"
8,End of Service Contract Renewal Date: HW,The last date to extend or renew a service con...,"January 29, 2019"
9,Last Date of Support: HW,The last date to receive applicable service an...,"October 31, 2019"


In [40]:
html_dfs[1]

Unnamed: 0,0,1,2,3,4
0,End-of-Sale Product Part Number,Product Description,Replacement Product Part Number,Replacement Product Description,Additional Information
1,WS-C2960-24-S,Catalyst 2960 24 10/100 LAN Lite Image,WS-C2960+24TC-S,Catalyst 2960 Plus 24 10/100 + 2 T/SFP LAN Lite,-
2,WS-C2960-24LC-S,Catalyst 2960 24 10/100 (8 PoE) + 2 T/SFP LAN ...,WS-C2960+24LC-S,Catalyst 2960 Plus 24 10/100 (8 PoE) + 2 T/SFP...,-
3,WS-C2960-24LT-L,Catalyst 2960 24 10/100 (8 PoE)+ 2 1000BT LAN ...,WS-C2960+24LC-L,Catalyst 2960 Plus 24 10/100 (8 PoE) + 2 T/SFP...,-
4,WS-C2960-24PC-L,Catalyst 2960 24 10/100 PoE + 2 T/SFP LAN Base...,WS-C2960+24PC-L,Catalyst 2960 Plus 24 10/100 PoE + 2 T/SFP LAN...,-
5,WS-C2960-24PC-S,Catalyst 2960 24 10/100 PoE + 2 T/SFP LAN Lite...,WS-C2960+24PC-S,Catalyst 2960 Plus 24 10/100 PoE + 2 T/SFP LAN...,-
6,WS-C2960-24TC-L,Catalyst 2960 24 10/100 + 2T/SFP LAN Base Image,WS-C2960+24TC-L,Catalyst 2960 Plus 24 10/100 + 2T/SFP LAN Base,-
7,WS-C2960-24TC-S,Catalyst 2960 24 10/100 + 2 T/SFP LAN Lite Image,WS-C2960+24TC-S,Catalyst 2960 Plus 24 10/100 + 2 T/SFP LAN Lite,-
8,WS-C2960-24TT-L,Catalyst 2960 24 10/100 + 2 1000BT LAN Base Image,WS-C2960+24TC-L,Catalyst 2960 Plus 24 10/100 + 2T/SFP LAN Base,-
9,WS-C2960-48PST-L,Catalyst 2960 48 10/100 PoE + 2 1000BT +2 SFP ...,WS-C2960+48PST-L,Catalyst 2960 Plus 48 10/100 PoE + 2 1000BT +2...,-
