# The *pandas* library: `Series` and `DataFrame`s

Python is a general-purpose scripting language. However, when working in a particular domain, such as the world of data, it can be useful to make use of code libraries that provide higher-level data structures and operations that are suited to that domain.

A library we will be drawing on heavily in this module is *pandas*. If data is represented in an appropriate format, it can make it much easier to work with. *pandas* provides several such formats. In this Notebook, you will learn about two key *pandas* data structures: `Series` and a tabular data structure referred to as a `DataFrame`. 

To start with we need to load in the *pandas* library, by convention we associate this with the convenient label *pd*.

In [None]:
import pandas as pd

Note that it is possible to set up notebooks to preload your most commonly used libraries. However, as a reminder that many libraries we are using are not part of the standard Python code base we will include an `import` statement for the libraries we use.

This is also good practice as it allows a reader, or a newcomer to your code, to identify which libraries contain the functions or datastructures you are using. Similarly, it is bad (although widespread) practice to import all the functionality of a library with the `*` notation. For example, if we were using both *pandas* and *numpy*, the numerical python library that *pandas* is built on, we might see:

```python
from pandas import *
from numpy import *

# Where are isnull and nan defined?

if isnull(nan):
    # Do something
```

By labelling the libraries as they are imported, it is clear which library each operation belongs to:

```python
import pandas as pd
import numpy as np

# Clear where isnull and nan are defined:

if pd.isnull(np.nan):
    # Do something
```

## Python recap: `list`s and `dict`s

Python lists are flexible, mutable, data structures that can be used to represent an ordered list of objects.

In [None]:
simple_list = ['apples', 'oranges', 'bananas', 'pears']

Associated with each list is a numerical *index* value, with a count starting at zero, that identifies the position of each list member. The *N*th list member has index value *N* - 1.

In [None]:
print(f"First list item (index value 1-1 = 0): simple_list[0] -> {simple_list[0]}")
print(f"Third list item (index value 3-1 = 2): simple_list[2] -> {simple_list[2]}")

Python also supports unordered *associative arrays* in the form of `dict`s that allow you to index a value by name:

In [None]:
simple_dict = {'one':1, 'two':2, 'four':4, 'three':3}
print(f"Item with key (index) 'two' has value: {simple_dict['two']}")

We can inspect the keys and the values contained within a dict by converting them into a list:

In [None]:
print(list(simple_dict.keys()))
print(list(simple_dict.values()))

With a simple `list` or `dict`, we can use a `list` or `dict` comprehension to filter the contents of an object according to a test condition. For example, we can test against the value of the items in a `dict` and generate a `list` containing associated `dict` values:

In [None]:
[k for k in simple_dict if simple_dict[k] > 2]

In [None]:
# The following statement makes a new dict by swapping the keys and values
#    contained in simple_dict and prints them out.
alternative_dict = dict(zip(simple_dict.values(), simple_dict.keys()))

print(list(alternative_dict.keys()))
print(list(alternative_dict.values()))

## *pandas* `Series` and `DataFrame`s

In the following sections you will be introduced to two powerful data representations supported by the pandas library: `Series` and `DataFrames`. The introduction provides a quick overview of some of the operations that are possible using these data structures. We will be revisiting many of the operations in more depth later in the module, so for now just try to get a feel for what's possible.

## Series

A *pandas* `Series` combines the idea of a list with an additional index column, by default this is a numeric index starting at zero:

In [None]:
simple_series = pd.Series(['one', 'two', 'three', 'four'])

simple_series

We can index into a `Series` using the corresponding index value:

In [None]:
simple_series[1]

We can also grab several values at once if we pass the desired index values in as a list in the order we want them to be displayed:

In [None]:
simple_series[ [1, 0, 3] ]

In much the same way that we can inspect the keys used in a Python dict, we can inspect the index values used within a Series:

In [None]:
simple_series.index

We can also define our own index values:

In [None]:
myindex_series = pd.Series([1, 2, 3, 4],
                           index=['one', 'two', 'three', 'four'])
print(myindex_series.index)
myindex_series

Again, it's easy enough to pull out several values from the `Series` by providing several of our own index values in a `list`:

In [None]:
myindex_series[ ['two', 'four'] ]

`Series` can also be created from a simple `dict`, where the unique key values become index values:

In [None]:
pd.Series( {'Q1':'Spring', 'Q2':'Summer', 'Q3':'Autumn', 'Q4':'Winter'} )

In [None]:
pd.Series({'Q1':'Spring', 'Q2':'Summer', 'Q3':'Autumn', 'Q4':'Winter'}, 
       index=['Q4', 'Q3', 'Q2', 'Q1'] )

### Activity 1

Construct a `Series` containing the names of the days of the week and a day number index, ordering the days Monday to Sunday with day index values 1 to 7. Assign your `Series` to a variable `days_of_the_week`. 

In [None]:
# Enter your code in this cell. Add more cells if you need to.

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Here's our sample solution. The first thing to note is that simply defining a Series without managing the index starts the indexing at zero:

In [None]:
wrong_index_days_of_the_week = pd.Series( ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
                                           'Friday', 'Saturday', 'Sunday'] )
wrong_index_days_of_the_week

So we need to supply the index even though it's a simple integer run of values:

In [None]:
days_of_the_week = pd.Series( ['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                               'Friday', 'Saturday', 'Sunday'],
                             index=[1, 2, 3, 4, 5, 6, 7] )
days_of_the_week

#### End of Activity 1

### 'Vector' operations with `Series`

In certain respects, we can think of a `Series` as a vector. For example: if we multiply a `Series` by a scalar value, each member of the `Series` is multiplied by that value:

In [None]:
5 * pd.Series([1, 1, 2, 3, 4])

If you have every worked with a spreadsheet and used a formula to multiply the values in one column by the same fixed amount, you will have performed a similar operation.

Similarly with other operations can be applied to a `Series`, such as adding or subtracting the same fixed amount to each item:

In [None]:
mySeries = pd.Series([6, 7, 3, 9, 1])
mySeries, mySeries - 7.1

We can combine two `Series` with simple operations:

In [None]:
pd.Series([1, 1, 2, 3, 4]) + pd.Series([10, 10, 15, -15, -20])

 But be careful if the two series are not the same length - try it and see what happens.

We can also apply a function to each item in a Series, returning a new Series of the same length as the original one:

In [None]:
def square(x):
    """Return the square of a provided value."""
    return x**2
    
intSeries = pd.Series( [1, -2, 3, -4])

intSeries.apply(square)

We can also apply a `lambda` function to the elements of a `Series`:

In [None]:
intSeries.apply(lambda x: x**3)

### Filtering `Series`

A very useful feature of `Series` is that we can filter their values by value. The *values* in the `Series` (*not* the index values) are tested against the condition and the `Series` elements that pass the test are returned, along with their index values.

Let's create another simple series using `range` to generate a short sequence of values:

In [None]:
range_series=pd.Series(range(100, 106))
range_series

We can create a new `Series` created as a filtered set of values from the original `Series`, in this case, items where the value exceeds a specified minimum value:

In [None]:
range_series[ range_series > 102 ]

This takes a little bit of thinking about. Let's see what the index expression (in the square brackets, `[]`) is returning:

In [None]:
range_series > 102

Then, the index values that return `True` are used as the index value to return the `Series` values associated with them. In `range_series`, these are the values `103`, `104` and `105`.

Note also that the index values of the new `Series` are the index values from the original `Series`:

In [None]:
range_series[range_series > 102]

We can reset the index values by applying the `.reset_index()` method to the filtered `Series`. The `drop=True` parameter ensures the original index values are then just thrown away...

In [None]:
range_series[ range_series > 102 ].reset_index(drop=True)

### Aligning data in `Series`

In many situations it can be useful to be able to add `Series` elements together where they share some common index values, even if the `Series` are presented in a different order. 

For example, imagine you have two separate `Series` containing the values of payments made to just the companies who worked on a particular project. Now suppose we want a `totalSpend` `Series` which is the sum of the two `Series` elements for each company.

In [None]:
# The total spend Series for the four companies starts at zero.
totalSpend = pd.Series({"Company A":0, "Company B":0, "Company C":0, "Company D":0})
totalSpend

The `project1` variable represents the company expenditures on one project; `project2` represents another project:

In [None]:
project1 = pd.Series([1000, 2000, 500], index=["Company A", "Company B", "Company C"])
project2 = pd.Series([800, 2000 ], index=["Company D", "Company A"])

What happens if we try adding various combinations of `totalSpend`, `project1` and `project2`, treating them as vectors?

In [None]:
totalSpend + project1

And another:

In [None]:
project1 + project2

If any index values don't match across both `Series`, then the sum for that index value returns *NaN* - which means "*N*ot *a* *N*umber". If particular row index values do match across Series, even if they are not presented in the same order, then the rows are aligned and the sum of similarly indexed values is calculated and returned.

Using the `Series.add()` method, we can force missing values to be treated as a particular value, such as 0, using a `fill_value` parameter.

In [None]:
totalSpend.add(project1, fill_value=0)

The result of the `add()` method is itself a `Series`.  So we can chain the `add` expressions:

In [None]:
totalSpend.add(project1, fill_value=0).add(project2, fill_value=0)

## `DataFrames`

DataFrames are two-dimensional data tables in which rows of data have values spread across one or more columns, much like a sheet in a spreadsheet. Each column behaves as if it is a `Series`; a `DataFrame` can therefore be thought of as a `dict` of `Series`, where `dict` keys correspond to column names. 

Calling the `pandas.DataFrame()` function with no arguments creates an empty 'DataFrame`:

In [None]:
empty_df = pd.DataFrame()
empty_df

We can also check whether a 'DataFrame` object is empty or not:

In [None]:
empty_df.empty

The `courseData` object is a `dict`, with each key the name of a column, and each value a list of the values in that column:

In [None]:
course_dict = {'course_code': ['TM351', 'TU100', 'M269'],
              'points': [30, 60, 30],
              'study_level': ['3', '1', '2']
              }

course_dict

If we pass just such a `dict` to the `DataFrame()` function, we can create a `DataFrame` object from it:

In [None]:
course_df = pd.DataFrame(course_dict)
course_df

You will see that the notebook renders the `DataFrame` as a table. The bold font column on the left is the `DataFrame` index. The column names correspond to the `dict` keys.

We can check the object type too, and see whether it is empty:

In [None]:
type(course_df), course_df.empty

We can also create a `DataFrame` from a list of tuples - each tuple becomes a row, and we use the `columns` parameter to give a name to the columns:

In [None]:
pd.DataFrame([('a', 1, 2), ('b', 2, 3)], 
             columns=["alpha", 'num1', 'num2'])

We can pull out any required column as a Series by using the column name as a key value in a way that resembles using a `dict` key: 

In [None]:
course_df['course_code']

You can also use a dot notation to refer to the column as an attribute of the `DataFrame`, although this is perhaps not as clear as using the `[]` approach:

In [None]:
course_df.course_code

The `[]` approach also provides a level of indirection when accessing columns. For example, we can refer to a column via a variable:

In [None]:
myCol = 'course_code'
course_df[ myCol ]

To pull out several columns, provide a list of the column names you want to extract as index values:

In [None]:
course_df[ ['course_code', 'study_level'] ]

Again, we can refer to the columns via a variable:

In [None]:
myCols = ['course_code', 'study_level'] 
course_df[ myCols ]

We can force the ordering of columns in the `DataFrame` by means of the `columns` variable. If a column name is specified that does not have a match in the keys of the source data `dict`, an empty column is created with all the values set to `NaN`:

In [None]:
course_df = pd.DataFrame(course_dict, columns=['study_level', 'course_code', 'title', 'points'])
course_df

We can populate a column with the same value in each cell using a single value assignment (for example, a string or an `int`):

In [None]:
course_df['title'] = 'Unknown'
course_df

We can also set column values from a `Series`:

In [None]:
course_df['title'] = pd.Series(['The data course', 'The foundation course', 'The algorithms course'])
course_df

And we can pull out the values of a column in a similar way:

In [None]:
course_df['title']

#### Column typing in `DataFrame`s

In order to perform certain operations on a column, the column needs to be correctly typed - that is, all elements need to be of a particular type.

You can check the type of each column in a `DataFrame` using the `.dtypes` attribute:

In [None]:
course_df.dtypes

Here we see that the `study_level` and `title` are classed as object types, but the `points` values have been identified as integers.

We can recast the values of a column to another datatype using the `.astype()` method:

In [None]:
course_df['study_level'] = course_df['study_level'].astype(int)
course_df.dtypes

#### Simple vector operations

As with a `Series`, we can perform vector style operations across one or more columns. For example, we can multiply or divide the values in a column vector by a scalar value:

In [None]:
course_df['weekly_study_hours'] = 10 * course_df['points'] / 30
course_df

We can also apply the same operation to multiple columns, returning a `DataFrame`:

In [None]:
10 * course_df[['points', 'weekly_study_hours']]

As with multiple `Series`, we can perform vector operations across columns:

In [None]:
course_df['level_hours'] = course_df['study_level'] * course_df['weekly_study_hours'] 
course_df

If we inspect the type of a single `DataFrame` column, we see that it is a `Series`. If we select multiple columns, we return a `DataFrame`:

In [None]:
type(course_df['study_level']), type( course_df[['study_level', 'points']] )

This means that we can apply a function to each element just as we did with the Series:

In [None]:
course_df['study_level'].apply(lambda x: x * 10)

We can also set these values as the values of a new column:

In [None]:
course_df['dummy_column'] = course_df['study_level'].apply(lambda x: x * 10)
course_df

We can also apply functions across mutliple columns or rows in a dataframe.

We use the `axis=1` parameter to reference rows, and `axis=0` to reference columns.

In the following case, each row of the two column dataframe is passed to the lambda function, which selects one of the elements:

In [None]:
course_df[['study_level', 'points']].apply(lambda x: x['points'], axis=1)

We can take a closer look at what's going on by displaying what is passed to the `lambda` function each time it is called.

*If we add a semi-colon (`;`) to the last line of the code cell, we suppress the display of the object returned from the last line of the cell.*

In [None]:
def show_me(row):
        """Display what is passed from a DataFrame apply statement."""
        print(f"I was passed a {type(row)}:")
        display(row)

# Set axis='rows' to pass rows at each iteration of the apply statement
course_df[['study_level', 'title']].apply(show_me, axis='rows');

We can also pass columns using the `DataFrame.apply()` method:

In [None]:
# You may notice that this function is essentially exactly the same as the previous one
# in terms of functionality...

def show_me(column):
        """Display what is passed from a dataframe apply statement."""
        print(f"I was passed a {type(column)}:")
        display(column)
        
# This time, set axis='columns' to pass columns at each iteration of the apply statement
course_df[['study_level', 'title']].apply(show_me, axis='columns');

####Â Working with the index

One of the reasons for using a library such as *pandas* is that it offers data structures that make working with data as easy as possible.  Sometimes, this requires a little careful thinking about how best to organise the data within a `DataFrame`.

For example, for the course information `DataFrame`, it may be most useful to use the course codes as the index values.

In [None]:
course_df = course_df.set_index('course_code')
course_df

The visual presentation of this is slightly misleading - it looks as if there may be a blank row with index value `course_code`. But the size of the transformed table is correctly shown: one of the columns in the original `DataFrame` has moved from being a column in its own right to being the index column.

We can pull out one or more rows by referencing the appropriate index element(s) and the columns we wish to extract. To extract rows based on an index label, use the `.loc` attribute:

In [None]:
course_df.loc[ ['TU100', 'TM351'], ['title', 'study_level'] ]

If we identify a particular column, we can use an index value to pull out the value from the correspondingly indexed row and the chosen column:

In [None]:
course_df['title']['TM351']

We can also filter the rows of a `DataFrame` based on the values of one or more columns. We will cover this powerful feature in more depth later in the module.

In [None]:
course_df[course_df['points']==30]

You may recall that passing a column name in as the `DataFrame` key returns a `Series` containing the values of the column:

In [None]:
course_df['title']

Paraphrasing the author of *pandas*, Wes McKinney, writing in his book _Python for Data Analysis_, '[this apparent inconsistency] in syntax arose out of practicality and nothing more'. Pragmatic programming, as you might, FTW! (FTW is slang for 'for the win', a positive exclamation in gamer culture.)

#### Unique items in a column

We can pull out the unique values contained within a column very easily by applying the `unique()` function to the appropriate column:

In [None]:
course_df['points'].unique()

We can easily iterate through the items:

In [None]:
for item in course_df['points'].unique():
    print(item)

To get a list rather than an array, make use of the `tolist()` helper function:

In [None]:
course_df['points'].unique().tolist()

## Further information

*pandas'* `Series` and `DataFrame` objects provide several more useful, and powerful, data-manipulation methods than have been described here. You will meet many of them later in the module, as and when they are required.  If you want to read some more about these core *pandas'* data structures you can find some additional information at [*pandas*: Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

If you would like to learn about *pandas* in more depth, a copy of the book written by pandas' original developer, Wes McKinney, is available as an ebook via the OU Library: [Python for Data Analysis](http://proquestcombo.safaribooksonline.com.libezproxy.open.ac.uk/book/programming/python/9781449323592)

## What next?

Having hopefully got to grips with the *pandas'* very powerful tabular `DataFrame` representation, you now need to start filling them with data. 

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move to the Notebook `02.2 Data file formats` to learn how to open files and parse their data contents into a *pandas* `DataFrame`.