# Basic Data Processing with Pandas

From Coursera: Intro to Data Science, Week 2  
Patricia Schuster, University of Michigan  
March 2017

# The `series` data structure

The series is one of the core data structures in pandas. It is a cross between a list and a dictionary.  The items are stored in order and there are keywords with which to retrieve them. 

You can create a series by passing in a list of values. Pandas automatically assigns an index starting with zero and sets the name to None.

Start by importing pandas.

In [11]:
import numpy as np
import pandas as pd
# pop up documentation
pd.Series?

We can pass in anything array-like. Tell pandas about your favorite animals.

In [2]:
animals = ['Tiger','Bear','Moose']
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

Pandas automatically identified the type of the data being held in the list. We passed in a list of strings and pandas set the type to object.

In [4]:
numbers = [1,2,3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In this case, pandas stored the series as ints. Underneath, panda stores series values in a typed array using the numpy library. This makes data processing much faster.

# Handling missing data

In python, we have `None` to indicate a lack of data. But what do we do if we want to have a typed list like we do in a Series object?

In [5]:
animals = ['Tiger','Bear',None]
pd.Series(animals)

0    Tiger
1     Bear
2     None
dtype: object

Pandas inserts it as a None and uses the type object for the underlying array.

In [6]:
numbers = [1,2,None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

Here, pandas inserts `NaN`, which means not a number. This is a pretty important point. `NaN` is not `None`. It does not equal `None`, and it does not even equal itself.

In [12]:
np.nan == None

False

In [13]:
np.nan == np.nan

False

You need to use special functions to test for the presence of `NaN`, such as the numpy library `isnan`. We will revisit this later.

# Creating a pandas series

A series can be created from dictionary data. The index is automatically assigned to the keys of the dictionary.

When we create the series, we see that, since it was string data, pandas sets the data type of the series to object. 

In [16]:
sports = {'Archery': 'Bhutan',
          'Golf' : 'Scotland',
          'Sumo' : 'Japan',
          'Taekwondo' : 'South Korea'}

s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [17]:
s.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

You could also separate your index creation from the data by passing in the index as a list explicitly to the series. 

In [19]:
s = pd.Series(['Tiger','Bear','Moose'], index = ['India','America','Canada'])
s

India      Tiger
America     Bear
Canada     Moose
dtype: object

# Querying a series

A pandas series can be queried by the index position or label. If you don't provide labels, the position and label are effectively the same thing.

* To query by numeric location, starting at zero, use the `iloc` attribute.
* To query by the index label, use the `loc` attribute.

Revisit the olympic sports example.

In [22]:
sports = {'Archery': 'Bhutan',
          'Golf' : 'Scotland',
          'Sumo' : 'Japan',
          'Taekwondo' : 'South Korea'}

s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

See the fourth country, or the country associated with golf.

In [26]:
s.iloc[3]

'South Korea'

In [27]:
s.loc['Golf']

'Scotland'

Keep in mind that `iloc` and `loc` are not methods, they are attributes. You use square brackets to query them (the indexing operator). 

Pandas makes code more readable by providing a smart syntax using the indexing operator directly on the series itself. 

If you pass in an integer, the operator will behave as if you want it to query via the `iloc` attribute. If you pass in an object, it will query as if you want to use the `loc` attribute.

In [28]:
s[3]

'South Korea'

In [29]:
s['Golf']

'Scotland'

What happens if your index is a list of integers? Pandas can't determine whether you mean to query with `iloc` or `loc`. The safer option is to specify `iloc` and `loc` directly every time.

# Working with the data 

## Vectorization

A typical strategy would be to iterate over all of the items in the series, and invoke the operation one is interested in. For instance, we could create a data frame for prices. 

In [32]:
s = pd.Series([100.00, 120.0, 101.00, 3.00])
s

0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

In [33]:
total = 0
for item in s:
    total += item
print(total)

324.0


This is simple, but it is slow. Pandas and NumPy libraries support vectorization to speed up the process and do several things at once. Rewrite this last operation using the `sum()` function.

In [34]:
total = np.sum(s)
print(total)

324.0


How can we compare which of these two methods is faster? Jupyter notebook has a useful built-in function to see. Start by creating a big series of random numbers.

(The `head` method limits the amount of data printed out to the first five elements.)

In [37]:
s = pd.Series(np.random.randint(0,1000,10000))
s.head()

0    594
1    350
2    217
3    921
4     33
dtype: int32

In [38]:
len(s)

10000

Use a cellular *magic* function (starts with two `%` signs). `%%timeit` keeps track of how long it takes our code to run. We will use 1000 loops, and have `%%timeit` do 100 runs.

In [39]:
%%timeit -n 100

summary = 0
for item in s:
    summary += item

100 loops, best of 3: 970 µs per loop


In [40]:
%%timeit -n 100
summary = np.sum(s)

100 loops, best of 3: 121 µs per loop


## Broadcasting

With broadcasting, you can apply an operation to every value in the series, changing the series.  

For example, add 2 to every value in `s`. The procedural way to do this would be to iterate through the whole series, but we can just use `+=`. 

In [41]:
s += 2
s.head()

0    596
1    352
2    219
3    923
4     35
dtype: int32

How does this compare to iterating through the series?

In [42]:
%%timeit -n 10

s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label] = value+2

10 loops, best of 3: 622 ms per loop


In [43]:
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s += 2

10 loops, best of 3: 260 µs per loop


# Adding entries

The `.loc` attribute allows you to modify data in place and add new data. If the value you pass in as the index doesn't exist, a new entry is created. 

In [44]:
s = pd.Series([1,2,3])
s

0    1
1    2
2    3
dtype: int64

In [45]:
s.loc['Animal'] = 'Bear'
s

0            1
1            2
2            3
Animal    Bear
dtype: object

We see here that mixed data types are no problem for pandas. When we add the last entry, NumPy changes the data type of the preceding entries.

# Not-unique dataframes

So far we have only looked at data frames where the index values were unique. Now we will consider an example where the index values are not unique. This is what makes dataframes different, conceptually, than a relational database. 

Revisit the countries and their favorite sports.

In [46]:
original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf' : 'Scotland',
                             'Sumo' : 'Japan',
                             'Taekwondo' : 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'],
                                      index = ['Cricket',
                                               'Cricket',
                                               'Cricket',
                                               'Cricket'])

In [47]:
original_sports

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [48]:
cricket_loving_countries

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

Append the cricket loving countries to the original sports series.

In [49]:
all_countries = original_sports.append(cricket_loving_countries)
all_countries

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

In [50]:
original_sports

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Note that `original_sports` has not been changed. This is different than in other cases where `append` is used, in which the original list is modified.

In [54]:
all_countries['Cricket']

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object