# Introduction to Pandas and Series Data

## The Series Data Structure

The series is one of the core data structures in pandas. You think of it as a cross between a list and a dictionary. The items are all stored in an order and there are label with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It is important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. 

In [1]:
import pandas as pd

In [2]:
students = ['Alice', 'Jack', 'Molly']
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [3]:
numbers = [1,2,3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

Depending on the type of data is with the rest of the series will determine how None is handled. 

In [4]:
students = ['Alice', 'Jack', None]
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [5]:
numbers = [1,2,None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

Notice that NaN is a different value. Second, pandas set the dtype of this series to a floating point number instead of an object or ints. 

It is important to realize that None and NaN might be used in the same way, but to pandas they are different. None is NOT equivalent to NaN. 

In [6]:
import numpy as np
np.nan == None

False

In [7]:
np.nan == np.nan

False

In [8]:
np.isnan(np.nan)

True

So keep in mind when you see NaN, its meaning is similar to None, but it is a numeric value and treated differently for efficiency reasons. 

Often you haved labeled data that you want to manipulate so creating a Series from a dictionary is common. The indexd is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers. 

In [9]:
student_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(student_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

Once the series has been created, we can get the index object using the index attribute

In [10]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [11]:
students = [('Alice','Brown'),('Jack','White'),('Molly','Green')]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [12]:
# You can also separate your index creattion from the data by passing in the index as a list
pd.Series(['Physics', 'Chemistry', 'English'], index = ['Alice', 'Jack', 'Molly'])

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

What happens if your list of values in the index object are not aligned with the keys in your dictionary for creating the series?

In [13]:
student_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}
s = pd.Series(student_scores, index = ['Alice', 'Molly', 'Sam'])
s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

## Querying a Data Series

A Pandas series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute. 

In [14]:
import pandas as pd

In [15]:
students_classes = {'Alice': 'Physics', 'Jack': 'Chemsitry', 'Molly': 'English', 'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemsitry
Molly      English
Sam        History
dtype: object

So for this series, if you wanted to see the fourth entry we would use the iloc attribute with the parameter 3

In [16]:
s.iloc[3]

'History'

If you wanted to see what class Molly has, we would use the loc attribute with a parameter of Molly

In [17]:
s.loc['Molly']

'English'

Keep in mind that iloc and loc are attributes, not methods, so you do not use parentheses to query them, but square brackets instead, which is called the indexing operator. 

Pandas tries to make our code a bit more readable and provides a smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameters, the operator will behave as if you want it to query via the iloc attribute. And if you pass in an object, it will query as if you wanted to use the label based loc attribute. 

In [19]:
s[3]

'History'

In [20]:
s['Molly']

'English'

What happens if your index is a list of integers? This is a bit complicated and Pandas cannot determine automatically wherher you are intending to query by index position or index label. So you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the iloc or loc attributes directly. 

In [22]:
class_code = {99: 'Physics', 100: 'Chemistry', 101: 'English', 102: 'History'}
s = pd.Series(class_code)
s

99       Physics
100    Chemistry
101      English
102      History
dtype: object

If we try and call s[0] we will get a key error because there is no item in the classes list with an index of zero. Instead, we have to call iloc explicity if we want the first item. 

In [23]:
s.iloc[0]

'Physics'

Lets talk about working with the data. A common task is to want to consider all of the values inside of a series and do some sort of operation, This could be trying to find a certain number, or summarizing data or transforming the data in some way. 

A typical programmatic approach to this would be to iterate over all the items in the series, and incoke the operation one is interested in. For instance, we couldcreate a Series of integers representing student grades, and uust try and get an average grade. 

In [25]:
grades = pd.Series([90,80,70,60])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

75.0


This works but it is slow. Pandas and numpy support a method of computation called vectorization. Vectorization works with most of the functions in the numpt library, including the sum function. 

In [26]:
import numpy as np

total = np.sum(grades)
print(total/len(grades))

75.0


In [37]:
# 10,000 random integers between 0 and 1000
numbers = pd.Series(np.random.randint(0,1000,10000))

numbers.head()

0    371
1    997
2    822
3    924
4    509
dtype: int32

In [38]:
len(numbers)

10000

In [39]:
%%timeit -n 100
total = 0
for number in numbers:
    total =+ number
total/len(numbers)

668 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [40]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

36 µs ± 4.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


A related feature in pandas and numpy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the Series object. 

In [41]:
numbers.head()

0    371
1    997
2    822
3    924
4    509
dtype: int32

In [42]:
numbers += 2

In [43]:
numbers.head()

0    373
1    999
2    824
3    926
4    511
dtype: int32

The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, the a new entry is added. Keep in mind that indices can have mixed types. 

In [44]:
s = pd.Series([1,2,3])
s.loc['History'] = 102
s

0            1
1            2
2            3
History    102
dtype: int64

# DataFrame

## DataFrame Data Structure