# Explore Pandas as Series Structure

The series is one of the core data structures in pandas. Think of it as a cross between a list and dictionary. The items are stored in order and there's label in which you can retrieve them. As easy way to visualize this is two columns of data. The first column is the special index, similar to keys. While the second column is the actual data. It's important to note that the data column has a label on its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. 

In [2]:
import pandas as pd
import numpy as np 

In [3]:
# We can create a series by passing a list of values. When we do this, pandas automatically assign an index starting with zero and sets the name of the series to None.

# One of the easiest ways to create a series is to use an array-like object, like a list. 

students = ["Alice", "Bob", "Jack"]

# Now we just call the series function in pandas and pass in the students

pd.Series(students)

0    Alice
1      Bob
2     Jack
dtype: object

In [4]:
# We see hare that pandas has automatically identifed the type of data in this Series as "object" and set the dtype parameter as appropriate.

In [5]:
# If we passed in integers, we could see that pandas set the dtype to int64. Underneath panda stores series values in a typed array using the Numpy library. This offers a significant speedup when processing the data versus traditional python lists. 

numbers = [1, 2, 3]

# And turn this into Series

pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [6]:
# There's some other typing details that exist for performance that are important to know. The most important is how Numpy and thus pandas handle missing data. 
# If we create a list of strings and we have one element that is None type, pandas insert it as a None and uses the type object for the underlying array. 

# Example

students = ["Alice", "Jack", None]

pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [7]:
# If we create a list on integers or floating numbers and put in a none type, pandas will designate the None to NaN (Not a number) then pandas will set the numbers to float and cast the dtype to float64

num = [1,2,3, None]

pd.Series(num)

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [8]:
# Nan is not equivalent to None 

import numpy as np 
np.nan == None

False

In [9]:
# It turns out we can't do an equality to NaN to itself
np.nan == np.nan

False

In [10]:
# Instead we need to use a special function to test for the precence of not a number. 

np.isnan(np.nan)

True

In [11]:
# A series can be directly created from dictionary data. If we do this, the index is automatically assigned to the keys of the dictionary we provided and not just incrementing integers

student_scores = {"Alice": "Physics", "Jack": "Chemistry", "Bob": "Math"}
s = pd.Series(student_scores)
s


Alice      Physics
Jack     Chemistry
Bob           Math
dtype: object

In [12]:
# Once the series has been created, we can get the index of the object using the index attribute

s.index

Index(['Alice', 'Jack', 'Bob'], dtype='object')

In [13]:
# The dtype object is not just for strings, but for arbitrary objects . 
# For example a list of tuples

students = [("Alice","Bob"),("Jack", "Mary"),("Chris","Jen")]
pd.Series(students)


0    (Alice, Bob)
1    (Jack, Mary)
2    (Chris, Jen)
dtype: object

In [14]:
# We can also separate the index creation from the data by passing in the index as a list explicitly to the series. 

s = pd.Series(["Physics","Chemistry", "Math"], index=["Jen","Mary","Chris"])
s

Jen        Physics
Mary     Chemistry
Chris         Math
dtype: object

In [15]:
# If the data in our list does not align with the keys, pandas overrides the creation to favor only and all of the indices values that we provided. So it will ignore from dictionary all keys which are not in our index, and pandas will provide Nan or None for any index value we provided, which is not in the dictionary list. 

# example

student_scores = {"Alice": "Math", "Bob": "Chem", "Mary": "English"}

s = pd.Series(student_scores, index=["Alice","Jack","Mary"])
s

Alice       Math
Jack         NaN
Mary     English
dtype: object

# Querying A Series

 A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. 
 To query by numeric location, starting at zero, use the ***iloc*** attribute. 
 To query by the index label, use the ***loc*** attribute 

In [16]:
students_classes = {"Alice": "Physics", "Bob": "Chem", "Jack": "English"}
s = pd.Series(students_classes)
s

Alice    Physics
Bob         Chem
Jack     English
dtype: object

In [17]:
# If we wanted to see the third entry, we would use the iloc attribute. We can also access the data by using [] operator

print(s.iloc[2])
print(s[2])

English
English


In [18]:
# If we wanted to see what class Bob has, we use the loc attribute.

print(s.loc["Bob"])
print(s["Bob"])

Chem
Chem


In [19]:
# If our custom index are list of integers, it is better to explicitly call the attributes with loc or iloc. 

# Here is an example using class and classcodes, where classes are indexed by classcodes in a form of integers. 

classcode = {99: "Math", 100: "Java", 101: "Python"}
c = pd.Series(classcode)
c

99       Math
100      Java
101    Python
dtype: object

In [20]:
# If we just call in c[0], the program will give an error because it is not in the key, therefore it is best to explicitly call iloc or loc

c.iloc[0]

'Math'

## Working with Data

In [21]:
# For instance, create a series of students grades , and finding the average

grades = pd.Series([80,90,100])

total = 0

for grade in grades:
    total += grade
print(total/len(grades))

90.0


In [22]:
# This works, but it is slow. 
# Pandas and the underlying numpy libraries support a method of computation called vectorization. Vectorization works with most of the functions in the numpy library, including the sum function. 

total = np.sum(grades)
print(total/len(grades))

90.0


In [23]:
# Both methods works, but which one is actually faster? The Jupyter notebook has a magic function that can help this. 

# First let's create a big series with random numbers
numbers = pd.Series(np.random.randint(0,1000,10000))

# We can check the first n items in the series by using the head() function
print(numbers.head())

# then verify that the length is correct
print(len(numbers))

0    960
1    348
2    470
3    307
4    289
dtype: int32
10000


# Magic Functions

Ok, we're confident now that we have a big series. The ipython interpreter has something called
magic functions begin with a percentage sign. If we type this sign and then hit the Tab key, you
can see a list of the available magic functions. You could write your own magic functions too, 
but that's a little bit outside of the scope of this course.

Here, we're actually going to use what's called a cellular magic function. These start with two 
percentage signs and wrap the code in the current Jupyter cell. The function we're going to use 
is called timeit. This function will run our code a few times to determine, on average, how long 
it takes.

Let's run timeit with our original iterative code. You can give timeit the number of loops that 
you would like to run. By default, it is 1,000 loops. I'll ask timeit here to use 100 runs because 
we're recording this. Note that in order to use a cellular magic function, it has to be the first 
line in the cell

In [24]:
%pwd

'c:\\Users\\chrst\\OneDrive\\Python\\Jupyter_Notebook'

In [25]:
%%timeit -n 100
total = 0
for number in numbers:
    total += number

total/len(numbers)

2.65 ms ± 973 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [26]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

141 µs ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
# Wow! This is a pretty shocking difference in the speed and demonstrates why one should be 
# aware of parallel computing features and start thinking in functional programming terms.
# Put more simply, vectorization is the ability for a computer to execute multiple instructions
# at once, and with high performance chips, especially graphics cards, you can get dramatic
# speedups. Modern graphics cards can run thousands of instructions in parallel.

In [32]:
# A Related feature in pandas and nummy is called broadcasting. With broadcasting, you can 
# apply an operation to every value in the series, changing the series. For instance, if we
# wanted to increase every random variable by 2, we could do so quickly using the += operator 
# directly on the Series object. 

# Let's look at the head of our series
numbers.head()

0    111
1    225
2    115
3    362
4    544
dtype: int32

In [33]:
# Now let's add 2

numbers += 2
numbers.head()

0    113
1    227
2    117
3    364
4    546
dtype: int32

In [None]:
# The procedural way of doing this would be to iterate through all of the items in the 
# series and increase the values directly. Pandas does support iterating through a series 
# much like a dictionary, allowing you to unpack values easily.

In [30]:
# We can use the iteritems() function which returns a label and value 
for label, value in numbers.iteritems():
    # now for the item which is returned, lets call at()
    numbers.at[label, value]
# And we can check the result of this computation
numbers.head()

KeyError: (0, 960)

In [None]:
# So the result is the same, though you may notice a warning depending upon the version of
# pandas being used. But if you find yourself iterating pretty much *any time* in pandas,
# you should question whether you're doing things in the best possible way.

In [None]:
# Lets take a look at some speed comparisons. First, lets try five loops using the iterative approach

In [31]:
%%timeit -n 10
# we'll create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000,1000))
# And we'll just rewrite our loop from above.
for label, value in s.iteritems():
    s.loc[label]= value+2

83.5 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
# Now let's try that using broadcasting

In [33]:
%%timeit -n 10
s= pd.Series(np.random.randint(0,1000,1000))

# And we just broadcast with +=

s += 2

The slowest run took 6.80 times longer than the fastest. This could mean that an intermediate result is being cached.
776 µs ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
# Amazing. Not only is it significantly faster, but it's more concise and even easier 
# to read too. The typical mathematical operations you would expect are vectorized, and the 
# nump documentation outlines what it takes to create vectorized functions of your own. 

In [None]:
# One last note on using the indexing operators to access series data. The .loc attribute lets 
# you not only modify data in place, but also add new data as well. If the value you pass in as 
# the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types. 
# While it's important to be aware of the typing going on underneath, Pandas will automatically 
# change the underlying NumPy types as appropriate.

In [34]:
# Here's an example using a Series of a few numbers. 
s = pd.Series([1, 2, 3])

# We could add some new value, maybe a university course
s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

In [None]:
# We see that mixed types for data values or index labels are no problem for Pandas. Since 
# "History" is not in the original list of indices, s.loc['History'] essentially creates a 
# new element in the series, with the index named "History", and the value of 102

In [35]:
# Up until now I've shown only examples of a series where the index values were unique. I want 
# to end this lecture by showing an example where index values are not unique, and this makes 
# pandas Series a little different conceptually then, for instance, a relational database.

# Lets create a Series with students and the courses which they have taken
students_classes = pd.Series({'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [36]:
# Now lets create a Series just for some new student Kelly, which lists all of the courses
# she has taken. We'll set the index to Kelly, and the data to be the names of courses.
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [37]:
# Finally, we can append all of the data in this new Series to the first using the .append()
# function.
all_students_classes = students_classes.append(kelly_classes)

# This creates a series which has our original people in it as well as all of Kelly's courses
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [38]:
# There are a couple of important considerations when using append. First, Pandas will take 
# the series and try to infer the best data types to use. In this example, everything is a string, 
# so there's no problems here. Second, the append method doesn't actually change the underlying Series
# objects, it instead returns a new series which is made up of the two appended together. This is
# a common pattern in pandas - by default returning a new object instead of modifying in place - and
# one you should come to expect. By printing the original series we can see that that series hasn't
# changed.
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [39]:
# Finally, we see that when we query the appended series for Kelly, we don't get a single value, 
# but a series itself. 
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In this lecture, we focused on one of the primary data types of the Pandas library, the Series. You learned how to query the Series, with .loc and .iloc, that the Series is an indexed data structure, how to merge two Series objects together with append(), and the importance of vectorization.

There are many more methods associated with the Series object that we haven't talked about. But with these basics down, we'll move on to talking about the Panda's two-dimensional data structure, the DataFrame. The DataFrame is very similar to the series object, but includes multiple columns of data, and is the structure that you'll spend the majority of your time working with when cleaning and aggregating data.