## **Introduction to Pandas & Series Data**

###**a. Pandas**

This week we're going to deepen our investigation to how Python can be used to manipulate, clean, and query data by looking at the Pandas data tool kit. Pandas was created by Wes McKinney in 2008, and is an open source project under a very permissive license. As an open source project it's got a strong community, with over one hundred software developers all committing code to help make it better. Before pandas existed we had only a hodge podge of tools to use, such as numpy, the python core libraries, and some python statistical tools. But pandas has quickly become the defacto library for representing relational data for data scientists.

I want to take a moment here to introduce the question answersing site Stack Overflow. Stack Overflow is used broadly within the software development community to post questions about programming, programming languages, and programming toolkits. What's special about Stack Overflow is that it's heavily curated by the community. And the Pandas community, in particular, uses it as their number one resource for helping new members. It's quite possible if you post a question to Stack Overflow, and tag it as being Pandas and Python related, that a core Pandas developer will actually respond to your question. In addition to posting questions, Stack Overflow is a great place to go to see what issues people are having and how they can be solved. You can learn a lot from browsing Stacks at Stack Overflow and with pandas, this is where the developer community is.

A second resource you might want to consider are books. In 2012 Wes McKinney wrote the definitive Pandas reference book called Python for Data Analysis and published by O'Reilly, and it's recently been update to a second edition. I consider this the go to book for understanding how Pandas works. I also appreciate the more brief book "Learning the Pandas Library" by Matt Harrison. It's not a comprehensive book on data analysis and statistics. But if you just want to learn the basics of Pandas and want to do so quickly, I think it's a well laid out volume and it can be had for a good price.

The field of data science is rapidly changing. There's new toolkits and method being created everyday. It can be tough to stay on top of it all. Marco Rodriguez and Tim Golden maintain a wonderful blog aggregator site called Planet Python. You can visit the webpage at planetpython.org, subscribe with an RSS reader, or get the latest articles from the @PlanetPython Twitter feed. There's lots of regular Python data science contributors, and I highly recommend it if you follow RSS feeds.

Here's my last plug on how to deepen your learning. Kyle Polich runs an excellent podcast called Data Skeptic. It isn't Python based per se, but it's well produced and it has a wonderful mixture of interviews with experts in the field as well as short educational lessons. Much of the word he describes is specific to machine learning methods. But if that's something you are planning to explore through this specialization this course is in, I would really encourage you to subscribe to his podcast.

That's it for a little bit of an introduction to this week of the course. Next we're going to dive right into Pandas library and talk about the series data structure.

###**b. The Series Data Structure**

#####**Notes**

In this lecture we're going to explore the pandas Series structure. By the end of this lecture you should be familiar with how to store and manipulate single dimensional indexed data in the Series object.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. And we'll talk about that later on in the course.

##### **Codes**

In [56]:
import pandas as pd

In [57]:
#Create a list
students=['Alice','Jack','Molly']

#Call the Series function
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [58]:
#Create a number list with None
number=[1,2,3]

#Call the Series function
pd.Series(number)

0    1
1    2
2    3
dtype: int64

In [59]:
#Create a list with None
students=['Alice','Jack',None]

#Call the Series function
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [60]:
#Create a number list with None
number=[1,2,None]

#Call the Series function
pd.Series(number)

0    1.0
1    2.0
2    NaN
dtype: float64

When there is a Nan, the meaning is similar to None but it is a numeric value and trated differently for efficiency reasons

In [61]:
import numpy as np

np.nan==None

False

In [62]:
np.nan==np.nan

False

In [63]:
np.isnan(np.nan)

True

In [64]:
#Create data of students and the classes
students_classes={'Alice':'Physics',
                'Jack':'Chemistry',
                'Molly':'English'}
s=pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [65]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [66]:
s=pd.Series(['Physics','Chemistry','English'],index=['Alice','Jack','Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [67]:
students_classes={'Alice':'Physics',
                  'Jack':'Chemistry',
                  'Molly':'English'}
#Try to change one of the student name that is not listed in the list above
s=pd.Series(students_classes,index=['Alice','Sam','Molly'])
s

Alice    Physics
Sam          NaN
Molly    English
dtype: object

### **c. Querying a Series**

In [68]:
students_classes={'Alice':'Physics',
                  'Jack':'Chemistry',
                  'Molly':'English',
                  'Sam':'History'}

s=pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [69]:
#Trying the iloc function
s.iloc[3]

'History'

In [70]:
s.loc['Molly']

'English'

In [71]:
s[3]

'History'

In [72]:
s['Molly']

'English'

In [73]:
#If the index of the lists are integer
class_code={99: 'Physics',
            100: 'Chemistry',
            101: 'English',
            102: 'History'}
s=pd.Series(class_code)

In [74]:
#s[0] #will cause error

In [75]:
s.iloc[0]

'Physics'

In [76]:
s.iloc[3]

'History'

In [77]:
grades=pd.Series([90,80,70,60])

total=0
for grade in grades:
  total+=grade
print(total/len(grades))

75.0


In [78]:
#Calculation using NumPy
total=np.sum(grades)
print(total/len(grades))

75.0


In [79]:
#Create big series of random numbers
numbers=pd.Series(np.random.randint(0,1000,10000))
numbers.head()

0    831
1    413
2    665
3    825
4    642
dtype: int64

In [80]:
numbers.shape

(10000,)

In [81]:
numbers.size

10000

In [82]:
len(numbers)

10000

In [83]:
#Running timeit function to compare the for loop version VS NumPy version of summation
%%timeit -n 100
total=0
for number in numbers:
  total+=number

total/len(numbers)

100 loops, best of 3: 1.1 ms per loop


In [84]:
%%timeit -n 100
total=np.sum(numbers)
total/len(numbers)

100 loops, best of 3: 60.6 µs per loop


In [85]:
numbers.head()

0    831
1    413
2    665
3    825
4    642
dtype: int64

In [86]:
#Increasing all value with 2
numbers+=2
numbers.head()

0    833
1    415
2    667
3    827
4    644
dtype: int64

In [87]:
%%timeit -n 10
#Create a blank new series of items to deal with
s=pd.Series(np.random.randint(0,1000,1000))

for label, value in s.iteritems():
    s.loc[label]=value+2

10 loops, best of 3: 49.6 ms per loop


In [88]:
%%timeit -n 10
# We need to recreate a series
s=pd.Series(np.random.randint(0,1000,1000))
# And we just broadcast with +=
s+=2

10 loops, best of 3: 269 µs per loop


In [89]:
#A Series of a few numbers
s=pd.Series([1,2,3])

s.loc['History']=102
s

0            1
1            2
2            3
History    102
dtype: int64

In [92]:
s.loc['A']=202
s

0            1
1            2
2            3
History    102
A          202
dtype: int64

In [93]:
students_classes=pd.Series({'Alice':'Physics',
                            'Jack':'Chemistry',
                            'Molly':'English',
                            'Sam':'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [94]:
kelly_classes=pd.Series(['Philosophy','Arts','Math'],index=['Kelly','Kelly','Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [95]:
all_students_classes=students_classes.append(kelly_classes)

In [96]:
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [97]:
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [98]:
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object