In [1]:
import pandas as pd
import numpy as np

In [2]:
students = ['Alice','Jack','Molly']

pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [3]:
numbers = [1,2,3]

pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [4]:
students.append(None)

pd.Series(students)

0    Alice
1     Jack
2    Molly
3     None
dtype: object

In [5]:
numbers.append(None)

pd.Series(numbers)


0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

Pandas set the dtype of this series to floating point instead of ints as pandas represent NaN as a floating point number.  

None and NaN might be used interchangeably by data scientist. However, they are not equivilant to one another

In [6]:
np.nan == None

False

Even when comparing np.nan with itself, the answer is false 

In [7]:
np.nan == np.nan

False

In [8]:
np.isnan(np.nan)

True

Do take note that although NaN and None might be used to represent the same thing, however, they are interpreted differently by the system

In [9]:
student_subjects = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly':'English', 'Sam': 'History'}

s = pd.Series(student_subjects)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

For values in the index object that are not aligned with the keys in the dictionary, pandas overrides the automatic creation to favour only all of the indices values that you provided. So it will ignore keys in the dictionary that is not part of the index  

For example:

In [10]:
s = pd.Series(student_subjects, index = ['Alice','Molly','Jamesa'])
s

Alice     Physics
Molly     English
Jamesa        NaN
dtype: object

### 2. Querying from series

In this section, we will observe how to query and merge Series objects together, and the importance of thinking about parrallelisation when engaging in data science programming.

Series is an **indexed** data structure where it is iterable like a dictionary

#### 2.1 loc and iloc attribute

loc: Used to locate rows (and/or columns) with **particular labels**.  
iloc: Used to locate rows (and/or columns) at **integer** locations.  

Using the square brackets for iloc as they are indexing operator

In [23]:
o = pd.Series(student_subjects)

In [26]:
# Example of loc
print(o.loc['Molly'])
print(o['Molly'])

English
English


In [27]:
# Example of iloc
print(o.iloc[3])
print(o[3])

History
History


In [14]:
class_code = {99: 'Physics',
                100: 'Chemistry',
                101: 'English',
                102: 'History'}

t = pd.Series(class_code)

In [17]:
# A key error code is generated because there are no items in the classes list with an index of zero
# Instead, iloc have to be used explicitly if we want the first item

t[0]

KeyError: 0

In [20]:
t.iloc[0]

'Physics'

#### 2.2 Vectorisation

Vectorisation is the ability for a computer to execute multiple instructions at once using high performance chips, especially graphic cards

In [39]:
grades = pd.Series([90,80,70,60])


# Using loop to sum all of the grades in the series
total = 0 
for grade in grades:
    total += grade

print(total/len(grades))

75.0


In [29]:
# Much more efficient method using numpy

total = np.sum(grades)
print(total/len(grades))

75.0


In [32]:
# Create a big series of random numbers

numbers = pd.Series(np.random.randint(0,1000,10000))

# The first 5 numbers in the series
numbers.head()

0    910
1    829
2    476
3    942
4    456
dtype: int32

In [33]:
numbers.shape

(10000,)

In [37]:
%%timeit -n 100

total = 0 
for number in numbers:
    total +=number

total/len(numbers)

716 µs ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [38]:
%%timeit -n 100

total = np.sum(numbers)
total/len(numbers)

41.4 µs ± 9.28 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


By comparing the 2 codes, we observed that the second code by vectorisation is 16 times faster than the iteration code.  

This demostrated the power of parallel computing features using vectorisation

#### 2.3. Broadcasting

Can apply an operation to every value in series and changing the series

In [41]:
# Take a look at the series
numbers.head()

0    910
1    829
2    476
3    942
4    456
dtype: int32

In [42]:
numbers += 5
numbers.head()

0    915
1    834
2    481
3    947
4    461
dtype: int32

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily.

In [52]:
numbers.at[0]

915

In [56]:
# We can use the iteritems() function which returns a label and value

for label, value in numbers.iteritems():
    # no for the item which is returned, lets call set_value()
    numbers.at[label] = value+2

numbers.head()

0    917
1    836
2    483
3    949
4    463
dtype: int32

Comparing the efficiency of using a loop and broadcasting method

In [57]:
%%timeit -n 10

# Create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000,1000))

# And we'll just rewrite our loop from above

for label, value in s.iteritems():
    # no for the item which is returned, lets call set_value()
    s.at[label] = value+2

14.1 ms ± 804 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now using broadcasting method

In [58]:
%%timeit -n 10

s = pd.Series(np.random.randint(0,1000,1000))

s += 2


130 µs ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We noticed that broadcasting is about 130 times faster than using the loop

The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exists, then a new entry is added. And keep in mind, indices can have mixed types. While it is important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.

In [59]:
# Here is an example

s = pd.Series([1,2,3])

s.loc['History'] = 102

s

0            1
1            2
2            3
History    102
dtype: int64

#### 2.4. Dealing with non unique indexes

In [60]:
student_classes = pd.Series({'Alice': 'Physics',
                            'Jack': 'Chemistry',
                            'Molly' : 'English',
                            'Sam': 'History'})

student_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [61]:
# Create a series for one student named Kelly, which lists all of the courses she has taken

kelly_classes = pd.Series(['Philosophy','Art','Math'], index = ['Kelly','Kelly','Kelly'])
kelly_classes

Kelly    Philosophy
Kelly           Art
Kelly          Math
dtype: object

In [63]:
all_student_classes = student_classes.append(kelly_classes)

all_student_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly           Art
Kelly          Math
dtype: object

There are a couple of important considerations when using append.

First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string. So there is no problems here.   

Second, the append method doesn't actually change the underlying Series objects, it instead returns a new series which is made up of the 2 series appended together.

### 3. DataFrame

In [64]:
import pandas as pd
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj1 = pd.Series(sdata)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)
obj3 = pd.isnull(obj2)

In [80]:
import pandas as pd
d = {'1': 'Alice','2': 'Bob','3': 'Rita','4': 'Molly','5': 'Ryan'}
d1 = {'Alice': 1,'Bob':2, 'Rita':3}
S = pd.Series(d1)

S.loc[2]

KeyError: 2

In [68]:
obj2['Caliddfornia']a

nan

In [67]:
import math
math.isnan(obj2['California'])

True

In [None]:
a