In [None]:
import pandas as pd
import numpy as np

In [None]:
students = ['irene', 'violeta', 'miguel']

Now we call the `Series` function in Pandas and pass the previous python list. The result is a Series object.

Series is an indexed data structure

In [None]:
pd.Series(students)

0      irene
1    violeta
2     miguel
dtype: object

Important: Nan is not equal to None

In [None]:
np.nan == None

False

Use the numpy library **isnan()**

In [None]:
np.isnan(np.nan)

True

A series can be created directly from dictionary data. The index, instead of being numbers will be the keys

---



In [None]:
student_scores = {
    'irene': 'maths',
    'violeta': 'english',
    'miguel': 'physics'
}

In [None]:
s = pd.Series(student_scores)

We can get the index object using the index attribute

In [None]:
s.index

Index(['irene', 'violeta', 'miguel'], dtype='object')

You can also pass a list of tuples:



In [None]:
old_students = [("pedro", "pepe"), ("merche", "lola"), ("puri", "encarna")]
s = pd.Series(old_students)

Besides, you can separate the creation of the indexes:

In [None]:
s= pd.Series(['Ana', 'Pepi', 'Marti'], index=['index1', 'index2', 'index3'])
s

index1      Ana
index2     Pepi
index3    Marti
dtype: object

In [None]:
my_dic = {
    'ana': 'B',
    'Pepi': 'A',
    'Marti': 'B'
    }
    
my_dic

{'Marti': 'B', 'Pepi': 'A', 'ana': 'B'}

In [None]:
s = pd.Series(my_dic, index=['ana', 'Pepi', 'Jose'])
s

ana       B
Pepi      A
Jose    NaN
dtype: object

If we pass a member that doesn't exisst to the Series object, it will be deleted

# Querying a series


In [None]:
import pandas as pd

Let's see how to make queries from a Series object

In [None]:
my_dict = {
    'Sara': 'Maths',
    'Laura': 'Physics',
    'Tom': 'English',
    'Mark': 'Literature',
}

In [None]:
s = pd.Series(my_dict)
s

Sara          Maths
Laura       Physics
Tom         English
Mark     Literature
dtype: object

## loc & iloc

loc & iloc are attributes, not methods, so **don't use parenthesis**! Index operator: []
- **loc** is label-based, which means that you have to specify rows and columns based on their row and column labels.
- **iloc** is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

In [None]:
s.iloc[2] # You type the index value

'English'

In [None]:
s.loc['Laura'] # You type the label itself

'Physics'

Something cool about pandas is that you don't need to specifically write iloc:

In [None]:
s[2] # A more "pythonic" way!

'English'

The same goes for loc, if you pass an object it will use it as a label:

In [None]:
s['Laura'] # This is equivalent to s.loc["Laura"]

'Physics'

⚠️ Be careful, if your labels are numbers (imagine, the number codes of subjects) then, you need to specify loc or iloc, so therefore it is a good practice to specify those attirbutes

## Vectorization
Is a method of computation

Task: get the average of some grades.
If you do this purely with python, it's going to take a long time, so better use other tecniques like np.sum

In [18]:
grades = pd.Series([9, 7, 8.5, 6.5])
grades

0    9.0
1    7.0
2    8.5
3    6.5
dtype: float64

In [20]:
# Then, using numpy sum
print(np.sum(grades)/len(grades))

7.75


In [27]:
# In contrast, this would be the pythonic way:
total = 0
for grade in grades:
  total+=grade
print(total/len(grades))

7.75


In [49]:
numbers = pd.Series(np.random.randint(0, 1000, 10000)) # Create 10000 random numbers within the range 0 - 1000
#verify the len
len(numbers)

10000

Let's look at the top 5 numbers using `head`

In [38]:
numbers.head()

0    416
1    996
2    177
3    912
4    899
dtype: int64

## magic functions
Python has magic functions that begin with a  **%** symbol
We are going to use the magic funtion **timeit** to **compare the time performance** between computing the average grade using the pure pythonic way and doing it using np.sum

We are going to run whatever is under timeit, 100 times

In [31]:
%%timeit -n 100
total = 0
for number in numbers:
  total+=number

total/len(numbers)

100 loops, best of 5: 1.45 ms per loop


Now let's check how long it takes with vectorization

In [32]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

100 loops, best of 5: 79.1 µs per loop


## broadcasting
With broadcasting you can apply an operation to every value of the series. For example increasing every random variable by 2

In [39]:
numbers += 2
numbers.head()

0    418
1    998
2    179
3    914
4    901
dtype: int64

iteritems returns a label and a value. But just iterate when is reaaally necessary.


In [45]:
%%timeit -n 10
for label, value in numbers.iteritems():
  # for the item that is returned we call the at
  numbers.loc[label] = value +2

numbers.head()



10 loops, best of 5: 574 ms per loop


In [52]:
%%timeit -n 10
numbers = pd.Series(np.random.randint(0, 1000, 10000))
numbers+=2

10 loops, best of 5: 413 µs per loop


If you add a value and the index doesn't exist, it will create it

In [55]:
s = pd.Series(['assd', 'fddf','ree'])
s

0    assd
1    fddf
2     ree
dtype: object

In [57]:
s['ff'] = 'lol'
s

0     assd
1     fddf
2      ree
ff     lol
dtype: object