In [None]:
..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/

---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<img align="right" width="400" height="400"  src="..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/images/pandas-apps.png"  >

## _Overview of Pandas Series Data Structure.ipynb_

## Learning agenda of this notebook

1. Overview of Python Pandas library and its data structures
2. Creating a Series
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
3. Attributes of a Pandas Series
4. Understanding Index in a Series and its usage
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment

In [1]:
import pandas as pd
pd.__version__, pd.__path__

('2.0.3', ['C:\\Users\\FashN\\anaconda3\\Lib\\site-packages\\pandas'])

<img align="right" width="500" height="600"  src="..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/images/series-anatomy.png"  >

## 1. Creating a Series
> **A Series is a one-dimensional array capable of holding a sequence of values of any data type (integers, floating point numbers, strings, Python objects etc) which by default have numeric data labels starting from zero. You can imagine a Pandas Series as a column in a spreadsheet or a Pandas Dataframe object.**
- To create a Series object you can use `pd.Series()` method

**```pd.Series(data, index, dtype, name)```**
- Where,
   - `data`: can be a Python list, Python dictionary, numPy array, or a scalar value.
   - `index`: If you donot pass the index argument, it will default to `np.arrange(n)`. Indices must be hashable (numbers or strings) and have the same length as `data`. Non-unique index values are allowed. Index is used for three purposes:
       - Identification.
       - Selection.
       - Alignment.
   - `dtype`: Optionally, you can assign any valid numpy datatype to the series object (np.sctypes). If not specified, this will be inferred from `data`.
   - `name`: Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.

### a. Creating a Series from Python List

In [4]:
import pandas as pd
import numpy as np

list1 = ['Godwin', 'Simon', 'Blessing', 'Gloria', '']
# When index is not provided, it creates an index for the data starting from zero and with a step size of one.
s = pd.Series(data=list1)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0      Godwin
1       Simon
2    Blessing
3      Gloria
4            
dtype: object

<class 'pandas.core.series.Series'>


>Observe that output is shown in two columns - the index is on the left and the data value is on the right. If we do not explicitly specify an index for the data values while creating a series, then by default indices range from 0 through N – 1. Here N is the number of data elements.

**You can explicitly specify the index for a Series object, which can be either int or string type, and must be of the same size as the values in the series. Otherwise, it will raise a ValueError**

In [6]:
list1 = ['Godwin', 'Simon', 'Blessing', 'Gloria', '']
indices = ['FS01', 'FS02', 'FS03', 'FS04', 'FS05']

s = pd.Series(data=list1, index=indices)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
FS01      Godwin
FS02       Simon
FS03    Blessing
FS04      Gloria
FS05            
dtype: object

<class 'pandas.core.series.Series'>


In [7]:
s['FS03']

'Blessing'

>Also note that non-unique indices are allowed

In [11]:
list1 = ['Godwin', 'Simon', 'Blessing', ' ', 'Gloria']
indices = [2.1, 2.2, 2.3, 2.4, 2.5]

s = pd.Series(data=list1, index=indices)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
2.1      Godwin
2.2       Simon
2.3    Blessing
2.4            
2.5      Gloria
dtype: object

<class 'pandas.core.series.Series'>


**You can create a series with NaN values, using `np.nan`, which is IEEE 754 floating-point representation of Not a Number. NaN values can act as a placeholder for any missing numerical values in the array.**

In [12]:
list1 = ['Godwin', 'Simon', 'Blessing', np.nan, 'Gloria']
# indices = [2.1, 2.2, 2.3, 2.4, 2.5]

s = pd.Series(data=list1)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0      Godwin
1       Simon
2    Blessing
3         NaN
4      Gloria
dtype: object

<class 'pandas.core.series.Series'>


>Also note the `dtype` of the series object is inferred from the data as `float64`

**You can use the `dtype` argument to specify a datatype to the series object.**

In [13]:
list1 = [33, 21, 34, 67]
# indices = [2.1, 2.2, 2.3, 2.4, 2.5]

s = pd.Series(data=list1, dtype=np.uint8)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0    33
1    21
2    34
3    67
dtype: uint8

<class 'pandas.core.series.Series'>


In [14]:
list1 = ['Godwin', 'Simon', 'Blessing', np.nan, 'Gloria']
indices = [2.1, 2.2, 2.3, 2.4, 2.5]

s = pd.Series(data=list1, index=indices, name="Family")
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
2.1      Godwin
2.2       Simon
2.3    Blessing
2.4         NaN
2.5      Gloria
Name: Family, dtype: object

<class 'pandas.core.series.Series'>


### b. Creating a Series from NumPy Array

In [17]:
s = pd.Series(data=np.random.randint(low=5, high=200, size=20))
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0     146
1      35
2      88
3      97
4      78
5      15
6      36
7      31
8      36
9     175
10     98
11     88
12     47
13    170
14     23
15     44
16    150
17     79
18    186
19     16
dtype: int32

<class 'pandas.core.series.Series'>


In [18]:
arr1 = np.array([22.3, 44.4,  5.9, 112.21])
s = pd.Series(data=arr1, dtype='float64')

print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0     22.30
1     44.40
2      5.90
3    112.21
dtype: float64

<class 'pandas.core.series.Series'>


### c. Creating a Series from Python Dictionary

In [19]:
details = {
    'name' : 'Okoeguale Godwin',
    'gender' : 'Male',
    'Role' : 'Junior Dev',
    'subject' : 'Data Science'
}

s = pd.Series(data=details)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
name       Okoeguale Godwin
gender                 Male
Role             Junior Dev
subject        Data Science
dtype: object

<class 'pandas.core.series.Series'>


**When you create a series from dictionary, it will automatically take the keys as index and the value as data**

### d. Creating a Series from Scalar value

In [22]:
s = pd.Series(data=2001, name='Year', dtype='uint16')
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0    2001
Name: Year, dtype: uint16

<class 'pandas.core.series.Series'>


### e. Creating an Empty Series

In [23]:
# Need to pass atleast `dtype` else you get a warning
s = pd.Series()
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
Series([], dtype: object)

<class 'pandas.core.series.Series'>


## 3. Attributes of Panda  Series
- We can access certain properties called attributes of a series by using that property with the series name using dot `.` notation

In [25]:
lecturers = {0:'Simon', 1:np.nan, 2:'Godwin', 3:'Abimbola', 4:'Phillip'}
s = pd.Series(data=lecturers, name="LECTURERS")
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
0       Simon
1         NaN
2      Godwin
3    Abimbola
4     Phillip
Name: LECTURERS, dtype: object

<class 'pandas.core.series.Series'>


In [26]:
# `name` attribute of a series object return the name of the series object
s.name

'LECTURERS'

In [27]:
# `index` attribute of a series object return the list of indices and its datatype
s.index

Index([0, 1, 2, 3, 4], dtype='int64')

In [33]:
# `nbytes` attribute of a series object return the number of bytes of underlying data (object data type take 8 bytes)
# `shape` attribute of a series object return a tuple of shape of underlying data
# `dtype` attribute of a series object return the type of underlying data
# `hasnans` attribute of a series object return true if there are NaN values in the data
# `ndim` attribute of a series object return number of dimensions of underlying data
# `size` attribute of a series object return number of elements in the underlying data

print(f'dtype : {s.dtype}\nshape : {s.shape}\nbytes = {s.nbytes}\nvalues: {s.values}\nindex : {s.index}\nname : {s.name}\nndim : {s.ndim}\nsize : {s.size}')

dtype : object
shape : (5,)
bytes = 40
values: ['Simon' nan 'Godwin' 'Abimbola' 'Phillip']
index : Index([0, 1, 2, 3, 4], dtype='int64')
name : LECTURERS
ndim : 1
size : 5


<img align="right" width="500" height="500"  src="..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/images/series-anatomy.png"  >

## 4. Understanding Index in a Series
- Every series object has an index associated with every item. 
- The Pandas series object supports both integer-based (default) and label/string-based indexing and provides a host of methods for performing operations involving the index.
<br><br>
    - When index is unique, Pandas use a hashtable to map `key to value` and searching can be done in O(1) time. 
    - When index is non-unique but sorted, Pandas use binary search, which takes logarithmic time O(logN).
    - When index is randomly ordered, searching takes linear time, as Pandas need to check all the keys in the index O(N).<br><br>
- Index in series object is used for three purposes:
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment <br><br>

### a. Changing Index of a Series Object
- In above examples, we have seen that
    - If we create a Series object from dictionary, the keys of dictionray become the index 
    - If we create a Series object from a list or numPy array, the index defaults to integers from 0, 1, 2, ...
    - Last but not the least, we can assign the indices of our own choice, which can be integers or strings
- Let us see as how we can change the indices of a series object after creation

In [35]:
list1 = ['Godwin', 'Simon', 'Blessing', 'Gloria', '']
# indices = ['FS01', 'FS02', 'FS03', 'FS04', 'FS05']

s = pd.Series(data=list1)
print(f'Series : \n{s}\n\n{s.index}')

Series : 
0      Godwin
1       Simon
2    Blessing
3      Gloria
4            
dtype: object

RangeIndex(start=0, stop=5, step=1)


>Index attribute of series object shows that index range for this series is from (0-4) with step value of 1

**Let us modify the index of this series object to some random integers by assigning a random array of integers to `index` attribute of this series object**

In [36]:
arr = np.random.randint(low=100, high=200, size=5)
s.index = arr

print(f'Series : \n{s}\n\n{s.index}')

Series : 
111      Godwin
129       Simon
189    Blessing
105      Gloria
168            
dtype: object

Index([111, 129, 189, 105, 168], dtype='int32')


In [38]:
s.index = [1,4,6,7,3.18]

print(f'Series : \n{s}\n\n{s.index}')

Series : 
1.00      Godwin
4.00       Simon
6.00    Blessing
7.00      Gloria
3.18            
dtype: object

Index([1.0, 4.0, 6.0, 7.0, 3.18], dtype='float64')


**Changing index of a series to a list of strings**

In [40]:
family = ['Godwin', 'Simon', 'Blessing', 'Gloria', 'Vincent']
s= pd.Series(data=family)

print(f'Series : \n{s}\n\n{s.index}')

Series : 
0      Godwin
1       Simon
2    Blessing
3      Gloria
4     Vincent
dtype: object

RangeIndex(start=0, stop=5, step=1)


In [41]:
indices = ['FAM1', 'FAM2', 'FAM3', 'FAM4', 'FAM5']
s.index = indices

print(f'Series : \n{s}\n\n{s.index}')

Series : 
FAM1      Godwin
FAM2       Simon
FAM3    Blessing
FAM4      Gloria
FAM5     Vincent
dtype: object

Index(['FAM1', 'FAM2', 'FAM3', 'FAM4', 'FAM5'], dtype='object')


<img align="right" width="300" height="300"  src="..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/images/series-anatomy.png"  >

### b. First use of Index (Identification)
- Since every data value of a series object has an associated index (integer or string). So we can use this index/label to identify or access data value(s)
- There are three ways to access elements of a series:
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `.iloc[]` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1

In [42]:
family = ['Godwin', 'Simon', 'Blessing', 'Gloria', 'Vincent']
indices = ['FAM1', 'FAM2', 'FAM3', 'FAM4', 'FAM5']
s= pd.Series(data=family, index=indices)

print(f'Series : \n{s}\n\n{s.index}')

Series : 
FAM1      Godwin
FAM2       Simon
FAM3    Blessing
FAM4      Gloria
FAM5     Vincent
dtype: object

Index(['FAM1', 'FAM2', 'FAM3', 'FAM4', 'FAM5'], dtype='object')


In [43]:
# Give index to subscript operator
s['FAM3']
# Subscript operator do not work on position
#s[0] # will raise an error because index 0 do not exist

'Blessing'

In [44]:
# Give index to  loc method
s.loc['FAM2']
# loc method do not work on position
#s.loc[0] # will raise an error because index 0 do not exist

'Simon'

In [47]:
# iloc method is position based, so will flag an error if you pass an actual index
#s.iloc[20] 
s.iloc[0]

'Godwin'

**Fancy Indexing**

In [50]:
# Can access multiple values by specifying a list of indices
s[['FAM5', 'FAM3']]

FAM5     Vincent
FAM3    Blessing
dtype: object

In [52]:
# Can access multiple values by specifying a list of indices
s.loc[['FAM4', 'FAM2']]

FAM4    Gloria
FAM2     Simon
dtype: object

In [53]:
# Can access multiple values by specifying list of positions
s.iloc[[3,0]]

FAM4    Gloria
FAM1    Godwin
dtype: object

**Negative Indexing, work only for `iloc`**

In [55]:
s[-1]

'Vincent'

In [57]:
# s.loc[-1]

In [58]:
s.iloc[-1]

'Vincent'

<img align="right" width="400" height="400"  src="..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/images/series-anatomy.png"  >

### c. Second use of Index (Selection)
- A series can be sliced using `:` symbol, which returns a subset of a series object (values with corresponding indices).
- There are three arguments of slice object `[[start]:[stop][:step]]`, and all are optional

- The slice object can be used in three ways to slice a Pandas Series object::
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `.iloc` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1
- Keep following points in mind:
    - The `stop` argument is NOT inclusive for `s[]` for integer indices, while it is inclusive for string indices.
    - The `stop` argument is inclusive for `s.loc[]` for both integer and label indices.
    - The `stop` argument is NOT inclusive for `s.iloc[]` being position based.
  
>**Note: Once you slice a Pandas series, you get a view of the original object, which is similar to shallow copy. So if you modify an element in original series object, the change will also be visible in the other series object.**

**Selection/Filtering/Subsetting of Series object having Integer indices**

In [69]:
list1 = ['Godwin', 'Simon', 'Blessing', 'Gloria', 'Vincent', 'Queen']
indices = [5, 10, 15, 20, 25, 30]

s = pd.Series(data=list1, index=indices)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
5       Godwin
10       Simon
15    Blessing
20      Gloria
25     Vincent
30       Queen
dtype: object

<class 'pandas.core.series.Series'>


In [70]:
s[6:15]

Series([], dtype: object)

In [71]:
# The subscript operator considers the slice object as positional index and not as the actual indices 
# (if we have integer indices)
# The `stop` argument is NOT inclusive for `s[]` for integer indices
s[0:5]

5       Godwin
10       Simon
15    Blessing
20      Gloria
25     Vincent
dtype: object

In [72]:
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc[5:15]

5       Godwin
10       Simon
15    Blessing
dtype: object

In [73]:
# The iloc[] method considers the slice object as positional index and not as the actual indices
# The `stop` argument is NOT inclusive for `s.iloc[]` being position based
s.iloc[1:4]

10       Simon
15    Blessing
20      Gloria
dtype: object

In [74]:
list1 = ['Godwin', 'Simon', 'Blessing', 'Gloria', 'Vincent', 'Queen']
indices = ['FS01', 'FS02', 'FS03', 'FS04', 'FS05', 'FS06']

s = pd.Series(data=list1, index=indices)
print(f'Series : \n{s}\n\n{type(s)}')

Series : 
FS01      Godwin
FS02       Simon
FS03    Blessing
FS04      Gloria
FS05     Vincent
FS06       Queen
dtype: object

<class 'pandas.core.series.Series'>


**Understanding Step with Series object having String Indices**

In [75]:
s

FS01      Godwin
FS02       Simon
FS03    Blessing
FS04      Gloria
FS05     Vincent
FS06       Queen
dtype: object

In [76]:
# The step works fine with string indices as well
s['FS02': 'FS05': 1]

FS02       Simon
FS03    Blessing
FS04      Gloria
FS05     Vincent
dtype: object

In [77]:
s['FS01': 'FS06': 2]

FS01      Godwin
FS03    Blessing
FS05     Vincent
dtype: object

In [79]:
s['FS01': 'FS06': -1]

Series([], dtype: object)

<img align="right" width="300" height="300"  src="..\..\..\data_science\exercise\Section-3-Python-for-Data-Scientists/images/series-anatomy.png"  >

### d. Third use of Index (Alignment)
- We can perform basic arithmetic operations like addition, subtraction, multiplication, division, etc., on two Series objects, to produce a new Series instance.
- The operation is done on each corresponding pair of elements. This is done by matching the indices of the two series objects.

In [82]:
list_a = [1,3,5,7,9];
list_b = [2,4,6,8,10];

s1 = pd.Series(data=list_a)
s2 = pd.Series(data=list_b)

print(f'Series : \n{s1}\n\n{s1.index}\n{s2}\n\n{s2.index}')

Series : 
0    1
1    3
2    5
3    7
4    9
dtype: int64

RangeIndex(start=0, stop=5, step=1)
0     2
1     4
2     6
3     8
4    10
dtype: int64

RangeIndex(start=0, stop=5, step=1)


In [85]:
s3 = s1 + s2
print(f'Series : \n{s3}\n\n{s3.index}')

Series : 
0     3
1     7
2    11
3    15
4    19
dtype: int64

RangeIndex(start=0, stop=5, step=1)


**Example 2:** Adding two series object having different integer indices

In [88]:
list1 = [6,9,7,5]
index1 = [0,1,2,3]
list2 = [8,6,2,1]
index2 = [0,2,3,5]

s1 = pd.Series(data=list1, index=index1)
s2 = pd.Series(data=list2, index=index2)
print(f'Series : \n{s1}\n\n{s1.index}\n{s2}\n\n{s2.index}')

Series : 
0    6
1    9
2    7
3    5
dtype: int64

Index([0, 1, 2, 3], dtype='int64')
0    8
2    6
3    2
5    1
dtype: int64

Index([0, 2, 3, 5], dtype='int64')


In [89]:
s3 = s1 + s2
print(f'Series : \n{s3}\n\n{s3.index}')

Series : 
0    14.0
1     NaN
2    13.0
3     7.0
5     NaN
dtype: float64

Index([0, 1, 2, 3, 5], dtype='int64')


**Problem:** While performing mathematical operations on series having mismatched indices, all missing values are filled in with NaN by default.

**Solution:** To handle this problem, instead of using the operators (`+, -, *, /`), an explicit call to `s.add()`, `s.sub()`, `s.mul()` and `s.div()` is preferred. This allows us to replace the missing values in any of the series witth a specific value, so as to have a concrete output in place of NaN

In [90]:
s1.add(s2, fill_value=0)

0    14.0
1     9.0
2    13.0
3     7.0
5     1.0
dtype: float64

**Example 3:** Adding two series object having different string indices

In [91]:
list1 = [6,9,7,5, 2]
labels1 = ['num1', 'num2', 'num3', 'num4', 'num5']

list2 = [8,6,2,3,6]
labels2 = ['num1', 'num2', 'num3', 'num8', 'num5']

s1 = pd.Series(data=list1, index=labels1)
s2 = pd.Series(data=list2, index=labels2)

print(f'Series : \n{s1}\n\n{s1.index}\n{s2}\n\n{s2.index}')

Series : 
num1    6
num2    9
num3    7
num4    5
num5    2
dtype: int64

Index(['num1', 'num2', 'num3', 'num4', 'num5'], dtype='object')
num1    8
num2    6
num3    2
num8    3
num5    6
dtype: int64

Index(['num1', 'num2', 'num3', 'num8', 'num5'], dtype='object')


In [92]:
s3 = s1.add(s2, fill_value=5)
print(f'Series : \n{s3}\n\n{s3.index}')

Series : 
num1    14.0
num2    15.0
num3     9.0
num4    10.0
num5     8.0
num8     8.0
dtype: float64

Index(['num1', 'num2', 'num3', 'num4', 'num5', 'num8'], dtype='object')
