<a href="https://colab.research.google.com/github/EngComp-Henrique/Effective-Pandas/blob/main/Effective-Pandas-Chapter-4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Content
* Study of `Series` data structure
* Dictionaries and series
* `Numpy` as `Series`
* `dtypes`

## What is a Series
* A unidimensional data structure
* Imagine we want to represent this:

|  hello | world |
|--------|-------|
|    0   |  145  | 
|    1   |  142  |
|    2   |   38  | 
|    3   |   13  |

We could use a built-in python data structure

In [1]:
series = {
    'index':    [0, 1, 2, 3],
    'data':     [145, 142, 38, 13],
    'name':     'songs'
}

Getting items by index

In [2]:
def get_items_by_index(series, idx):
    value_idx = series['index'].index(idx)
    return series['data'][value_idx]

In [3]:
get_items_by_index(series, 1)

142

It's also possible to use this function even the indexes are strings

In [4]:
songs = {
    'index':[ 'Paul' , 'John' , 'George', 'Ringo'],
    'data':[145 , 142 , 38 , 13] ,
    'name': 'counts'
}

In [5]:
get_items_by_index(songs, 'John')

142

### The pandas series
* Let's take a look at how we could do the same job using Series

In [6]:
import pandas as pd
songs2 = pd.Series([145, 148, 38, 13], name='counts')
songs2

0    145
1    148
2     38
3     13
Name: counts, dtype: int64

In [7]:
songs2.index

RangeIndex(start=0, stop=4, step=1)

In [8]:
songs3 = pd.Series(
    [145 , 142 , 38 , 13],
    name='counts',
    index=['Paul' , 'John' , 'George', 'Ringo'])

In [9]:
songs3.index

Index(['Paul', 'John', 'George', 'Ringo'], dtype='object')


When dtype is object: 
1. Strings values are used as indexes
2. Heterogeneous or mixed types

Don't use object type when you have numerical or date types

## The `NaN` value
* It's a standar for *Not a number* value
* Not considered in numerical operations
* Just supported by `float64`

### Note
Because `NaN` is only supported by `float64`, a broadcast operation takes place, so the numerical data will be converted. See the example below

In [10]:
import numpy as np
nan_series = pd.Series([2, np.nan],
            index=['Ono', 'Clapton'])

In [11]:
nan_series

Ono        2.0
Clapton    NaN
dtype: float64

Notice the `count` method return the value of non-NA values observation in the series

In [12]:
nan_series.count()

1

The size property also counts the NA values

In [13]:
nan_series.size

2

## Optional Integer Support for `NaN`
* `int64` doesn't support NA values! But it is possible to consider them using the `dtype=Int64`
* The broadcast operation won't happen!

#### Note
* In general, missing data is cleared

In [14]:
nan_series2 = pd.Series([2, None], index=['Ono', 'Clapton'], dtype='Int64')
nan_series2

Ono           2
Clapton    <NA>
dtype: Int64

In [15]:
nan_series2.count()

1

In [16]:
nan_series2.size

2

## Similar to `Numpy`

In [17]:
import numpy as np

numpy_ser = np.array([145 , 142 , 38 , 13])

In [18]:
songs3[1] == numpy_ser[1]

True

it is possible to filter data, both in `Series` and in `Numpy`, once both support the *boolean array* concept

In [19]:
songs3.mean()

84.5

In [20]:
numpy_ser.mean()

84.5

In [21]:
mask = songs3 > songs3.mean()
mask

Paul       True
John       True
George    False
Ringo     False
Name: counts, dtype: bool

In [22]:
songs3[mask]

Paul    145
John    142
Name: counts, dtype: int64

In [23]:
numpy_ser[numpy_ser >np . median ( numpy_ser )]

array([145, 142])

## Categorical Data
* Use less memory than string type
* Improve Performance
* Can have an ordering
* Can perform operations on categories
* Enforce membership values

In [24]:
s = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='category')

In [25]:
s

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['l', 'm', 's', 'xl', 'xs']

As we can see, categories can be ordered or not

In [26]:
s.cat.ordered

False

Giving an order to the `Series`

In [27]:
s2 = pd.Series(['m', 'l', 'xs', 's', 'xl'], dtype='category')
size_type = pd.api.types.CategoricalDtype(
    categories=['s', 'm', 'l'], ordered=True)

In [28]:
size_type

CategoricalDtype(categories=['s', 'm', 'l'], ordered=True)

Creating the ordered categories

In [29]:
s3 = s2.astype(size_type)

In [30]:
s3

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

Now we can make comparisons

In [31]:
s3 > 's'

0     True
1     True
2    False
3    False
4    False
dtype: bool

Reordering operation

### Note
* The categories **must** be the same!

In [32]:
s.cat.reorder_categories(['xs','s','m','l', 'xl'], ordered=True)

0     m
1     l
2    xs
3     s
4    xl
dtype: category
Categories (5, object): ['xs' < 's' < 'm' < 'l' < 'xl']

We get an error if the categories are different

In [33]:
try:
    s.cat.reorder_categories(['a','b','c','d', 'e'], ordered=True)
except ValueError:
    print("ValueError: items in new_categories are not the same as in old categories")

ValueError: items in new_categories are not the same as in old categories


---
## Exercises

1. Using Jupyter, create a series with the temperature values for the last seven days. Filter out the values below the mean.

In [34]:
weather_week = pd.Series(
    data=[21.0, 22.0, 22.0, 23.0, 21.0, 21.0, 22.0],
    index=["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],
    dtype='float64'
)

In [35]:
weather_week

Sun    21.0
Mon    22.0
Tue    22.0
Wed    23.0
Thu    21.0
Fri    21.0
Sat    22.0
dtype: float64

In [36]:
weather_week.mean()

21.714285714285715

In [37]:
mask = weather_week > weather_week.mean()
mask

Sun    False
Mon     True
Tue     True
Wed     True
Thu    False
Fri    False
Sat     True
dtype: bool

In [38]:
weather_week[mask]

Mon    22.0
Tue    22.0
Wed    23.0
Sat    22.0
dtype: float64

2. Using Jupyter, create a series with your favorite colors. Use a categorical type.

In [39]:
colors = pd.Series(["Black", "White", "Blue", "Pink"], dtype='category')
colors_order = pd.api.types.CategoricalDtype(["Black", "White", "Blue", "Pink"], ordered=True)

In [40]:
colors

0    Black
1    White
2     Blue
3     Pink
dtype: category
Categories (4, object): ['Black', 'Blue', 'Pink', 'White']

In [41]:
colors_order

CategoricalDtype(categories=['Black', 'White', 'Blue', 'Pink'], ordered=True)

In [42]:
ordered_colors = colors.astype(colors_order)

In [43]:
ordered_colors

0    Black
1    White
2     Blue
3     Pink
dtype: category
Categories (4, object): ['Black' < 'White' < 'Blue' < 'Pink']