<a href="https://colab.research.google.com/github/EngComp-Henrique/Effective-Pandas/blob/main/Effective-Pandas-Chapter-5-6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Series Deep Dive
* Let's see more about `Series` data structure and some methods
* First, I'll import the libraries

In [1]:
import pandas as pd
from io import StringIO

Then, the url link is saved on a variable. This link contains data from the US Fuel Economy. This site has data on the efficiency of makes and models of cars sold in the US since 1984

In [2]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

Finally, we load the data

### Note
* Get the data on [Fuel Economy](https://www.fueleconomy.gov/feg/download.shtml)
* `read_csv` method is able to load data from url's and zip files (if there's only a file)

In [3]:
df = pd.read_csv(url)

  exec(code_obj, self.user_global_ns, self.user_ns)


Let's investigate the first columns in the dataset: *city08* and *highway08*. These columns provide info about miles per gallon usage, while driving around in the city and highway respectively

In [4]:
city_mpg = df.city08
highway_mpg = df.highway08

In [5]:
city_mpg

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: int64

In [6]:
highway_mpg

0        25
1        14
2        33
3        12
4        23
         ..
41139    26
41140    28
41141    24
41142    24
41143    21
Name: highway08, Length: 41144, dtype: int64

## Series attributes

* Getting how many atrributes are available to `Series`
* Will explore them in the next chapter!

In [7]:
len(dir(city_mpg))

419

----
## Exercises
1. Explore the documentation for five attributes of a series from Jupyter.

In [8]:
help(pd.Series.mean)

Help on function mean in module pandas.core.generic:

mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
    Return the mean of the values over the requested axis.
    
    Parameters
    ----------
    axis : {index (0)}
        Axis for the function to be applied on.
    skipna : bool, default True
        Exclude NA/null values when computing the result.
    level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar.
    numeric_only : bool, default None
        Include only float, int, boolean columns. If None, will attempt to use
        everything, then use only numeric data. Not implemented for Series.
    **kwargs
        Additional keyword arguments to be passed to the function.
    
    Returns
    -------
    scalar or Series (if level specified)



In [9]:
help(pd.Series.add)

Help on function add in module pandas.core.ops:

add(self, other, level=None, fill_value=None, axis=0)
    Return Addition of series and other, element-wise (binary operator `add`).
    
    Equivalent to ``series + other``, but with support to substitute a fill_value for
    missing data in either one of the inputs.
    
    Parameters
    ----------
    other : Series or scalar value
    fill_value : None or float value, default None (NaN)
        Fill existing missing (NaN) values, and any new element needed for
        successful Series alignment, with this value before computation.
        If data in both corresponding Series locations is missing
        the result of filling (at that location) will be missing.
    level : int or name
        Broadcast across a level, matching Index values on the
        passed MultiIndex level.
    
    Returns
    -------
    Series
        The result of the operation.
    
    See Also
    --------
    Series.radd : Reverse of the Addition oper

In [10]:
help(pd.Series.loc)

Help on property:

    Access a group of rows and columns by label(s) or a boolean array.
    
    ``.loc[]`` is primarily label based, but may also be used with a
    boolean array.
    
    Allowed inputs are:
    
    - A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
      interpreted as a *label* of the index, and **never** as an
      integer position along the index).
    - A list or array of labels, e.g. ``['a', 'b', 'c']``.
    - A slice object with labels, e.g. ``'a':'f'``.
    
          start and the stop are included
    
    - A boolean array of the same length as the axis being sliced,
      e.g. ``[True, False, True]``.
    - An alignable boolean Series. The index of the key will be aligned before
      masking.
    - An alignable Index. The Index of the returned selection will be the input.
    - A ``callable`` function with one argument (the calling Series or
      DataFrame) and that returns valid output for indexing (one of the above)
    
    See more at 

In [11]:
help(pd.Series.unstack)

Help on function unstack in module pandas.core.series:

unstack(self, level=-1, fill_value=None) -> 'DataFrame'
    Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
    
    Parameters
    ----------
    level : int, str, or list of these, default last level
        Level(s) to unstack, can pass level name.
    fill_value : scalar value, default None
        Value to use when replacing NaN values.
    
    Returns
    -------
    DataFrame
        Unstacked Series.
    
    Examples
    --------
    >>> s = pd.Series([1, 2, 3, 4],
    ...               index=pd.MultiIndex.from_product([['one', 'two'],
    ...                                                 ['a', 'b']]))
    >>> s
    one  a    1
         b    2
    two  a    3
         b    4
    dtype: int64
    
    >>> s.unstack(level=-1)
         a  b
    one  1  2
    two  3  4
    
    >>> s.unstack(level=0)
       one  two
    a    1    3
    b    2    4



In [12]:
help(pd.Series.drop_duplicates)

Help on function drop_duplicates in module pandas.core.series:

drop_duplicates(self, keep='first', inplace=False) -> 'Series | None'
    Return Series with duplicate values removed.
    
    Parameters
    ----------
    keep : {'first', 'last', ``False``}, default 'first'
        Method to handle dropping duplicates:
    
        - 'first' : Drop duplicates except for the first occurrence.
        - 'last' : Drop duplicates except for the last occurrence.
        - ``False`` : Drop all duplicates.
    
    inplace : bool, default ``False``
        If ``True``, performs operation inplace and returns None.
    
    Returns
    -------
    Series or None
        Series with duplicates dropped or None if ``inplace=True``.
    
    See Also
    --------
    Index.drop_duplicates : Equivalent method on Index.
    DataFrame.drop_duplicates : Equivalent method on DataFrame.
    Series.duplicated : Related method on Series, indicating duplicate
        Series values.
    
    Examples
    ---

2. How many attributes are found on the .str attribute? Look at the documentation for three of them.

* [pd.Series.str docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html)

In [13]:
len(dir(pd.Series.str))

88

In [14]:
help(pd.Series.str.split)

Help on function split in module pandas.core.strings.accessor:

split(self, pat=None, n=-1, expand=False)
    Split strings around given separator/delimiter.
    
    Splits the string in the Series/Index from the beginning,
    at the specified delimiter string. Equivalent to :meth:`str.split`.
    
    Parameters
    ----------
    pat : str, optional
        String or regular expression to split on.
        If not specified, split on whitespace.
    n : int, default -1 (all)
        Limit number of splits in output.
        ``None``, 0 and -1 will be interpreted as return all splits.
    expand : bool, default False
        Expand the split strings into separate columns.
    
        * If ``True``, return DataFrame/MultiIndex expanding dimensionality.
        * If ``False``, return Series/Index, containing lists of strings.
    
    Returns
    -------
    Series, Index, DataFrame or MultiIndex
        Type matches caller unless ``expand=True`` (see Notes).
    
    See Also
    ---

In [15]:
help(pd.Series.str.capitalize)

Help on function capitalize in module pandas.core.strings.accessor:

capitalize(self)
    Convert strings in the Series/Index to be capitalized.
    
    Equivalent to :meth:`str.capitalize`.
    
    Returns
    -------
    Series or Index of object
    
    See Also
    --------
    Series.str.lower : Converts all characters to lowercase.
    Series.str.upper : Converts all characters to uppercase.
    Series.str.title : Converts first character of each word to uppercase and
        remaining to lowercase.
    Series.str.capitalize : Converts first character to uppercase and
        remaining to lowercase.
    Series.str.swapcase : Converts uppercase to lowercase and lowercase to
        uppercase.
    Series.str.casefold: Removes all case distinctions in the string.
    
    Examples
    --------
    >>> s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])
    >>> s
    0                 lower
    1              CAPITALS
    2    this is a sentence
    3           

In [16]:
help(pd.Series.str.get_dummies)

Help on function get_dummies in module pandas.core.strings.accessor:

get_dummies(self, sep='|')
    Return DataFrame of dummy/indicator variables for Series.
    
    Each string in Series is split by sep and returned as a DataFrame
    of dummy/indicator variables.
    
    Parameters
    ----------
    sep : str, default "|"
        String to split on.
    
    Returns
    -------
    DataFrame
        Dummy variables corresponding to values of the Series.
    
    See Also
    --------
    get_dummies : Convert categorical variable into dummy/indicator
        variables.
    
    Examples
    --------
    >>> pd.Series(['a|b', 'a', 'a|c']).str.get_dummies()
       a  b  c
    0  1  1  0
    1  1  0  0
    2  1  0  1
    
    >>> pd.Series(['a|b', np.nan, 'a|c']).str.get_dummies()
       a  b  c
    0  1  1  0
    1  0  0  0
    2  1  0  1



3. How many attributes are found on the .dt attribute? Look at the documentation for three of them.

In [17]:
len(dir(pd.Series.dt))

93

In [18]:
help(pd.Series.dt.date)

Help on property:

    Returns numpy array of python datetime.date objects (namely, the date
    part of Timestamps without timezone information).



In [19]:
help(pd.Series.dt.asfreq)

Help on function asfreq in module pandas.core.accessor:

asfreq(self, *args, **kwargs)
    Convert the PeriodArray to the specified frequency `freq`.
    
    Equivalent to applying :meth:`pandas.Period.asfreq` with the given arguments
    to each :class:`~pandas.Period` in this PeriodArray.
    
    Parameters
    ----------
    freq : str
        A frequency.
    how : str {'E', 'S'}, default 'E'
        Whether the elements should be aligned to the end
        or start within pa period.
    
        * 'E', 'END', or 'FINISH' for end,
        * 'S', 'START', or 'BEGIN' for start.
    
        January 31st ('END') vs. January 1st ('START') for example.
    
    Returns
    -------
    PeriodArray
        The transformed PeriodArray with the new frequency.
    
    See Also
    --------
    PeriodIndex.asfreq: Convert each Period in a PeriodIndex to the given frequency.
    Period.asfreq : Convert a :class:`~pandas.Period` object to the given frequency.
    
    Examples
    --------
 

In [20]:
help(pd.Series.dt.day_name)

Help on function day_name in module pandas.core.accessor:

day_name(self, *args, **kwargs)
    Return the day names of the DateTimeIndex with specified locale.
    
    Parameters
    ----------
    locale : str, optional
        Locale determining the language in which to return the day name.
        Default is English locale.
    
    Returns
    -------
    Index
        Index of day names.
    
    Examples
    --------
    >>> idx = pd.date_range(start='2018-01-01', freq='D', periods=3)
    >>> idx
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'],
                  dtype='datetime64[ns]', freq='D')
    >>> idx.day_name()
    Index(['Monday', 'Tuesday', 'Wednesday'], dtype='object')



# Operators (& Dunder Methods)
* Magic methods in python

In [21]:
# Calculating the average of two series
(city_mpg + highway_mpg) / 2

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

## Criterias to perform operations on Series
* The indexes are unique (no duplicates)
* The indexes are common to both series

## Remember
* Pandas will *broadcasts* the math operations
* These operations are optimized by the *vectorization* 
* A numeric pandas series is a block of memory, and moderns CPU's leverage Single Instructions

In [22]:
s1 = pd.Series([10, 20, 30], index=[1, 2, 2])
s2 = pd.Series([35, 44, 53], index=[2, 2, 4], name='s2')

In [23]:
s1

1    10
2    20
2    30
dtype: int64

In [24]:
s2

2    35
2    44
4    53
Name: s2, dtype: int64

In [25]:
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

Only the values with the same indexes where added!

## Iterartions
* Avoid them! We lose the benefits of vectorization and operating at the C level!

## Operators methods
* It's possible to change the behavior passing parameters to the methods

In [26]:
s1 + s2

1     NaN
2    55.0
2    64.0
2    65.0
2    74.0
4     NaN
dtype: float64

In [28]:
s1.add(s2, fill_value=0.0)

1    10.0
2    55.0
2    64.0
2    65.0
2    74.0
4    53.0
dtype: float64

## Chaining
* Stylistic $\to$ help us to read the code better

In [30]:
(city_mpg
    .add(highway_mpg)
    .div(2)
)

0        22.0
1        11.5
2        28.0
3        11.0
4        20.0
         ... 
41139    22.5
41140    24.0
41141    21.0
41142    21.0
41143    18.5
Length: 41144, dtype: float64

## Exercises
1. Add a numeric series to itself.

In [33]:
my_series = pd.Series([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55], index=list(range(11)), dtype='int64', name='fibonacci_sequence')
my_series

0      0
1      1
2      1
3      2
4      3
5      5
6      8
7     13
8     21
9     34
10    55
Name: fibonacci_sequence, dtype: int64

In [34]:
my_series.add(my_series)

0       0
1       2
2       2
3       4
4       6
5      10
6      16
7      26
8      42
9      68
10    110
Name: fibonacci_sequence, dtype: int64

2. Add 10 to a numeric series.

In [35]:
my_series.add(10)

0     10
1     11
2     11
3     12
4     13
5     15
6     18
7     23
8     31
9     44
10    65
Name: fibonacci_sequence, dtype: int64

3. Add a numeric series to itself using the .add method.

* Already did that

4. Read the documentation for the .add method.

In [39]:
from pprint import pprint
pprint(my_series.add.__doc__)

('\n'
 'Return Addition of series and other, element-wise (binary operator `add`).\n'
 '\n'
 'Equivalent to ``series + other``, but with support to substitute a '
 'fill_value for\n'
 'missing data in either one of the inputs.\n'
 '\n'
 'Parameters\n'
 '----------\n'
 'other : Series or scalar value\n'
 'fill_value : None or float value, default None (NaN)\n'
 '    Fill existing missing (NaN) values, and any new element needed for\n'
 '    successful Series alignment, with this value before computation.\n'
 '    If data in both corresponding Series locations is missing\n'
 '    the result of filling (at that location) will be missing.\n'
 'level : int or name\n'
 '    Broadcast across a level, matching Index values on the\n'
 '    passed MultiIndex level.\n'
 '\n'
 'Returns\n'
 '-------\n'
 'Series\n'
 '    The result of the operation.\n'
 '\n'
 'See Also\n'
 '--------\n'
 'Series.radd : Reverse of the Addition operator, see\n'
 '    `Python documentation\n'
 '    '
 '<https://docs.pyt