# An introduction to Pandas Series

A series models one-dimensional data. The series object has an index and a name. 


|Artist| Data |
|------|------|
|  0   |  145 |
|  1   |  142 |
|  2   |  38  |
|  3   |  13  |




In [1]:
# Creating a Pandas Series

import pandas as pd

songs = pd.Series(
    [120, 142, 38, 130, 100, 120, 140, 120, 140, 120, 140, 120, 140, 120, 140],
    name = "counts"
)

songs

0     120
1     142
2      38
3     130
4     100
5     120
6     140
7     120
8     140
9     120
10    140
11    120
12    140
13    120
14    140
Name: counts, dtype: int64

The interpreter prints the series. Though it looks 2-dimensional, it is infact one-dimensional. The leftmost column is the `index`, and contains entries for the index.  The index is not part of the values. TH generic name of the index is axis, and the values of the index-0,1,2,3- are axis labels.

The data  - 120, 142, 38,... - are called `values`. In this case, integers (dtype: int64 - 64-bit integer)

In [2]:
#  Check the index - monotonically increasing integers
songs.index

RangeIndex(start=0, stop=15, step=1)

## Working with NaN

When Pandas determines that a series holds numeric values but can not find a number to represent an entry, it will use `NaN` - 'Not A Number'.

In [3]:
import numpy as np

nan_series = pd.Series([2, 5, 10, np.nan, 8], index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])

nan_series

Mon     2.0
Tue     5.0
Wed    10.0
Thu     NaN
Fri     8.0
dtype: float64

> One thing to note is that the type of this series is float64, not int64! The type is a float because float64 supports NaN, which int64 does not. When pandas sees numeric data as well as the
np.nan, it coerced the numeric data to a float values.

Below is an example of how pandas ignores NaN. The method, which counts the number of
".count" values in a series, disregards NaN. In this case the count of items in the series is 4, ignoring the value for "Thu".

In [4]:
# Pandas ignores NaN values in operations
nan_series.count()

4

In [5]:
# To inspect the number of entries (including) missing values 

nan_series.size

5

> If you load data from a CSV file, an empty value for an otherwise numeric column will become
NaN. Later, methods such as .fillna and .dropna will explain how to deal with NaN.

## Similar to Numpy

The pandas Series behaves similarly yo a Numpy array amd have methods in common.

In [7]:
numpy_ser = np.array([2, 5, 10, 8])

print(f"Index 1 of Numpy Series --> {numpy_ser[1]}")
print(f"Mean of the Numpy Series --> {numpy_ser.mean()}\n")

print(f"Index 1 of Pandas Series --> {songs[1]}")
print(f"Mean of the Pandas Series --> {songs.mean()}\n")

Index 1 of Numpy Series --> 5
Mean of the Numpy Series --> 6.25

Index 1 of Pandas Series --> 142
Mean of the Pandas Series --> 122.0



> Pandas Series and Numpy arrays also support Boolean Filtering. A boolean array is a series with the same index as the series you are working with that has boolean values, and it can be used as a mask to filter out items. Normal Python lists do not support such fancy index operations, like sticking a list into an index operation.

In [9]:
# Boolean Filtering
mask = songs > songs.mean()

mask

0     False
1      True
2     False
3      True
4     False
5     False
6      True
7     False
8      True
9     False
10     True
11    False
12     True
13    False
14     True
Name: counts, dtype: bool

> Now, the mask can be passed in an index operation. If the mask has a true value for a given index, the value is kept. Otherwise the value os dropped.

In [10]:
# Return series with values > mean of songs
songs[mask]

1     142
3     130
6     140
8     140
10    140
12    140
14    140
Name: counts, dtype: int64

NumPy also has filtering by boolean arrays, but lacks the .median method on an array. Instead,
NumPy provides a median function in the NumPy namespace. The equivalent version in NumPy
looks like this:

In [11]:
numpy_ser[numpy_ser > np.mean(numpy_ser)]

array([10,  8])

## Working with Categorical Data

When data is loaded as categorical, it means the data is limited to only a few values. Categorical data have the ff benefits:
- Use less memory than strings
- Improve performance
- Can have ordering
- Can perform operations on categories
- Enforce membership on values

Categorical data is NOT just limited to strings; numbers or datetime values can be converted into categorical data.
To create a category, we pass dtype="category" into the Series constructor. Alternatively, we can
call the .astype("category") method on a series:

In [13]:
sizes = pd.Series(["m", "s", "m", "l", "s", "m", "xl", "s", "m", "xl", "s", "m", "l", "s"], dtype="category")

sizes

0      m
1      s
2      m
3      l
4      s
5      m
6     xl
7      s
8      m
9     xl
10     s
11     m
12     l
13     s
dtype: category
Categories (4, object): ['l', 'm', 's', 'xl']

In [14]:
sizes.cat.ordered

False

To convert a non-categorical series to an ordered category, we can create a type with the
CategoricalDtype constructor and the appropriate parameters. Then we pass this type into the
.astype method:

In [16]:
news_series = pd.Series(['m', 'l', 'xs', 's', 'xl'])

# Convert the pandas series to a categorical data type
size_type = pd.api.types.CategoricalDtype(categories=['s' ,'m', 'l'], ordered=True)

sizes_2 = news_series.astype(size_type)

# Preview the size data
sizes_2

0      m
1      l
2    NaN
3      s
4    NaN
dtype: category
Categories (3, object): ['s' < 'm' < 'l']

In this case, we limited the categories to just 's', 'm', and 'l', but the data had values that were
not in those categories. Converting the data to a category type replaces those extra values with NaN.
If we have ordered categories, we can do comparisons on them:

In [17]:
# Performing Comparisms with Ordinal Data
sizes_2 > "s"

0     True
1     True
2    False
3    False
4    False
dtype: bool

The prior example created a new Series from existing data that was not categorical. We can also
add ordering information to categorical data. We just need to make sure that we specify all of the
members of the category or pandas will throw a ValueError:

In [19]:
sizes.cat.reorder_categories(["s", "m", "l", "xl"], ordered=True)

0      m
1      s
2      m
3      l
4      s
5      m
6     xl
7      s
8      m
9     xl
10     s
11     m
12     l
13     s
dtype: category
Categories (4, object): ['s' < 'm' < 'l' < 'xl']