# Pandas Series

In this section, we will introduce Pandas **Series**, the Python equivalent of a column of data, and cover their basic properties, creation, manipulation, and useful functions for analysis.

## Goals for this lesson:
- Understand the relationship between Pandas Series and NumPy arrays
- Use the `.loc()` and `.iloc()` methods to access Series data by their indices or values
- Learn to sort, filter, and aggregate Pandas Series using methods and functions
- Apply custom functions using conditional logic to Pandas Series

# Introduction to Series

**Series** are Pandas data structures built on top of NumPy arrays
- Series also contain an **index** and have an **optional name**, in addition to the array of data
- They can be created from other data types, but are usually imported from external sources
- Two or more Series grouped together form a Pandas DataFrame

In [1]:
import numpy as np
import pandas as pd

In [2]:
sales = [0, 5, 155, 0, 518, 0, 1827, 616, 317, 325]

sales_series = pd.Series(sales, name="Sales")
sales_series

0       0
1       5
2     155
3       0
4     518
5       0
6    1827
7     616
8     317
9     325
Name: Sales, dtype: int64

> - Pandas' Series function converts Python lists and NumPy arrays into Pandas Series.
> - The name argument lets you specify a name.
> - The index is an array of integers starting at 0 by default, but it can be modified.

# Series Properties

Pandas Series have these key properties:
- **values** - the data array in the Series
- **index**  - the index array in the Series
- **name**   - the optional name for the Series (*useful for accessing columns in a DataFrame*)
- **dtype**  - the data type of the elements in the values array


In [3]:
sales_series.values

array([   0,    5,  155,    0,  518,    0, 1827,  616,  317,  325],
      dtype=int64)

In [4]:
sales_series.index

RangeIndex(start=0, stop=10, step=1)

In [5]:
sales_series.name

'Sales'

In [6]:
sales_series.dtype

dtype('int64')

> You can **convert the datatype** in a Pandas Series by using the `.astype()` method and specifying the desired data type (if compatible)

In [7]:
sales_series

0       0
1       5
2     155
3       0
4     518
5       0
6    1827
7     616
8     317
9     325
Name: Sales, dtype: int64

In [8]:
sales_series.astype("float")

0       0.0
1       5.0
2     155.0
3       0.0
4     518.0
5       0.0
6    1827.0
7     616.0
8     317.0
9     325.0
Name: Sales, dtype: float64

In [9]:
sales_series.astype("bool")

0    False
1     True
2     True
3    False
4     True
5    False
6     True
7     True
8     True
9     True
Name: Sales, dtype: bool

In [10]:
sales_series.astype("datetime64[ns]")

0   1970-01-01 00:00:00.000000000
1   1970-01-01 00:00:00.000000005
2   1970-01-01 00:00:00.000000155
3   1970-01-01 00:00:00.000000000
4   1970-01-01 00:00:00.000000518
5   1970-01-01 00:00:00.000000000
6   1970-01-01 00:00:00.000001827
7   1970-01-01 00:00:00.000000616
8   1970-01-01 00:00:00.000000317
9   1970-01-01 00:00:00.000000325
Name: Sales, dtype: datetime64[ns]

# EXERCISE: SERIES BASICS

#### NEW MESSAGE: 
- From: Rachel Revenue (Financial Analyst)
- Subject: Oil Price Series

`Hi there, glad to have you on the team!`

`I work in the finance department, and I’m working on an
analysis on the impact of oil prices on our sales.`

`Our last analyst read in oil data and created a NumPy array, can you
convert that to a Pandas Series and report back on properties
of the Series?`

`Make sure to include name, dtype, size, index, then take the
mean of the values array. Finally, convert the series to an
integer data type and recalculate the mean.`

Thanks!

In [11]:
# create a DataFrame from the oil file, drop missing values
oil = pd.read_csv("https://media.githubusercontent.com/media/apoorvpd/data_practice/master/oil.csv").dropna()

In [12]:
oil.head()

Unnamed: 0,date,dcoilwtico
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21


In [13]:
# # Grab 100 rows of oil prices
oil_array = np.array(oil["dcoilwtico"].iloc[1000:1100])

oil_array

array([52.22, 51.44, 51.98, 52.01, 52.82, 54.01, 53.8 , 53.75, 52.36,
       53.26, 53.77, 53.98, 51.95, 50.82, 52.19, 53.01, 52.36, 52.45,
       51.12, 51.39, 52.33, 52.77, 52.38, 52.14, 53.24, 53.18, 52.63,
       52.75, 53.9 , 53.55, 53.81, 53.01, 52.19, 52.37, 52.99, 53.84,
       52.96, 53.21, 53.11, 53.41, 53.41, 54.02, 53.61, 54.48, 53.99,
       54.04, 54.  , 53.82, 52.63, 53.33, 53.19, 52.68, 49.83, 48.75,
       48.05, 47.95, 47.24, 48.34, 48.3 , 48.34, 47.79, 47.02, 47.29,
       47.  , 47.3 , 47.02, 48.36, 49.47, 50.3 , 50.54, 50.25, 50.99,
       51.14, 51.69, 52.25, 53.06, 53.38, 53.12, 53.19, 52.62, 52.46,
       50.49, 50.26, 49.64, 48.9 , 49.22, 49.22, 48.96, 49.31, 48.83,
       47.65, 47.79, 45.55, 46.23, 46.46, 45.84, 47.28, 47.81, 47.83,
       48.86])

In [14]:
oil_series = pd.Series(oil_array, name="oil_prices")

In [15]:
print(f"Name: {oil_series.name}")
print(f"dtype: {oil_series.dtype}")
print(f"size: {oil_series.size}")
print(f"index: {oil_series.index}")

Name: oil_prices
dtype: float64
size: 100
index: RangeIndex(start=0, stop=100, step=1)


In [16]:
oil_series.values.mean()

51.128299999999996

In [17]:
oil_series.astype('int').mean()

50.66

# The Index

The **index** lets you easily access *"rows"* in a Pandas Series or DataFrame

In [18]:
sales = [0, 5, 155, 0, 518]

In [19]:
sales_series = pd.Series(sales, name="Sales")
sales_series

0      0
1      5
2    155
3      0
4    518
Name: Sales, dtype: int64

> - Here we are using the default integer index, which is preferred
> - You can **index** and **slice** Series like other sequence data types, but we will learn a better method

In [20]:
sales_series[2]

155

In [21]:
sales_series[2:4]

2    155
3      0
Name: Sales, dtype: int64

# CUSTOM INDICES

There are cases where it's applicable to use a **custom index** for accessing rows

In [22]:
sales = [0, 5, 155, 0, 518]
items = ["coffee", "bananas", "tea", "coconut", "sugar"]

In [23]:
sales_series = pd.Series(sales, index=items, name="Sales")
sales_series

coffee       0
bananas      5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

> You can still **index** and **slice** to retrieve Series values using the custom indices

In [24]:
sales_series["tea"]

155

In [25]:
sales_series["bananas":"coconut"]

bananas      5
tea        155
coconut      0
Name: Sales, dtype: int64

# THE ILOC METHOD

The **.iloc[]** method is the preferred way to access values by their positional index
- This method works even when Series have a custom, non-integer index
- It is more efficient than slicing and is recommended by Pandas' creators

`df.iloc[row position, column position]`

Examples:
- `0 (single row)`
- `[5, 9] (multiple rows)`
- `[0:11] (range of rows)`

**Note**: We will use the column position argument once we start working with Pandas DataFrames

In [26]:
sales_series

coffee       0
bananas      5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [27]:
sales_series.iloc[2]

155

> This returns the value in the 3rd position (*0-indexed*), even though the custom index for that value is "tea"

In [28]:
sales_series.iloc[2:4]

tea        155
coconut      0
Name: Sales, dtype: int64

> This returns the values from the 3rd to the 4th position (stop is non-inclusive)

# THE LOC METHOD

The **.loc[]** method is the preferred way to access values by their custom labels

`df.loc[row label, column label]`

Examples:
- `"pizza" (single row)`
- `["mike", "ike"] (multiple rows)`
- `["jan":"dec"] (range of rows)`

In [29]:
sales_series

coffee       0
bananas      5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [30]:
sales_series.loc["tea"]

155

In [31]:
sales_series.loc["bananas":"coconut"]

bananas      5
tea        155
coconut      0
Name: Sales, dtype: int64

> **Note**:
> - Slices are inclusive when using custom labels
> - The **.loc[]** method works even when the indices are integers, but if they are custom integers not ordered from 0 to n-1, the rows will be returned based on the labels themselves and NOT their numeric position

# DUPLICATE INDEX VALUES

It is possible to have **duplicate index values** in a Pandas Series or DataFrame
- Accessing these indices by their label using `.loc[]` returns all corresponding rows

In [32]:
sales = [0, 5, 155, 0, 518]
items = ["coffee", "coffee", "tea", "coconut", "sugar"]

In [33]:
sales_series = pd.Series(sales, index=items, name="Sales")
sales_series

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [34]:
sales_series.loc["coffee"]

coffee    0
coffee    5
Name: Sales, dtype: int64

> Warning! Duplicate index value are **generally not advised**, but there are some edge cases where they are useful

# RESETTING THE INDEX

You can **reset the index** in a Pandas Series or DataFrame back to the default range of integers by using the `.reset_index()` method.
- By default, the existing index will become a new column in a DataFrame

In [35]:
sales_series

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [36]:
sales_series.reset_index()

Unnamed: 0,index,Sales
0,coffee,0
1,coffee,5
2,tea,155
3,coconut,0
4,sugar,518


In [37]:
sales_series

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

> Use **drop=True** when resetting the index if you don't want the previous index values stored

In [38]:
sales_series.reset_index(drop=True)

0      0
1      5
2    155
3      0
4    518
Name: Sales, dtype: int64

# EXERCISE: ACCESSING SERIES DATA

#### NEW MESSAGE: 
- From: Rachel Revenue (Finacial Analyst)
- Subject: Oil Price Series w/Dates

`Thanks for picking up this work, but this data isn’t really useful
without dates since I need to understand trends over time to
improve my forecasts.`

`Can you set the date series to be the index?`

`Then, take the mean of the first 10 and last 10 prices. After
that, can you grab all oil prices from January 1st, 2017 to
January 7th, 2017 and revert the index of this slice back to
integers?`

`Thanks!`

In [39]:
# create a DataFrame from the oil file, drop missing values
oil = pd.read_csv("https://media.githubusercontent.com/media/apoorvpd/data_practice/master/oil.csv").dropna()

In [40]:
oil.head()

Unnamed: 0,date,dcoilwtico
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21


In [41]:
# # Grab 100 rows of oil prices
oil_array = np.array(oil["dcoilwtico"].iloc[1000:1100])

oil_array

array([52.22, 51.44, 51.98, 52.01, 52.82, 54.01, 53.8 , 53.75, 52.36,
       53.26, 53.77, 53.98, 51.95, 50.82, 52.19, 53.01, 52.36, 52.45,
       51.12, 51.39, 52.33, 52.77, 52.38, 52.14, 53.24, 53.18, 52.63,
       52.75, 53.9 , 53.55, 53.81, 53.01, 52.19, 52.37, 52.99, 53.84,
       52.96, 53.21, 53.11, 53.41, 53.41, 54.02, 53.61, 54.48, 53.99,
       54.04, 54.  , 53.82, 52.63, 53.33, 53.19, 52.68, 49.83, 48.75,
       48.05, 47.95, 47.24, 48.34, 48.3 , 48.34, 47.79, 47.02, 47.29,
       47.  , 47.3 , 47.02, 48.36, 49.47, 50.3 , 50.54, 50.25, 50.99,
       51.14, 51.69, 52.25, 53.06, 53.38, 53.12, 53.19, 52.62, 52.46,
       50.49, 50.26, 49.64, 48.9 , 49.22, 49.22, 48.96, 49.31, 48.83,
       47.65, 47.79, 45.55, 46.23, 46.46, 45.84, 47.28, 47.81, 47.83,
       48.86])

In [42]:
oil_series = pd.Series(oil_array, index=oil['date'].iloc[1000:1100], name='oil_prices')

In [43]:
oil_series

date
2016-12-20    52.22
2016-12-21    51.44
2016-12-22    51.98
2016-12-23    52.01
2016-12-27    52.82
              ...  
2017-05-09    45.84
2017-05-10    47.28
2017-05-11    47.81
2017-05-12    47.83
2017-05-15    48.86
Name: oil_prices, Length: 100, dtype: float64

In [44]:
oil_series.index # Sanity Check!

Index(['2016-12-20', '2016-12-21', '2016-12-22', '2016-12-23', '2016-12-27',
       '2016-12-28', '2016-12-29', '2016-12-30', '2017-01-03', '2017-01-04',
       '2017-01-05', '2017-01-06', '2017-01-09', '2017-01-10', '2017-01-11',
       '2017-01-12', '2017-01-13', '2017-01-17', '2017-01-18', '2017-01-19',
       '2017-01-20', '2017-01-23', '2017-01-24', '2017-01-25', '2017-01-26',
       '2017-01-27', '2017-01-30', '2017-01-31', '2017-02-01', '2017-02-02',
       '2017-02-03', '2017-02-06', '2017-02-07', '2017-02-08', '2017-02-09',
       '2017-02-10', '2017-02-13', '2017-02-14', '2017-02-15', '2017-02-16',
       '2017-02-17', '2017-02-21', '2017-02-22', '2017-02-23', '2017-02-24',
       '2017-02-27', '2017-02-28', '2017-03-01', '2017-03-02', '2017-03-03',
       '2017-03-06', '2017-03-07', '2017-03-08', '2017-03-09', '2017-03-10',
       '2017-03-13', '2017-03-14', '2017-03-15', '2017-03-16', '2017-03-17',
       '2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23', '2017-03-24',

In [45]:
oil_series[:10].mean()

52.765

In [46]:
oil_series[-10:].mean()

47.13

In [47]:
oil_series.loc['2017-01-01': '2017-01-07']

date
2017-01-03    52.36
2017-01-04    53.26
2017-01-05    53.77
2017-01-06    53.98
Name: oil_prices, dtype: float64

In [48]:
oil_series.loc['2017-01-01': '2017-01-07'].reset_index(drop=True)

0    52.36
1    53.26
2    53.77
3    53.98
Name: oil_prices, dtype: float64

# Filtering Series

You can **filter a Series** by passing a logical test into the `.loc[]` accessor (*like arrays!*)

In [49]:
sales_series

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [50]:
sales_series.loc[sales_series > 0] # This returns all rows from sales_series with a value greater than 0

coffee      5
tea       155
sugar     518
Name: Sales, dtype: int64

In [51]:
mask = (sales_series > 0) & (sales_series.index == "coffee")
sales_series.loc[mask]

coffee    5
Name: Sales, dtype: int64

> This uses a **mask** to store complex logic and returns all rows from `sales_series` with greater than 0 and an index equal to "coffee"

In [52]:
sales_series.index.isin(['coffee', 'tea'])

array([ True,  True,  True, False, False])

In [53]:
~sales_series.index.isin(['coffee', 'tea'])

array([False, False, False,  True,  True])

> The tilde `~` inverts Boolean values!

# Sorting Series

You can **sort Series** by their values or their index

1. The **.sort_values()** method sorts a Series by it's values in ascending order
2. The **.sort_index()** method sorts a Series by it's index in ascending order

In [54]:
sales_series

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [55]:
sales_series.sort_values()

coffee       0
coconut      0
coffee       5
tea        155
sugar      518
Name: Sales, dtype: int64

In [56]:
sales_series.sort_values(ascending=False)

sugar      518
tea        155
coffee       5
coffee       0
coconut      0
Name: Sales, dtype: int64

In [57]:
sales_series.sort_index()

coconut      0
coffee       0
coffee       5
sugar      518
tea        155
Name: Sales, dtype: int64

In [58]:
sales_series.sort_index(ascending=False)

tea        155
sugar      518
coffee       0
coffee       5
coconut      0
Name: Sales, dtype: int64

# EXERCISE: SORTING & FILTERING SERIES

#### NEW MESSAGE: 
- From: Rachel Revenue (Finacial Analyst)
- Subject: Oil Price Anomalies

`Hi again, your work has been super helpful already!`

`I need to look at this data from a few more angles.`

`First, can you get me the 10 lowest prices from the data,
sorted by date, starting with the most recent and ending with
the oldest?`

`After that, return to the original data. I’ve provided a list of
dates I want to narrow down to, and I also want to look only at
prices less than or equal to 50 dollars per barrel.`

`Thanks!`

In [59]:
dates = [
    "2016-12-22",
    "2017-05-03",
    "2017-01-06",
    "2017-03-05",
    "2017-02-12",
    "2017-03-21",
    "2017-04-14",
    "2017-04-15",
]

In [60]:
# create a DataFrame from the oil file, drop missing values
oil = pd.read_csv("https://media.githubusercontent.com/media/apoorvpd/data_practice/master/oil.csv").dropna()

# Grab 100 rows of oil prices
oil_array = np.array(oil["dcoilwtico"].iloc[1000:1100])

oil_array

array([52.22, 51.44, 51.98, 52.01, 52.82, 54.01, 53.8 , 53.75, 52.36,
       53.26, 53.77, 53.98, 51.95, 50.82, 52.19, 53.01, 52.36, 52.45,
       51.12, 51.39, 52.33, 52.77, 52.38, 52.14, 53.24, 53.18, 52.63,
       52.75, 53.9 , 53.55, 53.81, 53.01, 52.19, 52.37, 52.99, 53.84,
       52.96, 53.21, 53.11, 53.41, 53.41, 54.02, 53.61, 54.48, 53.99,
       54.04, 54.  , 53.82, 52.63, 53.33, 53.19, 52.68, 49.83, 48.75,
       48.05, 47.95, 47.24, 48.34, 48.3 , 48.34, 47.79, 47.02, 47.29,
       47.  , 47.3 , 47.02, 48.36, 49.47, 50.3 , 50.54, 50.25, 50.99,
       51.14, 51.69, 52.25, 53.06, 53.38, 53.12, 53.19, 52.62, 52.46,
       50.49, 50.26, 49.64, 48.9 , 49.22, 49.22, 48.96, 49.31, 48.83,
       47.65, 47.79, 45.55, 46.23, 46.46, 45.84, 47.28, 47.81, 47.83,
       48.86])

In [61]:
oil.head()

Unnamed: 0,date,dcoilwtico
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21


In [62]:
oil_series = pd.Series(oil_array, index=oil["date"].iloc[1000:1100], name="oil_series")

In [63]:
oil_series.sort_values().iloc[:10]

date
2017-05-04    45.55
2017-05-09    45.84
2017-05-05    46.23
2017-05-08    46.46
2017-03-23    47.00
2017-03-21    47.02
2017-03-27    47.02
2017-03-14    47.24
2017-05-10    47.28
2017-03-22    47.29
Name: oil_series, dtype: float64

In [64]:
oil_series.sort_values().iloc[:10].sort_index(ascending=False)

date
2017-05-10    47.28
2017-05-09    45.84
2017-05-08    46.46
2017-05-05    46.23
2017-05-04    45.55
2017-03-27    47.02
2017-03-23    47.00
2017-03-22    47.29
2017-03-21    47.02
2017-03-14    47.24
Name: oil_series, dtype: float64

In [65]:
mask = oil_series.index.isin(dates) & (oil_series <= 50)
oil_series.loc[mask]

date
2017-03-21    47.02
2017-05-03    47.79
Name: oil_series, dtype: float64

# String Methods

The Pandas str accessor lets you access many **string methods**

In [66]:
prices = pd.Series([3.99, 5.99, 22.99, 7.99, 33.99])
prices

0     3.99
1     5.99
2    22.99
3     7.99
4    33.99
dtype: float64

In [67]:
prices = "$" + prices.astype("float").astype("string")

In [68]:
prices

0     $3.99
1     $5.99
2    $22.99
3     $7.99
4    $33.99
dtype: string

In [69]:
prices.str.contains("3")  # The str accessor lets you access the string methods

0     True
1    False
2    False
3    False
4     True
dtype: boolean

In [70]:
clean = prices.str.strip("$").astype("float")  # This is removing the dollar sign, then converting to float
clean

0     3.99
1     5.99
2    22.99
3     7.99
4    33.99
dtype: float64

# EXERCISE: SERIES OPERATIONS

#### NEW MESSAGE: 
- From: Rachel Revenue (Finacial Analyst)
- Subject: Sensitivity Analysis

`Hey there,`

`I’m doing some ‘stress testing’ on my models. I want to look at
the financial impact if oil prices were 10% higher and add an
additional two dollars per barrel on top of that.`

`Once you’ve done that, create a series that represents the
percent difference between each price and the max price.`

`Finally, extract the month from the string dates in the index,
and store them as an integer.`

`Thanks!`

In [71]:
# create a DataFrame from the oil file, drop missing values
oil = pd.read_csv("https://media.githubusercontent.com/media/apoorvpd/data_practice/master/oil.csv").dropna()

# Grab 100 rows of oil prices
oil_array = np.array(oil["dcoilwtico"].iloc[1000:1100])

oil_array

array([52.22, 51.44, 51.98, 52.01, 52.82, 54.01, 53.8 , 53.75, 52.36,
       53.26, 53.77, 53.98, 51.95, 50.82, 52.19, 53.01, 52.36, 52.45,
       51.12, 51.39, 52.33, 52.77, 52.38, 52.14, 53.24, 53.18, 52.63,
       52.75, 53.9 , 53.55, 53.81, 53.01, 52.19, 52.37, 52.99, 53.84,
       52.96, 53.21, 53.11, 53.41, 53.41, 54.02, 53.61, 54.48, 53.99,
       54.04, 54.  , 53.82, 52.63, 53.33, 53.19, 52.68, 49.83, 48.75,
       48.05, 47.95, 47.24, 48.34, 48.3 , 48.34, 47.79, 47.02, 47.29,
       47.  , 47.3 , 47.02, 48.36, 49.47, 50.3 , 50.54, 50.25, 50.99,
       51.14, 51.69, 52.25, 53.06, 53.38, 53.12, 53.19, 52.62, 52.46,
       50.49, 50.26, 49.64, 48.9 , 49.22, 49.22, 48.96, 49.31, 48.83,
       47.65, 47.79, 45.55, 46.23, 46.46, 45.84, 47.28, 47.81, 47.83,
       48.86])

In [72]:
oil_series = pd.Series(oil_array, index=oil["date"].iloc[1000:1100], name="oil_series")

In [73]:
oil_series * 1.1 + 2

date
2016-12-20    59.442
2016-12-21    58.584
2016-12-22    59.178
2016-12-23    59.211
2016-12-27    60.102
               ...  
2017-05-09    52.424
2017-05-10    54.008
2017-05-11    54.591
2017-05-12    54.613
2017-05-15    55.746
Name: oil_series, Length: 100, dtype: float64

In [74]:
max_price = oil_series.max()
max_price

54.48

In [75]:
max_price_differential = (oil_series - max_price) / max_price
max_price_differential

date
2016-12-20   -0.041483
2016-12-21   -0.055800
2016-12-22   -0.045888
2016-12-23   -0.045338
2016-12-27   -0.030470
                ...   
2017-05-09   -0.158590
2017-05-10   -0.132159
2017-05-11   -0.122430
2017-05-12   -0.122063
2017-05-15   -0.103157
Name: oil_series, Length: 100, dtype: float64

In [76]:
month = oil_series.index.str[5:7].astype('int')
month

Index([12, 12, 12, 12, 12, 12, 12, 12,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  4,  4,
        4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  5,
        5,  5,  5,  5,  5,  5,  5,  5,  5,  5],
      dtype='int32', name='date')

# MISSING DATA

**Missing data** in Pandas is often represented by NumPy "NaN" values
- This is more efficient than Python's "None" data type
- Pandas treats NaN values as a float, which allows them to be used in vectorized operations

In [77]:
sales = [0, 5, 155, np.nan, 518]

sales_series = pd.Series(sales, name="Sales")
sales_series

0      0.0
1      5.0
2    155.0
3      NaN
4    518.0
Name: Sales, dtype: float64

> - **np.nan** creates a NaN value.
> - These are rarely created by hand, and typically appear when reading in data from external sources
> - If NaN was not present here, the data type would be int64

In [78]:
sales_series + 2

0      2.0
1      7.0
2    157.0
3      NaN
4    520.0
Name: Sales, dtype: float64

> Arithmetic operations performed on NaN values will return NaN

In [79]:
sales_series.add(2, fill_value=0)

0      2.0
1      7.0
2    157.0
3      2.0
4    520.0
Name: Sales, dtype: float64

> Most operation methods include a `fill_value` argument that lets you pass a value instead of NaN

Pandas released it's own **missing data type**, NA, in December 2020
- This allows missing values to be stored as integers, instead of needing to convert to float
- This is still a new feature, but most bugs end up converting the data to NumPy's NaN

In [80]:
sales = [0, 5, 155, pd.NA, 518]

sales_series = pd.Series(sales, name="Sales")
sales_series

0       0
1       5
2     155
3    <NA>
4     518
Name: Sales, dtype: object

# Identifying Missing Data

The `.isna()` and `.value_counts()` methods let you **identify missing data** in a Series
- The **`.isna()`** method returns True if a value is missing, and False otherwise
- The **`.value_counts()`** method returns unique values and their frequency

In [81]:
checklist = pd.Series(["COMPLETE", np.nan, np.nan, np.nan, "COMPLETE"], name="checklist")
checklist

0    COMPLETE
1         NaN
2         NaN
3         NaN
4    COMPLETE
Name: checklist, dtype: object

In [82]:
checklist.isna()

0    False
1     True
2     True
3     True
4    False
Name: checklist, dtype: bool

In [83]:
checklist.isna().sum()  # .isna().sum() returns the count of NaN values

3

In [84]:
checklist.value_counts()

checklist
COMPLETE    2
Name: count, dtype: int64

In [85]:
checklist.value_counts(dropna=False)

checklist
NaN         3
COMPLETE    2
Name: count, dtype: int64

> Most methods ignore NaN values, so you need to specify **`dropna=False`** to return the count of NaN values

# Handling Missing Data

- The `.dropna()` method removes NaN values from your Series or DataFrame
- The `.fillna(value)` method replaces NaN values with a specified value

In [86]:
checklist

0    COMPLETE
1         NaN
2         NaN
3         NaN
4    COMPLETE
Name: checklist, dtype: object

In [87]:
checklist.dropna()

0    COMPLETE
4    COMPLETE
Name: checklist, dtype: object

> **Note**: the index has gaps, so you can use `.reset_index()` to restore the range of integers

In [88]:
checklist.fillna("INCOMPLETE")

0      COMPLETE
1    INCOMPLETE
2    INCOMPLETE
3    INCOMPLETE
4      COMPLETE
Name: checklist, dtype: object

It's important to be **thoughtful and delibrate** in how you handle missing data
- Do you **keep** them?
- Do you **remove** them?
- Do you **replace** them with zeros?
- Do you **impute** them with the mean?

**PRO TIP**: These operations can dramatically impact the results of an analysis, so make sure you understand these impacts and talk to a data SME to understand why data is missing

In [89]:
sales_series

0       0
1       5
2     155
3    <NA>
4     518
Name: Sales, dtype: object

In [90]:
sales_series.dropna()

0      0
1      5
2    155
4    518
Name: Sales, dtype: object

In [91]:
pd.set_option('future.no_silent_downcasting', True)

In [92]:
sales_series.fillna(0)

0      0
1      5
2    155
3      0
4    518
Name: Sales, dtype: object

In [93]:
sales_series.fillna(sales_series.mean())

0        0
1        5
2      155
3    169.5
4      518
Name: Sales, dtype: object

# EXERCISE: MISSING DATA

#### NEW MESSAGE: 
- From: Rachel Revenue (Finacial Analyst)
- Subject: Erroneous Data

`Hey,`

`I just got a promotion thanks to the analysis you helped me
with. I owe you lunch!`

`I noticed that two prices (51.44, 47.83), were incorrect, so I
had them filled in with missing values. I’m not sure if I did this
correctly.`

`Can you confirm the number of missing values in
the price column? Once you’ve done that, fill the prices in with
the median of the oil price series.`

`Thanks!`

In [94]:
oil_series

date
2016-12-20    52.22
2016-12-21    51.44
2016-12-22    51.98
2016-12-23    52.01
2016-12-27    52.82
              ...  
2017-05-09    45.84
2017-05-10    47.28
2017-05-11    47.81
2017-05-12    47.83
2017-05-15    48.86
Name: oil_series, Length: 100, dtype: float64

In [97]:
oil_series = oil_series.where(~oil_series.isin([51.44, 47.83]), np.NaN)

In [98]:
oil_series.isna().sum()

2

In [99]:
oil_series.fillna(oil_series.median())

date
2016-12-20    52.220
2016-12-21    52.205
2016-12-22    51.980
2016-12-23    52.010
2016-12-27    52.820
               ...  
2017-05-09    45.840
2017-05-10    47.280
2017-05-11    47.810
2017-05-12    52.205
2017-05-15    48.860
Name: oil_series, Length: 100, dtype: float64

# THE APPLY METHOD

The **`.apply()`** method lets you apply custom functions to Pandas Series


In [100]:
def discount(price):
    if price > 20:
        return round(price * 0.9, 2)
    return price

In [102]:
clean_wholesale = pd.Series([3.99, 5.99, 22.99, 7.99, 33.99])
clean_wholesale

0     3.99
1     5.99
2    22.99
3     7.99
4    33.99
dtype: float64

In [103]:
clean_wholesale.apply(discount)

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64

In [105]:
clean_wholesale.apply(lambda x: round(x * 0.9, 2) if x > 20 else x)

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64