##### <b> Pandas Series </b></br> - Series is equivalent to a column of data

In [1]:
import numpy as np
import pandas as pd

##### <b> Series are Pandas data structures built on top of NumPy arrays</b></br> - Series contain index and optional name with array of data </br> - Can be created from other data types, usually imported </br> - Two or more series grouped together form Pandas Dataframe

In [2]:
# create list of integers
sales = np.arange(5)

# convert to Pandas Series
sales_series = pd.Series(sales, name="Sales") # name will give hte column a header when joining or appending to other dataframes/series

sales_series

0    0
1    1
2    2
3    3
4    4
Name: Sales, dtype: int32

##### <b> Pandas Series have these key properties</b></br> - values: data array in the series </br> - index: index array in the series </br> - name: optional name for the serise (useful for accessing columns)</br> - dtype: dataype of elements in the values array

In [3]:
# accessing series values is accessing a numpy array
sales_series.values

array([0, 1, 2, 3, 4])

In [4]:
# can view index
sales_series.index

RangeIndex(start=0, stop=5, step=1)

In [5]:
# can change the index values using range or manually
sales_series.index = pd.RangeIndex(10, 51, 10)
sales_series

10    0
20    1
30    2
40    3
50    4
Name: Sales, dtype: int32

In [6]:
# aggregations can be performed
sales_series.mean()

2.0

In [7]:
# name can be updated
sales_series.name = "updatedSales"
sales_series

10    0
20    1
30    2
40    3
50    4
Name: updatedSales, dtype: int32

| Numeric Data Types| Library | Description                    | Bitsize          |
|-------------------|---------|--------------------------------|------------------|
| Bool              | NumPy   | Boolean True/False             | 8                |
| int64             | NumPy   | Whole Numbers                  | 8, 16, 32, 64   |
| float64           | NumPy   | Decimal Numbers                | 8, 16, 32, 64   |
| object            | NumPy   | Any Python Object              | N/A              |
| boolean           | Pandas  | Nullable Boolean True/False    | 8                |
| int64             | Pandas  | Nullable Whole Numbers         | 8, 16, 32, 64   |
| float64           | Pandas  | Nullable Decimal Numbers       | 8, 16, 32, 64   |
| string/text       | Pandas  | Text/String Data               | N/A              |
| category          | Pandas  | Maps categorical data to numerical array for efficiency| N/A              |
| datetime64        | Pandas  | single moment in time (January 4, 2015, 2:00:00PM)     | 64               |
| timedelta         | Pandas  | Duration between 2 dates or times             | N/A               |
| period   4        | Pandas  | A span on Time             | N/A               |


##### </br> <b> Object/Text Data Types</b></br>  object - Any Python Object </br> string - only contains strings or text </br> category - Maps categorical data to a numeric array for efficiency 

##### </br> <b> Time Series </b></br> datetime - a single moment in time (January 4, 2015, 2:00:00 PM) </br> timedelta  - The duration between two dates or times (1o days, 3 seconds, etc...) </br> period - a span of time (a day, a week, etc...)

##### <b> Type Conversion </b>

In [8]:
# using the method .astype("<Data Type>") you can convert values if they are compatible
print(sales_series)
print(sales_series.astype("bool"))
print(sales_series.astype("float"))
print(sales_series.astype("object"))
print(sales_series.astype("string"))

10    0
20    1
30    2
40    3
50    4
Name: updatedSales, dtype: int32
10    False
20     True
30     True
40     True
50     True
Name: updatedSales, dtype: bool
10    0.0
20    1.0
30    2.0
40    3.0
50    4.0
Name: updatedSales, dtype: float64
10    0
20    1
30    2
40    3
50    4
Name: updatedSales, dtype: object
10    0
20    1
30    2
40    3
50    4
Name: updatedSales, dtype: string


In [9]:
print(sales_series.astype("bool").mean())

0.8


In [10]:
# this cannot be converted - ValueError
# print(sales_series.astype("datetime64"))
# ValueError: The 'datetime64' dtype has no unit. Please pass in 'datetime64[ns]' instead.


### <b> Indexing </b></br>

##### <b> The Index </b></br> Index lets you easily access 'rows' in Pandas Series or Dataframe

In [None]:
# create list of integers
sales = np.arange(5)

# convert to Pandas Series
sales_series = pd.Series(sales, name="Sales") # name will give hte column a header when joining or appending to other dataframes/series

sales_series

0    0
1    1
2    2
3    3
4    4
Name: Sales, dtype: int32

In [None]:
# Series can be accessed and sliced like other access sequence data types like base python
sales_series[2]
print(sales_series[2:4])
# however this is a better way

2    2
3    3
Name: Sales, dtype: int32


##### <b> Custom Indices </b></br> There are cases to use a custom index for accessing rows

In [None]:
# a list can be assigned to the index of a series as long as the value count matches
# for data analysis, default integer index is best
sales = [0, 5, 155, 0, 518]
items = ['coffee', 'bananas', 'tea', 'coconut', 'sugar']
# assigning items to the indexed_sales index column values
indexed_sales = pd.Series(sales, index=items, name='Indexed_Sales')
indexed_sales

coffee       0
bananas      5
tea        155
coconut      0
sugar      518
Name: Indexed_Sales, dtype: int64

In [None]:
# can call row value based on the index string
indexed_sales['tea']

155

In [None]:
# when slicing using index labels, the stop point is included, however numeric index the stop is no included
print()
print('using Index name does include stop point which is coconut')
print(indexed_sales['bananas':'coconut'])
print()
print('using Index Integer does not include stop point which is coconut')
print(indexed_sales[1:3])


using Index name does include stop point which is coconut
bananas      5
tea        155
coconut      0
Name: Indexed_Sales, dtype: int64

using Index Integer does not include stop point which is coconut
bananas      5
tea        155
Name: Indexed_Sales, dtype: int64


##### <b> .iloc[] Method </b></br> Preferred method to access values by positional index (using numeric) </br> - Method works even when Series have custom non-integer index </br> - more efficient </br> &nbsp;&nbsp; Series: seriesname.iloc[row position] </br> &nbsp;&nbsp; Dataframe: dataframename.iloc[row position, column position]

In [None]:
# can call row value using the iloc[] accessor 
print()
print('Single Index iloc[] Call')
print(indexed_sales.iloc[2])
print()
print('Index slice iloc[] Call')
print(indexed_sales.iloc[2:4])
print()
print('specific Indices iloc[] Call which requires nested list')
print(indexed_sales.iloc[[0, 2, 4]]) # require nested list to work
print()
print('Last Index iloc[] Call')
print(indexed_sales.iloc[-1]) 
print()
print('Reverse series Index iloc[] Call')
print(indexed_sales.iloc[::-1])


Single Index iloc[] Call
155

Index slice iloc[] Call
tea        155
coconut      0
Name: Indexed_Sales, dtype: int64

specific Indices iloc[] Call which requires nested list
coffee      0
tea       155
sugar     518
Name: Indexed_Sales, dtype: int64

Last Index iloc[] Call
518

Reverse series Index iloc[] Call
sugar      518
coconut      0
tea        155
bananas      5
coffee       0
Name: Indexed_Sales, dtype: int64


##### <b> .loc[] Method </b></br> Preferred method to access values by their custom labels </br> &nbsp;&nbsp; seriesname.loc[row label] </br> &nbsp;&nbsp; dataframename.loc[row label, column label] </br> - If row indices are numeric and default, loc[] method can use index number with column label

In [None]:
# can call row value using the loc[] accessor 
print()
print('Single Index loc[] Call')
print(indexed_sales.loc['coconut'])
print()
print('Single Index loc[] Call')
print(indexed_sales.loc['coffee':'coconut'])


Single Index loc[] Call
0

Single Index loc[] Call
coffee       0
bananas      5
tea        155
coconut      0
Name: Indexed_Sales, dtype: int64


##### <b>  Duplicate Index Values </b></br> Possible to have duplicate Index values in Pandas Series/Dataframe </br>- DO NOT SET DUPLICATE INDEX VALUES - </br> - accessing these indices using label .iloc[] returns all corresponding rows

In [None]:
# assigning list of values which includes a duplicate value. 
#####################################
# DO NOT SET DUPLICATE INDEX VALUES
#####################################
sales1 = [0, 5, 155, 0, 518]
items1 = ['coffee', 'coffee', 'tea', 'coconut', 'sugar']
# assigning items to the indexed_sales index column values
duplicate_index = pd.Series(sales1, index=items1, name='Duplicate_Index')
duplicate_index

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Duplicate_Index, dtype: int64

##### <b>  Reseeting Index Values </b></br> Can reset index back to default range of integers using .reset_index() method </br> - by default, existing index will become new column in dataframe

In [None]:
#in series it will become a dataframe as default for deafault .reset_index()
duplicate_index.reset_index()

Unnamed: 0,index,Duplicate_Index
0,coffee,0
1,coffee,5
2,tea,155
3,coconut,0
4,sugar,518


In [None]:
# including drop=True, the index will reset and not include the previous index 
duplicate_index.reset_index(drop=True)

0      0
1      5
2    155
3      0
4    518
Name: Duplicate_Index, dtype: int64

In [None]:
# reset index in call series row indices
duplicate_index.reset_index(drop=True).loc[2:4]


2    155
3      0
4    518
Name: Duplicate_Index, dtype: int64

In [None]:
duplicate_index.reset_index(drop=True, inplace=True)
duplicate_index

0      0
1      5
2    155
3      0
4    518
Name: Duplicate_Index, dtype: int64

### <b> Sorting and Filtering Series </b></br>

##### <b> Filtering Series </b></br> - Filter a series by passing a logical test in the .loc[] accessor

In [None]:
# a list can be assigned to the index of a series as long as the value count matches
# for data analysis, default integer index is best
sales = [0, 5, 155, 0, 518]
items = ['coffee', 'coffee', 'tea', 'coconut', 'sugar']
# assigning items to the indexed_sales index column values
sales_series = pd.Series(sales, index=items, name='Sales')
sales_series

coffee       0
coffee       5
tea        155
coconut      0
sugar      518
Name: Sales, dtype: int64

In [None]:
# filtering a series using .loc[] accessor
sales_series.loc[sales_series > 0]

coffee      5
tea       155
sugar     518
Name: Sales, dtype: int64

In [None]:
# create a mask for multiple logical conditions
mask = (sales_series > 0) & (sales_series.index == 'coffee') 
sales_series[mask]

coffee    5
Name: Sales, dtype: int64

### **Operators and Methods to Create Boolean Filters for Logical Tests**

| Description                 | Python Operator | Pandas Method |
|-----------------------------|-----------------|---------------|
| Equal                       | `==`            | `.eq()`       |
| Not Equal                   | `!=`            | `.ne()`       |
| Less Than or Equal          | `<=`            | `.le()`       |
| Less Than                   | `<`             | `.lt()`       |
| Greater Than or Equal       | `>=`            | `.ge()`       |
| Greater Than                | `>`             | `.gt()`       |
| Membership Test             | `in`            | `isin()`      |
| Inverse Membership Test     | `not in`        | `~.isin()`    |
##### .isin() method syntax: pd[`column_name_to_be_searched`].isin([`list of EXACT search strings`]) otherwise use .str.contains(`string characters`)

In [None]:
# Membership Test
sales_series.index.isin(['coffee', 'tea'])

array([ True,  True,  True, False, False])

In [None]:
# Inverse Membership Test use the tilde infront of series/dataframe
~sales_series.index.isin(['coffee', 'tea'])

array([False, False, False,  True,  True])

In [None]:
# logical membership test using .loc accessor and boolean mask
sales_series.loc[sales_series.isin([0,5])]

coffee     0
coffee     5
coconut    0
Name: Sales, dtype: int64

In [None]:
# logical inverse membership test using .loc accessor and boolean mask
sales_series.loc[~sales_series.isin([0,5])]

tea      155
sugar    518
Name: Sales, dtype: int64

In [None]:
sales_series.loc[sales_series >= 155]

tea      155
sugar    518
Name: Sales, dtype: int64

In [None]:
# to do inverse of sales_series.loc[sales_series >= 155] can use (less than or equal) or tilde on entire logical condition
sales_series.loc[~(sales_series >= 155)]

coffee     0
coffee     5
coconut    0
Name: Sales, dtype: int64

##### <b> Sorting Series </b></br> Sort can be done by values or their index in ascending order by default </br> - sort values: .sortvalues() </br> - sort index: .sort_index()

In [None]:
# default value sort (ascending)
sales_series.sort_values()

coffee       0
coconut      0
coffee       5
tea        155
sugar      518
Name: Sales, dtype: int64

In [None]:
# descending value sort
sales_series.sort_values(ascending=False)

sugar      518
tea        155
coffee       5
coffee       0
coconut      0
Name: Sales, dtype: int64

In [None]:
# default index sort (ascending)
sales_series.sort_index()

coconut      0
coffee       0
coffee       5
sugar      518
tea        155
Name: Sales, dtype: int64

In [None]:
# descending index sort
sales_series.sort_index(ascending=False)

tea        155
sugar      518
coffee       0
coffee       5
coconut      0
Name: Sales, dtype: int64

##### <b> Operations and Aggregations </b></br> - Series is equivalent to a column of data

##### Basic Python Operations and Pandas Methods
| Description            | Python Operator | Pandas Method |
|------------------------|-----------------|---------------|
| Addition               | `+`             | `.add()`      |
| Subtraction            | `-`             | `.sub()`      |
| Multiplication         | `*`             | `.mul()`      |
| Division               | `/`             | `.div()`      |
| Floor Division         | `//`            | `.floordiv()` |
| Modulo                 | `%`             | `.mod()`      |
| Exponentiation         | `**`            | `.pow()`      |



In [None]:
# creation of Series
sales = [0, 5, 155, 0, 518]
sales_series = pd.Series(sales)
sales_series

0      0
1      5
2    155
3      0
4    518
dtype: int64

In [None]:
# these are the same
print("+ 2")
print(sales_series +2)
print(".add(2)")
print(sales_series.add(2))

+ 2
0      2
1      7
2    157
3      2
4    520
dtype: int64
.add(2)
0      2
1      7
2    157
3      2
4    520
dtype: int64


In [None]:
# to concatenate series by including $
# change to float for decimal and then to string to be able to concatenate with $
'$' + sales_series.astype('float').astype('string')

0      $0.0
1      $5.0
2    $155.0
3      $0.0
4    $518.0
dtype: string

In [None]:
# creating series with null
my_series = pd.Series([1, np.NAN, 2, 3, 4], index = ['day 0', 'day 1','day 2','day 3','day 4'])
my_series

day 0    1.0
day 1    NaN
day 2    2.0
day 3    3.0
day 4    4.0
dtype: float64

In [None]:
# using python operations with Null does not change value of null
my_series + 2

day 0    3.0
day 1    NaN
day 2    4.0
day 3    5.0
day 4    6.0
dtype: float64

In [None]:
# using pandas operations with Null can indicate a fill value for nulls
print('using .fill_value(<value> will impute value to be affected by operation)')
my_series.add(2, fill_value=0)

using .fill_value(<value> will impute value to be affected by operation)


day 0    3.0
day 1    2.0
day 2    4.0
day 3    5.0
day 4    6.0
dtype: float64

In [None]:
# creating copy of my_series with no nulls
my_series2 = my_series.add(2, fill_value=0).copy()
# adding 2 series together and null values stay as null
my_series + my_series2

day 0     4.0
day 1     NaN
day 2     6.0
day 3     8.0
day 4    10.0
dtype: float64

In [None]:
# adding 2 series together with fill_value() will address nulls
my_series2.add(my_series, fill_value=0)

day 0     4.0
day 1     2.0
day 2     6.0
day 3     8.0
day 4    10.0
dtype: float64

##### <b>String Methods</b> </br> Pandas .str accessor lets you access many string methods and these methods all return a series </br> split returns multiple series
| String Method               | Description                        |
|-----------------------------|------------------------------------|
| .strip(), lstrip(), rstrip()| Removes leading and/or whitespace  |
| .upper(), .lower()          | Converts text to upper/lower case  |
| .slice(start,stop,step)     | Applies slice to strings in series |
| .count('string')            | Count all instances of given string|
| .contains('string')         | Return true if string is found, false if not |
| .replace('a','b')           | Replace instances of string a with string b      |
| .split('delimiter', expand=True) | Splits strings on given delimiter string, returns dataframe with series for each split |
| .len()                      | Return length of each string in a series |
| .startswith('string')       | Return true if found, false if not |
| .endswith('string')         | Return true if found, false if not |

In [None]:
# creation of string series
string_series = pd.Series(['day 0','day 1','day 2','day 3','day 4'])
string_series

0    day 0
1    day 1
2    day 2
3    day 3
4    day 4
dtype: object

In [None]:
# when searching a specific string, it's better to change all strings to str.lower() or str.upper() so it is easier to identify

# assigning uppercase series to new series
upper_series = string_series.str.upper()
upper_series

0    DAY 0
1    DAY 1
2    DAY 2
3    DAY 3
4    DAY 4
dtype: object

In [None]:
# search within series for 'DAY 1'
upper_series.str.contains('DAY 1')

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [None]:
# can be done as a mask
mask = upper_series.str.contains('DAY 1')
# displays series row that has values DAY 1
upper_series[mask]

1    DAY 1
dtype: object

In [None]:
# can be done as a mask1
mask1 = upper_series.str.contains('DAY')
# displays all series because 'DAY' is in each series value
upper_series[mask1]

0    DAY 0
1    DAY 1
2    DAY 2
3    DAY 3
4    DAY 4
dtype: object

In [None]:
# stripping string away from original string_series to new series
stripped = string_series.str.strip('day')
stripped.str.contains(' ')
stripped = stripped.str.strip('')
stripped

0     0
1     1
2     2
3     3
4     4
dtype: object

In [None]:
# using .str.split() to split based on specific characters or spaces for further analysis if (<delimiter>),expand=True) it will be split to a new column as a dataframe
string_dfsplit = string_series.str.split(' ', expand=True)
string_dfsplit

Unnamed: 0,0,1
0,day,0
1,day,1
2,day,2
3,day,3
4,day,4


In [None]:
# when using default, then splits into a list split by delimiter
string_listsplit = string_series.str.split(' ')
string_listsplit

0    [day, 0]
1    [day, 1]
2    [day, 2]
3    [day, 3]
4    [day, 4]
dtype: object

##### Pandas Numerical Aggregation Functions

| Method                            | Description                                             |
|-----------------------------------|---------------------------------------------------------|
| `.count()`                        | Returns the number of items in the series or data frame column(s). |
| `.sum()`                          | Computes the sum of the series or data frame column(s). |
| `.prod()`                         | Computes the product of the series or data frame column(s). |
| `.first()`,`.last()`              | Returns the first or last value in the series or data frame column(s). |
| `.min()`,`.max()`                 | Returns the minimum or maximum value in the series or data frame column(s). |
| `.argmin()`,`.argmax()`           | Returns the index for the smallest or largest valuesof the series or data frame column(s). |
| `.mean()`,`.median()`             | Calculates the mean or median of the series or data frame column(s). |
| `.mad()`                          | Computes the mean absolute deviation of the series or data frame column(s). |
| `.std()`,`.var()`                 | Calculates the standard deviation or variance of the series or data frame column(s). |
| `.quantile(q)`                    | Returns the quantile of the series or data frame column(s); `q` should be between 0 and 1. |
| `.describe()`                     | Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset’s distribution. |


**Note**: These functions can be applied to a Pandas Series or DataFrame. For DataFrames, these functions by default operate on each column, returning a Series of aggregated values.


In [None]:
# import data from transactions csv
transactions = pd.read_csv('Pandas Course Resources/retail/transactions.csv')
transactions

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [None]:
# create series from transactions column with name
transaction_series = pd.Series(transactions['transactions'], name='Transactions')
transaction_series

0         770
1        2111
2        2358
3        3487
4        1922
         ... 
83483    2804
83484    1573
83485    2255
83486     932
83487     802
Name: Transactions, Length: 83488, dtype: int64

In [None]:
# .count() method
transaction_series.count()

83488

In [None]:
# .sum() method
transaction_series.sum()

141478945

In [None]:
# .quantile(q) method which can take a list of quantile percent values
transaction_series.quantile([.25, .75, .90])

0.25    1046.0
0.75    2079.0
0.90    3071.0
Name: Transactions, dtype: float64

In [None]:
# with small datasets .quantile(q) method may require (q, interpolation='nearest') which mean it will extract the dataponst that is nearest the quantile percent
transaction_series.quantile([.25, .75, .90], interpolation='nearest') # doesn't make a difference now cause it's a big dataset

0.25    1046
0.75    2079
0.90    3071
Name: Transactions, dtype: int64

##### Pandas Categorical Aggregation Functions

| Method                            | Description                                             |
|-----------------------------------|---------------------------------------------------------|
| `.unique()`                       | Returns an array of unique items in the series or data frame column(s). |
| `.nunique()`                      | Returns the number of unique items in the series or data frame column(s). |
| `.value_counts()`                 | Returns a Series of Unique items and their frequency of a series or data frame column(s). |

In [None]:
# Create series of categorical data
items = pd.Series(['coffee', 'coffee', 'tea', 'coconut', 'sugar'])
items

0     coffee
1     coffee
2        tea
3    coconut
4      sugar
dtype: object

In [None]:
# count the frequency of the categories
items.value_counts() 

coffee     2
tea        1
coconut    1
sugar      1
Name: count, dtype: int64

In [None]:
# count the frequency of the categories and normalize=True will display the percentage total for each category
items.value_counts(normalize=True)

coffee     0.4
tea        0.2
coconut    0.2
sugar      0.2
Name: proportion, dtype: float64

In [None]:
# .unique() method will display each unique value as array
items.unique()

array(['coffee', 'tea', 'coconut', 'sugar'], dtype=object)

In [None]:
# .nunique() method will display the count of unique categories
items.nunique()

4

##### <b> Missing Data </b></br> - Represented as NaN and treated as float so it can be used in vectorized operations

In [None]:
# depending on the analysis being perform you may want the NaN value to be filled so operations can be performed that produce a result other than NaN

In [None]:
# Creation of a series that has a NaN value
# if the series is numerical, introducing a NaN will make the datatype float.
sales = pd.Series([0, 5, 155, np.nan, 518])
sales

0      0.0
1      5.0
2    155.0
3      NaN
4    518.0
dtype: float64

In [None]:
# performing operations on a series with a NaN does not affect the NaN. It will stay as NaN
sales + 2

0      2.0
1      7.0
2    157.0
3      NaN
4    520.0
dtype: float64

In [None]:
# if we don't want NaN then we can use the ( ,fill_value=) with the pandas method to add a value to NaNs
sales_nonull = sales.add(2, fill_value=0)
# this changes NaN to 0 and allows 2 to be added to it and can be assigned to a new series
sales_nonull

0      2.0
1      7.0
2    157.0
3      2.0
4    520.0
dtype: float64

##### <b> Identify Missing Data </b></br> The .isna() and .value_counts() method can identfiy missing data </br> - value_counts by default does not include na/NaN (use dropna=False) to include them

In [None]:
# creation of Series with 3 NaNs
checklist = pd.Series(['COMPLETE', np.nan, np.nan, np.nan, 'COMPLETE'])
checklist

0    COMPLETE
1         NaN
2         NaN
3         NaN
4    COMPLETE
dtype: object

In [None]:
# .isna() will return True (Value 1) if na/NaN. 
checklist.isna()

# This allows .sum() or .mean() operations to be done on the na/NaN values 
checklist.isna().sum()

3

In [None]:
# this can also be used as a boolean mask
mask = checklist.isna()
bool_checklist = checklist[mask]
bool_checklist

1    NaN
2    NaN
3    NaN
dtype: object

In [None]:
# .value_counts() method by default does not include na/NaN
checklist.value_counts()

COMPLETE    2
Name: count, dtype: int64

In [None]:
# to include na/NaN use dropna=False
checklist.value_counts(dropna=False)

NaN         3
COMPLETE    2
Name: count, dtype: int64

##### .dropna() removes NaN values from series or dataframe </br> when dropping rows, always use .reset_index() to fix index sequence

In [None]:
# drops na but only stays implemented if assigned to new series/dataframe 
checklist.dropna()

0    COMPLETE
4    COMPLETE
dtype: object

In [None]:
# drops na if assigned to new series/dataframe with .reset_index(drop=True) which drops previous index
checklist.dropna().reset_index(drop=True)

0    COMPLETE
1    COMPLETE
dtype: object

In [None]:
# fills na but only stays implemented if assigned to new series/dataframe 
# filles na with specific value. For categorical,k can use missing, incomplete... dataset dependant 
checklist.fillna("INCOMPLETE")

0      COMPLETE
1    INCOMPLETE
2    INCOMPLETE
3    INCOMPLETE
4      COMPLETE
dtype: object

##### be thoughtful and deliberate with how to handle missing data

In [None]:
# create index labels
items = ['coffee', 'coffee', 'tea', 'coconut', 'sugar']
# assign index labels
sales.index = items
sales

coffee       0.0
coffee       5.0
tea        155.0
coconut      NaN
sugar      518.0
dtype: float64

##### Understand why the data is missing can help decide what steps to take

In [None]:
# do you remove them?
sales.dropna()

coffee      0.0
coffee      5.0
tea       155.0
sugar     518.0
dtype: float64

In [None]:
# do you fille them with zero?
sales.fillna(0)

coffee       0.0
coffee       5.0
tea        155.0
coconut      0.0
sugar      518.0
dtype: float64

In [None]:
# do you impute them with the mean? -> Good for Machine Learning because it doesn't change the summary statistics
sales.fillna(sales.mean()) # Coconut NaN becomes 169.5

coffee       0.0
coffee       5.0
tea        155.0
coconut    169.5
sugar      518.0
dtype: float64

### **Apply Custom Functions**

##### <b> Apply Method </b></br> .apply() lets you apply custom fucntions to Pandas Series/Dataframes </br> - not as efficient as native functions because not vectorized

In [None]:
# by createing a function and applying it to a series, it is less efficient than pandas builtin methods
def discount(price):
    if price > 20:
        return round(price * .9, 2)
    return price

In [None]:
# generate price series
clean_wholesale = pd.Series([3.99, 5.99, 22.99, 7.99, 33.99])
clean_wholesale

0     3.99
1     5.99
2    22.99
3     7.99
4    33.99
dtype: float64

In [None]:
# use .apply() method to apply custom function
clean_wholesale.apply(discount)
# discount applied to index 3 and 4

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64

In [None]:
# Apply method not ideal, if need to do one off, can utilize lambda function with apply method
clean_wholesale.apply(lambda x: round(x * .9, 2) if x > 20 else x)

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64

##### <b> Pandas Where Method </b></br> .where() method returns series/dataframe values based on logical condition </br> &nbsp;df.where(logical test, </br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; values to return if False, </br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;inplace=False)

In [None]:
clean_wholesale

0     3.99
1     5.99
2    22.99
3     7.99
4    33.99
dtype: float64

In [None]:
# pandas .where method only returns values if the logical test is false
clean_wholesale.where(clean_wholesale <= 20, round(clean_wholesale * .9, 2))

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64

In [None]:
# can utilize the tilde to invert the boolean values and turn this into a value if true expression i.e. (~(clean_wholesale > 20)
clean_wholesale.where(~(clean_wholesale > 20), round(clean_wholesale * .9, 2))

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64

In [None]:
# NumPy's where function is often more covenient and useful than pandas method however the output will be in a numpy array so convert into series
pd.Series(np.where(clean_wholesale <= 20, clean_wholesale, round(clean_wholesale* .9, 2)))

0     3.99
1     5.99
2    20.69
3     7.99
4    30.59
dtype: float64