##### <b> Operations and Aggregations </b></br> - Series is equivalent to a column of data

In [1]:
import numpy as np
import pandas as pd

##### Basic Python Operations and Pandas Methods
| Description            | Python Operator | Pandas Method |
|------------------------|-----------------|---------------|
| Addition               | `+`             | `.add()`      |
| Subtraction            | `-`             | `.sub()`      |
| Multiplication         | `*`             | `.mul()`      |
| Division               | `/`             | `.div()`      |
| Floor Division         | `//`            | `.floordiv()` |
| Modulo                 | `%`             | `.mod()`      |
| Exponentiation         | `**`            | `.pow()`      |



In [2]:
# creation of Series
sales = [0, 5, 155, 0, 518]
sales_series = pd.Series(sales)
sales_series

0      0
1      5
2    155
3      0
4    518
dtype: int64

In [3]:
# these are the same
print("+ 2")
print(sales_series +2)
print(".add(2)")
print(sales_series.add(2))

+ 2
0      2
1      7
2    157
3      2
4    520
dtype: int64
.add(2)
0      2
1      7
2    157
3      2
4    520
dtype: int64


In [4]:
# to concatenate series by including $
# change to float for decimal and then to string to be able to concatenate with $
'$' + sales_series.astype('float').astype('string')

0      $0.0
1      $5.0
2    $155.0
3      $0.0
4    $518.0
dtype: string

In [5]:
# creating series with null
my_series = pd.Series([1, np.NAN, 2, 3, 4], index = ['day 0', 'day 1','day 2','day 3','day 4'])
my_series

day 0    1.0
day 1    NaN
day 2    2.0
day 3    3.0
day 4    4.0
dtype: float64

In [6]:
# using python operations with Null does not change value of null
my_series + 2

day 0    3.0
day 1    NaN
day 2    4.0
day 3    5.0
day 4    6.0
dtype: float64

In [7]:
# using pandas operations with Null can indicate a fill value for nulls
print('using .fill_value(<value> will impute value to be affected by operation)')
my_series.add(2, fill_value=0)

using .fill_value(<value> will impute value to be affected by operation)


day 0    3.0
day 1    2.0
day 2    4.0
day 3    5.0
day 4    6.0
dtype: float64

In [8]:
# creating copy of my_series with no nulls
my_series2 = my_series.add(2, fill_value=0).copy()
# adding 2 series together and null values stay as null
my_series + my_series2

day 0     4.0
day 1     NaN
day 2     6.0
day 3     8.0
day 4    10.0
dtype: float64

In [9]:
# adding 2 series together with fill_value() will address nulls
my_series2.add(my_series, fill_value=0)

day 0     4.0
day 1     2.0
day 2     6.0
day 3     8.0
day 4    10.0
dtype: float64

##### <b>String Methods</b> </br> Pandas .str accessor lets you access many string methods and these methods all return a series </br> split returns multiple series
| String Method               | Description                        |
|-----------------------------|------------------------------------|
| .strip(), lstrip(), rstrip()| Removes leading and/or whitespace  |
| .upper(), .lower()          | Converts text to upper/lower case  |
| .slice(start,stop,step)     | Applies slice to strings in series |
| .count('string')            | Count all instances of given string|
| .contains('string')         | Return true if string is found, false if not |
| .replace('a','b')           | Replace instances of string a with string b      |
| .split('delimiter', expand=True) | Splits strings on given delimiter string, returns dataframe with series for each split |
| .len()                      | Return length of each string in a series |
| .startswith('string')       | Return true if found, false if not |
| .endswith('string')         | Return true if found, false if not |

In [10]:
# creation of string series
string_series = pd.Series(['day 0','day 1','day 2','day 3','day 4'])
string_series

0    day 0
1    day 1
2    day 2
3    day 3
4    day 4
dtype: object

In [11]:
# when searching a specific string, it's better to change all strings to str.lower() or str.upper() so it is easier to identify

# assigning uppercase series to new series
upper_series = string_series.str.upper()
upper_series

0    DAY 0
1    DAY 1
2    DAY 2
3    DAY 3
4    DAY 4
dtype: object

In [12]:
# search within series for 'DAY 1'
upper_series.str.contains('DAY 1')

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [13]:
# can be done as a mask
mask = upper_series.str.contains('DAY 1')
# displays series row that has values DAY 1
upper_series[mask]

1    DAY 1
dtype: object

In [14]:
# can be done as a mask1
mask1 = upper_series.str.contains('DAY')
# displays all series because 'DAY' is in each series value
upper_series[mask1]

0    DAY 0
1    DAY 1
2    DAY 2
3    DAY 3
4    DAY 4
dtype: object

In [15]:
# stripping string away from original string_series to new series
stripped = string_series.str.strip('day')
stripped.str.contains(' ')
stripped = stripped.str.strip('')
stripped

0     0
1     1
2     2
3     3
4     4
dtype: object

In [16]:
# using .str.split() to split based on specific characters or spaces for further analysis if (<delimiter>),expand=True) it will be split to a new column as a dataframe
string_dfsplit = string_series.str.split(' ', expand=True)
string_dfsplit

Unnamed: 0,0,1
0,day,0
1,day,1
2,day,2
3,day,3
4,day,4


In [17]:
# when using default, then splits into a list split by delimiter
string_listsplit = string_series.str.split(' ')
string_listsplit

0    [day, 0]
1    [day, 1]
2    [day, 2]
3    [day, 3]
4    [day, 4]
dtype: object

##### Pandas Numerical Aggregation Functions

| Method                            | Description                                             |
|-----------------------------------|---------------------------------------------------------|
| `.count()`                        | Returns the number of items in the series or data frame column(s). |
| `.sum()`                          | Computes the sum of the series or data frame column(s). |
| `.prod()`                         | Computes the product of the series or data frame column(s). |
| `.first()`,`.last()`              | Returns the first or last value in the series or data frame column(s). |
| `.min()`,`.max()`                 | Returns the minimum or maximum value in the series or data frame column(s). |
| `.argmin()`,`.argmax()`           | Returns the index for the smallest or largest valuesof the series or data frame column(s). |
| `.mean()`,`.median()`             | Calculates the mean or median of the series or data frame column(s). |
| `.mad()`                          | Computes the mean absolute deviation of the series or data frame column(s). |
| `.std()`,`.var()`                 | Calculates the standard deviation or variance of the series or data frame column(s). |
| `.quantile(q)`                    | Returns the quantile of the series or data frame column(s); `q` should be between 0 and 1. |
| `.describe()`                     | Generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset’s distribution. |


**Note**: These functions can be applied to a Pandas Series or DataFrame. For DataFrames, these functions by default operate on each column, returning a Series of aggregated values.


In [26]:
# import data from transactions csv
transactions = pd.read_csv('Pandas Course Resources/retail/transactions.csv')
transactions

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922
...,...,...,...
83483,2017-08-15,50,2804
83484,2017-08-15,51,1573
83485,2017-08-15,52,2255
83486,2017-08-15,53,932


In [28]:
# create series from transactions column with name
transaction_series = pd.Series(transactions['transactions'], name='Transactions')
transaction_series

0         770
1        2111
2        2358
3        3487
4        1922
         ... 
83483    2804
83484    1573
83485    2255
83486     932
83487     802
Name: Transactions, Length: 83488, dtype: int64

In [29]:
# .count() method
transaction_series.count()

83488

In [30]:
# .sum() method
transaction_series.sum()

141478945

In [33]:
# .quantile(q) method which can take a list of quantile percent values
transaction_series.quantile([.25, .75, .90])

0.25    1046.0
0.75    2079.0
0.90    3071.0
Name: Transactions, dtype: float64

In [36]:
# with small datasets .quantile(q) method may require (q, interpolation='nearest') which mean it will extract the dataponst that is nearest the quantile percent
transaction_series.quantile([.25, .75, .90], interpolation='nearest') # doesn't make a difference now cause it's a big dataset

0.25    1046
0.75    2079
0.90    3071
Name: Transactions, dtype: int64

##### Pandas Categorical Aggregation Functions

| Method                            | Description                                             |
|-----------------------------------|---------------------------------------------------------|
| `.unique()`                       | Returns an array of unique items in the series or data frame column(s). |
| `.nunique()`                      | Returns the number of unique items in the series or data frame column(s). |
| `.value_counts()`                 | Returns a Series of Unique items and their frequency of a series or data frame column(s). |

In [37]:
# Create series of categorical data
items = pd.Series(['coffee', 'coffee', 'tea', 'coconut', 'sugar'])
items

0     coffee
1     coffee
2        tea
3    coconut
4      sugar
dtype: object

In [38]:
# count the frequency of the categories
items.value_counts() 

coffee     2
tea        1
coconut    1
sugar      1
Name: count, dtype: int64

In [39]:
# count the frequency of the categories and normalize=True will display the percentage total for each category
items.value_counts(normalize=True)

coffee     0.4
tea        0.2
coconut    0.2
sugar      0.2
Name: proportion, dtype: float64

In [40]:
# .unique() method will display each unique value as array
items.unique()

array(['coffee', 'tea', 'coconut', 'sugar'], dtype=object)

In [41]:
# .nunique() method will display the count of unique categories
items.nunique()

4