##### <b> DataFrames </b></br> Python equivalent to Excel Spreadsheet/SQL Table </br> Each Columns in a DataFrame is a pandas series </br> to improve code readability best to include `_df` for DataFrames or something that indicates it is a Dataframe like `_data`

In [1]:
import pandas as pd
import numpy as np

##### <b> Axis Values </b></br> - axis=0: This is for rows </br> To sum across Rows - df.sum(axis=0) </br> - axis=1: This is for columns </br> To sum down columns - df.sum(axis=1)

##### DataFrame Properties
| Properties| Description |
|----------|--------------|
| `shape`  | Number of rows and columns in a DataFrame (`index in not considered a column`)|
| `index`  | The row index in a DataFrame, by default it si a rnage of integers (`axis=0`)|
| `columns`| The column index in a DataFrame, represented by the Series names (`axis=1`)|
| `axes`   | The row and column indices in a DataFrame|
| `dtypes` | The data type for each Series in a DataFrame (`which can be different`)|



In [2]:
# load data from retail_2016_2017 csv
# there are a lot of functions that can be used when reading the csv into a DataFrame
# common practice is to create a path variable to call when reading in data
# instead of -> retail_df = pd.read_csv('Pandas Course Resources/retail/retail_2016_2017.csv')
# use
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.000,0
1,1945945,2016-01-01,1,BABY CARE,0.000,0
2,1945946,2016-01-01,1,BEAUTY,0.000,0
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
4,1945948,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...,...
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [3]:
# number of rows and columns
retail_df.shape

(1054944, 6)

In [4]:
# row index of DataFrame
retail_df.index

RangeIndex(start=0, stop=1054944, step=1)

In [5]:
# list of the DataFrame columns
retail_df.columns

Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')

In [6]:
# this retrieves information of axis=0 and axis=1 at the same time
retail_df.axes

[RangeIndex(start=0, stop=1054944, step=1),
 Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')]

In [7]:
# the datatypes and these may need to be changed
retail_df.dtypes

id               int64
date            object
store_nbr        int64
family          object
sales          float64
onpromotion      int64
dtype: object

##### <b> Creating a DataFrame </b></br> Can create a DataFrame from a Python Dictionary {`Key`:`Value`} or NumPy array `np.array([])` using Pandas DataFrame() method </br> - pd.DataFrame(`Dictionary` or `np.array`)

In [8]:
# Creation using Python Dictionary (keys and values) - rarely done from dictionary as an analyst
# keys are columns names and values are the column values
pd.DataFrame(
    {'id': [1, 2],
    'store_nbr': [1, 2],
    'family': ['POULTRY', 'PRODUCE']        
    }
)

Unnamed: 0,id,store_nbr,family
0,1,1,POULTRY
1,2,2,PRODUCE


In [16]:
# creating DataFrame from oil.csv
# common practice is to create a path variable to call when reading in data
# instead of -> oil_df = pd.read_csv('Pandas Course Resources/retail/oil.csv')
# use
oil_path = 'Pandas Course Resources/retail/oil.csv'
oil_df = pd.read_csv(oil_path)
oil_df

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [17]:
# relabeling columns based on column index order
# oil_df.columns = ['price_date', 'oil_price']
# better to use .rename and a dictionary so column index is not an issue
oil_df.rename(columns={'dcoilwtico': 'oil price'}, inplace=True)
oil_df

Unnamed: 0,date,oil price
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.20
...,...,...
1213,2017-08-25,47.65
1214,2017-08-28,46.40
1215,2017-08-29,46.46
1216,2017-08-30,45.96


In [11]:
# check oil_df axes
oil_df.axes

[RangeIndex(start=0, stop=1218, step=1),
 Index(['price_date', 'oil_price'], dtype='object')]

In [12]:
# check oil_df datatypes
oil_df.dtypes

price_date     object
oil_price     float64
dtype: object

##### <b> Exploring a DataFrame </b>
|Method|Descriptions|
|------|------------|
|`head`| Returns the first n rows of the DataFrame (`Default is 5`) - `df.head(nrows)`|
|`tail`| Returns the last n rows of the DataFrame (`Default is 5`) - `df.tail(nrows)`|
|`sample`| Returns n rows from a random sample (`Default is 1`) - `df.sample(nrows)`|
|`info`| Returns key details on DataFrame size, columns, and memory usage - `df.info()`|
|`describe`| Returns descriptive statistics for the columns in a DataFrame (`only numeric columns by default`; use the `include` argument to specify more columns) - `df.describe(include)`|

In [13]:
# tail and head are great for QA of data upon import
# great to verify if columns headers have been read in or if they need to be relabelled
retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [14]:
# tail is great for timeseries to look at the last date 
retail_df.tail()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
1054939,3000883,2017-08-15,9,POULTRY,438.133,0
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.0,8
1054943,3000887,2017-08-15,9,SEAFOOD,16.0,0


In [15]:
# random sample out of DataFrame which is built off NumPy rng and use random_state= which is the seed used for reproducibility
retail_df.sample(5, random_state=616)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0


### <b> Exploring Dataframes </b>

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head(10)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0
5,1945949,2016-01-01,1,BREAD/BAKERY,0.0,0
6,1945950,2016-01-01,1,CELEBRATION,0.0,0
7,1945951,2016-01-01,1,CLEANING,0.0,0
8,1945952,2016-01-01,1,DAIRY,0.0,0
9,1945953,2016-01-01,1,DELI,0.0,0


##### <b> .info() Method </b></br> This method returns details on DataFrame properties and memory usage

In [None]:
# .info() method for properties and memory usage
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 48.3+ MB


In [None]:
# non-nulls Count is included by default
# if there are more than 1.7 million rows the Non-Null Count of .info() method will be suppressed and to include use argument (show_counts=True)
retail_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 48.3+ MB


##### <b> .describe() Method </b></br> This method returns key statistics on a DataFrame's columns

In [None]:
# .describe method for numeric columns by default
# default values in Dataframe will be in scientific notation
retail_df.describe() 

Unnamed: 0,id,store_nbr,sales,onpromotion
count,1054944.0,1054944.0,1054944.0,1054944.0
mean,2473416.0,27.5,457.7225,5.937977
std,304536.2,15.58579,1317.155,18.08632
min,1945944.0,1.0,0.0,0.0
25%,2209680.0,14.0,2.0,0.0
50%,2473416.0,27.5,24.0,0.0
75%,2737151.0,41.0,262.0,3.0
max,3000887.0,54.0,124717.0,741.0


In [None]:
# to remove scientific notation from .describe() method is to include .round()
retail_df.describe().round()

Unnamed: 0,id,store_nbr,sales,onpromotion
count,1054944.0,1054944.0,1054944.0,1054944.0
mean,2473416.0,28.0,458.0,6.0
std,304536.0,16.0,1317.0,18.0
min,1945944.0,1.0,0.0,0.0
25%,2209680.0,14.0,2.0,0.0
50%,2473416.0,28.0,24.0,0.0
75%,2737151.0,41.0,262.0,3.0
max,3000887.0,54.0,124717.0,741.0


In [None]:
# .describe method for all columns not just numeric columns
# use includes='all' argument which can include labelled index
retail_df.describe(include='all').round()
# with the categorical colum 'family' it shows 33 unique categories with AUTOMOTIVE has the highest frequency of occurrence 31968

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
count,1054944.0,1054944,1054944.0,1054944,1054944.0,1054944.0
unique,,592,,33,,
top,,2016-01-01,,AUTOMOTIVE,,
freq,,1782,,31968,,
mean,2473416.0,,28.0,,458.0,6.0
std,304536.0,,16.0,,1317.0,18.0
min,1945944.0,,1.0,,0.0,0.0
25%,2209680.0,,14.0,,2.0,0.0
50%,2473416.0,,28.0,,24.0,0.0
75%,2737151.0,,41.0,,262.0,3.0


### <b> Accessing Dataframes </b></br> Can access by using bracket notation or dot notation

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head(10)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0
5,1945949,2016-01-01,1,BREAD/BAKERY,0.0,0
6,1945950,2016-01-01,1,CELEBRATION,0.0,0
7,1945951,2016-01-01,1,CLEANING,0.0,0
8,1945952,2016-01-01,1,DAIRY,0.0,0
9,1945953,2016-01-01,1,DELI,0.0,0


##### <b> Bracket Notation </b></br>

In [None]:
# using the column name, or list [['', '']] of column names as a mask for the DataFrame to display data
retail_df['family']

0                          AUTOMOTIVE
1                           BABY CARE
2                              BEAUTY
3                           BEVERAGES
4                               BOOKS
                      ...            
1054939                       POULTRY
1054940                PREPARED FOODS
1054941                       PRODUCE
1054942    SCHOOL AND OFFICE SUPPLIES
1054943                       SEAFOOD
Name: family, Length: 1054944, dtype: object

##### <b> Dot Notation </b></br> - Only works with valid Python variable names and if column name is not already an existing variable/method in the code

In [None]:
# dot notation
retail_df.family

0                          AUTOMOTIVE
1                           BABY CARE
2                              BEAUTY
3                           BEVERAGES
4                               BOOKS
                      ...            
1054939                       POULTRY
1054940                PREPARED FOODS
1054941                       PRODUCE
1054942    SCHOOL AND OFFICE SUPPLIES
1054943                       SEAFOOD
Name: family, Length: 1054944, dtype: object

##### <b> Use Series Operations on DataFrame Columns </b></br>

In [None]:
# number of unique values in a column
retail_df['family'].nunique()

33

In [None]:
# list of unique values in a column as a Series, default would be an array
pd.Series(retail_df['family'].unique())

0                     AUTOMOTIVE
1                      BABY CARE
2                         BEAUTY
3                      BEVERAGES
4                          BOOKS
5                   BREAD/BAKERY
6                    CELEBRATION
7                       CLEANING
8                          DAIRY
9                           DELI
10                          EGGS
11                  FROZEN FOODS
12                     GROCERY I
13                    GROCERY II
14                      HARDWARE
15            HOME AND KITCHEN I
16           HOME AND KITCHEN II
17               HOME APPLIANCES
18                     HOME CARE
19                    LADIESWEAR
20               LAWN AND GARDEN
21                      LINGERIE
22              LIQUOR,WINE,BEER
23                     MAGAZINES
24                         MEATS
25                 PERSONAL CARE
26                  PET SUPPLIES
27       PLAYERS AND ELECTRONICS
28                       POULTRY
29                PREPARED FOODS
30        

In [None]:
# first 5 unique values in a column using iloc[:stop]
retail_df['family'].value_counts().iloc[:5]

family
AUTOMOTIVE                    31968
HOME APPLIANCES               31968
SCHOOL AND OFFICE SUPPLIES    31968
PRODUCE                       31968
PREPARED FOODS                31968
Name: count, dtype: int64

In [None]:
# mean of the sales column
retail_df['sales'].mean().round(2)

457.72

In [None]:
# sum of all hte values in the sales column (rounded)
retail_df['sales'].sum().round(2)

482871591.33

In [None]:
# select multiple columns using a list [['', '']] which is great for non-consecutive columns and output as a DataFrame
retail_df[['family', 'store_nbr']]

Unnamed: 0,family,store_nbr
0,AUTOMOTIVE,1
1,BABY CARE,1
2,BEAUTY,1
3,BEVERAGES,1
4,BOOKS,1
...,...,...
1054939,POULTRY,9
1054940,PREPARED FOODS,9
1054941,PRODUCE,9
1054942,SCHOOL AND OFFICE SUPPLIES,9


In [None]:
# Can also slice the multiple columns selection DataFrame using .iloc[]

retail_df[['family', 'store_nbr']].iloc[:5]

Unnamed: 0,family,store_nbr
0,AUTOMOTIVE,1
1,BABY CARE,1
2,BEAUTY,1
3,BEVERAGES,1
4,BOOKS,1


##### <b> Accessing Data with .iloc[] </b></br> Access DataFrames by the row and column indices </br> iloc[`row_start`:`row_stop`, `column_start`:`column_stop`]

In [None]:
# grab first 5 rows (row index 1 to 4) and columns 2 to 4 (column index 1,2,3)

retail_df.iloc[:5, 1:4]

Unnamed: 0,date,store_nbr,family
0,2016-01-01,1,AUTOMOTIVE
1,2016-01-01,1,BABY CARE
2,2016-01-01,1,BEAUTY
3,2016-01-01,1,BEVERAGES
4,2016-01-01,1,BOOKS


##### <b> Accessing Data with .loc[] </b></br> Access DataFrames by the row and column labels </br> iloc[`row_start`:`row_stop`, `column_start`:`column_stop`]

In [None]:
# using .loc for with all rows and a single column without [] around column name will return a series
retail_df.loc[:, 'date']

0          2016-01-01
1          2016-01-01
2          2016-01-01
3          2016-01-01
4          2016-01-01
              ...    
1054939    2017-08-15
1054940    2017-08-15
1054941    2017-08-15
1054942    2017-08-15
1054943    2017-08-15
Name: date, Length: 1054944, dtype: object

In [None]:
# using .loc for with all rows and a single column with [] around column name will return a DataFrame
retail_df.loc[:, ['date']]

Unnamed: 0,date
0,2016-01-01
1,2016-01-01
2,2016-01-01
3,2016-01-01
4,2016-01-01
...,...
1054939,2017-08-15
1054940,2017-08-15
1054941,2017-08-15
1054942,2017-08-15


In [None]:
# using .loc[] accessor to access all rows and a list of columns
retail_df.loc[:, ['date', 'sales']]

Unnamed: 0,date,sales
0,2016-01-01,0.000
1,2016-01-01,0.000
2,2016-01-01,0.000
3,2016-01-01,0.000
4,2016-01-01,0.000
...,...,...
1054939,2017-08-15,438.133
1054940,2017-08-15,154.553
1054941,2017-08-15,2419.729
1054942,2017-08-15,121.000


In [None]:
# using .loc[] accessor to access all rows and a slice of columns
retail_df.loc[:, 'date':'sales']

Unnamed: 0,date,store_nbr,family,sales
0,2016-01-01,1,AUTOMOTIVE,0.000
1,2016-01-01,1,BABY CARE,0.000
2,2016-01-01,1,BEAUTY,0.000
3,2016-01-01,1,BEVERAGES,0.000
4,2016-01-01,1,BOOKS,0.000
...,...,...,...,...
1054939,2017-08-15,9,POULTRY,438.133
1054940,2017-08-15,9,PREPARED FOODS,154.553
1054941,2017-08-15,9,PRODUCE,2419.729
1054942,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000


In [None]:
# using .loc[] accessor to access all 10 rows (loc is stop inclusive) and a slice of columns
retail_df.loc[:10, 'date':'sales']

Unnamed: 0,date,store_nbr,family,sales
0,2016-01-01,1,AUTOMOTIVE,0.0
1,2016-01-01,1,BABY CARE,0.0
2,2016-01-01,1,BEAUTY,0.0
3,2016-01-01,1,BEVERAGES,0.0
4,2016-01-01,1,BOOKS,0.0
5,2016-01-01,1,BREAD/BAKERY,0.0
6,2016-01-01,1,CELEBRATION,0.0
7,2016-01-01,1,CLEANING,0.0
8,2016-01-01,1,DAIRY,0.0
9,2016-01-01,1,DELI,0.0


##### <b> Dropping Rows and Columns </b></br> .drop() method drops rows and columns from a DataFrame </br> - `axis=0` to drop rows: typically drop rows via slicing or filtering instead of .drop() method </br> - `axis=1` to drop columns </br> - `inplace=True` permanently removes rows/columns from DataFrame: Typically better to save into a new DataFrame so the original DataFrame is still available


In [None]:
# list of column names
retail_df.columns

Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')

In [None]:
# this returns first 5 rows of data from DataFrame with the 'id' column as it's redundant as pandas index
retail_df.drop('id', axis=1).head()

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,0.0,0
1,2016-01-01,1,BABY CARE,0.0,0
2,2016-01-01,1,BEAUTY,0.0,0
3,2016-01-01,1,BEVERAGES,0.0,0
4,2016-01-01,1,BOOKS,0.0,0


In [None]:
# permanently remove 'id' and 'onpromotion' from DataFrame
retail_df.drop(['id', 'onpromotion'], axis=1, inplace=True)
retail_df.head()

Unnamed: 0,date,store_nbr,family,sales
0,2016-01-01,1,AUTOMOTIVE,0.0
1,2016-01-01,1,BABY CARE,0.0
2,2016-01-01,1,BEAUTY,0.0
3,2016-01-01,1,BEVERAGES,0.0
4,2016-01-01,1,BOOKS,0.0


##### <b> Identifying Duplicate Row </b></br> .duplicated() method identifies duplicate rows of data - This means every column of that row is duplicated exactly in another row </br> - specify `subset=column(s)` to look for duplicates across a subset of columns (so only duplicate values in the rows of that column(s)) </br> if number of unique values (`.nunique()`) is less than the total number of rows, then that column contains duplicate values


In [None]:
# create duplicated value DataFrame using dictionary {key:values}
products_df = pd.DataFrame(
    {'product': ['Dairy', 'Dairy', 'Dairy', 'Vegetable', 'Fruits'],
    'price': [2.56, 2.56, 4.55, 2.74, 5.44]        
    }
)
products_df

Unnamed: 0,product,price
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# shape of DataFrame
products_df.shape

(5, 2)

In [None]:
# Number of Unique Values in each DataFrame Columns
# number of uniques values (produce: 3, price: 4) is less than number of rows (5)
products_df.nunique()


product    3
price      4
dtype: int64

In [None]:
# the .duplicated() method returns True for the second row because it is an exact duplicate of the first row
products_df.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [None]:
# to find a duplicate value in a specific column use (subset=) argument
products_df.duplicated(subset='product')

0    False
1     True
2     True
3    False
4    False
dtype: bool

##### <b>Drop Duplicate Rows </b></br> .drop_duplicates() method drops duplicate rows (where there are rows that are exact duplicates of each other) </br> - specify `subset=column(s)` to drop duplicate rows based on specfic column(s)</br> Passable arguments: - `keep=`('first, 'last', False): first is default, False drops all duplicates </br> - `inplace=`(True, False): False is default, True removes on current DataFrame, False returns a DataFrame copy with removing done </br> - `ignore_index=`(True, False): False is default, True re-indexes the resulting DataFrame at 0


In [None]:
# remove duplicates in 'product' column, keep the last of the duplicates (not the first), ignore_index resets index, and this is done as a copy, not to the original DataFrame (that requires inplace=True)
products_df.drop_duplicates(subset='product', keep='last', ignore_index=True)

Unnamed: 0,product,price
0,Dairy,4.55
1,Vegetable,2.74
2,Fruits,5.44


In [None]:
# original DataFrame still contains all rows
products_df

Unnamed: 0,product,price
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


### **Blank and Duplicate Values**

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head(10)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0
5,1945949,2016-01-01,1,BREAD/BAKERY,0.0,0
6,1945950,2016-01-01,1,CELEBRATION,0.0,0
7,1945951,2016-01-01,1,CLEANING,0.0,0
8,1945952,2016-01-01,1,DAIRY,0.0,0
9,1945953,2016-01-01,1,DELI,0.0,0


##### <b> Identifying Duplicate Row </b></br> .duplicated() method identifies duplicate rows of data - This means every column of that row is duplicated exactly in another row </br> - specify `subset=column(s)` to look for duplicates across a subset of columns (so only duplicate values in the rows of that column(s)) </br> if number of unique values (`.nunique()`) is less than the total number of rows, then that column contains duplicate values


In [None]:
# create duplicated value DataFrame using dictionary {key:values}
products_df = pd.DataFrame(
    {'product': ['Dairy', 'Dairy', 'Dairy', 'Vegetable', 'Fruits'],
    'price': [2.56, 2.56, 4.55, 2.74, 5.44]        
    }
)
products_df

Unnamed: 0,product,price
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# shape of DataFrame
products_df.shape

(5, 2)

In [None]:
# Number of Unique Values in each DataFrame Columns
# number of uniques values (produce: 3, price: 4) is less than number of rows (5)
products_df.nunique()


product    3
price      4
dtype: int64

In [None]:
# the .duplicated() method returns True for the second row because it is an exact duplicate of the first row
products_df.duplicated()

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [None]:
# to find a duplicate value in a specific column use (subset=) argument
products_df.duplicated(subset='product')

0    False
1     True
2     True
3    False
4    False
dtype: bool

##### <b>Drop Duplicate Rows </b></br> .drop_duplicates() method drops duplicate rows (where there are rows that are exact duplicates of each other) </br> - specify `subset=column(s)` to drop duplicate rows based on specfic column(s)</br> Passable arguments: - `keep=`('first, 'last', False): first is default, False drops all duplicates </br> - `inplace=`(True, False): False is default, True removes on current DataFrame, False returns a DataFrame copy with removing done </br> - `ignore_index=`(True, False): False is default, True re-indexes the resulting DataFrame at 0


In [None]:
# remove duplicates in 'product' column, keep the last of the duplicates (not the first), ignore_index resets index, and this is done as a copy, not to the original DataFrame (that requires inplace=True)
products_df.drop_duplicates(subset='product', keep='last', ignore_index=True)

Unnamed: 0,product,price
0,Dairy,4.55
1,Vegetable,2.74
2,Fruits,5.44


In [None]:
# orginal DataFrame still contains all rows
products_df

Unnamed: 0,product,price
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


##### <b> Identifying Missing Data </b></br> Can identify by column using the `.isna()` and `.sum()` methods. The `.info()` can also identify null values

In [None]:
# create NAduplicated value DataFrame using dictionary {key:values} with NA pandas and NAN from numpy 
productsNA_df = pd.DataFrame(
    {'product': [pd.NA, 'Dairy', 'Dairy', np.NAN, 'Fruits'],
    'price': [2.56, pd.NA, 4.55, 2.74, np.NaN],
    'product_id': [1, 2, 3, 4, 5]
    }
)
productsNA_df

Unnamed: 0,product,price,product_id
0,,2.56,1
1,Dairy,,2
2,Dairy,4.55,3
3,,2.74,4
4,Fruits,,5


In [None]:
# isna identifies both pd.NA and np.NAN in the DataFrame
# ideal to use
productsNA_df.isna().sum()

product       2
price         2
product_id    0
dtype: int64

In [None]:
# can use .info()
productsNA_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   product     3 non-null      object
 1   price       3 non-null      object
 2   product_id  5 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 252.0+ bytes


##### <b> Handling Missing Data </b></br> The `.dropna()` method will remove rows with na values and `.fillna()` method will replace na values with new values </br> The standard `.fillna()` method will fill all na's in every column with the same value can is an issue </br> to replace values in a specific column, pass the argument as a dictionary into .fillna() method - `.fillna({'key':'value})`

In [None]:
# using .fillna() method to fill values in specific columns using a dictionary {key:value} and can have multiple keys:values separated by commas. Does not alter original Dataframe unless assigning to itself or new one
productsNA_df.fillna({'product':'Unknown', 'price':0 })


  productsNA_df.fillna({'product':'Unknown', 'price':0 })


Unnamed: 0,product,price,product_id
0,Unknown,2.56,1
1,Dairy,0.0,2
2,Dairy,4.55,3
3,Unknown,2.74,4
4,Fruits,0.0,5


In [None]:
new_products_df = productsNA_df.fillna({'product':'Unknown', 'price':0 })
new_products_df

  new_products_df = productsNA_df.fillna({'product':'Unknown', 'price':0 })


Unnamed: 0,product,price,product_id
0,Unknown,2.56,1
1,Dairy,0.0,2
2,Dairy,4.55,3
3,Unknown,2.74,4
4,Fruits,0.0,5


In [None]:
# original na DataFrame is unchanged
productsNA_df

Unnamed: 0,product,price,product_id
0,,2.56,1
1,Dairy,,2
2,Dairy,4.55,3
3,,2.74,4
4,Fruits,,5


### <b> Sorting and Filtering Dataframes </b></br> 

##### <b> filtering Dataframes </b></br> Can filter rows in a DataFrame by passing logical test into the .loc[] access (like series/Np Array) </br>

### **Operators and Methods to Create Boolean Filters for Logical Tests**

| Description                 | Python Operator | Pandas Method |
|-----------------------------|-----------------|---------------|
| Equal                       | `==`            | `.eq()`       |
| Not Equal                   | `!=`            | `.ne()`       |
| Less Than or Equal          | `<=`            | `.le()`       |
| Less Than                   | `<`             | `.lt()`       |
| Greater Than or Equal       | `>=`            | `.ge()`       |
| Greater Than                | `>`             | `.gt()`       |
| Membership Test             | `in`            | `isin()`      |
| Inverse Membership Test     | `not in`        | `~.isin()`    |
##### .isin() method syntax: pd[`column_name_to_be_searched`].isin([`list of EXACT search strings`]) otherwise use .str.contains(`string characters`)

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [None]:
# create logical condition mask to filter DataFrame and .loc returns all rows that pass the condition
mask = retail_df['date'] == '2016-10-28'
retail_df.loc[mask]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
536382,2482326,2016-10-28,1,AUTOMOTIVE,8.000,0
536383,2482327,2016-10-28,1,BABY CARE,0.000,0
536384,2482328,2016-10-28,1,BEAUTY,9.000,1
536385,2482329,2016-10-28,1,BEVERAGES,2576.000,38
536386,2482330,2016-10-28,1,BOOKS,0.000,0
...,...,...,...,...,...,...
538159,2484103,2016-10-28,9,POULTRY,391.292,24
538160,2484104,2016-10-28,9,PREPARED FOODS,78.769,1
538161,2484105,2016-10-28,9,PRODUCE,993.760,5
538162,2484106,2016-10-28,9,SCHOOL AND OFFICE SUPPLIES,0.000,0


In [None]:
# create logical condition mask1 to filter DataFrame and select specific columns then use .loc[mask] to return all rows that pass the condition
mask1 = retail_df['date'] == '2016-10-28', ['date', 'sales']
retail_df.loc[mask1]

Unnamed: 0,date,sales
536382,2016-10-28,8.000
536383,2016-10-28,0.000
536384,2016-10-28,9.000
536385,2016-10-28,2576.000
536386,2016-10-28,0.000
...,...,...
538159,2016-10-28,391.292
538160,2016-10-28,78.769
538161,2016-10-28,993.760
538162,2016-10-28,0.000


##### <b> Apply multiple Filters </b></br> Join logical test with an &(and) |(or).

In [None]:
# create a complex boolean mask for string character search using .str.contains('characters')
mask2 = (retail_df['family'].str.contains('AUTO')) | (retail_df['family'].str.contains('DAIR'))
retail_df.loc[mask2]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
8,1945952,2016-01-01,1,DAIRY,0.0,0
33,1945977,2016-01-01,10,AUTOMOTIVE,0.0,0
41,1945985,2016-01-01,10,DAIRY,0.0,0
66,1946010,2016-01-01,11,AUTOMOTIVE,0.0,0
...,...,...,...,...,...,...
1054853,3000797,2017-08-15,7,DAIRY,1279.0,25
1054878,3000822,2017-08-15,8,AUTOMOTIVE,4.0,0
1054886,3000830,2017-08-15,8,DAIRY,1330.0,24
1054911,3000855,2017-08-15,9,AUTOMOTIVE,15.0,0


In [None]:
# create a complex boolean mask for using membership test .isin([])
mask3 = (retail_df['family'].isin(['AUTOMOTIVE', 'CLEANING']) & 
        retail_df['sales'] > 0)
retail_df.loc[mask3]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
561,1946505,2016-01-01,25,AUTOMOTIVE,4.0,0
568,1946512,2016-01-01,25,CLEANING,734.0,0
1782,1947726,2016-01-02,1,AUTOMOTIVE,7.0,0
1789,1947733,2016-01-02,1,CLEANING,526.0,3
1815,1947759,2016-01-02,10,AUTOMOTIVE,1.0,0
...,...,...,...,...,...,...
1054852,3000796,2017-08-15,7,CLEANING,1139.0,9
1054878,3000822,2017-08-15,8,AUTOMOTIVE,4.0,0
1054885,3000829,2017-08-15,8,CLEANING,1198.0,13
1054911,3000855,2017-08-15,9,AUTOMOTIVE,15.0,0


In [None]:
# create boolean mask for year 2016 with sales greater than 500
mask2016 = ((retail_df['date'].str[:4] == '2016') 
            & (retail_df['sales'] > 500))
retail_df.loc[mask2016]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
564,1946508,2016-01-01,25,BEVERAGES,5104.000,1
566,1946510,2016-01-01,25,BREAD/BAKERY,680.952,0
568,1946512,2016-01-01,25,CLEANING,734.000,0
569,1946513,2016-01-01,25,DAIRY,1033.000,11
572,1946516,2016-01-01,25,FROZEN FOODS,596.125,0
...,...,...,...,...,...,...
650409,2596353,2016-12-31,9,GROCERY I,7657.226,147
650415,2596359,2016-12-31,9,HOME CARE,515.000,7
650422,2596366,2016-12-31,9,PERSONAL CARE,516.000,13
650425,2596369,2016-12-31,9,POULTRY,687.853,1


### **Query Method** </br> .query() Method uses SQL-like syntax to filter DataFrames </br> Use of this method can based if teams within the company allows this or not </br> - create complex filters using `and` & `or` keywords </br> - use the `in` keyword from base Python</br> pd.query("`column_to_filter` in [`'list_of_search_text'`] with `and`/`or` condition if needed") </br> Similar to SQL the `@` can be used to call variables in the query statement `@variable_name` </br> -- cannot use slice with .query() method but date values can be parsed to their own columns

In [None]:
# .query method wraps entire statement in "" and search text in ''
retail_df.query(
    "family in ['CLEANING', 'DAIRY'] and sales > 0"
    )

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
568,1946512,2016-01-01,25,CLEANING,734.0,0
569,1946513,2016-01-01,25,DAIRY,1033.0,11
1789,1947733,2016-01-02,1,CLEANING,526.0,3
1790,1947734,2016-01-02,1,DAIRY,627.0,15
1822,1947766,2016-01-02,10,CLEANING,1216.0,4
...,...,...,...,...,...,...
1054853,3000797,2017-08-15,7,DAIRY,1279.0,25
1054885,3000829,2017-08-15,8,CLEANING,1198.0,13
1054886,3000830,2017-08-15,8,DAIRY,1330.0,24
1054918,3000862,2017-08-15,9,CLEANING,1439.0,25


In [None]:
# creation of average sales variable
avg_sales = retail_df['sales'].mean()
avg_sales

457.72248700136413

In [None]:
# .query method wraps entire statement in "" and search text in '' and uses @variable_name when calling created variables
retail_df.query(
    "family in ['CLEANING', 'DAIRY'] and sales > @avg_sales"
    )

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
568,1946512,2016-01-01,25,CLEANING,734.0,0
569,1946513,2016-01-01,25,DAIRY,1033.0,11
1789,1947733,2016-01-02,1,CLEANING,526.0,3
1790,1947734,2016-01-02,1,DAIRY,627.0,15
1822,1947766,2016-01-02,10,CLEANING,1216.0,4
...,...,...,...,...,...,...
1054853,3000797,2017-08-15,7,DAIRY,1279.0,25
1054885,3000829,2017-08-15,8,CLEANING,1198.0,13
1054886,3000830,2017-08-15,8,DAIRY,1330.0,24
1054918,3000862,2017-08-15,9,CLEANING,1439.0,25


### **Sorting DataFrames by Indices** </br> Can sort a DataFrame by it's indices using the `.sort_index()` method </br> - This sorts rows (`axis=0`) by default, but can specify (`axis=1`) to sort columns

In [None]:
# create sample DataFrame by filtering rows for 3 product familes
condition = retail_df['family'].isin(['BEVERAGES', 'DAIRY', 'DELI'])
retail_df[condition]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
3,1945947,2016-01-01,1,BEVERAGES,0.000,0
8,1945952,2016-01-01,1,DAIRY,0.000,0
9,1945953,2016-01-01,1,DELI,0.000,0
36,1945980,2016-01-01,10,BEVERAGES,0.000,0
41,1945985,2016-01-01,10,DAIRY,0.000,0
...,...,...,...,...,...,...
1054886,3000830,2017-08-15,8,DAIRY,1330.000,24
1054887,3000831,2017-08-15,8,DELI,276.639,8
1054914,3000858,2017-08-15,9,BEVERAGES,3530.000,26
1054919,3000863,2017-08-15,9,DAIRY,835.000,19


In [None]:
# grab 5 sample rows into sample_df
sample_df = retail_df[condition].sample(5, random_state=2021)

In [None]:
# sort sample_df by index ascending (default)
sample_df.sort_index()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
13506,1959450,2016-01-08,38,DELI,131.545,43
74292,2020236,2016-02-11,43,DELI,212.0,2
445008,2390952,2016-09-06,45,BEVERAGES,8339.0,19
495966,2441910,2016-10-05,25,DELI,0.0,0
882588,2828532,2017-05-11,23,BEVERAGES,1194.0,22


In [None]:
# sort sample_df by index descending 
sample_df.sort_index(ascending=False)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
882588,2828532,2017-05-11,23,BEVERAGES,1194.0,22
495966,2441910,2016-10-05,25,DELI,0.0,0
445008,2390952,2016-09-06,45,BEVERAGES,8339.0,19
74292,2020236,2016-02-11,43,DELI,212.0,2
13506,1959450,2016-01-08,38,DELI,131.545,43


In [None]:
# sort sample_df by column index (or column labels) ascending (default)
sample_df.sort_index(axis=1, inplace=True)
sample_df

Unnamed: 0,date,family,id,onpromotion,sales,store_nbr
74292,2016-02-11,DELI,2020236,2,212.0,43
13506,2016-01-08,DELI,1959450,43,131.545,38
882588,2017-05-11,BEVERAGES,2828532,22,1194.0,23
445008,2016-09-06,BEVERAGES,2390952,19,8339.0,45
495966,2016-10-05,DELI,2441910,0,0.0,25


### **Sorting DataFrames by its Values** </br> Can sort a DataFrame by it's values using the `.sort_values()` method </br> - Can sort a single column or multiple columns </br> - can specify ascending/descending for specific columns during .sort_values() method using .sort_values([`list of columns`], ascending[`List of True or False for each column`])

In [None]:
# sort sample_df by 1 column 'store_nbr'
sample_df.sort_values('store_nbr')

Unnamed: 0,date,family,id,onpromotion,sales,store_nbr
882588,2017-05-11,BEVERAGES,2828532,22,1194.0,23
495966,2016-10-05,DELI,2441910,0,0.0,25
13506,2016-01-08,DELI,1959450,43,131.545,38
74292,2016-02-11,DELI,2020236,2,212.0,43
445008,2016-09-06,BEVERAGES,2390952,19,8339.0,45


In [None]:
# sort sample_df by 2 columns 'family', 'sales' with 'family sorted ascending and sales sorted descending
sample_df.sort_values(['family', 'sales'], ascending=[True,False])

Unnamed: 0,date,family,id,onpromotion,sales,store_nbr
445008,2016-09-06,BEVERAGES,2390952,19,8339.0,45
882588,2017-05-11,BEVERAGES,2828532,22,1194.0,23
74292,2016-02-11,DELI,2020236,2,212.0,43
13506,2016-01-08,DELI,1959450,43,131.545,38
495966,2016-10-05,DELI,2441910,0,0.0,25


### <b> Modifying Columns </b></br> 

##### <b> Renaming Modifying Columns </b></br> Can rename columns in places via assignment by using `.columns = ['list of column name(s)]` </br> Columns names must be listed in correct order for assignment </br> can programmatically change column case using `.columns = [col.upper() for col in pd.columns]` using list comprehension </br> .rename() method </br> Using a dictionary to map new names to old names </br>`pd.rename(columns={'old_name':'new_name'})`

In [None]:
# create DataFrame using dictionary {key:values}
products_df = pd.DataFrame(
    {'product': ['Dairy', 'Dairy', 'Dairy', 'Vegetable', 'Fruits'],
    'price': [2.56, 2.56, 4.55, 2.74, 5.44]        
    }
)
products_df

Unnamed: 0,product,price
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# assign product_name and cost to columns
products_df.columns = ['product_name', 'cost']
products_df

Unnamed: 0,product_name,cost
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# change column labels to be uppercase using list comprehension
products_df.columns = [col.upper() for col in products_df.columns]
products_df

Unnamed: 0,PRODUCT_NAME,COST
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# recreate DataFrame using dictionary {key:values}
products_df = pd.DataFrame(
    {'product': ['Dairy', 'Dairy', 'Dairy', 'Vegetable', 'Fruits'],
    'price': [2.56, 2.56, 4.55, 2.74, 5.44]        
    }
)
products_df

Unnamed: 0,product,price
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# use .rename() method to change label names, creates new DataFrame to be assigned
products_df.rename(columns= {'product':'product_name', 'price':'cost'})

Unnamed: 0,product_name,cost
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


In [None]:
# can use lambda function to clean/standardize column names, creates new DataFrame
products_df.rename(columns=lambda x: x.upper())


Unnamed: 0,PRODUCT,PRICE
0,Dairy,2.56
1,Dairy,2.56
2,Dairy,4.55
3,Vegetable,2.74
4,Fruits,5.44


##### **Reorder Columns**</br> .reindex() method: use this when sorting won't suffice </br>
`pd.reindex(labels=[list of columns is specified order], axis=1)`

In [None]:
products_df = pd.DataFrame(
    {'product': ['Dairy', 'Dairy', 'Dairy', 'Vegetable', 'Fruits'],
    'price': [2.56, 2.56, 4.55, 2.74, 5.44],
    'product_id': [1, 2, 3, 4, 5]     
    }
)
products_df

Unnamed: 0,product,price,product_id
0,Dairy,2.56,1
1,Dairy,2.56,2
2,Dairy,4.55,3
3,Vegetable,2.74,4
4,Fruits,5.44,5


In [None]:
# reindex the columns for as required, returns a new DataFrame
products_df.reindex(labels = ['product_id', 'product', 'price'], axis=1)

Unnamed: 0,product_id,product,price
0,1,Dairy,2.56
1,2,Dairy,2.56
2,3,Dairy,4.55
3,4,Vegetable,2.74
4,5,Fruits,5.44


##### **Arithmetic Column Creation** </b></br> Can create columns with arithmetic by assigning them Series operations </br> Specify new column name and assign operation required

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [None]:
# filter retail_df for 'BABY CARE' and BOOKS
mask = (
    retail_df['family'].isin(['BABY CARE', 'BOOKS']) &
    (retail_df['sales'] > 0)
)
baby_books = retail_df.loc[mask]
baby_books = baby_books.sample(5, random_state=2022)
baby_books

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0
587170,2533114,2016-11-25,34,BABY CARE,2.0,0
592618,2538562,2016-11-28,37,BOOKS,2.0,0
824377,2770321,2017-04-08,4,BOOKS,2.0,0
397321,2343265,2016-08-10,8,BABY CARE,1.0,0


In [None]:
# Create tax_amount column = sales columns * 0.05 (use .loc[:, new_column])
baby_books.loc[:,'tax_amount'] = baby_books['sales'] * 0.05
baby_books.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0,0.05
587170,2533114,2016-11-25,34,BABY CARE,2.0,0,0.1
592618,2538562,2016-11-28,37,BOOKS,2.0,0,0.1
824377,2770321,2017-04-08,4,BOOKS,2.0,0,0.1
397321,2343265,2016-08-10,8,BABY CARE,1.0,0,0.05


In [None]:
# Create total_amount column = tax_amount + sales columns (use .loc[:, new_column])
baby_books.loc[:,'total_amount'] = baby_books['tax_amount'] + baby_books['sales']
baby_books.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,total_amount
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0,0.05,1.05
587170,2533114,2016-11-25,34,BABY CARE,2.0,0,0.1,2.1
592618,2538562,2016-11-28,37,BOOKS,2.0,0,0.1,2.1
824377,2770321,2017-04-08,4,BOOKS,2.0,0,0.1,2.1
397321,2343265,2016-08-10,8,BABY CARE,1.0,0,0.05,1.05


##### **Boolean Column Creation** </b></br> Can create columns using logical test

In [None]:
# create new columns taxable_category using logical test
baby_books.loc[:, 'taxable_category'] = baby_books['family'] != 'BABY CARE'
baby_books

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,total_amount,taxable_category
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0,0.05,1.05,False
587170,2533114,2016-11-25,34,BABY CARE,2.0,0,0.1,2.1,False
592618,2538562,2016-11-28,37,BOOKS,2.0,0,0.1,2.1,True
824377,2770321,2017-04-08,4,BOOKS,2.0,0,0.1,2.1,True
397321,2343265,2016-08-10,8,BABY CARE,1.0,0,0.05,1.05,False


##### **Column Creation based on Boolean Arithmetic** </b></br> Can create columns using logical test that will preform arithmetic is true

In [None]:
# if baby_books['family'] != 'BABY CARE' is True then multiply by 1, if false multiple by 0 so the value for the column will be zero because it failed the logical test
baby_books.loc[:,'tax_amount_bool'] = baby_books['sales'] * 0.05 * (baby_books['family'] != 'BABY CARE')
baby_books

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,total_amount,taxable_category,tax_amount_bool
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0,0.05,1.05,False,0.0
587170,2533114,2016-11-25,34,BABY CARE,2.0,0,0.1,2.1,False,0.0
592618,2538562,2016-11-28,37,BOOKS,2.0,0,0.1,2.1,True,0.1
824377,2770321,2017-04-08,4,BOOKS,2.0,0,0.1,2.1,True,0.1
397321,2343265,2016-08-10,8,BABY CARE,1.0,0,0.05,1.05,False,0.0


In [None]:
# Create integer Date Columns based on the 'date' column
# change datatype of 'date' to datetime64 so it can be parsed
baby_books['date'] = baby_books['date'].astype('datetime64[ns]')

In [None]:
# parse the 'month' from the date 'column'
baby_books["month"] = baby_books["date"].dt.month
# parse the 'day' from the date 'column'
baby_books["day_of_week"] = baby_books["date"].dt.day

baby_books

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,total_amount,taxable_category,tax_amount_bool,month,day_of_week
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0,0.05,1.05,False,0.0,7,28
587170,2533114,2016-11-25,34,BABY CARE,2.0,0,0.1,2.1,False,0.0,11,25
592618,2538562,2016-11-28,37,BOOKS,2.0,0,0.1,2.1,True,0.1,11,28
824377,2770321,2017-04-08,4,BOOKS,2.0,0,0.1,2.1,True,0.1,4,8
397321,2343265,2016-08-10,8,BABY CARE,1.0,0,0.05,1.05,False,0.0,8,10


##### **Advanced column Creation with NumPy Select() method** </b></br> Can create columns based on multiple conditions </br> More flexible than np.where() or pd.where() methods </br> the output can be categories or calculations/values

In [None]:
# use np.select() to select from conditions and choices
conditions = [
    # condition 1 links with choices list 1
    (baby_books['date'] == '2017-07-28') & (baby_books['family'] == 'BABY CARE'),
    # condition 2 links with choices list 2
    (baby_books['date'] == '2016-11-28') & (baby_books['family'] == 'BOOKS'),
    # condition 3 links with choices list 3
    (baby_books['date'] == '2016-11-25') & (baby_books['store_nbr'] > 28),    
]

choices = ['Winter Clearance', 'Christmas Eve', 'New Store Special']

# Creation of new column using np.select from conditions and choices and if no match default value is output
baby_books['Sales_Name'] = np.select(conditions, choices, default='No Sale')

baby_books

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,total_amount,taxable_category,tax_amount_bool,month,day_of_week,Sales_Name
1021351,2967295,2017-07-28,17,BABY CARE,1.0,0,0.05,1.05,False,0.0,7,28,Winter Clearance
587170,2533114,2016-11-25,34,BABY CARE,2.0,0,0.1,2.1,False,0.0,11,25,New Store Special
592618,2538562,2016-11-28,37,BOOKS,2.0,0,0.1,2.1,True,0.1,11,28,Christmas Eve
824377,2770321,2017-04-08,4,BOOKS,2.0,0,0.1,2.1,True,0.1,4,8,No Sale
397321,2343265,2016-08-10,8,BABY CARE,1.0,0,0.05,1.05,False,0.0,8,10,No Sale


In [None]:
# counts the values in the new categorical column
baby_books['Sales_Name'].value_counts()

Sales_Name
No Sale              2
Winter Clearance     1
New Store Special    1
Christmas Eve        1
Name: count, dtype: int64

##### **Mapping Values to Columns** </b></br> Can map values to columns or an entire DataFame </br> can pass a dictionary with existing values as the key and new values as values `{old_value:new_value}` </br> - may not use.... but better than multiple conditions if using np.select() method </br> can also use `lambda function` to map formatting to values </br> **`NOTE`** If re-categorizing smaller catergories to major categories, if the category value is not included in the dictionary it will be `NaN`. However map may be better to use for this instead of select due to less logical conditions needing to be created

In [None]:
products_df

Unnamed: 0,product,price,product_id
0,Dairy,2.56,1
1,Dairy,2.56,2
2,Dairy,4.55,3
3,Vegetable,2.74,4
4,Fruits,5.44,5


In [None]:
# mapping values from existing values
#create mapped values dictionary
mapping_dict = {'Dairy':'Non-Vegan', 'Vegetable':'Vegan', 'Fruits':'Vegan'}

# create new column and use .map(dictionary_variable) on column of interest to map values to new column. Creates new DataFrame
products_df['Vegan?'] = products_df['product'].map(mapping_dict)
products_df

Unnamed: 0,product,price,product_id,Vegan?
0,Dairy,2.56,1,Non-Vegan
1,Dairy,2.56,2,Non-Vegan
2,Dairy,4.55,3,Non-Vegan
3,Vegetable,2.74,4,Vegan
4,Fruits,5.44,5,Vegan


In [None]:
# use lambda function to assign currency formatting to price values, Creates new DataFrame
products_df['price'] = products_df['price'].map(lambda x: f'${x}')

products_df

Unnamed: 0,product,price,product_id,Vegan?
0,Dairy,$2.56,1,Non-Vegan
1,Dairy,$2.56,2,Non-Vegan
2,Dairy,$4.55,3,Non-Vegan
3,Vegetable,$2.74,4,Vegan
4,Fruits,$5.44,5,Vegan


In [None]:
retail_df['family'].value_counts()

family
AUTOMOTIVE                    31968
HOME APPLIANCES               31968
SCHOOL AND OFFICE SUPPLIES    31968
PRODUCE                       31968
PREPARED FOODS                31968
POULTRY                       31968
PLAYERS AND ELECTRONICS       31968
PET SUPPLIES                  31968
PERSONAL CARE                 31968
MEATS                         31968
MAGAZINES                     31968
LIQUOR,WINE,BEER              31968
LINGERIE                      31968
LAWN AND GARDEN               31968
LADIESWEAR                    31968
HOME CARE                     31968
HOME AND KITCHEN II           31968
BABY CARE                     31968
HOME AND KITCHEN I            31968
HARDWARE                      31968
GROCERY II                    31968
GROCERY I                     31968
FROZEN FOODS                  31968
EGGS                          31968
DELI                          31968
DAIRY                         31968
CLEANING                      31968
CELEBRATION          

In [None]:
# for analytics purposes, re-categorize subject families into major family's but anything not included in dictionary will become NaN in new column
family_category = {
    'PRODUCE':'Grocery',
    'PREPARED FOODS':'Grocery',
    'POULTRY':'Grocery',
    'MEATS':'Grocery',
    'GROCERY II':'Grocery',
    'GROCERY I':'Grocery',
    'FROZEN FOODS':'Grocery',
    'EGGS':'Grocery',
    'DELI':'Grocery',
    'DAIRY':'Grocery',
    'BREAD/BAKERY':'Grocery',
    'BEVERAGES':'Grocery',
    'SEAFOOD':'Grocery'
}

retail_df['product'] = retail_df['family'].map(family_category)
retail_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,product
0,1945944,2016-01-01,1,AUTOMOTIVE,0.000,0,
1,1945945,2016-01-01,1,BABY CARE,0.000,0,
2,1945946,2016-01-01,1,BEAUTY,0.000,0,
3,1945947,2016-01-01,1,BEVERAGES,0.000,0,Grocery
4,1945948,2016-01-01,1,BOOKS,0.000,0,
...,...,...,...,...,...,...,...
1054939,3000883,2017-08-15,9,POULTRY,438.133,0,Grocery
1054940,3000884,2017-08-15,9,PREPARED FOODS,154.553,1,Grocery
1054941,3000885,2017-08-15,9,PRODUCE,2419.729,148,Grocery
1054942,3000886,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8,


##### **Column Creation with Assign** </br> `.assign()` Method creates multiple columns at once and returns a DataFrame that can be assigned</br> Can be Chained together with of data processing methods </br> new column does not need to be in quotations </br> boolean, arithmetic, map can be used within assign() methods </br> can also create columns based on columns created in the same .assign() method but `MUST USE LAMBDA FUNCTION TO DO ARITHMETIC/BOOLEAN Logic` </br> can be chained with `.query()` at end to filter newly assigned columns

In [None]:
# Create random sample from retail_df
sample_df = retail_df.sample(10, random_state=2022)
sample_df = sample_df.drop(columns='product')
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
86825,2032769,2016-02-18,45,BEAUTY,7.0,0
487764,2433708,2016-09-30,44,MEATS,1718.515,54
358165,2304109,2016-07-19,9,HOME AND KITCHEN II,10.0,0
770952,2716896,2017-03-09,40,CELEBRATION,7.0,0
239109,2185053,2016-05-14,18,MEATS,116.755,0
550589,2496533,2016-11-04,8,HOME APPLIANCES,0.0,0
260289,2206233,2016-05-26,12,HOME CARE,122.0,3
393399,2339343,2016-08-08,47,CELEBRATION,23.0,0
311328,2257272,2016-06-23,44,CELEBRATION,30.0,0
869484,2815428,2017-05-03,6,AUTOMOTIVE,4.0,0


In [None]:
# using .assign() method new column does not need to be in quotes
sample_df.assign(tax_amount=sample_df['sales']*0.05).round(2)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount
86825,2032769,2016-02-18,45,BEAUTY,7.0,0,0.35
487764,2433708,2016-09-30,44,MEATS,1718.52,54,85.93
358165,2304109,2016-07-19,9,HOME AND KITCHEN II,10.0,0,0.5
770952,2716896,2017-03-09,40,CELEBRATION,7.0,0,0.35
239109,2185053,2016-05-14,18,MEATS,116.76,0,5.84
550589,2496533,2016-11-04,8,HOME APPLIANCES,0.0,0,0.0
260289,2206233,2016-05-26,12,HOME CARE,122.0,3,6.1
393399,2339343,2016-08-08,47,CELEBRATION,23.0,0,1.15
311328,2257272,2016-06-23,44,CELEBRATION,30.0,0,1.5
869484,2815428,2017-05-03,6,AUTOMOTIVE,4.0,0,0.2


In [None]:
# using .assign() method to create complex column creations 
sample_df.assign(
    tax_amount = (sample_df['sales']*0.05).round(2).map(lambda x: f'${x}'),
    on_promotion_flag = sample_df['onpromotion'] > 0,
    year = sample_df['date'].str[:4].astype('int'),
)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,on_promotion_flag,year
86825,2032769,2016-02-18,45,BEAUTY,7.0,0,$0.35,False,2016
487764,2433708,2016-09-30,44,MEATS,1718.515,54,$85.93,True,2016
358165,2304109,2016-07-19,9,HOME AND KITCHEN II,10.0,0,$0.5,False,2016
770952,2716896,2017-03-09,40,CELEBRATION,7.0,0,$0.35,False,2017
239109,2185053,2016-05-14,18,MEATS,116.755,0,$5.84,False,2016
550589,2496533,2016-11-04,8,HOME APPLIANCES,0.0,0,$0.0,False,2016
260289,2206233,2016-05-26,12,HOME CARE,122.0,3,$6.1,True,2016
393399,2339343,2016-08-08,47,CELEBRATION,23.0,0,$1.15,False,2016
311328,2257272,2016-06-23,44,CELEBRATION,30.0,0,$1.5,False,2016
869484,2815428,2017-05-03,6,AUTOMOTIVE,4.0,0,$0.2,False,2017


In [None]:
# using .assign() method to create complex column creations chained with .query() method
sample_df.assign(
    tax_amount = (sample_df['sales']*0.05).round(2).map(lambda x: f'${x}'),
    on_promotion_flag = sample_df['onpromotion'] > 0,
    year = sample_df['date'].str[:4].astype('int'),
).query('on_promotion_flag == True')

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,on_promotion_flag,year
487764,2433708,2016-09-30,44,MEATS,1718.515,54,$85.93,True,2016
260289,2206233,2016-05-26,12,HOME CARE,122.0,3,$6.1,True,2016


##### **If using columns in .assign() method to create columns in same .assign() method </br> `lambda function must be used, otherwise it will not work`**

In [None]:
# MUST USE LAMBDA FUNCTION TO APPLY ARITHMETIC AND BOOLEAN LOGIC IF CREATING COLUMNS BASED ON COLUMNS IN ASSIGN METHOD

# using .assign() method to create complex column creations chained with .query() method
sample_df.assign(
    tax_amount = (sample_df['sales']*0.05).round(2).map(lambda x: f'${x}'),
    on_promotion_flag = sample_df['onpromotion'] > 0,
    year = sample_df['date'].str[:4].astype('int'),
    onpromotion_ratio = sample_df['sales'] / sample_df['onpromotion'],
    # this has to use lambda function to work with columns within .assign method
    sales_onpromo_target = lambda x: x['onpromotion_ratio'] > 100
)


Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,tax_amount,on_promotion_flag,year,onpromotion_ratio,sales_onpromo_target
86825,2032769,2016-02-18,45,BEAUTY,7.0,0,$0.35,False,2016,inf,True
487764,2433708,2016-09-30,44,MEATS,1718.515,54,$85.93,True,2016,31.824352,False
358165,2304109,2016-07-19,9,HOME AND KITCHEN II,10.0,0,$0.5,False,2016,inf,True
770952,2716896,2017-03-09,40,CELEBRATION,7.0,0,$0.35,False,2017,inf,True
239109,2185053,2016-05-14,18,MEATS,116.755,0,$5.84,False,2016,inf,True
550589,2496533,2016-11-04,8,HOME APPLIANCES,0.0,0,$0.0,False,2016,,False
260289,2206233,2016-05-26,12,HOME CARE,122.0,3,$6.1,True,2016,40.666667,False
393399,2339343,2016-08-08,47,CELEBRATION,23.0,0,$1.15,False,2016,inf,True
311328,2257272,2016-06-23,44,CELEBRATION,30.0,0,$1.5,False,2016,inf,True
869484,2815428,2017-05-03,6,AUTOMOTIVE,4.0,0,$0.2,False,2017,inf,True


### **Data Types**
| Numeric Data Types| Library | Description                    | Bitsize          |
|-------------------|---------|--------------------------------|------------------|
| Bool              | NumPy   | Boolean True/False             | 8                |
| int64             | NumPy   | Whole Numbers                  | 8, 16, 32, 64   |
| float64           | NumPy   | Decimal Numbers                | 8, 16, 32, 64   |
| object            | NumPy   | Any Python Object              | N/A              |
| boolean           | Pandas  | Nullable Boolean True/False    | 8                |
| int64             | Pandas  | Nullable Whole Numbers         | 8, 16, 32, 64   |
| float64           | Pandas  | Nullable Decimal Numbers       | 8, 16, 32, 64   |
| string/text       | Pandas  | Text/String Data               | N/A              |
| category          | Pandas  | Maps categorical data to numerical array for efficiency| N/A              |
| datetime64[ns]        | Pandas  | single moment in time (January 4, 2015, 2:00:00PM)     | 64               |
| timedelta         | Pandas  | Duration between 2 dates or times             | N/A               |
| period   4        | Pandas  | A span on Time             | N/A               |


##### Pandas categorical data type stores text data with repeated values efficiently </br> Python maps each unique category to an integer to save space </br> only consider this data type when </br> `unique categories < (number_of_df_rows / 2)`


In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [None]:
#  memory_usage='deep' provides a more accurate memory usage calculation by considering the memory usage of object dtype columns
retail_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 167.8 MB


In [None]:
# converting retail_df DataFrame 'family' to category to see memory usage
retail_df = retail_df.astype({'family':'category'})
retail_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype   
---  ------       --------------    -----   
 0   id           1054944 non-null  int64   
 1   date         1054944 non-null  object  
 2   store_nbr    1054944 non-null  int64   
 3   family       1054944 non-null  category
 4   sales        1054944 non-null  float64 
 5   onpromotion  1054944 non-null  int64   
dtypes: category(1), float64(1), int64(3), object(1)
memory usage: 100.6 MB


##### **Changing `'family'` column from `object` to `category` saved 67.8MB of memory </br> This will reduce amount of resources DataFrame use and especially with data manipulation**

In [None]:
sample_df = retail_df.sample(10, random_state=616)
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0
79237,2025181,2016-02-14,32,BOOKS,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.143,1
615101,2561045,2016-12-11,18,HARDWARE,0.0,0


In [None]:
# when converting into category data type there are integers for the 'family' categorical values being stored in the background, but it is still displayed as text in dataframe. 
sample_df.astype({'family':'category'})

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0
79237,2025181,2016-02-14,32,BOOKS,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.143,1
615101,2561045,2016-12-11,18,HARDWARE,0.0,0


##### **Type Conversion** </br> Can convert datatypes in DataFrames by using .astype() method and specifying desired data type (if compatible) </br> can use dictionary to pass multiple column data type conversions

In [None]:
# create new column with sales value as integer
sample_df['sales_int'] = sample_df['sales'].astype('int')
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 399033 to 615101
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           10 non-null     int64   
 1   date         10 non-null     object  
 2   store_nbr    10 non-null     int64   
 3   family       10 non-null     category
 4   sales        10 non-null     float64 
 5   onpromotion  10 non-null     int64   
 6   sales_int    10 non-null     int32   
dtypes: category(1), float64(1), int32(1), int64(3), object(1)
memory usage: 1.8+ KB


In [None]:
# convert multiple column data types and assign to new DataFrame
sample_df = sample_df.astype({'date':'datetime64[ns]', 'onpromotion':'float'})
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 399033 to 615101
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           10 non-null     int64         
 1   date         10 non-null     datetime64[ns]
 2   store_nbr    10 non-null     int64         
 3   family       10 non-null     category      
 4   sales        10 non-null     float64       
 5   onpromotion  10 non-null     float64       
 6   sales_int    10 non-null     int32         
dtypes: category(1), datetime64[ns](1), float64(2), int32(1), int64(2)
memory usage: 1.8 KB


In [None]:
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_int
399033,2344977,2016-08-11,54,PRODUCE,487.239,1.0,487
579626,2525570,2016-11-21,22,HARDWARE,0.0,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0.0,3
534555,2480499,2016-10-26,8,LINGERIE,7.0,0.0,7
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0.0,5212
79237,2025181,2016-02-14,32,BOOKS,0.0,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8.0,238
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.143,1.0,9679
615101,2561045,2016-12-11,18,HARDWARE,0.0,0.0,0


In [None]:
# Changing Sales column values to be object to show how to clean after and convert back to float
sample_df['sales'] = sample_df['sales'].round(2).map(lambda x: f'${x}')
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_int
399033,2344977,2016-08-11,54,PRODUCE,$487.24,1.0,487
579626,2525570,2016-11-21,22,HARDWARE,$0.0,0.0,0
546385,2492329,2016-11-02,4,BOOKS,$3.0,0.0,3
534555,2480499,2016-10-26,8,LINGERIE,$7.0,0.0,7
96159,2042103,2016-02-23,7,PRODUCE,$5212.62,0.0,5212
79237,2025181,2016-02-14,32,BOOKS,$0.0,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,$238.0,8.0,238
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,$0.0,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,$9679.14,1.0,9679
615101,2561045,2016-12-11,18,HARDWARE,$0.0,0.0,0


In [None]:
# cannot convert sales to float because it is an object due to $ included
# sample_df['sales'].astype('float')
# ValueError: could not convert string to float: '$487.24'

In [None]:
# need to clean the column first
# using .assign method -> can strip the '$' string from the column and chain convert .astype('float')
# assign method has column_name = without bracket notation!!!
sample_df.assign(sales = sample_df['sales'].str.strip('$').astype('float'))

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_int
399033,2344977,2016-08-11,54,PRODUCE,487.24,1.0,487
579626,2525570,2016-11-21,22,HARDWARE,0.0,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0.0,3
534555,2480499,2016-10-26,8,LINGERIE,7.0,0.0,7
96159,2042103,2016-02-23,7,PRODUCE,5212.62,0.0,5212
79237,2025181,2016-02-14,32,BOOKS,0.0,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8.0,238
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.14,1.0,9679
615101,2561045,2016-12-11,18,HARDWARE,0.0,0.0,0


##### **Memory Optimization** </b></br> DataFrames are stored entirely in memory </br> `Large Datasets use excess of memory` 

##### **Best Practices**
| Step | Description|
|---|----------------|
| 1 | Drop unneccessary Columns (or avoid reading them)|
| 2 | Convert object types to numeric or datetime where possible|
| 3 | Downcast numeric data to the smallest appropriate bit size|
| 4 | Use the categorical data type `'category'` if `number of unique values < (rows / 2)`|

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [None]:
# assess memory usage in MB
retail_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 167.8 MB


In [None]:
# assess memory usage in bytes
retail_df.memory_usage(deep=True).sum()
# 175920036

175920036

##### **Drop Columns**

In [None]:
# if data has id column that is the exact same as DataFrame index
# then drop that column inplace=True to overwrite DataFrame
retail_df.drop(columns='id', inplace=True)
retail_df

Unnamed: 0,date,store_nbr,family,sales,onpromotion
0,2016-01-01,1,AUTOMOTIVE,0.000,0
1,2016-01-01,1,BABY CARE,0.000,0
2,2016-01-01,1,BEAUTY,0.000,0
3,2016-01-01,1,BEVERAGES,0.000,0
4,2016-01-01,1,BOOKS,0.000,0
...,...,...,...,...,...
1054939,2017-08-15,9,POULTRY,438.133,0
1054940,2017-08-15,9,PREPARED FOODS,154.553,1
1054941,2017-08-15,9,PRODUCE,2419.729,148
1054942,2017-08-15,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


In [None]:
# assess memory usage in bytes
retail_df.memory_usage(deep=True).sum()
# 175920036 before 'id' column drop
# 167480484 after 'id' column drop

167480484

##### **Convert Object Data Types**

In [None]:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1054944 entries, 0 to 1054943
# Data columns (total 6 columns):
#  #   Column       Non-Null Count    Dtype  
# ---  ------       --------------    -----  
#  0   id           1054944 non-null  int64  
#  1   date         1054944 non-null  object 
#  2   store_nbr    1054944 non-null  int64  
#  3   family       1054944 non-null  object 
#  4   sales        1054944 non-null  float64
#  5   onpromotion  1054944 non-null  int64  
# dtypes: float64(1), int64(3), object(2)
# memory usage: 167.8 MB

retail_df = retail_df.astype({'date': 'datetime64[ns]', 'family':'category'})
# assess memory usage in bytes
retail_df.memory_usage(deep=True).sum()
# 175920036 before 'id' column drop
# 167480484 after 'id' column drop
# 34816592 after objects converted to datetime64, category data types

34816592

##### **Downcasting Number Data** </br> Pandas casts Integers and Floats as 64-bit as default to be able to handle any value
|bit size|number range|
|--------|------------|
|8-bits| -128 to 127|
|16-bits| -32,768 to 32,767|
|32-bits| -2,147,483,648 to 2,147,483,647|
|64-bits| -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807|


In [None]:
retail_df['onpromotion'].value_counts()

onpromotion
0      643896
1       93647
2       44341
3       26985
4       20934
        ...  
600         1
306         1
672         1
474         1
425         1
Name: count, Length: 362, dtype: int64

In [None]:
retail_df['onpromotion'] = retail_df['onpromotion'].astype('int16')
retail_df.memory_usage(deep=True).sum()
# 175920036 before 'id' column drop
# 167480484 after 'id' column drop
# 34816592 after objects converted to datetime64, category data types
# 28486928 after int64 converted to int32

28486928

##### **When doing Data Analysis, have an analysis path and optimize if if required </br> don't spend analysis time focused on optimizing memory, do after, especially if data is required for a pipeline