##### **Data Types** </br> Pandas Data types mostly expand on their base Python and NumPy equivalents </br>

In [22]:
import pandas as pd
import numpy as np

##### **Data Types**
| Numeric Data Types| Library | Description                    | Bitsize          |
|-------------------|---------|--------------------------------|------------------|
| Bool              | NumPy   | Boolean True/False             | 8                |
| int64             | NumPy   | Whole Numbers                  | 8, 16, 32, 64   |
| float64           | NumPy   | Decimal Numbers                | 8, 16, 32, 64   |
| object            | NumPy   | Any Python Object              | N/A              |
| boolean           | Pandas  | Nullable Boolean True/False    | 8                |
| int64             | Pandas  | Nullable Whole Numbers         | 8, 16, 32, 64   |
| float64           | Pandas  | Nullable Decimal Numbers       | 8, 16, 32, 64   |
| string/text       | Pandas  | Text/String Data               | N/A              |
| category          | Pandas  | Maps categorical data to numerical array for efficiency| N/A              |
| datetime64[ns]        | Pandas  | single moment in time (January 4, 2015, 2:00:00PM)     | 64               |
| timedelta         | Pandas  | Duration between 2 dates or times             | N/A               |
| period   4        | Pandas  | A span on Time             | N/A               |


##### Pandas categorical data type stores text data with repeated values efficiently </br> Python maps each unique category to an integer to save space </br> only consider this data type when </br> `unique categories < (number_of_df_rows / 2)`


In [23]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [24]:
#  memory_usage='deep' provides a more accurate memory usage calculation by considering the memory usage of object dtype columns
retail_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   id           1054944 non-null  int64  
 1   date         1054944 non-null  object 
 2   store_nbr    1054944 non-null  int64  
 3   family       1054944 non-null  object 
 4   sales        1054944 non-null  float64
 5   onpromotion  1054944 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 167.8 MB


In [25]:
# converting retail_df DataFrame 'family' to category to see memory usage
retail_df = retail_df.astype({'family':'category'})
retail_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1054944 entries, 0 to 1054943
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype   
---  ------       --------------    -----   
 0   id           1054944 non-null  int64   
 1   date         1054944 non-null  object  
 2   store_nbr    1054944 non-null  int64   
 3   family       1054944 non-null  category
 4   sales        1054944 non-null  float64 
 5   onpromotion  1054944 non-null  int64   
dtypes: category(1), float64(1), int64(3), object(1)
memory usage: 100.6 MB


##### **Changing `'family'` column from `object` to `category` saved 67.8MB of memory </br> This will reduce amount of resources DataFrame use and especially with data manipulation**

In [26]:
sample_df = retail_df.sample(10, random_state=616)
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0
79237,2025181,2016-02-14,32,BOOKS,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.143,1
615101,2561045,2016-12-11,18,HARDWARE,0.0,0


In [27]:
# when converting into category data type there are integers for the 'family' categorical values being stored in the background, but it is still displayed as text in dataframe. 
sample_df.astype({'family':'category'})

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0
79237,2025181,2016-02-14,32,BOOKS,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.143,1
615101,2561045,2016-12-11,18,HARDWARE,0.0,0


##### **Type Conversion** </br> Can convert datatypes in DataFrames by using .astype() method and specifying desired data type (if compatible) </br> can use dictionary to pass multiple column data type conversions

In [28]:
# create new column with sales value as integer
sample_df['sales_int'] = sample_df['sales'].astype('int')
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 399033 to 615101
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           10 non-null     int64   
 1   date         10 non-null     object  
 2   store_nbr    10 non-null     int64   
 3   family       10 non-null     category
 4   sales        10 non-null     float64 
 5   onpromotion  10 non-null     int64   
 6   sales_int    10 non-null     int32   
dtypes: category(1), float64(1), int32(1), int64(3), object(1)
memory usage: 1.8+ KB


In [29]:
# convert multiple column data types and assign to new DataFrame
sample_df = sample_df.astype({'date':'datetime64[ns]', 'onpromotion':'float'})
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 399033 to 615101
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           10 non-null     int64         
 1   date         10 non-null     datetime64[ns]
 2   store_nbr    10 non-null     int64         
 3   family       10 non-null     category      
 4   sales        10 non-null     float64       
 5   onpromotion  10 non-null     float64       
 6   sales_int    10 non-null     int32         
dtypes: category(1), datetime64[ns](1), float64(2), int32(1), int64(2)
memory usage: 1.8 KB


In [30]:
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_int
399033,2344977,2016-08-11,54,PRODUCE,487.239,1.0,487
579626,2525570,2016-11-21,22,HARDWARE,0.0,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0.0,3
534555,2480499,2016-10-26,8,LINGERIE,7.0,0.0,7
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0.0,5212
79237,2025181,2016-02-14,32,BOOKS,0.0,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8.0,238
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.143,1.0,9679
615101,2561045,2016-12-11,18,HARDWARE,0.0,0.0,0


In [31]:
# Changing Sales column values to be object to show how to clean after and convert back to float
sample_df['sales'] = sample_df['sales'].round(2).map(lambda x: f'${x}')
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_int
399033,2344977,2016-08-11,54,PRODUCE,$487.24,1.0,487
579626,2525570,2016-11-21,22,HARDWARE,$0.0,0.0,0
546385,2492329,2016-11-02,4,BOOKS,$3.0,0.0,3
534555,2480499,2016-10-26,8,LINGERIE,$7.0,0.0,7
96159,2042103,2016-02-23,7,PRODUCE,$5212.62,0.0,5212
79237,2025181,2016-02-14,32,BOOKS,$0.0,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,$238.0,8.0,238
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,$0.0,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,$9679.14,1.0,9679
615101,2561045,2016-12-11,18,HARDWARE,$0.0,0.0,0


In [33]:
# cannot convert sales to float because it is an object due to $ included
# sample_df['sales'].astype('float')
# ValueError: could not convert string to float: '$487.24'

In [37]:
# need to clean the column first
# using .assign method -> can strip the '$' string from the column and chain convert .astype('float')
# assign method has column_name = without bracket notation!!!
sample_df.assign(sales = sample_df['sales'].str.strip('$').astype('float'))

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,sales_int
399033,2344977,2016-08-11,54,PRODUCE,487.24,1.0,487
579626,2525570,2016-11-21,22,HARDWARE,0.0,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0.0,3
534555,2480499,2016-10-26,8,LINGERIE,7.0,0.0,7
96159,2042103,2016-02-23,7,PRODUCE,5212.62,0.0,5212
79237,2025181,2016-02-14,32,BOOKS,0.0,0.0,0
830813,2776757,2017-04-12,20,BREAD/BAKERY,238.0,8.0,238
151831,2097775,2016-03-26,19,SCHOOL AND OFFICE SUPPLIES,0.0,0.0,0
380850,2326794,2016-08-01,44,PRODUCE,9679.14,1.0,9679
615101,2561045,2016-12-11,18,HARDWARE,0.0,0.0,0
