This notebook serves as a guideline to loading in large datasets with Pandas. For a detailed explanation, please refer to this excellent link, https://www.dataquest.io/blog/pandas-big-data/. I follow its content closely and adapt my code by writing functions and classes for future use. Let's begin by importing some useful modules

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting
%matplotlib inline
import seaborn as sns # advanced visualization
import warnings # control warnings
warnings.filterwarnings('ignore')
import pprint # prettyprint module
import time # measure how much time to run some code

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Helper functions

In [None]:
def mem_usage(pds_obj):
    '''
    MEM_USAGE calculates how much memory used by loading in an object
    
    Inputs:
    - pds_obj: pandas dataframe or series
    
    Outputs:
    - a string showing how much memory already occupied by pds_obj
    '''
    if isinstance(pds_obj, pd.DataFrame):
        usage = pds_obj.memory_usage(deep=True).sum() # sum all memory usages in each column if a dataframe
    else: # a Series
        usage = pds_obj.memory_usage(deep=True)
    # Convert usage from bytes to megabytes
    usage = usage / 1024 ** 2 # traditional conversion, not mathematical conversion
    return "{:.2f} MB".format(usage)

# Example with a Toy Dataset
The main idea for memory efficiency is to specify the data types we want pandas to parse when reading in the csv files via the argument `dtype` of the `pd.read_csv` method. To this end, we'll play with a toy dataset, figure out the best data types for memory reduction, save the specifications in a dictionary, and finally pass this dictionary to the argument `dtype` while reading the main csv file.

In [None]:
# Use the first 1m rows as toy data
start_time = time.time() # start timer
toy_data = pd.read_csv('../input/train.csv', nrows=1e6) # read in the first 1m rows
end_time = time.time() # end timer
print('Run time: %.2f seconds' %(end_time - start_time)) # total run time

Some readers may wonder why we just read in 1m rows but not some other numbers. This question will be answered at the end of the notebook.

In [None]:
# Memory usage for just one chunk
print('Memory usage for just one chunk:', mem_usage(toy_data))

Thus, the total memory usage for our training data is roughly `2238.98 MB * 9 = 20150.82 MB ~ 20 GB`. We definitely need a way to load the data in more efficiently. This can be done by downcasting `float` and `int` and converting `object` to `category`. Again, for a more detailed explanation, please refer to the link at the beginning of this notebook.

Indeed, most of the time `object` takes up much more memory comparing to the other two and thus, converting it to other type for memory efficient is essential. The following demonstrates this point,

In [None]:
# Total memory usage by each data type in toy_data
data_types = ['int', 'float', 'object']
for dtype in data_types:
    selected_data = toy_data.select_dtypes(include=[dtype])
    print(dtype, 'takes up', mem_usage(selected_data), 'in memory space')

Downcasting data types is the key to memory reduction in pandas. Let's start with `int`.

# Downcasting "int" data types
`int` can be downcasted to `unsigned` so save some memory. Also, we won't have to worry about negative entries as Pandas will reserve the original data type if it fails to convert.

In [None]:
# Downcast int to unsigned
toy_data_int = toy_data.select_dtypes(include=['int'])
converted_int = toy_data_int.apply(pd.to_numeric, downcast='unsigned')

# Effect of downcasting on int
compared_int = pd.concat([toy_data_int.dtypes, converted_int.dtypes], axis=1)
compared_int.columns = ['before', 'after']
compared_int.apply(pd.Series.value_counts)

Observe that all `int64` data, where each datum takes up 64 bytes in the memory space, have been converted to `uint8` and `uint16` data, where each datum now takes up only 8 or 16 bytes in the memory space, respectively. The following result shows that we have made some progress regarding memory efficiency, though not significantly since we don't have that many `int64` columns (!!!),

In [None]:
# Compare memory reduction between toy_data_int and converted_int
# Note that we only consider `int` data type. So although the result
# may indicate a significance reduction, it is insignificant overall
# since most of our memory space is taken up by (1) `object`  and
# (2) `float` data types
print('toy_data_int takes up', mem_usage(toy_data_int))
print('converted_int takes up', mem_usage(converted_int))

In [None]:
# Finally, replace `int` columns in toy_data with a more efficient `int` columns from toy_data_int
optimized_toy_data = toy_data.copy()
optimized_toy_data[converted_int.columns] = converted_int

# Downcasting "float" data types
When loading in data, Pandas will treat continuous variables as `float64` data type (please help me double check if this is correct!!!) Pandas can attempt to downcast them to as small as `np.float32` type, if it's appropriate for it to do so.

In [None]:
# Downcast float64 to just float
toy_data_float = toy_data.select_dtypes(include=['float'])
converted_float = toy_data_float.apply(pd.to_numeric, downcast='float')

# Effect of downcasting to just float
compared_float = pd.concat([toy_data_float.dtypes, converted_float.dtypes], axis=1)
compared_float.columns = ['before', 'after']
compared_float.apply(pd.Series.value_counts)

In [None]:
# Compare memory reduction between toy_data_float and converted_float.
# Similar note as in downcasting `int64` to 'unsigned'
print('toy_data_float takes up', mem_usage(toy_data_float))
print('converted_float takes up', mem_usage(converted_float))

In [None]:
# Replace `float64` columns of toy_data_float by new data types
optimized_toy_data[converted_float.columns] = converted_float

Thus, so far we have achieved roughly `(1 - 1992.94 / 2238.98) * 100 ~ 11%` of memory reduction,

In [None]:
# Total memory reduction so far
print("toy_data takes up", mem_usage(toy_data), 'in memory space')
print("optimized_toy_data takes up", mem_usage(optimized_toy_data), 'in memory space')

# Downcasting "object" data types
Finally, we can work on converting `object` data type into `category`. Due to some restrictions with `category` type (see link at the beginning of this notebook), we will only convert an `object` column into `category` if there are at most 50% of unique values within that column. The downcasting process will be slighly different comparing to that of `int` and `float` since we will need to use the astype method.

In [None]:
# Isolate `object` data type
toy_data_obj = toy_data.select_dtypes(include=['object'])

In [None]:
# Downcast `object` to `category` if column has at most 50% of unique values
converted_obj = pd.DataFrame()

for col in toy_data_obj.columns:
    num_unique_values = len(toy_data_obj[col].unique())
    total_num_values = len(toy_data_obj[col])
    # Only convert to `category` type if column has at most 50% of unique values
    if num_unique_values / total_num_values < 0.5:
        converted_obj.loc[:, col] = toy_data_obj[col].astype('category')
    else:
        converted_obj.loc[:, col] = toy_data_obj[col]
        
# Effect of downcasting `object` to `category`
compared_obj = pd.concat([toy_data_obj.dtypes, converted_obj.dtypes], axis=1)
compared_obj.columns = ['before', 'after']
compared_obj.apply(pd.Series.value_counts)

In [None]:
# Compare memory reduction between toy_data_obj and converted_obj.
print('toy_data_obj takes up', mem_usage(toy_data_obj))
print('converted_obj takes up', mem_usage(converted_obj))

Whoops! We have achieved an amzing reduction results. Let's look at the first few roes of `converted_obj`. Note that the entries in these columns appear exactly the same as before conversion, but internally, they have been optimized via encoding.

In [None]:
# Columns appear to be exactly the same before conversion
converted_obj.head()

In [None]:
# Internal structure has been optimized for memory efficiency
converted_obj.dtypes

Let's finally replace some of the `object` data types in `optimized_toy_data` by `category` type and observe the significant reduction in memory usage.

In [None]:
optimized_toy_data[converted_obj.columns] = converted_obj

By optimizing the columns, we have achived an impressive `(1 - 274.64 / 2238.98) * 100 ~ 88%` memory reduction!

In [None]:
# Final total memory reduction
print("toy_data takes up", mem_usage(toy_data), 'in memory space')
print("optimized_toy_data takes up", mem_usage(optimized_toy_data), 'in memory space')

# Reading the Entire Dataset
We will load in the entire `train.csv` by specifying the specific data type for each column.

In [None]:
# Get data types of columns
dtypes = optimized_toy_data.dtypes

# Zip columns' names and their types accordingly
col_name = dtypes.index
col_type = [item.name for item in dtypes.values]
parsing_types = dict(zip(col_name, col_type))

# Print out nicely first 10 items with prettyprint
preview = {key: value for key, value in list(parsing_types.items())[:10]}
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(preview)

In [None]:
# Load in the entire dataset
start_time = time.time()
df = pd.read_csv('../input/train.csv', dtype=parsing_types)
end_time = time.time()
print('Run time: %.2f minutes' %((end_time - start_time) / 60)) # total run time
print('Optimized df takes up', mem_usage(df), 'in memory space')

# Putting Pieces Together
Following are the functions that form the pipeline for reading any large datasets in Pandas. `mem_usage`, `downcast_int`, `downcast_float`, `downcast_obj`, and `optimal_dtypes` are helper functions. `load_dataset` is the main code to optimize your datasets. You may use it to read in`train_csv` to train your models and read in`test_csv` to test your models.

In [None]:
def mem_usage(pds_obj):
    '''
    MEM_USAGE calculates how much memory used by loading in an object
    
    Inputs:
    - pds_obj: pandas dataframe or series
    
    Outputs:
    - a string showing how much memory already occupied by pds_obj
    '''
    if isinstance(pds_obj, pd.DataFrame):
        usage = pds_obj.memory_usage(deep=True).sum() # sum all memory usages in each column if a dataframe
    else: # a Series
        usage = pds_obj.memory_usage(deep=True)
    # Convert usage from bytes to megabytes
    usage = usage / 1024 ** 2 # traditional conversion, not mathematical conversion
    return "{:.2f} MB".format(usage)

In [None]:
def downcast_int(toydf):
    '''
    DOWNCAST_INT downcasts variables of int type to unsigned int type, whenever possible, for memory reduction.
    
    Inputs:
    - toydf: a dataframe
    
    Outputs:
    - converted_int: a new dataframe with some columns, possibly all, of unsigned int type
    '''
    # Isolate columns of `int` type
    toydf_int = toydf.select_dtypes(include=['int'])
    # Downcast each column to `unsigned` with apply function
    converted_int = toydf_int.apply(pd.to_numeric, downcast='unsigned')
    
    return converted_int

In [None]:
def downcast_float(toydf):
    '''
    DOWNCAST_FLOAT downcasts variables of float64 type to float type, whenever possible, for memory reduction.
    The smallest possible type is np.float32.
    
    Inputs:
    - toydf: a dataframe
    
    Outputs:
    - converted_float: a new dataframe with some columns, possibly all, of float type
    '''
    # Isolate columns of `float` type
    toydf_float = toydf.select_dtypes(include=['float'])
    # Downcast each column to `float` with apply function
    converted_float = toydf_float.apply(pd.to_numeric, downcast='float')
    
    return converted_float

In [None]:
def downcast_obj(toydf):
    '''
    DOWNCAST_OBJ downcasts variables of object type to category type, whenevere possible, for memory reduction.
    
    Inputs:
    - toydf: a dataframe
    
    Outputs:
    - converted_obj: a new dataframe with some columns, possibly all, of category type.
      Due to Pandas' technical reason, only columns with at most 50% of unique values will be converted to category type. 
      See link `http://pandas.pydata.org/pandas-docs/stable/categorical.html#gotchas` for more info.
    '''
    # Isolate columns of `object` type
    toydf_obj = toydf.select_dtypes(include=['object'])
    # Initialize a dataframe that will hold converted object
    converted_obj = pd.DataFrame()
    # Downcast `object` type to `category` type
    for col in toydf_obj.columns:
        num_unique_values = len(toydf_obj[col].unique())
        total_num_values = len(toydf_obj[col])
        if num_unique_values / total_num_values < 0.5: # only downcast columns with at most 50% of unique values
            converted_obj.loc[:, col] = toydf_obj[col].astype('category')
        else:
            converted_obj.loc[:, col] = toydf_obj[col]
    
    return converted_obj

In [None]:
def optimal_dtypes(toydf):
    '''
    OPTIMAL_DTYPES finds the optimal data type of each column for memory reduction
    
    Inputs:
    - toydf: a dataframe
    
    Outputs:
    - dict_of_optimalDataTypes: dictionary of (key, value) = (column name, optimal data type) 
      and is used to parse data when loading in a dataset
    '''
    # Create a copy of toydf so not mess up with original data
    optimized_toydf = toydf.copy()
    
    # Optimize columns of `int` type
    converted_int = downcast_int(optimized_toydf)
    optimized_toydf[converted_int.columns] = converted_int
    
    # Optimize columns of `float` type
    converted_float = downcast_float(optimized_toydf)
    optimized_toydf[converted_float.columns] = converted_float
    
    # Optimize columns of `object` type
    converted_obj = downcast_obj(optimized_toydf)
    optimized_toydf[converted_obj.columns] = converted_obj
    
    # Create a dictionary where (key, value) = (column name, optimal data type)
    series_of_optimalDataTypes = optimized_toydf.dtypes # a pandas series
    col_names = series_of_optimalDataTypes.index
    col_types = [item.name for item in series_of_optimalDataTypes.values]
    dict_of_optimalDataTypes = dict(zip(col_names, col_types)) # dictionary to parse data when loading in a dataset
    
    # Returns
    return dict_of_optimalDataTypes

In [None]:
def load_dataset(df_name):
    '''
    LOAD_DATASET reads in the entire dataset whose columns have been optimized for memory reduction.
    This function also prints out total loading run time and memory usage of the dataset.
    
    Inputs:
    - df_name (str): name of file for reading
    
    Outputs:
    - df: dataframe with optimized columns
    '''
    # Start timer
    start_time = time.time()
    
    # Find optimal data types to parse the entire dataset
    path = '../input/' + df_name # path to the dataset
    toy_data = pd.read_csv(path, nrows=1e6) # get a sample dataset from original data to analyze
    parsing_types = optimal_dtypes(toy_data) # optimal data types
    
    # Load in the entire dataset with optimal data types
    df = pd.read_csv(path, dtype=parsing_types)
    
    # End timer
    end_time = time.time()
    
    # Returns
    print('Run time: %.2f minutes' %((end_time - start_time) / 60)) # total run time
    print('This df takes up', mem_usage(df), 'in memory space') # memory usage
    return df

In [None]:
# Let's test our main function, load_dataset
print("Training data")
df_train = load_dataset('train.csv')
print("========================================")
print("Test data")
df_test = load_dataset('test.csv')

# Final Thoughts
1. As promised above, I will answer why I chose to read in 1m rows but not some other numbers. I used 1m of rows as toy data because I felt it was the right choice since `train_csv` has roughly 9m of observations. I could have used 500 rows as toy data, but then the conversion from `object` type to `category` type would have been trickier since we couldn't have a rough estimate of unique values in each column for the entire dataset. In contrast, I didn't want to use, say 2m rows, as my toy data, simply because it could take me too much time. Recall that we are just at the stage of reading in data while no analysis has been done yet!
1. My kernel was inspired by [Theo](https://www.kaggle.com/theoviel/load-the-totality-of-the-data)'s :) I was just obsessed with a systematic way to work with **any** large dataset with pandas and that was how I googled to find the link at the beginning of my kernel. Thank you Theo very much for the inspiration!
1. If your dataset has `datetime` columns, you can isolate and treat them separately (though in a similar manner).
1. It would be nice to organize the above functions into a class for more convenience. I'm really sucked at this :( ... so I hope someone can help me with this task.
1. I love Pandas (whether you are talking about a programming library or the animals), but I think Dask will be a better tool for very large datasets. I've heard some of its downsides comparing to Pandas, e.g., the lack of certain methods for data analysis, but Dask is coming better.