# Testing out DataTable
In this post, I am taking a look at the python `datatable` package. I cannot express how much I __LOVE__ the R `data.table` package. It single handedly made my R workflow. It's fast, memory efficient, has a fun community, and a pleasent API. As I have started using python as my main data science language I have longed for a tool as succinct, and efficient for handling large data in python as the `data.table` package is for R.  
Now obviously, when you mention tabular data and python, the first thing you will likely think of is the `pandas` package. Pandas is a fabulous python library; it's fun, has a huge ecosystem, and couples well with nearly every popular python scientific computing tool. The first critisim people have about pandas is _"It's so slow! And inefficient"_ or _"It's so bloated! And has a lot of warts!"_. I think these statements are a little unfair. I have been using pandas everyday for the past few years for processing millions of records, and performing data analysis. I think with a package as large, and developed, as pandas it's very possible to find parts of the API that are less efficient than others. So while I would consider all parts of pandas usable, it sometimes takes a little trial and error to use it as efficiently as needed for large (800K-4MM record files) data workflow.  

Enter pythons `datatable`... I was immediatly excited about this package when I discovered it. It boasted a similar API, and speed to the R `data.table` pacakge. It had many similar functions, was largely built on a bare metal langauge, and borrowed many of the data handling implementations from it's R counterpart. However, the first no-go for me was that it was unusable on windows. Whether I like it or not, my company heavily utalizes the microsoft ecosystem, so I needed a tool that would build on windows. This put my intersts on the package on hold for a while. I checked in everyonce in a while, untill that day came... issue [1114](https://github.com/h2oai/datatable/issues/1114) "Support datatable on windows" was closed! So here we are about a year later, and I am finally getting to testing it out.  

A few things I am hoping for from this package that exist in it's R sibling package  (these are largely compared to pandas, because as much as I just said I love pandas, if it had all these things... I wouldn't be looking into the datatable package):  

  - __Fast csv IO__: CSV files are one of the most common file types in existence. Sadly, while it is an easy file format to grock (there are values... and they are seperated by commas), there are a lot of tools out their that will write malformed csv files, and on the whole other extreme an increadibly small portion that can actually read most of the csv data found in the wild. The R `data.table::fread` function is a __BEAST__! It rips through gigabytes of csv files like a lightsaber through butter _(much more effective than a KNIFE through butter)_. This is sadly an area that pandas is not quite as proficient in. While the pandas `read_csv` function is great, and actually does a really good job with messy deliminated files, it is _so slow_. Additionally, ever other file it complains about encoding issues (hello `encoding="latin1"`), but maybe this is a difference between how encoding is dealt with between R and python? I am not totally sure.
  - __Well supported missing values__: This is certainly a artifiact of the R language itself. R supports missing values for character, integer and float data typtes. Pandas, support a less cohesive list. It has the numpy float `nan` value, and then sometimes uses `None` types for strings, and then other times not... All that to say while I think the project is centralizing on consistent missing values, it is not as cut and dry as the missing types in R are. The R `data.table` package which uses all native R types, has just as consistent handling of missing values.

Outside of these points, I am really just hoping for a fun, fast pacakge, that is straightforward to use _(come on `datatable` I believe in you)_.  
Let's get started.

## Getting That Data In
First let's take a look at the IO side of things. This is check box number 1, and one I feel quite posotive about. The `datatable` package implements the R package `fread` function, so it seems like they can easily win with this one, by using a lot of the same functionality.

In [1]:
import datatable as dt

In [5]:
tb = dt.fread("data/application_train.csv")

In [6]:
tb.shape

(307511, 122)

That seemed pretty fast... How does it stack up against the pandas csv reader.

In [7]:
import pandas as pd

In [8]:
%%timeit
pd.read_csv("data/application_train.csv")

1.87 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%%timeit
dt.fread("data/application_train.csv")

178 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


So right away, we see that the `fread` function is nearly 10 times faster than the pandas equivilant. That's a pretty nice bump in speed.

In [10]:
df = pd.read_csv("data/application_train.csv")

In [16]:
df.memory_usage(deep=True).sum()

562761929

In [17]:
import sys

In [18]:
sys.getsizeof(df)

562761945

In [19]:
sys.getsizeof(tb)

240272739