Lightweight DataFrame for data science.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

DFLite: The Quick & Easy DataFrame

The data frame was a concept I first came across in R, where it is a fundemental component of data analysis. Never having done much data analysis in Python, I came across a situation where I needed a data frame but didn't know about the pandas implementation, so I went about writing my own DataFrame class. When I realized the pandas version existed, I immediately switched all my code over only to find that for my application, the pandas DataFrame was over 2 times slower than my lightweight DataFrame. So I spent some time making sure the interfaces were the same (at least in what I was doing) and made some nice 'this is what you would expect' modifications for personal use in the future.

In general, the pandas data frame performs quite well, especially with large datasets. I'm sure that this class has quite a few holes in it, but the idea of a lightweight DataFrame for Python is worth pursuing in the future. You can find the source code for dflite on GitHub, including a copy of this notebook. Here's some general usage.

Importing

Importing is easy, the only dependency is numpy.

import dflite as df

Creating a DataFrame

Usually all I want to do is create a DataFrame from a CSV file, but in code there's a couple of other ways to construct the class. The DataFrame.from_records() method is probably the most useful, creating a DataFrame from an iterable grouped by record (items coming out of a Postgres database via psycopg2 are a good example). This is more or less equivalent to the pandas method of the same name. For now, we'll demo the class with the small CSV included in the directory.

data = df.read_csv("test.csv")
data.head()
Time (UTC)LatitudeLongitude
02016-03-02 17:50:1845.10303743-64.29103034
12016-03-02 17:50:1945.10291441-64.29095464
22016-03-02 17:50:2045.10279595-64.29089237
32016-03-02 17:50:2145.1026838-64.29084603
42016-03-02 17:50:2245.10259138-64.29080328
52016-03-02 17:50:2345.10251977-64.29080362

The read_csv() function is pretty much the same as the pandas version, at least for simple usage. You can also pass in a file-like object and a driver= parameter. Currently only csv files are supported, but they're the most common, so hey.

The tail() method works much the same as the head() method.

data.tail(3)
Time (UTC)LatitudeLongitude
02016-03-02 17:50:5445.09937809-64.29696471
12016-03-02 17:50:5545.09924766-64.2972626
22016-03-02 17:50:5645.09911724-64.29754859

Notice here how the indicies are 0, 1, and 2, where they should be the last few indicies of the DataFrame. The pandas.DataFrame supports (I would argue is slightly obsessed with) thd idea of indexes for rows/columns. For rows I almost never have a reason to access them by anything other than an integer (certainly not a string), and because of this I don't bother with them. Since tail() is basically just checking the DataFrame, I didn't bother to re-number the rows (tail() is actually just shorthand for data.iloc[(len(data)-nrows):len(data)]), so it's really its own DataFrame object).

Columns

Column names can be accessed and set just like the pandas version:

data.columns
['Time (UTC)', 'Latitude', 'Longitude']
data.columns = ("col1", "col2", "col3")
data.head()
col1col2col3
02016-03-02 17:50:1845.10303743-64.29103034
12016-03-02 17:50:1945.10291441-64.29095464
22016-03-02 17:50:2045.10279595-64.29089237
32016-03-02 17:50:2145.1026838-64.29084603
42016-03-02 17:50:2245.10259138-64.29080328
52016-03-02 17:50:2345.10251977-64.29080362

Column values can be added and removed in a similar way

data["col1"]
array(['2016-03-02 17:50:18', '2016-03-02 17:50:19', '2016-03-02 17:50:20',
       '2016-03-02 17:50:21', '2016-03-02 17:50:22', '2016-03-02 17:50:23',
       '2016-03-02 17:50:24', '2016-03-02 17:50:25', '2016-03-02 17:50:26',
       '2016-03-02 17:50:27', '2016-03-02 17:50:28', '2016-03-02 17:50:29',
       '2016-03-02 17:50:30', '2016-03-02 17:50:31', '2016-03-02 17:50:32',
       '2016-03-02 17:50:33', '2016-03-02 17:50:34', '2016-03-02 17:50:35',
       '2016-03-02 17:50:36', '2016-03-02 17:50:37', '2016-03-02 17:50:38',
       '2016-03-02 17:50:39', '2016-03-02 17:50:40', '2016-03-02 17:50:41',
       '2016-03-02 17:50:42', '2016-03-02 17:50:43', '2016-03-02 17:50:44',
       '2016-03-02 17:50:45', '2016-03-02 17:50:46', '2016-03-02 17:50:47',
       '2016-03-02 17:50:48', '2016-03-02 17:50:49', '2016-03-02 17:50:50',
       '2016-03-02 17:50:51', '2016-03-02 17:50:52', '2016-03-02 17:50:53',
       '2016-03-02 17:50:54', '2016-03-02 17:50:55', '2016-03-02 17:50:56'], 
      dtype='<U19')

Columns can also be accessed by index (this isn't possible in the pandas version, and I'm not quite sure why)

data[0]
array(['2016-03-02 17:50:18', '2016-03-02 17:50:19', '2016-03-02 17:50:20',
       '2016-03-02 17:50:21', '2016-03-02 17:50:22', '2016-03-02 17:50:23',
       '2016-03-02 17:50:24', '2016-03-02 17:50:25', '2016-03-02 17:50:26',
       '2016-03-02 17:50:27', '2016-03-02 17:50:28', '2016-03-02 17:50:29',
       '2016-03-02 17:50:30', '2016-03-02 17:50:31', '2016-03-02 17:50:32',
       '2016-03-02 17:50:33', '2016-03-02 17:50:34', '2016-03-02 17:50:35',
       '2016-03-02 17:50:36', '2016-03-02 17:50:37', '2016-03-02 17:50:38',
       '2016-03-02 17:50:39', '2016-03-02 17:50:40', '2016-03-02 17:50:41',
       '2016-03-02 17:50:42', '2016-03-02 17:50:43', '2016-03-02 17:50:44',
       '2016-03-02 17:50:45', '2016-03-02 17:50:46', '2016-03-02 17:50:47',
       '2016-03-02 17:50:48', '2016-03-02 17:50:49', '2016-03-02 17:50:50',
       '2016-03-02 17:50:51', '2016-03-02 17:50:52', '2016-03-02 17:50:53',
       '2016-03-02 17:50:54', '2016-03-02 17:50:55', '2016-03-02 17:50:56'], 
      dtype='<U19')
data["newcol"] = 10
data.head()
col1col2col3newcol
02016-03-02 17:50:1845.10303743-64.2910303410
12016-03-02 17:50:1945.10291441-64.2909546410
22016-03-02 17:50:2045.10279595-64.2908923710
32016-03-02 17:50:2145.1026838-64.2908460310
42016-03-02 17:50:2245.10259138-64.2908032810
52016-03-02 17:50:2345.10251977-64.2908036210
data["newcol"] = data["newcol"] + 4
data.head()
col1col2col3newcol
02016-03-02 17:50:1845.10303743-64.2910303414
12016-03-02 17:50:1945.10291441-64.2909546414
22016-03-02 17:50:2045.10279595-64.2908923714
32016-03-02 17:50:2145.1026838-64.2908460314
42016-03-02 17:50:2245.10259138-64.2908032814
52016-03-02 17:50:2345.10251977-64.2908036214
del data["newcol"]
data.head()
col1col2col3
02016-03-02 17:50:1845.10303743-64.29103034
12016-03-02 17:50:1945.10291441-64.29095464
22016-03-02 17:50:2045.10279595-64.29089237
32016-03-02 17:50:2145.1026838-64.29084603
42016-03-02 17:50:2245.10259138-64.29080328
52016-03-02 17:50:2345.10251977-64.29080362

Rows

In pandas, rows are accessed through the iloc attribute, so after considerable changing of code, so does mine. Here, data.iloc[3] will give the fourth row (as a dict ish object), and data.iloc[3, :] will give a DataFrame with only one row. The pandas version also has a loc[] option where names can be specified, but in this implementation iloc and loc are identical, and so you can pass more or less anything between the brackets and get a sensible result.

row = data.iloc[3]
row
col1col2col3
2016-03-02 17:50:2145.1026838-64.29084603

Of course, I've made the nice _repr_html_() method so it displas nicely but each row is actually a _DFRow object, which is a subclass of dict that keeps its values in order. This means you can index it by column name or by index.

row["col1"]
'2016-03-02 17:50:21'
row[0]
'2016-03-02 17:50:21'

Iterating through rows is done using the itertuples() method, which returns an iterator that iterates through the rows in the same way as the pandas version. Because pandas returns its row with the 0th item as the row number (or row index, if you believe in that kind of thing), this method does as well.

for row in data.head().itertuples():
    print(row[0], row["col2"], row[1])
0 45.10303743 2016-03-02 17:50:18
1 45.10291441 2016-03-02 17:50:19
2 45.10279595 2016-03-02 17:50:20
3 45.1026838 2016-03-02 17:50:21
4 45.10259138 2016-03-02 17:50:22
5 45.10251977 2016-03-02 17:50:23

Subsetting

Each column is a NumPy ndarray object, so it can be indexed like any other ndarray object (i.e. by a list of desired rows, by an ndarray of logicals, by a single index, or by a slice). Some of this notation is available in the iloc method as well, which returns a single value (if two ints are passed), a _DFRow (if only a single integer is passed), or a subsetted DataFrame (if some combination of slices/ints/lists is passed). See the following examples:

data.iloc[1:3]
col1col2col3
02016-03-02 17:50:1945.10291441-64.29095464
12016-03-02 17:50:2045.10279595-64.29089237
data.iloc[1:5, 0:2]
col1col2
02016-03-02 17:50:1945.10291441
12016-03-02 17:50:2045.10279595
22016-03-02 17:50:2145.1026838
32016-03-02 17:50:2245.10259138
data.iloc[[2, 5, 5, 33], ("col1", "col3")]
col1col3
02016-03-02 17:50:20-64.29089237
12016-03-02 17:50:23-64.29080362
22016-03-02 17:50:23-64.29080362
32016-03-02 17:50:51-64.29613321

Notice again how our original row number aren't preserved. You can work around this by making a column with your original row numbers. I get how this could be annoying, but including it was too complicated and wasn't necessary for what I was doing.

data["original_rows"] = list(range(len(data)))
data.iloc[[2, 5, 5, 33], ("original_rows", "col1", "col3")]
original_rowscol1col3
022016-03-02 17:50:20-64.29089237
152016-03-02 17:50:23-64.29080362
252016-03-02 17:50:23-64.29080362
3332016-03-02 17:50:51-64.29613321

All of the nice indexing things we can do with NumPy are also available in the 'rows' part of the index:

data.iloc[data["col2"] > 45.1022]
col1col2col3original_rows
02016-03-02 17:50:1845.10303743-64.291030340
12016-03-02 17:50:1945.10291441-64.290954641
22016-03-02 17:50:2045.10279595-64.290892372
32016-03-02 17:50:2145.1026838-64.290846033
42016-03-02 17:50:2245.10259138-64.290803284
52016-03-02 17:50:2345.10251977-64.290803625
62016-03-02 17:50:2445.10245523-64.290831526
72016-03-02 17:50:2545.10240112-64.290866387
82016-03-02 17:50:2645.10233343-64.29093478
92016-03-02 17:50:2745.10227411-64.291020369
102016-03-02 17:50:2845.1022154-64.2911165410

Exporting

Writing the DataFrame to a CSV is probably the easiest way to export, although TSV is also supported. The to_csv() method works more or less like the pandas version, and can take a file-like object as well as a filename.

Performance

As I mentioned earlier, running the pandas.DataFrame in production code that used quite a lot of DataFrames was quite slow. I have a feeling that there's a lot of overhead involved with the convenience of multiple indexing and built-in plotting support that slows the class down when there isn't a need for it. There's also probably a lot of work to be done on this class that can add convenience without comprimising performance, but I'll leave that up to some folks with a bit more spare time than I do. Cheers!