Last article, I gave my basic reasons for pursuing a research project into the changing demographics of the slave trade in the Americas. This was very broad, and very vague. However, now I am about to plunge into locating the necessary data to reconstruct the history I wish to reconstruct. And already I am running into technical issues. 

The archive I want to use was created 22 years ago, back when Microsoft was the preeminent provider of consumer computer products. Everything is saved as a .dbf database file, with .sav, .sps and other files used to port it to an SPSS application. I had never heard of SPSS before this project. It's a proprietary system currently owned by IBM and used heavily in the GIS community. Opening this type of layered database file in a modern environment is complicated. Microsoft has a deprecated product, Visual FoxPro, and of course there are paid products. However, I want to use a modern Jupyter notebook with Python, pandas and other standard data science tools.

I turned to the great oracle, Google, to find out how to open a .dbf file in Jupyter. There is a fairly well known package, PySal, that seemed ideal. Unfortunately, I am using Python 3.8. Pysal is dependent on a package called Rasterio, which in turn is dependent on a set of C libraries called GDAL. And here is where I ran into issues. GDAL does not, for some reason, edit its path environment variables on a Windows machine with Python 3.8. There are open issues with several development teams, but all are quite recent (the past year or so) and the issue is not yet fixed. While I'd love to use a Linux machine, at the moment I'm stuck with Windows.

So I am going to have to sacrifice the convenience of Pysal and turn to a more hacked together solution. Python comes with a dbfread library. It is going to take some work to get dbfread into Pandas, but hopefully not too much.

In [1]:
import pandas as pd
from dbfread import DBF

Above, I imported Pandas. Pandas is perhaps the most popular data science package for Python and is extremely useful. I've used some of its competitors, especially Turi, and Pandas is both better documented and more supported. It does have some strange quirks, such as its odd use of zero indexing, but overall is a fun package to use.

Notice I'm also using dbfread. I'm using a script form the documentation at https://dbfread.readthedocs.io/en/latest/exporting_data.html. Please be nice to me processor gods, I just want a dataframe.

In [2]:
dbf = DBF('SLAVE.DBF')

Let's try to convert the .dbf database into a Pandas dataframe. I'm going to use Pandass' inbuilt DataFrame function because it will continue to be updated with Pandas' development. It is important to note that dbfread is only a somewhat maintained library. After parsing the source code, I noticed a lot of development is still needed. This would come back to bite me as I worked with this project. Per dbfread's documentation, we also need to pass an iterable form of the dbf file we created above. Below is the single line of code that nearly destroyed my sanity.

In [3]:
df = pd.DataFrame(iter(dbf))

Originally, this errored. And it errored in the most bizarre way. It was throwing errors about float numbers and integers with full stops. I was very confused and google didn't help much. In fact, most people just ended up editing their data file. I do not have the ability to open a dbf file, nor did I really want to open a file, edit it however many times, and close it again. So, I opened the source code of dbfread to the line that triggered the exception. Below I have a copy of the exception, lovingly recreated for you after I fixed it. Notice all the nonsense about integers and floats and b'.'.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
c:\users\casti\appdata\local\programs\python\python38\lib\site-packages\dbfread\field_parser.py in parseN(self, field, data)
    179         try:
--> 180             return int(data)
    181         except ValueError:

ValueError: invalid literal for int() with base 10: b'.'

***Edited for clarity***

c:\users\casti\appdata\local\programs\python\python38\lib\site-packages\dbfread\field_parser.py in parseN(self, field, data)
    184             else:
    185                 # Account for , in numeric fields
--> 186                 return float(data.replace(b',',b'.'))
    187                 '''
    188                 if isinstance(data, float) or len(data) >= 2:

ValueError: could not convert string to float: b'.'

As I said, googling this error only turned up more people with the same issue. Since I couldn't use the package that everyone else seemed to prefer (PySal!) I decided to wade into the source code. Perhaps somewhere in the field_parser.py file I'd find a clue to fixing my code.

I didn't. Instead, I found the reason the code errored in the first place. Here's some code taken direcly out of a 

    def parseN(self, field, data):
        """
        Parse numeric field (N)

        Returns int, float or None if the field is empty.
        """
        
        # In some files * is used for padding.
        data = data.strip().strip(b'*')
        try:
            return int(data)
        except ValueError:
            if not data.strip():
                return None
            else:
                # Account for , in numeric fields
                return float(data.replace(b',',b'.'))
                
Folks, that was the entirety of the exception handling. The default case was, simply, to replace all commas with full stops and then try to convert whatever monstrosity resulted into a float. It is understandable Python didn't really want to do this. By this logic, if the number 1,234.56 was passed to the parseN method, the result would be 1.234.56. This is, obviously, neither a float nor an integer. It is an unholy creation of man. I decided to fix this.

First, I decided the final case should be the float NaN. NaN stands for, in the surreally linear logic of computer scientists, *N*ot *A* *N*umber. Get it? NaN...Not A Number. This meant that, if the code was absolutely sure we were dealing with numbers, but it kept mangling whatever it was passed, it should just insert the Pythonic equivalent of a shrug. This is probably not the best design, but short of refactoring the whole dbfreads library I thought it would work as a stopgap. The code now read:

    def parseN(self, field, data):
        """
        Parse numeric field (N)

        Returns int, float or None if the field is empty.
        """
        
        # In some files * is used for padding.
        
        data = data.strip().strip(b'*')
        try:
            return int(data)
        except ValueError:
            if not data.strip():
                return None
            else:
                # Account for , in numeric fields
                if isinstance(data, float):
                    return float(data.replace(b',', b'.')
                return float(b"NaN")
                
I don't actually know why the function needs to return a binary code. Perhaps there's a calling function somewhere in dbfreads that only accepts binary. Maybe it's a powermove on the part of the developers, who wanted to remind the computer it only speaks binary. Nonetheless, I am now returning the binary version of the float number "NaN".

This didn't quite fix my problem. Instead, I still kept getting that error about the binary full stop from earlier. It had moved on a few lines (I was inserting counters at one point to trace where in the file there were errors) but I was still having problems with this function. So I started thinking about when a number might *look* like a float but was, in fact, never meant to be such a thing. Suppose there was a field that said *0,*. This would obviously be corruption or entry error, but hey data is messy. If you called this parseN function on *0,*, it wouldn't convert to an integer, it *would* raise an exception, but would not trigger the replace method. Instead, it would turn into a float "NaN". I didn't want that data loss, even if I don't understand the data. So, I should really build a trap for such pieces of data.


    def parseN(self, field, data):
        """
        Parse numeric field (N)

        Returns int, float or None if the field is empty.
        """
        
        # In some files * is used for padding.
        data = data.strip().strip(b'*')

        try:
            return int(data)
        except ValueError:
            if not data.strip():
                return None
            else:
                # Account for , in numeric fields

                if isinstance(data, float) or len(data) >= 2:
                    return float(data.replace(b',',b'.'))
                return float(b'NaN')
                
See my clever little trap? I'm checking to see if the length of the data is over 2. Why? Well, I've already checked if the number is a float. If the data was *0,0* it would already be caught. But the number *1,500.0* still won't be caught. I don't want numbers that are only 1 digit long. They will easily convert to integers, unless the program is trying to pass *','* or some other crazy character as a number. I don't know enought about the program to dig through why it is trying to pass those characters, so I weed them out with the length check. Now, *1,500.0* will get the floating point conversion treatment it deserves.

I also had to make one other little edit. There was a similar problem with a datetime method. However, in this case it looks like the originaly developer forgot to add a .strip() method they clearly meant to. This is because, in the comments, they clearly indicate they intend on testing for both empty spaces *and* spaces full of zeros for a datetime column. Nonetheless, they only strip off the zeros, not the spaces. I added a clause looking for the empty spaces and now it works.

This is the end of the post. There's still a lot of work to do to make this data accessible and accurate enough for any sort of valid historical conclusions to be drawn. However, it is now loaded, the dbfread source code has been strengthened, and it's time to take a break. Somewhere, distantly, I hear a martini glass clinking.