# Data Serialization with Python

This course aims at an intermediate level Python programmer.  We will assume you already have some familiarity with basic Python constructs like importing libraries, writing classes, variables, flow control, and similar topics an introductory course will have addressed.  However, we do not assume an advanced level of knowledge, nor do we assume experience with the specific tools and libraries these lessons will address.

# Why Serialization?

There are several key purposes for which data serialization, and correspondingly de-serialization, are important.  In one case, we simply wish to be able to persist data from a running Python program to resume computation later on, still utilizing the same data.  At other times, we would like to exchange data with other tools and programs, perhaps ones written in other programming languages altogether.  Data, of course, comes in many shapes and sizes, and can be utilized for many purposes.

## Pickles

<img src="https://user-images.githubusercontent.com/7065401/91055530-28a9f380-e5fb-11ea-9c4a-237ae8275b48.jpg" align="right" width="33%"/>In these lessons we will look first at the pickle module of the Python standard library.  For most purposes within the Python ecosystem itself, pickle is your go-to serialization technique.

# Nested Data

In the next several lessons, we will look at JSON, the Javascript Object Notation, which is widely used as a means of exchanging data among various programming languages and tools, in particular often associated with web services.

# Flat Data

After looking at the structured and hierachical data often represented in JSON, we look at working with CSV and other delimited files, which are the most common means of representing tabular data and sharing it among different tools.  For the CSV lessons, we consider both the Python standard library and the popular third party library Pandas.

## Document Data

<img src="https://user-images.githubusercontent.com/7065401/91055568-36f80f80-e5fb-11ea-80e7-31bbc6c3d03c.png" align="right" width="33%"/>Moving further into the world of typically document-oriented data, we look at several Python libraries for working with XML data sources.  A large share of all documents in the world live as XML, and Python contains excellent tools for accessing those.

<img src="https://user-images.githubusercontent.com/7065401/91055601-44ad9500-e5fb-11ea-82d8-77323dc0e4f0.jpg" align="right" width="33%"/>

# Numeric Datasets

In the final two lessons, we will look at formats specifically concerned with scientific and numeric data, often including very large datasets.  The NumPy array library and the HDF5 storage format are both discussed.

# Universal Python Serialization

The first tool you should always think about when serializing Python objects is the native pickle format. A pickle can serialize *almost any* Python object in a binary format.

More specialized protocols exist for serialization within a cluster computing frameworks. Cloudpickle is widely used for this purpose, but is not specifically discussed in this training.  Later in this training we will look at a variety of formats that Python can work with, but that are not specific to Python objects.

Let us start out by loading a few Python standard library modules this lesson will utilize.

In [8]:
import pickle
import io
from pprint import pprint
from dataclasses import dataclass
from zipfile import ZipFile
from datetime import datetime

### A Byte Representation

Let us create a dictionary and use pickle to serialize it in a binary form.

In [9]:
my_data = dict(name="David", real_number=76.54, count=22,
               pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

pkl = pickle.dumps(my_data)
pprint(pkl, width=50)

(b'\x80\x04\x95g\x00\x00\x00\x00\x00\x00\x00}'
 b'\x94(\x8c\x04name\x94\x8c\x05David'
 b'\x94\x8c\x0breal_number\x94G@S"\x8f\\(\xf5\xc3'
 b'\x8c\x05count\x94K\x16\x8c\x04pets\x94]\x94('
 b'\x8c\x08Astrophe\x94\x8c\x07Kachina'
 b'\x94\x8c\x07Jackson\x94\x8c\x05Rebel\x94eu.')


## Same Values, Different Object

Unpickling a serialization will create an *equivalent* object, but not an identical object.  It should not be confused with a shared memory or concurrency mechanism (pickles are a building block for *some* concurrency models, however).

In [10]:
new_data = pickle.loads(pkl)
new_data

{'name': 'David',
 'real_number': 76.54,
 'count': 22,
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}

In [11]:
print("Equality:", new_data == my_data)
print("Identity:", new_data is my_data)

Equality: True
Identity: False


## Pickling to Files

We can pickle to raw bytes, but for many or most purposes, it is useful to write these serializations to files.  The functions `.dump()` and `load()` serialize to file-like objects rather than create byte strings.

In [13]:
with open('data/data.pkl', 'wb') as fh:
    pickle.dump(my_data, fh)

In [16]:
%%bash
hexdump -C data.pkl

00000000  80 04 95 67 00 00 00 00  00 00 00 7d 94 28 8c 04  |...g.......}.(..|
00000010  6e 61 6d 65 94 8c 05 44  61 76 69 64 94 8c 0b 72  |name...David...r|
00000020  65 61 6c 5f 6e 75 6d 62  65 72 94 47 40 53 22 8f  |eal_number.G@S".|
00000030  5c 28 f5 c3 8c 05 63 6f  75 6e 74 94 4b 16 8c 04  |\(....count.K...|
00000040  70 65 74 73 94 5d 94 28  8c 08 41 73 74 72 6f 70  |pets.].(..Astrop|
00000050  68 65 94 8c 07 4b 61 63  68 69 6e 61 94 8c 07 4a  |he...Kachina...J|
00000060  61 63 6b 73 6f 6e 94 8c  05 52 65 62 65 6c 94 65  |ackson...Rebel.e|
00000070  75 2e                                             |u.|
00000072


## Reading Objects from Files

Reading a pickle from a file—or from another file-like object—is exactly symmetrical with writing it.  With Python's so-called duck-typing, anything with a `.read()` method producing bytes allows unpickling.  Symmetrically, any object with a `.write()` method accepting bytes is suitable for pickling.

In [35]:
pickle.load(open('data/data.pkl', 'rb'))

{'name': 'David',
 'real_number': 76.54,
 'count': 22,
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}

### File-Like Objects

A regular file on the local filesystem is a common location for pickles, but they might be available over a socket, or from a database connection with BLOB storage of pickles, or over an HTTP request, and so on.  For example, perhaps a zip file contains one or more pickles.

In [18]:
%%bash
zip data data.pkl

  adding: data.pkl (deflated 5%)


In [19]:
with ZipFile('data/data.zip') as zf:
    with zf.open('data.pkl') as zfile:
        pprint(pickle.load(zfile))

{'count': 22,
 'name': 'David',
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel'],
 'real_number': 76.54}


### Other File-Like Objects

Python uses *duck-typing* quite extensively; a great many things are file-like.  For example, we might use a memory IO buffer as the equivalent of a file.

In [20]:
memfile = io.BytesIO(pkl)
pickle.load(memfile)

{'name': 'David',
 'real_number': 76.54,
 'count': 22,
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}

# Pickle Limitations

Most Python objects can be pickled and unpickled.  A simple dictionary, with some scalars and one nested list, were used in the examples earlier.  You are not limited default data structures; however, there are a few limits.  If you pickle an instance of a class, the class itself needs to be available on the receiving system.  Often this is no problem, since the class is from a library installed at both ends.

### Round-Trip with a DataClass

Perhaps we wish to use a dataclass instead of a dictionary in a program.

In [21]:
@dataclass
class Trainer:
    name: str
    real_number: float
    count: int
    pets: list

In [22]:
my_instance = Trainer(name="David", real_number=76.54, count=22,
                      pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

In [23]:
my_instance

Trainer(name='David', real_number=76.54, count=22, pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

In [24]:
pickle.loads(pickle.dumps(my_instance))

Trainer(name='David', real_number=76.54, count=22, pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

### Round-Trip with Datetimes

Or we want to store and retrieve datetime values.

In [25]:
events = {'description': 'Developed Lesson',
          'start': datetime.fromisoformat('2020-05-22T12:11:10'),
          'end': datetime(2020, 5, 23, 9, 10, 11)}
events

{'description': 'Developed Lesson',
 'start': datetime.datetime(2020, 5, 22, 12, 11, 10),
 'end': datetime.datetime(2020, 5, 23, 9, 10, 11)}

In [26]:
pickle.loads(pickle.dumps(events))

{'description': 'Developed Lesson',
 'start': datetime.datetime(2020, 5, 22, 12, 11, 10),
 'end': datetime.datetime(2020, 5, 23, 9, 10, 11)}

### Missing Classes

If the class we want is not available, or even simply lives in the wrong namespace, we will not succeed in unpickling.  For example, a pickle file is available, but the code defining the class of the pickled instance is not on the local system.

> The data associated with this notebook can be found in the files associated with this course

In [27]:
try:
    with open('data/3dpoint.pkl', 'rb') as fh:
        print(pickle.load(fh))
except Exception as err:
    print(err)

[Errno 2] No such file or directory: '3dpoint.pkl'


### Transient State

Pickling is not directly possible for objects that are inherently impermanent.  For example, objects may represent file descriptors to the local filesystem or connections to a database.  The store the state of one particular computer at one particular time, and cannot be serialized.

In [29]:
hello = "¡Hola Mundo!"
num = 999
fname = 'data/test.data'
fd = open(fname, 'w')
fd.write(hello)
data = {'fd': fd, 'num': num, 'hello': hello}

In [30]:
try:
    pickle.dumps(data)
except Exception as err:
    print(err)

cannot pickle 'TextIOWrapper' instances


# Customizing Serialization

If you wish to serialize and deserialize classes you create yourself, you are free to specify which data is actually necessary and relevant for recreating *equivalent* instances.  This customization can allow you to initialize transient state in manner to allow something close to round-tripping.  For example, a particular local file cannot be shared on a non-networked filesystem; however, unpickling might create a usable file local to the destination filesystem.

In [53]:
class HelloNumber:
    "Plain class that holds file descriptor"
    def __init__(self, fname, hello, num):
        self.fd = open(fname, 'w+')
        self.fd.write(hello)
        self.num = num

    def __str__(self):
        return (f"<{self.__class__.__name__} holding file "
                f"{self.fd.name}({self.fd.fileno()}) and num {self.num}>")

We can add the capability of serializing the most important information in an instance to a simple tuple (another structure would work also; e.g. a list, a dict, a namedtuple, etc).

In [54]:
class HelloNumber2(HelloNumber):
    "Add the ability to pickle the essential data"
    def __getstate__(self):
        pos = self.fd.tell()
        self.fd.flush()
        self.fd.seek(0)
        hello = self.fd.read()
        self.fd.seek(pos)
        data = (self.fd.name, # fname
                hello,   # file content
                self.num)
        print("Pickling tuple only...")
        return data

The `.__init__()` of a class is not called during unpickling.  By default its `.__dict__` is simply restored.  We can make our class do something different from that.

In [55]:
class HelloNumber3(HelloNumber2):
    "Add the ability to reconstruct local state"
    def __setstate__(self, data):
        self.fd = open(fname, 'w+')
        self.fd.write(data[1])
        self.num = data[2]

Let us create an instance then round-trip it.

In [56]:
hi = HelloNumber3(fname, hello, num)
print(hi)

<HelloNumber3 holding file test(81) and num 3>


In [57]:
pkl = pickle.dumps(hi)
pprint(pkl, width=60)

Pickling tuple only...
(b'\x80\x04\x954\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main_'
 b'_\x94\x8c\x0cHelloNumber3\x94\x93\x94)\x81\x94\x8c\x04'
 b'test\x94\x8c\x05Hello\x94K\x03\x87\x94b.')


In [42]:
new_hi = pickle.loads(pkl)
print(new_hi)

{'name': 'David', 'real_number': 76.54, 'count': 22, 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}


# Reading and Writing JSON

JavaScript Object Notation (JSON) is a widely used data exchange format.  As the name suggests, it is a format derived from JavaScript, but it is strictly language neutral. JSON is currently specified by Internet Engineering Task Force (IETF) RFC 8259.

JSON is supported by a great many programming languages, in their standard library, or as built-ins, or with widely available libraries for those languages.  Many JSON strings are also identical to valid Python expression for some data structure or scalar.

In [59]:
import json
from pprint import pprint
from textwrap import fill
from dataclasses import dataclass, asdict
from datetime import datetime
from decimal import Decimal
from fractions import Fraction
from math import pi
import jsonpickle

## A String Representation

Let us create a dictionary and use the `json` module to serialize it in a string form. The examples in this lesson will largely follow those used the the lesson on Python pickles.

In [60]:
# Being still alive, lifespan is unknown & marked with NaN
my_data = dict(name="David", real_number=76.54, count=22, likes_python=True,
               lifespan=float('nan'), end_of_time=float('inf'),
               pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

jstr =json.dumps(my_data)
print(fill(jstr, width=65))

{"name": "David", "real_number": 76.54, "count": 22,
"likes_python": true, "lifespan": NaN, "end_of_time": Infinity,
"pets": ["Astrophe", "Kachina", "Jackson", "Rebel"]}


In [61]:
print(fill(str(my_data), width=65))

{'name': 'David', 'real_number': 76.54, 'count': 22,
'likes_python': True, 'lifespan': nan, 'end_of_time': inf,
'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}


## Almost Just Python

The JSON string representing the `my_data` dictionary is *almost* valid Python that we could copy-paste or `eval()`.  The main differences are the spelling different of `true` versus `True`, of `false` versus `False`, and of `null` versus `None`.

Another subtle issue occurred in the example, however.  The name `nan` is neither a Python keyword or built-in name *nor* is it strictly part of the JSON spec.  This special class of floating-point values (Not-a-Number) is very useful for certain numeric purposes, so many JSON libraries add it as an informal extension.

The JSON version is spelled `NaN`.  In Python, we could import the name `nan` from the `math` or `numpy` modules, or we can build it using the `float()` constructor.  The constants `+Infinity` and `-Infinity` which are part of the IEEE-754 floating point standard, likewise are often useful, but are not part of JSON narrowly.

In [62]:
try:
    json.dumps(my_data, allow_nan=False)
except Exception as err:
    print(err)

Out of range float values are not JSON compliant: nan


## Same Values, Different Object

Serialization and deserialization will create an *equivalent* object, but not an identical object.  It should not be confused with a shared memory or concurrency mechanism (but serialization is a building block for *some* concurrency models).

In [65]:
# Avoid the NaN issue
my_data = dict(name="David", likes_python=True, count=None,
               pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

jstr =json.dumps(my_data)
new_data = json.loads(jstr)

In [66]:
print("Equality:", new_data == my_data)
print("Identity:", new_data is my_data)

Equality: True
Identity: False


## Serializing JSON to Files

The API of the `json` module generally matches that of `pickle`.  Along with the `dumps()` and `loads()`, the `json` module also has `dump()` and `load()`.  In all of these, the final 's' is a very compact way of expressing the idea that the function consumes or produces *strings* rather than files.  That naming convention has an old history; most likely newer methods that did not require backward compatibility would use more obvious names.

In contrast to pickle format, which is *usually* used to save files with serialized objects, JSON is *usually* used to create an in-memory string to send over various wire protocols.

In [67]:
with open('data/data.json', 'w') as fh:
    json.dump(my_data, fh)

!cat tmp/data.json

cat: tmp/data.json: No such file or directory


## Reading Objects from Files

Reading JSON from a file—or from another file-like object—is exactly symmetrical with writing it.  With Python's so-called duck-typing, anything with a `.read()` method producing bytes allows unpickling.  Symmetrically, any object with a `.write()` method accepting bytes is suitable for pickling.  See examples in the previous lesson for use of several file-like objects. In this respect, `pickle` and `json` functions are the same.

In [68]:
json.load(open('data/data.json'))

{'name': 'David',
 'likes_python': True,
 'count': None,
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}

# JSON Limitations

Only basic Python collections and scalars can be directly represented in JSON; however, these collections *can* be nested indefinitely.  Specifically, JSON allows for dictionaries (called "objects" in the spec) and lists (called "arrays" in the spec); JSON does not have a way of representing tuples, sets, `collections.deque`, `collections.Counter`, NumPy arrays, or other collections you might use in Python.  The keys for JSON objects may only be strings, unlike Python dictionaries that can use any hashable object. For many purposes, casting another collection to a list suffices to transmit the data.

The scalars supported by JSON are exclusively: the three literal names `true`, `false`, and `null`, strings, and numbers.  Strings are surrounded by double quotes, and may contain escaped Unicode code points.  JSON itself only contains a generic "number" datatype.  By default, numbers without decimal points will be interpreted as Python ints.  Numbers with decimals will be interpreted as Python floats.  Python allows other number types, such as `decimal.Decimal`, `fraction.Fraction`, or NumPy values of specific bit lengths.

## Serialization Failures

Any custom classes, including ones that represent special scalars, will fail by default in JSON serialization.

In [69]:
timestamp = datetime.fromisoformat('2020-05-24T00:55:10')
try:
    json.dumps(timestamp)
except Exception as err:
    print(err)

Object of type datetime is not JSON serializable


In [70]:
decnum = Decimal('3.1415')
try:
    json.dumps(decnum)
except Exception as err:
    print(err)

Object of type Decimal is not JSON serializable


## Forcing Serialization

We can customize how special Python datatypes are serialized and deserialized.  This should be done with a caution, however, because it also can impact interoperability with other systems.  This might mean system in other programming languages, or it might simply be other Python machines without the same customizations.  First let us handle extra serialization.

In [71]:
class ScalarEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, (Decimal, Fraction)):
            return float(o)
        elif isinstance(o, datetime):
            return datetime.isoformat(o)
        else:
            return super().default(o)

### Semi-Generic Types

Let us encode some data using the custom encoder we developed.

In [72]:
nums = [timestamp, 42, decnum, pi, Fraction(22, 7)]
jnums = json.dumps(nums, cls=ScalarEncoder)
pprint(jnums, width=55)

('["2020-05-24T00:55:10", 42, 3.1415, '
 '3.141592653589793, 3.142857142857143]')


This customization will not introduce much compatibility concern.  The same "number" can be represented in different systems.  However, notice that in the JSON representation absolutely nothing distinguishes the float, Decimal, and Fraction we started with as several approximations of the transcendental number pi.  The timestamp has simply become a string, but one that contains all the underlying information.

Reading back in this JSON serialization will work fine, but with all non-integral numbers as platform-native floats.  We can change the default deserialization type for floats and ints if we would like to. We impose just one type for each of float and int.

In [73]:
json.loads(jnums)

['2020-05-24T00:55:10', 42, 3.1415, 3.141592653589793, 3.142857142857143]

In [74]:
json.loads(jnums, parse_float=Decimal, parse_int=Fraction)

['2020-05-24T00:55:10',
 Fraction(42, 1),
 Decimal('3.1415'),
 Decimal('3.141592653589793'),
 Decimal('3.142857142857143')]

## Customizing Serialization

For complex objects, the `.__dict__` of the object often serves as a reasonable proxy for "the interesting data" inside the object. We saw a definition of a custom encoder and could enhance it to deal with additional types that way. However, this is about the point where you want to worry more about the actual utility of your serialization, especially if you will transmit it to other systems (i.e. running different programming languages).

In [75]:
class RobustEncoder(ScalarEncoder):
    def default(self, o):
        try:
            return super().default(o)
        except:
            return o.__dict__

In [76]:
@dataclass
class TestData:
    description: str
    timestamp: datetime
    numbers: list

In [77]:
test_data = TestData(description="Pi approximations",
                     timestamp=timestamp,
                     numbers=[decnum, pi, Fraction(22, 7)])
pprint(str(test_data), width=56)

("TestData(description='Pi approximations', "
 'timestamp=datetime.datetime(2020, 5, 24, 0, 55, 10), '
 "numbers=[Decimal('3.1415'), 3.141592653589793, "
 'Fraction(22, 7)])')


In [78]:
pprint(json.dumps(test_data, cls=RobustEncoder))

('{"description": "Pi approximations", "timestamp": "2020-05-24T00:55:10", '
 '"numbers": [3.1415, 3.141592653589793, 3.142857142857143]}')


# JSON Pickles

If your concern for interoperability is low, and you only wish to exchange data between reasonably similarly configured Python systems (or only persist objects on the same system), the third-party module `jsonpickle` does this abstraction for you.  This achieves round-tripping, which is often useful.  Its capabilities and limitations are essentially identical to `pickle` itself.  However, the binary pickle format is considerably more compact than the JSON string format.

In [79]:
jpkl = jsonpickle.encode(test_data, indent=True)
new_data = jsonpickle.decode(jpkl)
pprint(str(new_data), width=56)

("TestData(description='Pi approximations', "
 'timestamp=datetime.datetime(2020, 5, 24, 0, 55, 10), '
 "numbers=[Decimal('3.1415'), 3.141592653589793, "
 'Fraction(22, 7)])')


## The Verbose Format

Above, I used the `indent=True` option to produce more human readable (but somewhat larger) JSON output.  It only modifies semantically meaningless whitespace.  The same switch exits on the `pickle` module itself.  Let us look at what is contained in this specialized JSON format. We will use several slides to see the parts.

In [80]:
lines = jpkl.splitlines()
print('\n'.join(lines[:14]))

{
 "py/object": "__main__.TestData",
 "description": "Pi approximations",
 "timestamp": {
  "py/object": "datetime.datetime",
  "__reduce__": [
   {
    "py/type": "datetime.datetime"
   },
   [
    "B+QFGAA3CgAAAA=="
   ]
  ]
 },


In [81]:
print('\n'.join(lines[14:28]))

 "numbers": [
  {
   "py/reduce": [
    {
     "py/type": "decimal.Decimal"
    },
    {
     "py/tuple": [
      "3.1415"
     ]
    }
   ]
  },
  3.141592653589793,


In [None]:
print('\n'.join(lines[28:]))

# Sharing JSON Among Languages

JavaScript Object Notation (JSON) is designed as a data interchange format.  Specifically, it is pobably used most commonly for RESTful web service (Representational state transfer).  While those might run in Python, there are numerous other programming languages and frameworks they might use; notably JavaScript is a prominent option.  Every widely used modern programming language has libraries supporting JSON.

For this lesson, we utilize an example Node.js server that is licensed as GPL v.3.0, and can be installed from Rob Kendal's GitHub repository at https://github.com/sereynha/ecommerce.  That repository is accompanied by an excellent introductory article that describes the steps of creating a simple Node.js webserver.  I have modified that code only in minor ways for this lesson.  I will show two snippets of the JavaScript code used for illustration, but the focus here is on talking to the server from Python, not learning JavaScript or Node.js.

In [84]:
import json
from http import HTTPStatus
import requests
!cp node-server/data/users-start.json node-server/data/users.json

cp: node-server/data/users-start.json: No such file or directory


## Making REST Requests

This lesson—and *microservices* very commonly—will consist of calling a webserver with a *payload* formatted as JSON, and receiving a response, also usually formatted as JSON.  This structure allows many servers to interact in a manner similar to function calls, with both computation and state distributed among the various servers.  An older approach to this same architecture was XMLRPC, which in fact has a current but legacy Python standard library module `xmlrpc` to support it.

The server in this lesson provides a simple key/value database of users.  All users must have a name and a password, but they may also optionally have other data associated with them.  This design is obviously terrible from a security perspective, since "passwords" are transmitted and stored without encryption (as is other data), but that concern is not for this lesson.

The third-party package `requests` is recommended for HTTP clients, even in the Python standard library documentation itself.  However, the standard library package `urllib.request` has a less intuitive API, but will perform the same tasks if the third-party package is not available.  In our server, we can query the data it contains by making a GET request to the endpoint `/api/products`.

A GET request does not pass any JSON body data; in principle it could pass URL parameters to communicate data, but that style is not used in this lesson.

In [108]:
import requests
import json

url = 'http://localhost:3031/api/products?page=1&limit=5'

login_url = 'http://localhost:3031/api/auth/login'
login_data = {
    'email': 'your_email@example.com',
    'password': 'your_password'
}

login_response = requests.post(login_url, json=login_data)
token = json.loads(login_response.text).get('token')

headers = {
    'Authorization': f'Bearer {token}'
}

response = requests.get(url, headers=headers)

# Show status code and load JSON body
print(response.status_code)
print(response.headers['Content-Type'])
data = json.loads(response.text)
print(data)

200
application/json; charset=utf-8
{'count': 3, 'data': [{'id': 3, 'name': 'Beer', 'description': 'A sweet Beer', 'stock': 0, 'price': '15', 'tags': 'Berr,Thai', 'categoryId': 1, 'createdAt': '2024-07-28T15:25:25.919Z', 'updatedAt': '2024-07-28T15:25:25.919Z'}, {'id': 2, 'name': 'Knitted T-Shirt', 'description': 'Knitted t-shirt featuring short sleeves, graphic print at the front and crew neckline.60% Polyester 40% Viscose', 'stock': 0, 'price': '8.69', 'tags': 'T-Shirt,clothing', 'categoryId': 1, 'createdAt': '2024-07-28T11:12:11.298Z', 'updatedAt': '2024-07-28T11:12:11.298Z'}, {'id': 1, 'name': 'Cross Training T-Shirt', 'description': 'Model is 173 cm tall / 65 kg weight and is wearing size M.', 'stock': 6, 'price': '8.69', 'tags': 'T-Shirt,clothing,updated', 'categoryId': 1, 'createdAt': '2024-07-27T08:10:21.196Z', 'updatedAt': '2025-07-16T12:30:35.787Z'}]}


### Unsuccessful Requests

A well behaving webserver will return a status code indicating the nature of the problem with a request. A very small support function will help us show the response details.

In [88]:
def phrase(response):
    for st in HTTPStatus:
        if st.value == response.status_code:
            return f"{st.value} {st.phrase}"

Trying a resource that simply does not exist.

In [100]:
url2 = 'http://localhost:3031/api/products/search/text?params=Beer'
response = requests.get(url2, headers=headers)
print(phrase(response))
try:
    json.loads(response.text)
except Exception as err:
    print(err)

200 OK


At times we might see a status code that is neither 200 nor 404.  A 404 will not have any body, but other status codes are likely to have a body that is encoded as plain text or in another manner.  We can use this clue to decide whether to JSON decode the body.

In [101]:
url3 = 'http://localhost:3031/api/orders'
response = requests.get(url3, headers=headers)
print(phrase(response))
print(response.headers['Content-Type'])
response.text

200 OK
application/json; charset=utf-8


'{"datas":[]}'

## Pushing JSON

The way this server is configured, the same endpoint behaves differently if it receives a POST request rather than a GET request.  With a POST, a new record is added to the database.

In [104]:
url4 = "http://localhost:3031/api/carts"
cartData = {
  "productId": 1,
  "quantity": 2
}
response = requests.post(url4, json=cartData, headers=headers)
print(phrase(response))
response.text

201 Created


'{"message":"Create successful","success":true}'

Let us make sure the database has the contents we hope for.

In [109]:
response = requests.get(url, headers=headers)
json.loads(response.text)

{'count': 3,
 'data': [{'id': 3,
   'name': 'Beer',
   'description': 'A sweet Beer',
   'stock': 0,
   'price': '15',
   'tags': 'Berr,Thai',
   'categoryId': 1,
   'createdAt': '2024-07-28T15:25:25.919Z',
   'updatedAt': '2024-07-28T15:25:25.919Z'},
  {'id': 2,
   'name': 'Knitted T-Shirt',
   'description': 'Knitted t-shirt featuring short sleeves, graphic print at the front and crew neckline.60% Polyester 40% Viscose',
   'stock': 0,
   'price': '8.69',
   'tags': 'T-Shirt,clothing',
   'categoryId': 1,
   'createdAt': '2024-07-28T11:12:11.298Z',
   'updatedAt': '2024-07-28T11:12:11.298Z'},
  {'id': 1,
   'name': 'Cross Training T-Shirt',
   'description': 'Model is 173 cm tall / 65 kg weight and is wearing size M.',
   'stock': 6,
   'price': '8.69',
   'tags': 'T-Shirt,clothing,updated',
   'categoryId': 1,
   'createdAt': '2024-07-27T08:10:21.196Z',
   'updatedAt': '2025-07-16T12:30:35.787Z'}]}

The server may validate a POST request (or any request) in some manner, and return an appropriate status based on the JSON passed to it.

In [110]:
anon = {"password": "P4cC!^*8chWz8", "profession": "Hacker"}
response = requests.post(url, data=json.dumps(anon), headers=headers)
print(phrase(response))
response.text

401 Unauthorized


'{"message":"Unauthorized","errorCode":401}'

# What the Server is Doing

The Node.js server has a bit of scaffolding to implement a server.  A very similar webserver could be implemented in Python or any other programming language.  While you may not be familiar with JavaScript, the below code should not be difficult to understand in outline.  This is the code that handle a POST to the `/users` route.

Although the data file that stores the database is itself simply JSON, the server explicitly parses it as JSON to assure the format.  Setting the header immediately before the call to `res.send()` is redundant because the server can detect the type from the JSON object; I added it to illustrate that we are able to explicitly set it.  Very similar APIs are present in Python websevers.

# JSON Schema

The prior lesson demonstrated communicating between a RESTful web server and a client.  Recall that we sent HTTP POST messages with a JSON body to a server and received JSON responses from GET queries.  One thing that was not done in the example was any validation of the format of these messages.  Or rather, there was one element of ad-hoc validation in that the server required the field "name" to be present in a user record.

Using JSON Schema, we can more precisely specify all the elements that may be present in an acceptable JSON document, including which are requires versus option, and indicate datatypes and nesting of containers.  JSON Schema can contain varying levels of details.  We will look at some possible schemata to define a valid user with varying degrees of specificity.

Let us start out by loading Python standard library modules and the third-party `jsonschema` module.  We also create JSON strings for several users to validate.

In [111]:
import json
from jsonschema import validate, ValidationError

In [112]:
guido = json.loads("""{
  "name": "Guido van Rossum",
  "password": "unladenswallow",
  "details": {
    "profession": "ex-BDFL"
  }
}""")

In [113]:
david = json.loads("""{
  "name": "David Mertz",
  "password": "badpassword",
  "details": {
    "profession": "Data Scientist",
    "publisher": "INE"
  },
  "lucky_numbers": [12, 42, 55, 87]
}""")

In [114]:
intruder = json.loads("""{
  "password": "P4cC!^*8chWz8",
  "profession": "Hacker"
}""")

# Validation

A JSON Schema is itself a JSON document following certain specifications.  At the simplest, it needs to specify a type for the JSON being validated. The module `jsonschema` expects Python objects as both `instance` and `schema` arguments.  If you are beginning with JSON—which is, after all, the point of using it—you need to use the `json` module to convert both to Python objects first.

The API the `jsonschema` module uses might be surprising.  It raises an exception on failure, but passes silently on success.  Let us look at a couple examples.

### Checking Scalars

In [115]:
try:
    validate(instance=99, schema={"type": "number"})
    print("99 is a number")
except ValidationError as err:
    print(err)

99 is a number


In [116]:
try:
    validate(99, {"type": "string"})
    print("99 is a string")
except ValidationError as err:
    print(err)

99 is not of type 'string'

Failed validating 'type' in schema:
    {'type': 'string'}

On instance:
    99


In [117]:
try:
    validate("99", {"type": "number"})
    print("99 is a string")
except ValidationError as err:
    print(err)

'99' is not of type 'number'

Failed validating 'type' in schema:
    {'type': 'number'}

On instance:
    '99'


## A Test Function

I find it easier to wrap the exception raising API with a function that will return either the error description as a string or None as a sentinel for "no errors."

In [118]:
def not_valid(instance, schema):
    try:
        validate(instance, schema)
        return None
    except ValidationError as err:
        return str(err)

The following is the pattern we will use for the remaining examples.

In [119]:
# The "walrus operator" requires Python 3.8+
if msg := not_valid("Ooops", {"type": "array"}):
    print(msg)

'Ooops' is not of type 'array'

Failed validating 'type' in schema:
    {'type': 'array'}

On instance:
    'Ooops'


# Checking Users

The simple examples above do not check structured collections. All user JSON records are what JavaScript calls "objects" but Python calls dicts.   For a JSON object, we need to define both the type and the properties we expect it to have.  We may specify keys as required, but validation will not prohibit inclusion of "cargo" in keys we have not specified.  Very often this is exactly desired behavior; JSON often carries extra information that might be used by other consumers, but a particular consumer only needs to assure the parts it cares about are present.

In [120]:
schema = json.loads("""{
  "type" : "object",
  "required": ["name"],
  "properties" : {
    "name" : {"type" : "string"}
    }
}""")

In [121]:
for user in [guido, david]:
    if msg := not_valid(user, schema):
        print(msg, "\n--------------------")
    else:
        print(f"User {user['name']} validates correctly")

User Guido van Rossum validates correctly
User David Mertz validates correctly


The schema in this first pass suffices to check the constraint the server in the prior lesson imposed.  In fact, it checks slightly more in guaranteeing that the field "name" is a string.


In [122]:
barbara_feldon = json.loads("""{
  "name": 99,
  "details": {"profession": "CONTROL Agent"}
}""")

We have two not-quite-conformant user JSON documents to validate. Each fails in a different way.

In [123]:
for user in [barbara_feldon, intruder]:
    if msg := not_valid(user, schema):
        print(msg, "\n--------------------")
    else:
        print(f"User {user['name']} validates correctly")

99 is not of type 'string'

Failed validating 'type' in schema['properties']['name']:
    {'type': 'string'}

On instance['name']:
    99 
--------------------
'name' is a required property

Failed validating 'required' in schema:
    {'type': 'object',
     'required': ['name'],
     'properties': {'name': {'type': 'string'}}}

On instance:
    {'password': 'P4cC!^*8chWz8', 'profession': 'Hacker'} 
--------------------


## Nested Structure

A JSON Schema allows specification of nested structures, including type and cardinality, and also may optionally contain a number of annotations to describe the schema itself.  Let us add a few. In the expanded schema, we will require a password along with a name.  Notice that we describe several aspects of what the field "lucky_numbers" might look like, but we do not make it required.  Guido had none, but David did; both should validate.

In [124]:
schema = json.loads("""{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http://example.com/user.schema.json",
  "title": "User",
  "description": "A User of Our Computer System",
  "type" : "object",
  "required": ["name", "password"],
  "properties" : {
     "name" : {"type" : "string"},
     "password": {
         "description": "Use special characters and mixed case",
         "type": "string"},
     "lucky_numbers": {
         "description": "Up to 6 favorite numbers 1-100",
         "type": "array",
         "items": {
           "type": "number",
           "minimum": 1,
           "maximum": 100
         },
         "uniqueItems": true,
         "minItems": 0,
         "maxItems": 6
    }
  }
}""")

In [125]:
for user in [guido, david]:
    if msg := not_valid(user, schema):
        print(msg, "\n--------------------")
    else:
        print(f"User {user['name']} validates correctly")

User Guido van Rossum validates correctly
User David Mertz validates correctly


There are a few ways that validation might fail with the expanded schema.  Obviously, "password" was added as a required field, but the pattern there is identical as with "name".  The field "lucky_numbers" has more going on.  It might be omitted altogether for a valid users, but if it is included, it can only be an array (Python list) of numbers between 1 and 100; moreover, it can only have from zero to six numbers that must be distinct.

In [126]:
the_count = json.loads("""{
  "name": "Count von Count",
  "password": "fourbananas",
  "lucky_numbers": ["one", "two", "three"]
}""")

if msg := not_valid(the_count, schema):
    print(msg, "\n--------------------")
else:
    print(f"User {user['name']} validates correctly")

'three' is not of type 'number'

Failed validating 'type' in schema['properties']['lucky_numbers']['items']:
    {'type': 'number', 'minimum': 1, 'maximum': 100}

On instance['lucky_numbers'][2]:
    'three' 
--------------------


In [127]:
cantor = json.loads("""{
  "name": "Georg Cantor",
  "password": "omega_aleph",
  "lucky_numbers": [1, 2, 3, 4, 5, 6, 7, 8]
}""")

if msg := not_valid(cantor, schema):
    print(msg, "\n--------------------")
else:
    print(f"User {user['name']} validates correctly")

[1, 2, 3, 4, 5, 6, 7, 8] is too long

Failed validating 'maxItems' in schema['properties']['lucky_numbers']:
    {'description': 'Up to 6 favorite numbers 1-100',
     'type': 'array',
     'items': {'type': 'number', 'minimum': 1, 'maximum': 100},
     'uniqueItems': True,
     'minItems': 0,
     'maxItems': 6}

On instance['lucky_numbers']:
    [1, 2, 3, 4, 5, 6, 7, 8] 
--------------------


In [129]:
revolution_9 = json.loads("""{
  "name": "Yoko Ono",
  "password": "grapefruit",
  "lucky_numbers": [9, 9, 9]
}""")

if msg := not_valid(revolution_9, schema):
    print(msg, "\n--------------------")
else:
    print(f"User {user['name']} validates correctly")

[9, 9, 9] has non-unique elements

Failed validating 'uniqueItems' in schema['properties']['lucky_numbers']:
    {'description': 'Up to 6 favorite numbers 1-100',
     'type': 'array',
     'items': {'type': 'number', 'minimum': 1, 'maximum': 100},
     'uniqueItems': True,
     'minItems': 0,
     'maxItems': 6}

On instance['lucky_numbers']:
    [9, 9, 9] 
--------------------


In [130]:
go_big = json.loads("""{
  "name": "Leslie Knope",
  "password": "ilovepawnee",
  "lucky_numbers": [1000000, 200000]
}""")

if msg := not_valid(go_big, schema):
    print(msg, "\n--------------------")
else:
    print(f"User {user['name']} validates correctly")

200000 is greater than the maximum of 100

Failed validating 'maximum' in schema['properties']['lucky_numbers']['items']:
    {'type': 'number', 'minimum': 1, 'maximum': 100}

On instance['lucky_numbers'][1]:
    200000 
--------------------


# Reading CSV with Standard Library

Python provides a module in its standard library for reading and writing CSV or other delimited files.  It can be tempting to create or read such files using only Python's powerful string manipulation functionality.  Indeed, the author of this tutorial has done so far more often than he wishes to admit; however, it is a mistake to eschew the `csv` module which simply deals with many edge cases that are easy to overlook in quick scripts.

Let us start out by loading a few Python standard library modules that this lesson will utilize.

In [1]:
import csv
from pprint import pprint
from collections import namedtuple
from decimal import Decimal

# Doing it Wrong

In Python, the string methods `.split()` and `.join()` do 90% of what we need to in working with CSV.  The problem is, they do not do the other 10%.  Let's try a naive approach that goes bad.

In [10]:
fields = ["Name", "Evaluation", "Rating", "Age"]
data = [
    ["Mia Johnson", "The movie was excellent", 9.5, 25],
    ["Liam Lopez", "Didn't really like it", 3.0, 35],
    ["Isabella Lee", "Wow! That was great", 8.0, 45,]
]

This is unremarkable data about several movie evaluations.  Let us try to serialize it.

> The data associated with this notebook can be found in the files associated with this course

In [3]:
with open('data/movie.csv', 'w') as movie:
    try:
        print(",".join(fields), file=movie)
        for record in data:
            print(",".join(record), file=movie)
    except Exception as err:
        print(err)

sequence item 2: expected str instance, float found


It is easy to see what went wrong.  The `.join()` method needs only strings in the iterable argument.  We can fix that fairly easily.  Python knows how to *stringify* all its objects.

In [4]:
with open('data/movie.csv', 'w') as movie:
    try:
        print(",".join(fields), file=movie)
        for record in data:
            print(",".join(str(r) for r in record), file=movie)
    except Exception as err:
        print(err)

Success! At least for now. Perhaps we want to read it back as a list of dictionaries.

We need to read the header first to use as keys, then we can pull values from each corresponding position in later rows.

In [5]:
with open('data/movie.csv') as movie:
    newdata = []
    keys = next(movie).split(',') # Header
    for line in movie:
        newdata.append(dict(zip(keys, line.split(','))))

pprint(newdata)

[{'Age\n': '25\n',
  'Evaluation': 'The movie was excellent',
  'Name': 'Mia Johnson',
  'Rating': '9.5'},
 {'Age\n': '35\n',
  'Evaluation': "Didn't really like it",
  'Name': 'Liam Lopez',
  'Rating': '3.0'},
 {'Age\n': '45\n',
  'Evaluation': 'Wow! That was great',
  'Name': 'Isabella Lee',
  'Rating': '8.0'}]


We did *pretty well*.  However, the last field of header and data have a trailing newline chacter we do not really want.  We can strip that, but other problems still arise.

In [6]:
with open('data/movie.csv') as movie:
    newdata = []
    line = next(movie).rstrip()  # Header
    keys = line.split(',')
    for line in movie:
        line = line.rstrip()
        newdata.append(dict(zip(keys, line.split(','))))

pprint(newdata)

[{'Age': '25',
  'Evaluation': 'The movie was excellent',
  'Name': 'Mia Johnson',
  'Rating': '9.5'},
 {'Age': '35',
  'Evaluation': "Didn't really like it",
  'Name': 'Liam Lopez',
  'Rating': '3.0'},
 {'Age': '45',
  'Evaluation': 'Wow! That was great',
  'Name': 'Isabella Lee',
  'Rating': '8.0'}]


We can see that something is going to go wrong when a field can legitimately contain the delimiter.

In [8]:
!cat data/movie.csv

Name,Evaluation,Rating,Age
Mia Johnson,The movie was excellent,9.5,25
Liam Lopez,Didn't really like it,3.0,35
Isabella Lee,Wow! That was great,8.0,45


Let's use the idential ad hoc reader to read the data on disk again.

In [7]:
with open('data/movie.csv') as movie:
    newdata = []
    line = next(movie).rstrip()
    keys = line.split(',') # Header
    for line in movie:
        line = line.rstrip()
        newdata.append(dict(zip(keys, line.split(','))))

pprint(newdata)

[{'Age': '25',
  'Evaluation': 'The movie was excellent',
  'Name': 'Mia Johnson',
  'Rating': '9.5'},
 {'Age': '35',
  'Evaluation': "Didn't really like it",
  'Name': 'Liam Lopez',
  'Rating': '3.0'},
 {'Age': '45',
  'Evaluation': 'Wow! That was great',
  'Name': 'Isabella Lee',
  'Rating': '8.0'}]


As written, nothing crashed.  But we also get data in the wrong fields sometimes. Another likely problem is handling embedded newlines in strings; a few other edge cases also occur. We could complicate matters with some additional code, and eventually get it right.  But the Python standard library does that for us.

# The `csv` Module

In the basic case, using the `csv` module gives us a largely file-like interface.  It merely handles a few things that can go wrong automatically.

In [9]:
with open('data/movie.csv', 'w') as fh:
    movie = csv.writer(fh, quoting=csv.QUOTE_MINIMAL)
    for record in [fields]+data:
        movie.writerow(record)

!cat data/movie.csv

Name,Evaluation,Rating,Age
Mia Johnson,The movie was excellent,9.5,25
Liam Lopez,Didn't really like it,3.0,35
Isabella Lee,Wow! That was great,8.0,45


Reading the data back is similar, with quoting and escaping handled properly.

In [10]:
with open('data/movie.csv') as fh:
    movie = csv.reader(fh)
    for record in movie:
        print(record)

['Name', 'Evaluation', 'Rating', 'Age']
['Mia Johnson', 'The movie was excellent', '9.5', '25']
['Liam Lopez', "Didn't really like it", '3.0', '35']
['Isabella Lee', 'Wow! That was great', '8.0', '45']


## Data Typing

Unlike some other tools, the standard library `csv` module makes little attempt to impose datatypes.  During writing, it will, of course, stringify objects that are not strings.  It usually leaves the decision of casting to other types up to the programmer.

In [11]:
with open('data/movie.csv', 'w') as fh:
    movie = csv.writer(fh, quoting=csv.QUOTE_NONNUMERIC)
    for record in [fields]+data:
        movie.writerow(record)

!cat data/movie.csv

"Name","Evaluation","Rating","Age"
"Mia Johnson","The movie was excellent",9.5,25
"Liam Lopez","Didn't really like it",3.0,35
"Isabella Lee","Wow! That was great",8.0,45


The `csv` module provides a limited option to quote all strings and to infer that anything unquoted is a number instead.  The numeric type used is always a floating point for this rule.  If you wish to read in an int, or a Decimal or Fraction, or another numeric type, you still need to write more custom code.

In [12]:
with open('data/movie.csv') as fh:
    movie = csv.reader(fh, quoting=csv.QUOTE_NONNUMERIC)
    for record in movie:
        print(record)

['Name', 'Evaluation', 'Rating', 'Age']
['Mia Johnson', 'The movie was excellent', 9.5, 25.0]
['Liam Lopez', "Didn't really like it", 3.0, 35.0]
['Isabella Lee', 'Wow! That was great', 8.0, 45.0]


Probably what we really want is to specify various data types for various columns.  In the example, Age is probably meant as in integer and Rating as a fractional number.  While we are customizing, perhaps a different collection type than a list is a more descriptive way to store records.

In [13]:
# Specify special types, string by default
types = {'Age': int, 'Rating': Decimal}

with open('data/movie.csv') as fh:
    newdata = []
    # Create a descriptive record for this data
    movie = csv.reader(fh)
    fields = next(movie)
    Movie = namedtuple("Movie", fields)
    for record in movie:
        # Cast each item to its needed datatype
        for pos, datum in enumerate(record):
            cast = types.get(fields[pos], str)
            record[pos] = cast(datum)
        newdata.append(Movie(*record))

pprint(newdata)

[Movie(Name='Mia Johnson', Evaluation='The movie was excellent', Rating=Decimal('9.5'), Age=25),
 Movie(Name='Liam Lopez', Evaluation="Didn't really like it", Rating=Decimal('3.0'), Age=35),
 Movie(Name='Isabella Lee', Evaluation='Wow! That was great', Rating=Decimal('8.0'), Age=45)]


## Records as Dictionaries

A namedtuple or a dataclass are useful Python standard library types for structured collections. The built-in standard type for doing that is a dictionary.  The `csv` module includes classes to make reading or writing dicts convenient.  These different types have pros and cons, but all are useful.  Dictionaries come with convenience classes to make similar code slightly shorter.

In [14]:
with open('data/movie.csv') as fh:
    movie = csv.DictReader(fh, quoting=csv.QUOTE_NONNUMERIC)
    for record in movie:
        print(record)

{'Name': 'Mia Johnson', 'Evaluation': 'The movie was excellent', 'Rating': 9.5, 'Age': 25.0}
{'Name': 'Liam Lopez', 'Evaluation': "Didn't really like it", 'Rating': 3.0, 'Age': 35.0}
{'Name': 'Isabella Lee', 'Evaluation': 'Wow! That was great', 'Rating': 8.0, 'Age': 45.0}


Writing dictionaries back out to CSV is very similar.  Here we add a minor option, `newline=` in order to be able to write records with newlines in strings.  While we are doing that, let us also use a different delimiter to demonstrate that.

In [15]:
with open('data/movie.txt', 'w', newline='') as fh:
    fields = ['Name', 'Rating', 'Age']
    movie = csv.DictWriter(fh, fieldnames=fields, delimiter="|")
    movie.writeheader()
    movie.writerow({'Name': 'Mia\nJohnson', 'Rating': 9.5, 'Age': 25})
    movie.writerow({'Age': 35, 'Name': 'Liam Lopez'})
    movie.writerow({'Name': 'Isabella "Bella" Lee', 'Rating': 8.0, 'Age': 45})

!cat data/movie.txt

Name|Rating|Age
"Mia
Johnson"|9.5|25
Liam Lopez||35
"Isabella ""Bella"" Lee"|8.0|45


Despite the slightly surprising newline inside a field, this will round-trip perfectly fine because of the quote.  The quotes inside one of the fields are also handle correctly.

In [16]:
with open('data/movie.txt', newline='') as fh:
    movie = csv.DictReader(fh, delimiter="|")
    for record in movie:
        print(record)

{'Name': 'Mia\nJohnson', 'Rating': '9.5', 'Age': '25'}
{'Name': 'Liam Lopez', 'Rating': '', 'Age': '35'}
{'Name': 'Isabella "Bella" Lee', 'Rating': '8.0', 'Age': '45'}


# Reading CSV with Pandas

If it is available in your environment, the `Pandas` package provides a versatile, flexible, and fast reader and writer of CSV and other delimited files.  Moreover, when read, delimited files are read into a flexible data structure called a DataFrame that has numerous useful methods.  The Pandas library can perform a great deal of work for data processing and data manipulation, but most of that is outside the scope of this lesson.

Let us start out by loading the Pandas library and the CSV module.  Pandas is conventionally loaded as the short name `pd`.  Similarly, `NumPy` is conventionally loaded as `np`.

In [13]:
import numpy as np
import pandas as pd
from datetime import datetime

# Basic Reading

In principle, Pandas provides a huge number of options for reading CSV or other delimited files.  In fact, it has readers for a huge number of entirely different data formats as well.  In the simple case, it could hardly be simpler.  Let us look at a CSV file then read it to a DataFrame.

> The data associated with this notebook can be found in the files associated with this course

In [2]:
!cat data/movie.csv

"Name","Evaluation","Rating","Age"
"Mia Johnson","The movie was excellent",9.5,25
"Liam Lopez","Didn't really like it",3.0,35
"Isabella Lee","Wow! That was great",8.0,45


In [3]:
df = pd.read_csv('data/movie.csv')
df

Unnamed: 0,Name,Evaluation,Rating,Age
0,Mia Johnson,The movie was excellent,9.5,25
1,Liam Lopez,Didn't really like it,3.0,35
2,Isabella Lee,Wow! That was great,8.0,45


### Data Types
Or interest here especially is the type inference that was performed by Pandas. Thins that look like integers get converted to integers, things that look like floats get converted to floats.

In [4]:
df.dtypes

Name           object
Evaluation     object
Rating        float64
Age             int64
dtype: object

### Explicit Typing
Panda lets you specify the type of the columns explicitly, inasmuch as datatypes make sense for a given column. For the most part, this is useful only to encode in fewer bits or to explicitly use floats where a column might be interred as integer.

In [5]:
df = pd.read_csv(
    'data/movie.csv',
    dtype={'Age': np.float16, 'Rating': np.float64}
)
df.dtypes

Name           object
Evaluation     object
Rating        float64
Age           float16
dtype: object

### Parsing Dates
pandas goes further than the standard library `csv` module can in also optionally parsing dates.The next example not only has dates in an extra field, but in fact encodes the dates in multiple different formats.If parsing a column as a data is specified,Pandas will attempt a large collection of heuristic rule to guess at what formate was intended.One of the dates given is accompanied by a particular time as well, down to a fraction of a second.

> The data associated with this notebook can be found in the file associated with this course

In [17]:
!cat data/movie.csv

"Name","Evaluation","Rating","Age"
"Mia Johnson","The movie was excellent",9.5,25
"Liam Lopez","Didn't really like it",3.0,35
"Isabella Lee","Wow! That was great",8.0,45


In [25]:
df["Date"] = datetime.today().strftime('%Y-%m-%d')
df.to_csv('data/movie.csv', index=False)
!cat data/movie.csv

Name,Evaluation,Rating,Age,date,Date
Mia Johnson,The movie was excellent,9.5,25.0,2025-07-17,2025-07-17
Liam Lopez,Didn't really like it,3.0,35.0,2025-07-17,2025-07-17
Isabella Lee,Wow! That was great,8.0,45.0,2025-07-17,2025-07-17


In [26]:
pd.read_csv('data/movie.csv', parse_dates=['date'])

Unnamed: 0,Name,Evaluation,Rating,Age,date,Date
0,Mia Johnson,The movie was excellent,9.5,25.0,2025-07-17,2025-07-17
1,Liam Lopez,Didn't really like it,3.0,35.0,2025-07-17,2025-07-17
2,Isabella Lee,Wow! That was great,8.0,45.0,2025-07-17,2025-07-17


# Format Variations

By choosing from various available parameters, the same `pd.read_csv()` function can read most delimited formats.  For example, we can read the pipe (`|`) delimited file created in the last lesson that also had embedded newlines and quotes. The special value `NaN` (Not a Number) is used to mark missing data.

> The data associated with this notebook can be found in the files associated with this course

In [1]:
!cat data/movie.txt

Name|Rating|Age
"Mia
Johnson"|9.5|25
Liam Lopez||35
"Isabella ""Bella"" Lee"|8.0|45


Some of the parameters used in the below example are simply their default values. They are shown to illustrate the range of options.

In [None]:
df = pd.read_csv('data/movie.txt', 
                 sep="|", 
                 nrows=100, 
                 skip_blank_lines=True, 
                 decimal='.', 
                 quotechar='"')
df

Sometimes you will encounter CSV or other delimited files without headers.  A few options can handle that.  If we do not give parameters to indicate this, the DataFrame will be confused.

> The data associated with this notebook can be found in the files associated with this course

In [None]:
pd.read_csv('data/movie-noheader.csv')

In [None]:
pd.read_csv('data/movie-noheader.csv', 
             names=['Person', 'Description', 'Score', 'Age'])

# Exporting to CSV

Once you *have* a Pandas DataFrame, whether constructed from scratch, read from any of numerous data formats, modified and filtered using Pandas methods, or whatever, it is easy to export it to a new CSV file.  This is not as completely general purpose as the Python `csv` module in that it is only a DataFrame that can do the writing, not arbitrary arrangements of data that you have manually programmed to write as records.  However, it is extremely straightforward, and allows generally the same numerous parameters as the reader.

In [None]:
# Notice automatic compression based on extension
df.to_csv('data/movie.tsv.gz', 
          sep='\t', 
          na_rep="N/A",
          quotechar="'")

In [None]:
!zcat data/movie.tsv.gz