# Ways for Storing Data

Central to the life of the data scientist is, well, data! At this point you are already well on your way to being an expert in manipulating data once you have loaded it into Python, but we haven't talked very much about the various formats in which data may be given to you as files. So in this reading we will talk about the three main families of data files you are likely to encounter, their pros and their cons, and how to work with them. These three families are:

1. Plaintext files: files that store data as text. These are files that you could easily open up in a text editor and read yourself, making them very flexible and robust.
2. Binary files: files that have processed your data prior to storage. Reading and writing data from these files tends to be faster, and these files will tend to take up less space on your computer, but you can't easily open them up and look at them, and you need the right software to access them.
3. Databases: databases aren't (usually) individual files, but rather software that collectively manages a collection of datasets in one places, and manages access to different datasets.

## Plaintext Files

For all the advancements that have been made in sophisticated data storage formats, plaintext files remain the most common data format you are likely to encounter. That's because plaintext file store data as—as the name suggests—plain text! As a result there is very little that can go wrong with data stored in this format—any computer that knows how to read a text file can open plaintext data, meaning no one has to worry about whether future data users or colleagues will have the right version of the right software to read the data. 

Indeed, nearly all of the data that we've used in this specialization has been stored in plaintext files. The US Income Data we worked with in Class 2 Week 2, for example, came from a file called `us_household_incomes.txt`, where `.txt` is a file suffix that just tells the computer the file is a "text" file. Indeed, if I open up the file in VS Code (instead of trying to read it into Python), it looks like this:

![US Household Incomes Opened As Text](img/us_household_incomes_as_text.png)

Note both that the contents are easily readable—each line is the income of a single house that, when read into numpy, becomes one entry in a vector—and also that VS Code recognizes it as Plain Text, and displays that in the bottom right.

And what exactly does it mean that the file can be opened in any text editor and read? It means that at the level of the 1s and 0s that make up the file, numbers and letters are encoded using simple, commonly used encodings (like [ASCII](https://en.wikipedia.org/wiki/ASCII) or [Unicode](https://en.wikipedia.org/wiki/Unicode). These files also do not contain anything complicated (pictures, media, etc.), and in fact don't even include information like fonts, or formatting. 

This simplicity makes plaintext files (nearly) universally compatible, and easy to work with, so are a favorite of programmers. Indeed, any code you've ever written in a file has almost surely been saved as a plaintext file too!

When it comes to the type of tabular data that we are working with in this course (data organized into rows and columns), there are two main plaintext formats to be aware of:

- Comma Separated Values (CSVs): plaintext files that use the file suffix `.csv`. In these files, each row of text represents one row in the data, and columns are separated by commas. 
- Tab Separated Values (TSVs): plaintext files that usually use the file suffix `.txt` or, less commonly, `.tsv`. In these files, each row of text represents one row in the data, and columns are separated by tabs (the special character denoting an indentation). 

Of these two, CSVs are by far the most used, in part I suspect because tabs are often an invisible character, sometimes making it hard to see where one column ends and the next ends when looking at the file as text. A small CSV, by contrast, can be pretty easy to read (or at least get a sense of). Here, for example, is what a small CSV file looks like when I open it in VS Code:

![Small CSV in VS Code](img/world_v_small.png)

Across the top are our column names, and each row below contains one row of data (one observation). Note that unlike in, say, Excel, the columns of a CSV won't necessarily line up (unless by chance all the entries in a column are of the same size). 

Moreover, in CSVs, you will notice that entries in columns that are meant to be read as text will often—though not always—be enclosed in quotation marks (in this CSV, the second column uses quotation marks but the first does not, despite both being text). In theory you don't need them, but if you have data—like names written `LAST NAME, FIRST NAME`—that contain commas in the data itself, the quotation marks are required so your computer knows which commas separate columns and which are data.

By the way, to make CSVs easier to read, there's a great little extension in VS Code called [Rainbow CSV](https://marketplace.visualstudio.com/items?itemName=mechatroner.rainbow-csv) that will assign a color to all data in each CSV column. These colors aren't in the file itself—VS Code is parsing the CSV and adding the colors after the file has been opened:

![Small CSV in VS Code with Rainbow Coloring](img/world_v_small_w_rainbow.png)

And indeed, if we wanted to read this into pandas, we could do so easily with `pd.read_csv`, and we'd get the table we expect!

In [1]:
import pandas as pd
pd.read_csv("data/world-very-small.csv")

Unnamed: 0,country,region,gdppcap08,polityIV
0,Brazil,S. America,10296,18
1,Germany,W. Europe,35613,20
2,Mexico,N. America,14495,18
3,Mozambique,Africa,855,16
4,Russia,C&E Europe,16139,17
5,Ukraine,C&E Europe,7271,16


## Binary Data

Binary data files 