# Entry 0: What is a file?

We should have an OS level understanding of what a file is.

I'm basing this notebook off of how Linux manages files, because I personally run Linux and have half-an-idea where to find docs.

Files are more or less strings of bytes, with a bunch of abstractions that eventually translate into information stored in some kind of memory. This means users don't need to worry about where the file actually goes on the disk, physically, so long as there's space. Somewhere, the kernel or a file system module maps the file's path to some physical address, and manages allocating space and linking bits of space that aren't connected. Like a lot of low level computing, this is a world of pointers and linked lists.

Of course, there's also metadata. Files have names, as well as timestamps for when the file was created, modified, and/or last accessed. And on a multiuser system (really any system these days, even if it's your personal device), there are permissions to specify who can read, write, or execute with that file.

And part of that metadata is information on what's inside the file. On Windows, that's traditionally been done with the three or longer character extension after the `.` (although I imagine there's probably a bit more sophistication these days), while Unix-y systems do something robust.

Here's a file, and now I have some questions. I downloaded it from Wikimedia, [so I'm welcome to share it](https://commons.wikimedia.org/wiki/File:Tapir_con_cr%C3%ADa.JPG).

<img src="tapir.jpg" width=400>

What exactly is going on here?

Let's figure it out in the terminal.

`
kim@kims-netbook ~/p/multitudes> du -h tapir.jpg 
516K	tapir.jpg
`

Here, `du -h` tells us how much space the image uses, in human friendly units. It's about half a megabyte.

```kim@kims-netbook ~/p/multitudes> ls -l tapir.jpg 
-rw-r--r-- 1 kim kim 523886 Nov 13 14:04 tapir.jpg```

`ls -l` tells us the permissions. So we have the permissions for the owner, user, and group. Then, there is the link count (more or less how many places this file appears in the file system, like if I made a symbolic link to it). Then the owner (ya girl Kim), the group (my own private kim group), the size in bytes, and a datestamp.
As the owner, I can read and write it. If somebody else were in the "kim" usergroup, (how dare they join my exclusive club?) they could only read, and if they were part of neither group, they could only read it. I can change these flags using the `chmod` command, and another user, like root, could claim the file using `chown`.  Nobody can execute this file as is, which is sensible. A JPEG is not a script or an executable.

`kim@kims-netbook ~/p/multitudes> ls -i tapir.jpg 
1583549 tapir.jpg`

This is a bit different. Now we have a -i flag, which lists the inode, which is a data structure in Unix-y systems. This structure actually contains information like what blocks on the disk hold the data, and much of that metadata we've been looking at. It describes objects in the file system, which includes directories, as well.

`kim@kims-netbook ~/p/multitudes> stat tapir.jpg 
  File: 'tapir.jpg'
  Size: 523886    	Blocks: 1032       IO Block: 4096   regular file
Device: 806h/2054d	Inode: 1583549     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/     kim)   Gid: ( 1000/     kim)
Access: 2017-11-13 14:06:16.567884038 -0500
Modify: 2017-11-13 14:04:25.727888980 -0500
Change: 2017-11-13 14:04:52.383887792 -0500
 Birth: -
`

`stat` finds and displays facts from that inode. We have the size again, plus the amount of storage blocks used, and the address on my SSD, plus a couple timestamps. I just downloaded it, so they're close to each other. stat can also describe directory inodes:

`  File: '.'
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 806h/2054d	Inode: 2765386     Links: 3
Access: (0755/drwxr-xr-x)  Uid: ( 1000/     kim)   Gid: ( 1000/     kim)
Access: 2017-11-13 14:51:35.767762805 -0500
Modify: 2017-11-13 14:51:28.439763132 -0500
Change: 2017-11-13 14:51:28.439763132 -0500
 Birth: -
`
This isn't our file, so I don't much care. I'm sorry, but I'm myopic like that. For a bit of fun research, read about what these permissions mean. Execute doesn't actually mean you can run the directory, but instead use it as your working directory, for one.

```kim@kims-netbook ~/p/multitudes> file tapir.jpg 
tapir.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=8, description=SONY DSC                     , manufacturer=SONY           , model=DSLR-A100, software=DSLR-A100 v1.02, datetime=2007:10:02 13:05:59], baseline, precision 8, 1535x1150, frames 3```

According to Linux's `file` command, the file is a JPEG, and it has a good amount of metadata about the image. We even know the EXIF metadata, which tells us the photo was taken with a Sony A100 DSLR back in 2007. I wish I had a camera like that and got to see a tapir in 2007.

But life moves forward, and not backwards. So I really need to think of the cameras and odd-toed ungulates of the future, rather than focusing on regrets.

Besides, at this point, we've figured out a file as both a series of bytes in storage, a record that a user can access or modify conditionally, and as data on the application level. That's pretty okay! Things are fine.

All of those commands showed a bit of the story behind that file, but how did `file` do what it did?

From the `file` (manpage)[https://linux.die.net/man/1/file]:
>file tests	each argument in an attempt to classify	it.  There are three
     sets of tests, performed in this order: filesystem	tests, magic tests,
     and language tests.  The first test that succeeds causes the file type to
     be	printed.
     [...]
     
>The filesystem tests are based on examining the return from a stat(2) system call. The program checks to see if the file is empty, or if it's some sort of special file. Any known file types appropriate to the system you are running on (sockets, symbolic links, or named pipes (FIFOs) on those systems that implement them) are intuited if they are defined in the system header file 

>The magic tests are used to check for files with data in particular fixed formats. [...] These files have a 'magic number' stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. 
     [...]
>If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. [...]

>Any file that cannot be identified as having been written in any of the character sets listed above is simply said to be 'data'.

In our case, we looked at a format that would have a "magic number" constant in the first few bytes. `file` consults this list, and returns a match if it finds one. If it were a text file, `file` would have tried to determine the encoding (ASCII? UTF-8?), and if it were a link or directory, it would have found that, too.

I could not find a human-readable version of that on my computer. I only had a compiled version of the magic file, which is harder to parse. But [here's somebody else's magic file](https://www.garykessler.net/library/magic.html).

`ff d8 ff e0 00 10 4a 46` is the first few bytes from our JPEG, according to `hexdump -C`. 

According to the document, JPEG is encoded like so:
```
0	beshort		0xffd8		JPEG image data
>6	string		JFIF		\b, JFIF standard
```

Our file starts with 0xffd8, so that is how my system knows to treat it like a JPEG, even if I were to mangle the extension. I could name it .png, .gif, .txt, .zip, .exe, and it would still work, because of the magic number system.

```kim@kims-netbook ~/p/multitudes> cp tapir.jpg tapir.mov
kim@kims-netbook ~/p/multitudes> file tapir.mov
tapir.mov: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=8, description=SONY DSC                     , manufacturer=SONY           , model=DSLR-A100, software=DSLR-A100 v1.02, datetime=2007:10:02 13:05:59], baseline, precision 8, 1535x1150, frames 3
```

In conclusion, I hope I've done a reasonable overview of how the system at higher and lower levels understands files. I get a strong feeling my later articles will think of them just as byte strings, but that's an assumption I can make because of the abstractions and tools that hide the relatively sloppy world of writing bits to magtnetic platters or memory cells, and the relatively outside-our-present-concerns world of OS permissions.