# Week 6 - Text Files

So far we've focused on learning how to do things totally within Python.  In the real world, though, we need to interact with data outside of our programs.  One of the most direct way to interact with outside data is through files. But files are tricky, both in how they are represented and how Python accesses them.

Topics

- Plain Text
- Text Encoding (ASCII, UTF-8)
- Comma-Separated Values CSV
- JavaScript Object Notation JSON
- Reading files in Python 
- Writing files in Python


## Plain Text

In the world of file formats, the most basic and simple are [plain text files](https://en.wikipedia.org/wiki/Plain_text). These are files whose "data" or "file contents" are only [characters](https://en.wikipedia.org/wiki/Character_(computing)), that is letters and numbers not Alf or Bilbo Baggins. 

![Alf gif](https://media2.giphy.com/media/apvx5lPCPsjN6/giphy.gif?cid=ecf05e47tufp144mcn9wr4xplxcvkjxyzwm76e8igjhbehds&rid=giphy.gif&ct=g)

*Not the kind of character we are talking about*

To understand plain text, first we have to understand how computers represent human language as numbers.

## Text Encoding

Text encoding, or more technical *[character encoding](https://en.wikipedia.org/wiki/Character_encoding)*, is the way in which human readable letters, numbers, and other things like emojis are mapped to numbers in the computer. By mapping these characters to numbers, then can be translated into binary (zeros and ones) and then stored, transmitted, and manipulated by the computer.

![Text to Binary](attachment:b0d79c04-b9c9-4bf7-a13d-837e93caae32.png)

We can see these mappings using Python!

In [None]:
# the ord function returns an integer number representing the unicode code point
ord('A')

In [None]:
# use the format function to convert the integer to its binary format
format(ord('A'),"08b")

In [None]:
# loop over every character in the string hello
for char in "Hello":
    # print the 8-bit binary representation of the character
    print(char, " - ", format(ord(char),"08b"))

In [None]:
# emojis have numbers too!
ord('üëÅ')

So this begs the question, who determines what numbers correspond to what characters?

We (humans) do! With ENCODING STANDARDS!

![soylent green is people](https://media4.giphy.com/media/3oEjHMURe9Te9XQf3q/giphy.gif?cid=ecf05e47cplib43oelijp4w3gf0x1a282lc0msisyxkwp8b3&rid=giphy.gif&ct=g)



### ASCII

ASCII or the American Standard Code for Information Interchange was an early standardized character encoding for digital computers first published in 1963. It was developed by the [American National Standards Institute](https://en.wikipedia.org/wiki/American_National_Standards_Institute)(ANSI).

This is the mapping of characters to numbers in ASCII: 
![ASCII table](https://www.asciitable.com/asciifull.gif)

If we look at the table we can see the "A" is mapped to the "code point", i.e. the number, 65. 

In [None]:
# what is the code point, the number, for the capital letter A
ord("A")

Neat! However, big problem with ASCII was that it only allowed for 7-bits worth of numbers. How many is that? 128.

This means you can only represent a maximum of 128 different characters in ASCII. That is not enough for the richness of human expression.

### Unicode

To address the limitations of ASCII, the Unicode standard, was formed to allow for a larger mapping of characters to numbers. 

The Unicode standard is maintained by a non-profit organization in California called the [Unicode Consortium](https://en.wikipedia.org/wiki/Unicode_Consortium) with a statement of purpose

> "To develop, extend and promote use of various standards, data, and open source software libraries which specify the representation of text in modern software[,] ... allowing data to be shared across multiple platforms, languages and countries without corruption"

While based on ASCII, the UTF-8 standard currently represents 144,697 different characters in 159 modern and historic languages. 

![image.png](attachment:5276a291-59e7-4022-8e31-87ef284e9d51.png)

New languages and characters, including new emoji, are constantly being added via a [formal proposal process](https://unicode.org/emoji/proposals.html). 

![XKCD Emoji Proposal](attachment:0608d052-b221-4ca1-bd50-a877bcce8ec6.png)

There have even been proposals to include [Klingon in Unicode](https://www.unicode.org/L2/L2020/20181-klingon.pdf). It was rejected.

![mad klingon](https://media2.giphy.com/media/2VWQ20reNgmgU/giphy.gif?cid=ecf05e47zc8nohqd8sslfvlk0wf5i3h0w2pvy06m6zdhex9m&rid=giphy.gif&ct=g)

### Plain Text vs. Formatted Text

Plain text has no formatting beyond the characters themselves. This means there is no **bold** or *italics* in the text itself.

When you write text in a Word Processor like Microsoft Word or Google Docs you are writing plain text that is getting *rendered* into fancier formatted text based on additional, hidden, structure or metadata.

If you open a MS Word file as plain text, this is (part of) what you see:
```
PK     ! Ô¨Ç¬ß‚ÄúlZ      [Content_Types].xml ¬¢(‚Ä†                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ¬•√Æ√Än¬¨0EÀú√ØÀô√´‚àëUb√ã¬¢‚Ñ¢*√£>√±-R√à{VÀù√≠¬´¬∫Àõ√¶QU√´
l"%3ÀúÔ¨Å3V‚àÜ√â‚Äî‚ÅÑ√∂l	¬µw%√é=√±√Ö√¨^i7+≈∏‚óä‚Ä∞-d&¬∑√Æ0Ô¨ÅA‚Ä¶6√Ñl4¬∫Œ©L60#¬µ‚àö√≠√ïS
O√∫¬£√∫√âX¬Ø √©*√Ø√®V$z√ß3√ë¬∏3‚Ä°ÀúŒ©Ô¨Å√≥Ô¨Å%p)O¬µ^‚Ä†√¨‚â§‚óä5}nH"d≈∏s‚ÄùXg√ØL√ë`¬•√¢√ç|√à‚Äò√º√Æ|√≥P√™r‚Ç¨√âsÔ£ø√©?√≤PW√©√ètt4Q+¬ª‚àÜ"¬∂wa¬©√£√ò|T\yœÄ‚àû¬ß,N‚Ç¨‚Ä°√ôU‚Ä¢%¬•Àô‚ÅÑ-D/√´≈í‚Äπ√∂¬¢‚â†X¬∞‚Ä∫√ªÀá(¬∂√ß¬∫<E‚Äû‚Ç¨)√´‚Ä° ;√ÅN√ëL?√òF√í√Ä¬∫¬ß¬¢‚Äπ√¢√≤‚àè<Fk‚Ä∫	√´h¬∞yÀÜ≈ì√ä√ø‚ÅÑ√∫√§¬ß≈íq√ôi¬£‚Äû?‚àÜÔ¨Å√òl‚â†≈íi‚Ä° 1√à‚Äù]√µH√∑g≈ìƒ±m‚Ä†@¬ª√ä‚Ç¨Àöm¬Ø  ÀáÀá PK     ! √´‚àë√î   N   _rels/.rels ¬¢(‚Ä†                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ¬®√≠¬°j‚àö0@√î√âÀù√â‚ÄîŒ©Q‚ÅÑ¬°¬£N/c‚Äì‚Ç¨≈∏[IL‚Ç¨√øj‚óäÀõÀù<√ø√ø]√àaG√Ä‚Äú‚Äù√¨‚Äìzs√∫Fu‚Ä°√Æ]Ô£ø√±U
√§Œ©	√∑Àò^‚àö[Àö¬∫x √Ø√ñ¬∫‚Ä¢1x√∑p‚Äö√µ√äÀÜfÀù¬†#I)¬†√â√£Y√§≈ì√´¬Ø√†√≤√ï¬ø√Ç*DÀÜ√Ç√üi")≈ì‚Äòc$‚â•¬£√ªqU‚óäÀú√≤~3‚Ä†√¥1‚Äô√∑jH[{‚Ñ¢=E√¶√ú‚à´≈í~
f?¬±√≥3-√™√®¬¨Ô¨Å‚â§]∆íT√ç√¨‚àè2√ßj)ƒ±,l0/%√∫√´b¬®
Ô£ø¬∫‚Äî√çz¬£√∏√ü‚âà√¢√ñ,	¬∞	√¢/Àö|f\ZÀõ√Å√§√ä?6√î!Y¬•_¬∑o√∫]A√õ  ÀáÀá PK     ! √∑d‚â•Q√ô   1   word/_rels/document.xml.rels ¬¢(‚Ä†                                                                                                                                                                                                                                                                 ¬®√≠√Äj‚àö0EÀú√ñÀõ√â√≤}-;}PB‚Ä∞lJ!‚Ç¨√∑Àù E?¬Æ,	√ï√ô¬∑√∏√òHI√é‚Äì`‚à´Ô£ør√Ü√≤s≈ì√Ñ6‚Ç¨≈ì¬°√§w√•‚Äò{√ü‚Ä†¬ªr√ã√•√ò{‚óä*x¬©√ò√ìAkWk√é*√´`[^^l√ª‚ÄìjNK‚Äòƒ±√ÖD¬¢8R‚Äì1√°¬µ√Æd:4e>‚Ä†K/√ß√®√â√ä4‚àÜVm^u√£r√Ø√Åw2NP√ª0‚âà√ÜVwƒ±5√†j¬Ø‚àÇo√∂Ô¨Å‚Ä°√â7o:>S!?pÀá√•√É√à8JX[d√¨0KD√™√ÅEVK√§‚Äì√£c2√üP,‚Ñ¢¬ø¬£‚âà¬©¬øa√ª¬¥√∏]‚â§√ª‚Äù.Àõ‚àÇ‚àÜ√î‚àû√≤s‚àèY‚Äú¬∞√í√©+Œ©‚àë√®√º√ã(!O>zÀò  ÀáÀá PK     ! √Ñ¬∫	√î  
     word/document.xml¬ß√±‚Ä∫o‚Ç¨ ¬øÔ¨Ç'√å‚àû¬∏Ô¨Å`;√¢√µZM‚Ñ¢6U‚Ñ¢>T‚Ñ¢√±N{&¬´¬Æ‚àÜ  _Àö√éw¬Ø3√µ‚óä¬†q‚Ä¢‚àö¬°Àù8√©¬™‚àëwG√ª9{‚Ñ¢4Àò‚Äπƒ±G√ª√é‚Äì√∫√†√≤√Ç‚Ç¨œÄÀö√õmu5smp‚ÄûL‚Ä∞t√ì√ª¬Æv√îÔ¨Ç√∏‚Ä∫¬¢X√™√üœÄq √´√é√ã ‚Ä¶‚ÄπM√ß√´B√∂¬ß√Æc=‚Äö√•(¬∞EbFDp$√≠√ë√§B‚âà(Ô£ø|√ò√ãI%‚ÄôÀÜ[‚Äö|√®¬µ[¬∑¬ª¬±-V¬Ø ¬†8A$‚âà¬†‚Äìc√ÄÔ£ø/√úL‚Äî
√∂uA¬° √∫0Ô£ø¬™¬Æ√í‚âà¬ÆY¬¥:‚Ä†‚Ä¶ X‚Äô!M√°√´Àõs‚àèp)√ã√≠√Ü√°√´‚àÜ]‚Äúl¬©N¬∫‚Ä°B‚Äú&¬∞860T[∆í¬±zÔ¨Ç‚Ä¶+ Kl√ø√úe√É√∫√Ñ√à√ñ5‚â•¬∏}√ÑE‚Ä†‚Äô¬Ø8√¶√≤p√ß‚àè√†i6√©k√§√≤¬™;√ØG√ØÀõU¬£oM√®JÀù√çSk¬Æ>√Å/U¬¥‚ÄöP√∫)√∂√Ö/D√ÜS&√µ√ÅCi0√¥√∑√™Àùg√°√ø√õ¬®^w√™~≈ìtÀò¬Æ<=√±√Ül√Ö}√É√ò¬∏≈ì‚â•‚Äú√ö≈ì√¢√¶‚óä‚ÄûF,¬¢‚Äî√ãc¬¨Ô¨Ç{√∑√±p√†¬¨v‚ÄûA√Ü9s√ÜÔ¨Ç‚â•√Ñ‚Äò√Ñ‚Ä†	√åYÔ£øk‚àÜ¬®b ‚Äúf¬Æ√Ç‚àû√ª¬©
```
*Is this Klingon?*

The style information of MS Word documents is encoded in their proprietary format. Very difficult to open and not friendly from a digital preservation perspective.

### A quick aside on Lightweight Markup Languages


If you have ever edited a [Wiki](https://en.wikipedia.org/wiki/Help:Wikitext) or every time you write something in a Markdown cell, you are working with plain text. But how does that text get *stylized*? How is **this text, which is plain text Markdown, bold if Markdown is plain text!?**

These [*lightweight markup languages*](https://en.wikipedia.org/wiki/Lightweight_markup_language) encode the style using special plain text syntax, which gets interpreted by a program and transformed into stylized text (Usually HTML).


![Wikitext to Formatted Text](attachment:88b2b57b-e451-456f-9fbf-60be098d2c73.png)

## Comma-separated values (CSV)

CSV or [Comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) is a delimited plain text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

The structure of a CSV file has the following format as specified by [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180):
1. Encoded as plain text so a CSV file can be easily opened with a basic text editor like TextEdit or Notepad
2. Each line of the file represents a *record* 
3. Records are divided into a set of *fields* separated by a *delimiter* (normally a comma or tab)
4. Every record has the same sequence and number of fields
5. The first line, the header, provides labels to each field (optional)



This is what a CSV file looks like when you look at it as plain text:
```
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
```

Many people associate CSV files with Microsoft Excel because you typically open a CSV with Excel. However, while they might have the same data, they are not the same kind of file! `data.csv` != `data.xls` because Excel, like MS Word above, uses a proprietary format. CSV is kind of like a lightweight markup language, but for data!

## JavaScript Object Notation (JSON)

JSON or [JavaScript Object Notation](https://en.wikipedia.org/wiki/JSON) is a *[semi-structured data format](https://en.wikipedia.org/wiki/Semi-structured_data)* for structuring and storing non-tabular data in a machine readable format.

![json slide](json-slides/Slide2.png)
![json slide](json-slides/Slide3.png)
![json slide](json-slides/Slide4.png)
![json slide](json-slides/Slide5.png)
![json slide](json-slides/Slide6.png)
![json slide](json-slides/Slide7.png)
![json slide](json-slides/Slide8.png)

## Reading Files in Python

Text files, data files, and other kinds of files are stored "on disk." To work with these files we need to connect our Python process to the file and then read the contents into active memory. Once these files have been *read* then we can do stuff with them.

Python has a built-in function called [`open()`](https://docs.python.org/3/library/functions.html#open) for establishing a connection to files on disk and performing various things to those files, namely reading their contents into memory.

In [None]:
# create a file handler to open 
file_handler = open("files/test.txt")
file_handler

File handlers are pointers to a file, but they don't actually represent the contents of the file. To read a file into memory all at once, we use the `read()` method to slurp all of the text data into a variable.

In [None]:
# read the file contents into a variable called file_contents
file_contents = file_handler.read()
# display the contents of the variable file_contents
file_contents

One tricky thing about reading files in Python, you can only read a file in once.

In [None]:
# read the file contents into a variable called file_contents
file_contents = file_handler.read()
# display the contents of the variable file_contents
file_contents

Empty! Where did it go!?

Once you read a file, you need to close it and then re-open it.

In [None]:
# close the file
file_handler.close()

Closing the connection to the file is good Python hygiene, but only close the connection when you are done working with the file.

In [None]:
# open a connection to the test file
file_handler = open("files/test.txt")

# read the file contents into a variable called file_contents
file_contents = file_handler.read()

# close the file
file_handler.close()

# display the contents of the variable file_contents
file_contents

Once you read a file into memory you can close it because all of the contents have been copied into Python.

### Looping over files

Sometimes the file you are working with is too big to read into memory all at once. Big Data. In this case, you want to loop over every line of the file and process each line.

We can use a `for` loop with the file handler and Python will automatically loop until it reaches the end of the file.

In [None]:
# establish a connection to the file
file_handler = open("files/test.txt")

# loop over each line in the file
# setting the variable line to the contents of the line
for line in file_handler:
    # print the current line
    print(line) 
    
# close the conncection to the file
file_handler.close()


One final note, when you open files and/or loop over them line by line Python will represent the contents as strings. If your text file contains numbers (like a CSV file) you will have to do the conversion yourself.

## Writing files in Python

The `open()` function does more than just read files. You can "open" a connection to a file that does not yet exist and then write text to that new file.

Let's look at the documentation to the `open()` function using the `help()` function.

In [None]:
# use the help function to display the documentation for the open function
help(open)

Look at the Jupyter file browser 

In [None]:
# open a connection to a new file that does not yet exist
# set the mode to 'w' for write
fileHandle = open('lectureFile.txt', mode='w')

A wild FILE appeared!

In [None]:
# write some text to our file
fileHandle.write("This is a test file write")

In [None]:
# try reading the file
fileHandle.read()

File was not open for "reading"!!  Let's close it and reopen in read mode. 

In [None]:
# close the file and write the 
fileHandle.close()

In [None]:
# open the file in read mode
fileHandle = open('lectureFile.txt', mode='r')
# read file into memory
contents = fileHandle.read()
# hygiene!
fileHandle.close()

contents

Now we can see the contents of the file. 

But what happens if we try to add some new text?

In [None]:
# write new text to the file
fileHandle = open('lectureFile.txt', mode='w')
fileHandle.write("Careful with your modes!! the W is DESTRUCTIVE!!\n")
fileHandle.close()

# read the file into memory
fileHandle = open('lectureFile.txt', mode='r')
contents = fileHandle.read()
fileHandle.close()

# display contents
contents

In [None]:
# open the file in append mode and write some text
fileHandle = open('lectureFile.txt', mode='a')
fileHandle.write("Using mode 'a' is an append, it will add to the end of the file\n")
fileHandle.close()

# read the file into memory
fileHandle = open('lectureFile.txt', mode='r')
contents = fileHandle.read()
fileHandle.close()

# display the contents more pretty
print(contents)


If we run the code cell above a few times, we will keep adding new text to the file.