<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

**Intermediate Python II**

**Description:** This notebook describes how to:
* Read and write files (.txt, .csv, .json)
* Use the tdm_client to read in metadata
* Use the tdm_client to read in data

This is part 2 of 3 in the series Intermediate Python that will prepare you to do text analysis using the Python programming language.

**Difficulty:** Intermediate

**Knowledge Required:** 
* [Getting Started with Jupyter Notebooks](../../0-beginner/getting-started-with-jupyter.ipynb)
* [Python Basics I](../../0-beginner/python-basic-skills/python-basics-1.ipynb)
* [Python Basics II](../../0-beginner/python-basic-skills/python-basics-2.ipynb)
* [Python Basics III](../../0-beginner/python-basic-skills/python-basics-3.ipynb)

**Knowledge Recommended:** None

**Completion Time:** 75 minutes

**Data Format:** None

**Libraries Used:** 
___

# Working with Files in Python
Working with files is an essential part of Python programming. When we execute code in Python, we manipulate data through the use of variables. When the program is closed, however, any data stored in those variables is erased. To save the information stored in variables, we must learn how to write it to a file.

At the same time, we may have notebooks for applying specific analyses, but we need to have a way to bring data into the notebook for analysis. Otherwise, we would have to type all the data into the program ourselves! Both reading-in from files and writing-out data out to files are important skills for data science and the digital humanities.

This section describes how to work with three kinds of common data files in Python:
* Plain Text Files (.txt)
* Comma-Separated Value files (.csv)
* Javascript Object Notation files (.json)

Each of these filetypes are in wide use in data science, digital humanities, and general programming. 

# Three Common Data File Types

## Plain Text Files (.txt)
A plain text file is one of the simplest kinds of computer files. Plain text files can be opened with a text editor like Notepad (Windows 10) or TextEdit (OS X). The file can contain only basic textual characters such as: letters, numbers, spaces, and line breaks. Plain text files do not contain styling such as: heading sizes, italic, bold, or specialized fonts. (To including styling in a text file, writers may use other file formats such as rich text format (.rtf) or markdown (.md).)

Plain text files (.txt) can be easily viewed and modified by humans by changing the text within. This is an important distinction from binary files such as images (.jpg), archives (.gzip), audio (.wav), or video (.mp4). If a binary file is opened with a text editor, the content will be largely unreadable.

## Comma-Separated Value Files (.csv)
A comma-separated value file is also a text file that can easily be modifed with a text editor. A CSV file is generally used to store data that fits in a series or table (like a list or spreadsheet). A spreadsheet application (like Microsoft Excel or Google Sheets) will allow you to view and edit a CSV data in the form of a table.

Each row of a CSV file represents a single row of a table. The values in a CSV are separated by commas (with no space between), but other delimiters can be chosen such as a tab or pipe (|). A tab-separated value file is called a TSV file (.tsv). Using tabs or pipes may be preferable if the data being stored contains commas (since this could make it confusing whether a comma is part of a single entry or a delimiter between entries).

### The text contents of a sample CSV file
```
Username,Login email,Identifier,First name,Last name
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
```
### The same CSV file represented in Google Sheets:

![CSV table view in Google Sheets](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/csv_in_sheets.png)

## JavaScript Object Notation (.json)
A Javascript Object Notation file is also a text file that can be modified with a text editor. A JSON file stores data in key/value pairs, very similar to a Python dictionary. One of the key benefits of JSON is its compactness which makes it ideal for exchanging data between web browsers and servers.

While smaller JSON files can be opened in a text editor, larger files can be difficult to read. Viewing and editing JSON is easier in specialized editors, available online at sites like: 

* [JSON Formatter](http://jsonformatter.org)
* [JSON Editor Online](https://jsoneditoronline.org/)

A JSON file has a nested structure, where smaller concepts are grouped under larger ones. Like extensible markup language (.xml), a JSON file can be checked to determine that it is valid (follows the proper format for JSON) and/or well-formed (follows the proper format defined in a specialized example, called a schema). 

### The text contents of a sample JSON file

```
{
    "firstName": "Julia",
    "lastName": "Smith",
    "gender": "woman",
    "age": 57,
    "address": {
        "streetAddress": "11434",
        "city": "Detroit",
        "state": "Mi",
        "postalCode": "48202"
    },
    "phoneNumbers": [
        { "type": "home", "number": "7383627627" }
    ]
}
```
### The same JSON file represented in JSON Editor Online
![An image of the JSON file showing the structure](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/json_editor.png)


# Opening, Reading, and Writing Text Files (.txt)

## Open the File

Before we can read or write to text file, we must open the file. Normally, when we open a file, a new window appears where we can see the contents. In Python, opening a file simply means create a *file object* that refers to the particular file. When the file has been opened, we can read or write to the file. Finally, we must close the file. Here are the three steps:

1. Use the open() function to create a file object
2. Use the .read(), .readlines(), or .write() method on the file object
3. Use the close() function to close the file object

Let's practice on `sample.txt`, a sample text file.

In [None]:
# Open the text file `sample.txt` creating
# a file object called sample_file

sample_file = open('sample.txt', 'r')

We have created a file object called `sample_file`. The first argument (`'sample.txt'`) is a string containing the file name. You can see the sample.txt in the same directory as this lesson. If your file was called reports.txt, you would replace that argument with `'reports.txt'`. The second argument (`'r'`) determines that we are opening the file in "read" mode, where the file can be read but not modified. There are three main modes that can be specified:

|Argument|Mode Name|Description|
|---|---|---|
|'r'|read|Reads file but no writing allowed (protects file from modification)|
|'w'|write|Writes to file, overwriting any information in the file (saves over the current version of the file|
|'a'|append|Appends (or adds) information to the end of the file (new information is added below old information)|

## Read the File

### .read() method
Now that we have a file object `sample_file` opened in "read mode," let's read the contents with the `.read()` method. We will create a variable called `file_contents` to hold the data that we are reading in.

In [None]:
# Create a variable called `file_contents`
# that will hold the result of using the
# .read() method on our file object
file_contents = sample_file.read()
print(file_contents)

When we are finished with the file, we must close it using the .close() method on the file object. It is very important to always close a file, otherwise your program may crash or create memory problems.

In [None]:
# Close the file by using the .close() method
# on the file object
sample_file.close()

### .readlines() method

If a file is very large, we may want to read a single line at a time so as not to fill all of the available computer memory. To read a single line at a time, we can use the `.readlines()` method instead of the `.read()` method.

In [None]:
sample_file = open('sample.txt', 'r')
file_contents = sample_file.readlines()
print(file_contents)

With the `.read()` method, we read in the whole text file as a single string. The `.readlines()` gives us a Python list, where each item is a single string representing a single line of the text document. You may also notice that each line ends with `\n` which represents a line break in a string. If we print the string, the line break is visible in our output.

In [None]:
# Print the first item in the file_contents list
# Note the \n turns into a line break
print(file_contents[0])
sample_file.close()

## Write to the File
To write to a file, we need to open it and create our file object in either write ('w') or append ('a') mode. 

### Append mode
Let's start with append mode which adds new data to the bottom of the file while leaving any previous information intact.

In [None]:
# Opening a file in append mode
# and creating a new file object 
# called sample_file
sample_file = open('sample.txt', 'a')

Now we can use the `.write()` method to append a string to the file. 

In [None]:
# Appending an eleventh line to the file
sample_file.write('\nThis is the eleventh line')
sample_file.close()

Can you read the file back in to see whether the `.write()` was successful?

In [None]:
# Open the the file in read mode
# create a file object called `sample_file`
sample_file = 

# Use the .read() method on the file object
# Store the result in a variable `file_contents`
file_contents = 

# Print the contents
print(file_contents)

# Close the file
sample_file.close()

### Write mode
Opening a file in write mode is useful in two scenarios:
* Creating a new text file and writing data to it
* Overwriting all data in the file with new data

Here is an example:

In [None]:
# Creating a new file in write mode
new_sample_file = open('new_sample.txt', 'w')

# Define a string variable to add to the new file
string = 'Here is some data\nWith a second line'

# Using write method on the file object
contents = new_sample_file.write(string)

# Close file object
new_sample_file.close()

## Open/Close Files `with open`
The `with open` technique is commonly used in Python because it has two significant advantages:
* It is more compact 
* It automatically closes the file afterward

The basic form resembles a flow control statement, ending in a colon and then executing an indented block of code. After the block executes, the file is closed automatically.

In [None]:
with open('sample.txt', 'r') as f:
    print(f.read())

# Opening, Reading, and Writing CSV Files (.csv)
CSV file data can be easily opened, read, and written using the `pandas` library. (For large CSV files (>500 mb), you may wish to use the `csv` library to read in a single row at a time to reduce the memory footprint.) The Pandas library is more flexible for viewing and editing tabular data. Pandas also makes it very easy to import and export CSV data.

In [1]:
# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv('sample.csv')

In [2]:
# Display the dataframe
print(df)

    Username         Login email  Identifier First name Last name
0   booker12  rachel@example.com        9012     Rachel    Booker
1     grey07                 NaN        2070      Laura      Grey
2  johnson81                 NaN        4081      Craig   Johnson
3  jenkins46    mary@example.com        9346       Mary   Jenkins
4    smith79   jamie@example.com        5079      Jamie     Smith


After you've made any necessary changes in Pandas, write the dataframe back to the CSV file. (Remember to always back up your data before writing over the file.)

In [3]:
# Write data to new file
# Keeping the Header but removing the index
df.to_csv('new_sample.csv', header=True, index=False)