# Introduction to Input/Output

Input/Output is simply reading data from sources, such as `.txt` or `.csv` files, manipulating the data, and then often writing the data to a different persistent source (e.g., a file). 



## Files and the Current Working Directory

Hopefully, you already understand what a **file** is and how to find it locally on your computer. For our purposes, we will restrict ourselves to files that are saved on your computer (which should be running the Microsoft Windows operating system), not in a cloud-based system (including Microsoft's OneDrive). A file is a form of persistent storage for data. The file resides in a folder or directory on your hard drive. You can use the *File Explorer* application on your Windows machine to navigate the directory structure. 

So, where are your Jupyter notebook files? You should already know the answer to this question. If you accepted all the defaults when you installed the Anaconda distribution, then when you open Jupyter notebook the starting working directory should be `C:\Users\user_name`, where `user_name` is the id you use to log in to your computer. When you start Jupyter notebook, the `Files` tab should confirm this location.

Understanding the **current working directory** is important as we begin our exploration of reading in files. Your **current working directory** is the directory or folder where you are currently working. In essence, it is the folder/directory where the currently running Jupyter notebook file is. Let's find the current working directory.

In [None]:
# One approach to seeing your current working directory
# Issue the `cd` Windows command from inside a code cell
!cd

We can also use the `os`, which stands for *operating system*, module to help us find the current working directory. First, you must import the module.

In [None]:
# import the os module/package
import os

In [None]:
# get the current working directory
cwd = os.getcwd()
print(f'The current working directory is: {cwd}')

In [None]:
# you can see a list of of files in the cwd
print(os.listdir())

In [None]:
# you can change the cwd 
os.chdir('C:\data')
print(os.getcwd())

In [None]:
# Notice that I saved our first cwd in a variable of that name
# That lets me go back to it easily
os.chdir(cwd)
print(os.getcwd())

We used an **absolute reference** to change the working directory above. In many scenarios you will want to change directories or read files from directories that are **relative** to your current working directory. For example, we will try to consistently use the convention that any data files that we want to use for our Python program will be in a subdirectory called `data`. A **subdirectory** is simply a directory inside of another directory. First, let's find what subdirectories exist and then look at the files residing in the subdirectories.

In [None]:
# Loop through the listing of files and directories
for file in os.listdir():
    # Check to see if the item is a directory
    if os.path.isdir(file):
        print(f'Got a directory: {file}')
        # So let's see all the files in that directory
        print(os.listdir(file))
        print()

In [None]:
# We are really only interested in the data subfolder
# To use relative reference you can simply do this:
print(os.listdir('data'))

In [None]:
# However, it sometimes better to be more explicit
# and state that you are starting from the cwd.
# You do this with the `.`
print(os.listdir('./data'))

In [None]:
# If you forget the `.` you get:
print(os.listdir('/data'))

In [None]:
# What happens if the directory does not exist?
print(os.listdir('./fun_stuff'))

In [None]:
# To move up a single directory, use `..`
print(os.listdir('..'))

In [None]:
# You can go as many times as needed
print(os.listdir('../..'))

<hr style="border:1px solid gray">

## File Types

You have undoubtedly worked with various file types, some of which you probably deal with all day, every day. Perhaps those are Microsoft Excel, Word, or PowerPoint files. In most instances, a particular software package saves files with its own **filename extension**. For example, newer versions of MS Excel files have the filename extension `.xlsx`, Word files have the `.docx` extension, and PowerPoint files use the `.pptx` extension. You have now also had the pleasure of adding `.ipynb` files to your often-used file types.

Generally, the filename extension tells your computer which application should be used to open the file. When you double-click on a file with the extension `.xlsx`, your computer will automatically open it in MS Excel. You may have tried double-clicking on a `.ipynb` file and noticed that it will **not** automatically open up Jupyter notebook for you. One reason is that Jupyter notebook is actually a small webserver that must be started before you can open the `.ipynb` through its interface. 

For this module we are going to explore several different file types. We'll start with some `.txt` files which are text files. Text files are also sometimes called "flat files" and have been around since personal (or micro) computers were introduced. In essence, they can handle one-dimensional and two-dimensional data (such as rows and columns) very easily, but not much beyond that. 

In [None]:
# Look at the files in the data subdirectory
os.listdir('./data')

We see that we have `.txt`, `.csv`, and `.xlsx` files. Is there a way to count how many of each file type we have? I'll give you the pseudocode and leave it to you to implement as a student exercise.

>*Create an empty dictionary to hold extension:count  
>For each file in the subdirectory data:  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Find its extension  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If the extension already exists in the dictionary  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add 1 to the count  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Else  
>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add entry to dictionary with a value of 1  
>Print out the dictionary*

<font color='red' size = '5'> Student Exercise </font>

Operationalize the given pseudocode above to determine how many of each file type exists in the subdirectory data.

In [None]:
# Implement pseudocode
# Create empty dictionary

# Loop over each file in subdirectory data

    # Find the extension of the file
    
    
    # If the extension is in the dictionary increment count
    
    # If not, then add it to dictionary
    
        
# Print out resulting dictionary


<hr style="border:1px solid gray">

## Reading Text Files

We will start by reading some text (`.txt`) files. These files may contain **structured** data, but also may contain **unstructured** data that we want to analyze. An example of a structured text file would either have a delimiter (e.g., a comma) between "columns" (or have fixed lengths for each "column") for a particular "row" of data. We often think of this as **tabular** data where each column represents an attribute for the observation (the row). You will often find the tabular format in `.csv` files or `.xlsx` files. We will discuss the idea of tabular data in much more detail in a future module. 

For unstructured data, we might want to parse emails, HTML (Hyper Text Markup Language), or JSON (JavaScript Object Notation) files. Each of these file types do have a structure to them, but it is often not **tabular** like we would see in `.csv` and `.xlsx` files. What this generally means is that we may need to manually examine a representative sample of the file types that we want to parse to help us write code that will automate the task of "reading" these files.

There are several ways to read the contents of a file into memory. At the most basic you can use the method `open()`.

- [`open()`](https://docs.python.org/3/library/functions.html#open) returns a [file object](https://docs.python.org/3/glossary.html#term-file-object), and is most commonly used with two positional arguments: `open(filename, mode)`.
- Using the [`with`](https://docs.python.org/3/reference/compound_stmts.html#with) keyword. It is good practice to use the `with` keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point. Using `with` is also much shorter than writing equivalent `try-finally` blocks.

The modes you can use for the second positional argument in the `open()` method include:

- `'r'`: open for reading (default)
- `'w'`: open for writing, truncating the file first
- `'x'`: open for exclusive creation, failing if the file already exists
- `'a'`: open for writing, appending to the end of the file if it exists
- `'b'`: binary mode
- `'t'`: text mode (default)
- `'+'`: open for updating (reading and writing)

Once the file is open, you can call `read()`, which will try to read the entire file. You can also read one line at a time with `readline()`. Additionally, you can use `readlines()` to read all the lines and returns a `list` where each element is one of the lines of the file. Let's try it.

In [None]:
# Use open() to get contents from doc1.txt file
first_file = open('./data/doc1.txt', 'r')
contents = first_file.read()
first_file.close()

print('contents of file:')
print('==================')
print(contents)

In [None]:
# Use the with statement to get contents from doc2.txt
with open('./data/doc2.txt', 'r') as second_file:
    contents2 = second_file.read()
    
print(f'Is file closed? {second_file.closed}')
print('contents of file:')
print('==================')
print(contents2)

In [None]:
# Open with open() and print one line at a time
third_file = open('./data/doc3.txt', 'r')
for line in third_file:
    print(line, end='')

# We did not explicitly close the file ... 
print(f'\n\nFile closed? {third_file.closed}')

In [None]:
# f.readline() reads a single line
# Using a while loop to get all the lines
with open('./data/doc1.txt', 'r') as f:
    while f:
        line = f.readline()
        print(line, end='')
        if line == '':
            break

In [None]:
# f.readlines() returns a list
with open('./data/doc1.txt', 'r') as f:
    all_stuff = f.readlines()
    
print(all_stuff)

### Best Practice: Use the `with` Statement

The preferred method for opening and reading the contents of a file is by using the `with` statement. There are two reasons for this. First, it will close the file for us. Second, it forces us to think in the **context** of working with the file when it is open. That is, whatever code we put in the `with` statement will be executed with the file open, which can be costly in terms of memory and processing time. 

<hr style="border:1px solid gray">

<font color='red' size = '5'> Student Exercise </font>

You have been given the `states.csv` file that contains the following attributes for each US state and the District of Columbia: `State`, `Population`, `ElectoralVotes`, `HighwayMiles`, and `SquareMiles`. Each of these attributes is separated by a comma, hence the `.csv` filename extension. Complete the following tasks:

1. Read the data from `.csv` file into a variable called `states_data`, where each row is an element of a list.
2. How many rows are in the file?
3. You can use the method `strip()` on a string to remove the beginning and ending whitespace from it and the method `split()` to break the string into a list. Using those two methods, print out each line of the file as a list.
-----

In [None]:
# 1. Read the data from `.csv` file into a variable called 
# `states_data`, where each row is an element of a list.


In [None]:
# 2. How many rows are in the file?


In [None]:
# 3. You can use the method `strip()` on a string to 
# remove the beginning and ending whitespace from it 
# and the method `split()` to break the string into 
# a list. Using those two methods, print out each 
# line of the file as a list.


<hr style="border:1px solid gray">

## Manipulating Text Files

The way we read and manipulate the contents of flat files (e.g., `.txt` and `.csv`) depends heavily on the their structure or lack of it. If we have structured and tabular data in a `.csv` file, many of the manipulation tasks are straightforward. (We actually have another useful package for handling those cases that we discuss in the latter part of this module.) When the data is unstructured, we are often entering the "text analysis" realm. This is a large field unto itself. 

We will revisit the examples of both structured, tabular data and unstructured contents that we encountered above. For the structured data we had the file `states.csv`. The unstructured data we will explore will be in the file `doc2.txt`. 

### Structured Text Data

Let's begin by looking at a `.csv`, file which is simply a text file that uses commas as the column/field delimiter. Recall that the `states.csv` file contains a header row and 51 rows of data. Each row has the name of the US state, its population, its number of presidential electoral votes, its number of highway miles, and its land mass area measured in square miles.

Our goal is to read in the contents so that we have a list of lists where each sublist has the state name (as type `str`), the population (as type `int`), the number of electoral votes (as type `int`), the number of highway miles (as type `float`), and the number of square miles (as type `float`). Then, using that two-dimensional list, we want to sum up all of the numerical elements.

We have several options to read the contents of the file. Using `.read()` gives us back one large string object. If we instead use `.readlines()`, a list is returned. We might as well try both approaches.

In [None]:
# Start with .read() for one large string object
# Open the file `states.csv` and read in contents
with open('./data/states.csv', 'r') as f:
    data = f.read()
    
print(type(data))
print(data)

In [None]:
# Now use splitlines to break the string
# into a list, one for each line
split_data = data.splitlines()
print(type(split_data))
print(split_data)

In [None]:
# Each element of split_data is a string, but
# we want separate elements with data types indicated
#
# Loop over the split_data list, splitting each
# string and converting data types
# Create empty list
fixed_data = []
for i in range(len(split_data)):
    # First element is header row
    if i == 0:
        fixed_data.append(split_data[i].split(','))
    # real data element
    else:
        state, pop, ev, hm, sq = split_data[i].split(',')
        fixed_data.append([state,int(pop),int(ev),float(hm),float(sq)])
        
print(fixed_data)

In [None]:
# Now need to sum up numerical elements
sums = []
for i in range(1, len(fixed_data[0])):
    count = 0
    for x in fixed_data[1:]:
        count += x[i]
    sums.append(count)
    
print(sums)

In [None]:
# Now let's try with readlines()
with open('./data/states.csv', 'r') as f:
    s_data = f.readlines()
    
print(type(s_data))
print(s_data)

In [None]:
# Notice that each element of the list is a string
# and it has a newline character at the end of it
# Loop over the list, stripping off newline and 
# splitting the string into a list and doing conversions

# New empty list
states_data = []
for i in range(len(s_data)):
    new_row = s_data[i].strip().split(',')
    # First row is header
    if i == 0:
        states_data.append(new_row)
    else:
        new_row[1] = int(new_row[1])
        new_row[2] = int(new_row[2])
        new_row[3] = float(new_row[3])
        new_row[4] = float(new_row[4])
        states_data.append(new_row)
        
print(states_data)

In [None]:
# Now sum them up
sums2 = []
for i in range(1, len(states_data[0])):
    count = 0
    for x in fixed_data[1:]:
        count += x[i]
    sums2.append(count)
    
print(sums2)

### Unstructured Text Data

There are times when you encounter unstructured text data, but still want to glean some insights from it. This idea is broadly referred to as **text analysis**. You may also hear the concept called **natural language processing** (NLP). One specific application that uses text analysis and NLP is [**sentiment analysis**][1], which is often used in marketing contexts to better understand a firm's customers.

We will barely touch this large topic here, but we can at least examine a few rudimentary tasks. We will be using the file `doc2.txt`. 

[1]: https://en.wikipedia.org/wiki/Sentiment_analysis

In [None]:
# read in the file and print it out
with open('./data/doc2.txt', 'r') as f:
    doc = f.read()
    
print(doc)

In [None]:
# How many characters are in the file?
print(f'There are {len(doc)} characters in the file')

In [None]:
# How many words are in the file?
print(f'There are approximately {len(doc.lower().split())} words in the file')

In [None]:
# What is the most frequently occurring word?
# Create empty dictionary
word_counts = {}

# put in lower case and loop over after putting in list
for word in doc.lower().split():
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1
        
print(word_counts)

In [None]:
# We can also use Counter
from collections import Counter
counts = Counter(doc.lower().split())
print(counts)

You will notice that these approaches are not perfect. The "word" `it?` is considered different than the word `it`. You would probably say that the one with the question mark should be counted along with the one without the punctuation mark. There are ways to handle this situation that we can explore later. 

<hr style="border:1px solid gray">

<font color='red' size = '5'> Student Exercise </font>

You have been given a `.txt` file, named `nyTimes.txt`, that contains an article from the *New York Times*. Complete the following tasks below:

1. Read the data from the `.txt` file into a variable called `article`.
2. Approximately how many characters are there in the file?
3. Approximately how many words are there in the file? 
4. Which **five** words occur the most frequently?

-----

In [None]:
# 1. Read the data from the `.txt` file into a variable called `article`.


In [None]:
# 2. Approximately how many characters are there in the file?


In [None]:
# 3. Approximately how many words are there in the file? 


In [None]:
# 4. Which five words occur the most frequently?


<hr style="border:1px solid gray">

## Reading JSON Files

JSON (**J**ava**S**cript **O**bject **N**otation) is a lightweight, text-based, language-independent data interchange format. While it is based on a subset of the JavaScript programming language, it is language agnostic and has its own [standard][1]. It gained popularity because it is easy for humans to read and write while at the same time being easy for machines to parse and generate.

Python supports JSON natively via the `json` module. When you transform your data into a series of bytes, allowing for storage or transmission across a network, you are **serializing** the data. **Deserialization** is the process of decoding data that has been stored or delivered in the JSON standard.

Many APIs (application programming interfaces) interchange data using the JSON standard. For our example, I have used the [Google books API][2] with the search term "python" to find Python-related books. By default, this API returns the first 10 entries. I saved the results in a file for us and named it `python_books.json`.

If you open this file, you will notice that it looks a lot like a dictionary. Python data types have a fairly intuitive conversion to JSON as shown in the table below.

|Python |	JSON|
| :----------- | -----------: |
|`dict` |	object |
|`list`, `tuple` |	array |
|`str` |	string |
|`int`, `long`, `float` |	number |
|`True` |	true |
|`False` |	false |
|`None` |	null |

Let's try a few things with our `python_books.json` file. To deserialize the `.json` file, we can use the `load()`, which is expecting (in essence) a file handle, or `loads()`, which is expecting a string (hence the `s` at the end of the function name).

[1]: https://www.rfc-editor.org/rfc/rfc8259
[2]: https://www.googleapis.com/books/v1/volumes?q={python}

In [None]:
# import the json module
import json

In [None]:
# Our JSON data is in a file, so we should use load()
with open('./data/python_books.json', 'r') as f:
    books = json.load(f)
    
print(type(books))
print(books)

Suppose we wanted to get the **titles** for those 10 books. We will have to examine the structure of the dictionary to help us understand how to extract the titles. How many entries are in the resulting `books` dictionary?

In [None]:
# How many elements
print(f'our books dictionary has {len(books)} elements')
# print the keys
print(f'They are {books.keys()}')

The "meat" of the contents is the element with the key `items`. Perhaps you noticed above that the value for that key is a `list`. That means we can easily iterate over the list. The question now is what are we looking for in the list? Well, each item in that list is a dictionary. What key are we looking for? The key `volumeInfo` appears to have the title in it. So, let's get the titles for the 10 books in our JSON file.

In [None]:
# loop over the items 
for i in books['items']:
    # pull out the title from volumeInfo
    print(i['volumeInfo']['title'])

<hr style="border:1px solid gray">

<font color='red' size = '5'> Student Exercise </font>

Using the `python_books.json` file, complete the following tasks:

1. Find and print the authors of the 10 books.
2. Find and print the ISBN numbers of the 10 books.

-----

In [None]:
# 1. Find and print the authors of the 10 books.


In [None]:
# 2. Find and print the ISBN numbers of the 10 books.


<hr style="border:1px solid gray">

## What About Writing Files?

So far, we have restricted ourselves to reading files. We can also write data to files. One thing to remember is that with the `w` mode, if a file already exists it will overwrite the current contents of the file.

There is a file named `another_file_bak.txt` that we will be using. Let's first open it and print out its contents.

In [None]:
# open another_file_bak.txt and print contents
with open('./data/another_file_bak.txt', 'r') as f:
    contents = f.read()
    
print(contents)

Let's now make a copy of that file and use the copy to try writing, etc. Here, we will use the module `shutil` to make a copy of the file.

In [None]:
import shutil

shutil.copyfile('./data/another_file_bak.txt',
                './data/my_copy.txt')

In [None]:
# check to make sure the file is there with same contents
with open('./data/my_copy.txt', 'r') as f:
    print(f.read())

We want to try to write some data into the file `my_copy.txt`. 

In [None]:
# create a silly string to write out
ss = 'Here are some words\nover multiple\nlines.'

In [None]:
# Write ss to my_copy.txt
with open('./data/my_copy.txt', 'w') as f:
    f.write(ss)

In [None]:
# check to see what is in file now
with open('./data/my_copy.txt', 'r') as f:
    print(f.read())

What about appending data to the file? We'll make a copy of the original file to `my_copy.txt` again and verify that it worked. Then we will try appending the string `ss` to the file.

In [None]:
# Copy the "bak" file to my_copy.txt again
shutil.copyfile('./data/another_file_bak.txt',
                './data/my_copy.txt')

# check to make sure the file is there with same contents
with open('./data/my_copy.txt', 'r') as f:
    print(f.read())

In [None]:
# Open the file in append mode and write to it
with open('./data/my_copy.txt', 'a') as f:
    f.write(ss)

In [None]:
# see what is in the file now
with open('./data/my_copy.txt', 'r') as f:
    print(f.read())

<hr style="border:1px solid gray">

### Additional Resources

The following links point you to additional resources that you might find helpful in learning this material.


1. The official API reference for [`io`][1].
2. [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text][2].
3. The [tutorial for reading and writing files][3].
4. Introducing [JSON][4].
5. The Python [documentation for the `json` module][5].

-----

[1]: https://docs.python.org/3/library/io.html
[2]: https://kunststube.net/encoding/
[3]: https://docs.python.org/3/tutorial/inputoutput.html#tut-files
[4]: https://www.json.org/json-en.html
[5]: https://docs.python.org/3/library/json.html


**&copy; 2022 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**