# Reading and Writing TSV Files 

In [1]:
def read_tsv_lists(filename):
    with open(filename) as fp:
        for line in fp:
            # strip newline at end of the line
            line = line.rstrip("\n")
            yield line.split("\t")

- The string method str.rstrip removes the specified characters at the end of the line. In this case, it is being used to strip the newline character '\n' marking the end of the line. Each line except the last one will end with the newline character, but it is not really data, just a separator between lines.
- The string method str.split breaks up the original string and returns the pieces in a list. The original string is not actually changed.
- Sometimes functions like this are written to return tuples of data to discourage accidental changes. This requires marginal extra code - just add tuple() to the yield statement. However, this makes changes like parsing numbers from strings more inconvenient, so its probably better to leave it as a list.
- All the values in the returned rows will be strings. You will eventually want to convert them to other types.
- This code is unlikely to have errors if the file exists and you can read it, but unexpected results may happen if the columns are inconsistent.
    - This code has no checks if the number of columns in each row are the same. Usually, you should at least check if the number matches the header (first) row.

# Reading a TSV File into Dictionaries 

In [2]:
def read_tsv_dictionaries(filename):
    with open(filename) as fp:
        def parse_line(line):
            return line.rstrip("\n").split("\t")
 
        header = parse_line(next(fp))
        for line in fp:
            line = parse_line(line)
            yield dict(zip(header, line))

- The nested function parse_line is for convenience to avoid repeating the logic for stripping newlines and splitting on tabs. It does not access any variables from the surrounding scope, but itself is scoped so only code inside the with statement can access it.
- The header variable is a list populated by parsing the first line of the file.
- This function uses three built-in functions that we haven't mentioned before.
    - The built-in function next returns the next output from an iterator. In this case, it was the first line from the file.
    - The built-in function dict takes in an iterable of pairs (sequences of length two) and populates a dictionary. The first value of each pair becomes a key, and the second value of each pair is that key's value.
    - The built-in function zip takes two or more iterables, repeatedly takes a value from each, and yields a tuple of those values. The name is an analogy to zipping up a zipper - two sides are becoming paired together. In this usage, the column names in the header variable are being paired with the values in the line variable.
- This code will raise an exception when next is called if the file is empty
    - This means that the file does not even have a header row, and your assumptions about this being a valid data file were wrong. If you catch this exception, you should probably report which file the error, and then raise an exception again since your process is unlikely to work with this file empty.
- If rows after the header have different numbers of values, this code will continue without an exception, but you may be surprised later since the dictionaries are not as expected
    - If there are fewer values than expected, then the code will implicitly assume that these values should be matched to the first column names in the header, and the dictionary will be missing the later keys. You will see the KeyError exception when trying to access those keys.
    - If there are more values than expected, then they will be silently dropped by the zip function.
    - If you want to catch these cases, you can add an explicit length comparison (len(header) != len(line)) and code your own response, or add strict=True to the zip call which will make it raise a ValueError if the lengths do not match.

# Writing a TSV File in Python

In [4]:
def write_tsv(filename, column_names, rows):
    with open(filename, "w") as fp: # "w" mode- file will be opened for writing, and any existing content in the file will be truncated (deleted)
        def write_line(row):
            fp.write("\t".join(str(v) for v in row) + "\n")
 
        write_line(column_names)
        for row in rows:
            write_line(row)

- This function assumes that str is adequate to convert all of the values to strings.
- The nested function write_line writes one line to the file with its input row converted to strings.
    - Unlike the previous nested function example, it uses the variable fp from the surrounding scope.
    - The file object fp's method .write takes in a string and writes it to the file. Unlike the print function, it does not automatically add a new line at the end, so it needs to be explicitly included.
- The string method str.join takes in a sequence of strings and joins them together separated by the original string.
    - "\t".join(["a", "b", "c"]) returns "a+b+c"
- This function could be modified to handle dictionary rows by changing the write_line(row) line to pass a list or generator comprehension to write_line to fetch the values in the right order.
    - If you make that modification, you should "freeze" column_names as a list so that you can iterate over it multiple times.
