# Problem Set 1.3: Reading and viewing data

[Click here to open this notebook in your browser](https://leifwalsh.github.io/data-analysis-problem-sets/lab/index.html?path=1-foundations/1.3-reading-and-viewing-data/1.3-reading-and-viewing-data.ipynb)

Start exploring some simple data, at first with just Python built-ins.

## Reading files

You can read files in the same directory as the notebook in a couple of ways:

- With the `open()` builtin - this gives you a file object to read from
- With `pathlib.Path()` - a little easier

Let's try both.

First, let's look at the files that are nearby. `os.listdir()` is like `ls` on
the command line: it **list**s what's in a **dir**ectory.

In [None]:
# Import the `os` module, which stands for Operating System

import os

# Call the `listdir()` function in the `os` module

os.listdir()

It's a list of strings! We can see the notebook we're in, and some other files.

Let's open and read the `simple.csv` file, first with the builtin `open()`:

In [None]:
# The "r" means you want to **read** the file, so Python won't let you edit it
#
# You can also open for writing, or both, and there's a thing about "encoding"
# we're going to ignore
simple_csv_file = open("simple.csv", "r")

# Now we have a "file object"
simple_csv_file

In [None]:
# Try tab-completing after the dot to see some things you can do with a file:
simple_csv_file.

In [None]:
# We can read the whole thing by calling `read()`:
text = simple_csv_file.read()

# Now we have a string with the contents. Note the `\n` thing: this means
# "newline"
text

In [None]:
# The file object has a sort of cursor - it remembers where you are in the
# text. This means you can read a file a little bit at a time, but after
# reading the whole thing, now we're at the end. If we call `read()` again, we
# get an empty string because there's nothing left to read after the end:
simple_csv_file.read()

In [None]:
# But we still have our text:
text

Now let's use `pathlib` to read the file again, then we'll get on to parsing:

In [None]:
# Import the `Path` class from the `pathlib` module:

from pathlib import Path

# Make a Path object referring to `simple.csv`
path = Path("simple.csv")
path

In [None]:
# We can do many things with a Path too:
print(f"{path.absolute() = }")
print(f"{path.parent = }")
print(f"{path.stem = }")

In [None]:
# Try tab-completing here too:
path.

In [None]:
# But most importantly, we can read it with one method on Path:
text = path.read_text()
text

In [None]:
# When you just evaluate `text`, it includes those special characters like
# newlines, but if you `print()` a string, it prints them like normal:
print(text)

## Basic CSV Parsing

Now let's try to turn this string into something we can work with. Later we'll
see that `pandas` can do this for us, and handle a bunch of weird formatting,
but for now we're on our own.

In [None]:
# You can split a string into lines with `splitlines()`
lines = text.splitlines()
lines

In [None]:
# Let's just look at the header row:
header = lines[0]
header

In [None]:
# We can split on columns with the `split()` method - you have to tell it to
# split on the comma.
column_names = header.split(",")
column_names

In [None]:
# We can split all the other rows too, in a loop:

data_lines = lines[1:]  # This is the same as lines[1:len(lines)] - everything
                        # but the first line
data_lines

In [None]:
for line in data_lines:
    print(line.split(","))

In [None]:
# Now we can build a dictionary, mapping column names to the values in that
# column.

# First, let's make an empty one:

table = {}  # Make an empty dictionary
for column in column_names:
    table[column] = []  # Insert an empty list

table

In [None]:
# There's a faster way to do this, called a "dict comprehension":

table = {                    # Make a dictionary
    col: []                  # by setting `col` to an empty list
    for col in column_names  # for each `col` in `column_names`
}
table

In [None]:
# You can do it all on one line too:
table = {col: [] for col in column_names}
table

In [None]:
# Then we can append every row's values to each column.
for line in data_lines:
    values = line.split(",")
    print(f"{values = }")
    # `enumerate()` prepends each item with its index number, starting at 0:
    for index, value in enumerate(values):
        print(f"{index = }, {value = }")
        # We can use `index` to get the column name for this value:
        col = column_names[index]
        print(f"{col = }")
        table[col].append(value)

In [None]:
# And we're done!
table

In [None]:
# Now, if we want to look something up, we specify the column we want, then the
# row number:
table["Column B"][2]

In [None]:
# Note that everything is still just strings. You can't sum the first column:
sum(table["Column A"])

In [None]:
# We can convert those to ints though:

# This is a "list comprehension", like the dict comprehension earlier:
table["Column A"] = [int(x) for x in table["Column A"]]
table

In [None]:
# Now we can sum it:
sum(table["Column A"])

## CSV files can get complicated!

What if there are commas inside one of your values? Suppose you spent way too
much time finding funny ball python color combinations on
https://www.worldofballpythons.com, and the `Genetics` column is a list itself?

Let's see what's in `morphs.csv`. Can you read the file and print the whole
thing as a string?

In [None]:
path = Path("morphs.csv")
morph_text = ...  # Fill this in
print(morph_text)

Let's try to parse it like before:

In [None]:
lines = morph_text.splitlines()
lines

In [None]:
column_names = lines[0].split(",")
table = {col: [] for col in column_names}
table

In [None]:
data_lines = lines[1:]
data_lines

In [None]:
for line in data_lines:
    values = line.split(",")
    print(f"{values = }")
    for index, value in enumerate(values):
        print(f"{index = }, {value = }")
        col = column_names[index]
        print(f"{col = }")
        table[col].append(value)

Oh no. `split()` doesn't know that the quotes around the last column are
important.

Can you fix it?

In [None]:
for line in data_lines:
    print(f"{line = }")

Think about this, and try to parse the table here before continuing:

In [None]:
for line in data_lines:
    ...

You did try to parse that before continuing, right?

In [None]:
# We can tell `split()` we only want to split two times:
for line in data_lines:
    values = line.split(",", 2)
    print(f"{values = }")

And maybe that's ok for this table:

In [None]:
# Reset the table
table = {col: [] for col in column_names}

# We should split one fewer than the number of columns we have
num_splits = len(table) - 1
print(f"{num_splits = }")

# Then the same, mostly:
for line in data_lines:
    values = line.split(",", num_splits)
    print(f"{values = }")
    for index, value in enumerate(values):
        print(f"{index = }, {value = }")
        col = column_names[index]
        print(f"{col = }")
        table[col].append(value)

In [None]:
# Ta-da?
table

But that's not going to work if there are commas in the other columns.

## Okay, fine! We'll try pandas.

In [None]:
import pandas

pandas.read_csv("morphs.csv")

Much easier, right?

Reading all kinds of csv files with complicated, messy formats is arguably what
made `pandas` popular in the first place. In the chapters that follow, we'll
explore more of what pandas is doing and how to use a table like this.

In [None]:
# By the way, `read_csv` has a lot of different options to let you control how
# it deals with strangely formatted data.

# In Jupyter, you can use the question mark after a function to look at its
# documentation, like `--help` or `man` on the command line:
pandas.read_csv?