## Lists

A list is a type of value in Python that represents a sequence of values. The list is a very common and versatile data structure in Python and is used frequently to represent (among other things) tabular data. Here's how you write one out in Python:

In [1]:
[5, 10, 15, 20, 25, 30]

[5, 10, 15, 20, 25, 30]

That is: a left square bracket, followed by a series of comma-separated expressions, followed by a right square bracket. Items in a list don't have to be values; they can be more complex expressions as well. Python will evaluate those expressions and put them in the list.

In [2]:
[5, 2*5, 3*5, 4*5, 5*5, 6*5]

[5, 10, 15, 20, 25, 30]

Lists can have an arbitrary number of values. Here's a list with only one value in it:

In [3]:
[5]

[5]

And here's a list with no values in it:

In [4]:
[]

[]

Here's what happens when we ask Python what type of value a list is:

In [5]:
type([1, 2, 3])

list

It's a value of type `list`.

Like any other kind of Python value, you can assign a list to a variable:

In [7]:
my_numbers = [5, 10, 15, 20, 25, 30]

### Getting values out of lists

Once we have a list, we might want to get values *out* of the list. You can write a Python expression that evaluates to a particular value in a list using square brackets to the right of your list, with a number representing which value you want, numbered from the beginning (the left-hand side) of the list. Here's an example:

In [8]:
[5, 10, 15, 20][2]

15

If we were to say this expression out loud, it might read, "I have a list of four things: 5, 10, 15, 20. Give me back the second item in the list." Python evaluates that expression to `15`, the second item in the list.

Here's what it looks like to use this indexing notation on a list stored in a variable:

In [9]:
my_numbers[2]

15

#### The second item? Am I seeing things. 15 is clearly the third item in the list.

You're right---good catch. But for reasons too complicated to go into here, Python (along with many other programming languages!) starts list indexes at 0, instead of 1. So what looks like the third element of the list to human eyes is actually the second element to Python. The first element of the list is accessed using index 0, like so:

In [11]:
[5, 10, 15, 20][0]

5

The way I like to conceptualize this is to think of list indexes not as specifying the number of the item you want, but instead specifying how "far away" from the beginning of the list to look for that value.

If you attempt to use a value for the index of a list that is beyond the end of the list (i.e., the value you use is higher than the last index in the list), Python gives you an error:

In [12]:
my_numbers[47]

IndexError: list index out of range

Note that while the type of a list is `list`, the type of an expression using index brackets to get an item out of the list is the type of whatever was in the list to begin with. To illustrate:

In [15]:
type(my_numbers)

list

In [17]:
type(my_numbers[0])

int

#### Indexes can be expressions too

The thing that goes inside of the index brackets doesn't have to be a number that you've just typed in there. Any Python expression that evaluates to an integer can go in there.

In [20]:
my_numbers[2 * 2]

25

In [22]:
x = 3
[5, 10, 15, 20][x]

20

### Other operations on lists

Because lists are so central to Python programming, Python includes a number of built-in functions that allow us to write expressions that evaluate to interesting facts about lists. For example, try putting a list between the parentheses of the `len()` function. It will evaluate to the number of items in the list:

In [24]:
len(my_numbers)

6

In [26]:
len([20])

1

In [184]:
len([])

0

The `in` operator checks to see if the value on the left-hand side is in the list on the right-hand side.

In [185]:
3 in my_numbers

False

In [186]:
15 in my_numbers

True

The `max()` function will evaluate to the highest value in the list:

In [34]:
readings = [9, 8, 42, 3, -17, 2]
max(readings)

42

... and the `min()` function will evaluate to the lowest value in the list:

In [35]:
min(readings)

-17

The `sum()` function evaluates to the sum of all values in the list.

In [32]:
sum([2, 4, 6, 8, 80])

100

Finally, the `sorted()` function evaluates to a copy of the list, sorted from smallest value to largest value:

In [36]:
sorted(readings)

[-17, 2, 3, 8, 9, 42]

### Negative indexes

If you use `-1` as the value inside of the brackets, something interesting happens:

In [40]:
fib = [1, 1, 2, 3, 5]
fib[-1]

5

The expression evaluates to the *last* item in the list. This is essentially the same thing as the following code:

In [41]:
fib[len(fib) - 1]

5

... except easier to write. In fact, you can use any negative integer in the index brackets, and Python will count that many items from the end of the list, and evaluate the expression to that item.

In [42]:
fib[-3]

2

If the value in the brackets would "go past" the beginning of the list, Python will raise an error:

In [80]:
fib[-14]

IndexError: list index out of range

### Generating lists with `range()`

The expression `list(range(n))` returns a list from 0 up to (but not including) `n`. This is helpful when you just want numbers in a sequence:

In [81]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

You can specify where the list should start and end by supplying two parameters to the call to `range`:

In [84]:
list(range(-10, 10))

[-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

## List slices

The index bracket syntax explained above allows you to write an expression that evaluates to a particular item in a list, based on its position in the list. Python also has a powerful way for you to write expressions that return a *section* of a list, starting from a particular index and ending with another index. In Python parlance we'll call this section a *slice*.

Writing an expression to get a slice of a list looks a lot like writing an expression to get a single value. The difference is that instead of putting one number between square brackets, we put *two* numbers, separated by a colon. The first number tells Python where to begin the slice, and the second number tells Python where to end it.

In [45]:
[4, 5, 6, 10, 12, 15][1:4]

[5, 6, 10]

Note that the value after the colon specifies at which index the slice should end, but the slice does *not* include the value at that index. (You can tell how long the slice will be by subtracting the value before the colon from the value after it.)

Also note that---as always!---any expression that evaluates to an integer can be used for either value in the brackets. For example:

In [46]:
x = 3
[4, 5, 6, 10, 12, 15][x:x+2]

[10, 12]

Finally, note that the type of a slice is `list`:

In [48]:
type(my_numbers)

list

In [49]:
type(my_numbers[1:4])

list

### Omitting slice values

Because it's so common to use the slice syntax to get a list that is either a slice starting at the beginning of the list or a slice ending at the end of the list, Python has a special shortcut. Instead of writing:

In [57]:
[4, 5, 6, 10, 12, 15][0:3]

[4, 5, 6]

You can leave out the `0` and write this instead:

In [58]:
[4, 5, 6, 10, 12, 15][:3]

[4, 5, 6]

Likewise, if you wanted a slice that starts at index 4 and goes to the end of the list, you might write:

In [60]:
[4, 5, 6, 10, 12, 15][4:]

[12, 15]

Getting the last two items in `my_numbers`:

In [56]:
my_numbers[:2]

[5, 10]

### Negative index values in slices

Now for some tricky stuff: You can use negative index values in slice brackets as well! For example, to get a slice of a list from the fourth-to-last element of the list up to (but not including) the second-to-last element of the list:

In [62]:
[4, 5, 6, 10, 12, 15][-4:-2]

[6, 10]

To get the last three elements of the list:

In [63]:
[4, 5, 6, 10, 12, 15][:-3]

[4, 5, 6]

All items from `my_numbers` from the third item from the end of the list upto the end of the list:

In [64]:
my_numbers[-3:]

[20, 25, 30]

## Lists within lists

So far we've seen lists that contain integers and floating-point numbers. But we're not limited to just those types! Importantly, lists can themselves contain... other lists. Lists within lists is one of the ways Python represents a matrix of values, like a spreadsheet. Here's what it looks like when you have lists inside of a list:

In [65]:
[[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]]

[[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]]

Whew, that's a lot of brackets! This is a list that has three items, each of which is itself a list containing four items. One way to visualize this list is to think of it as a table or spreadsheet that looks like this:

col 0|col 1|col 2|col 3
-----|-----|-----|-----
1    | 2   | 3   | 4
5    | 10  | 15  | 20
100  | 200 | 300 | 400

Using the `len()` function on this list returns the number of items in the outer list, or the number of "rows" in the "table":

In [181]:
len([[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]])

3


It'll be clearer if we assign our list-of-lists to a variable first:

In [69]:
data = [[1, 2, 3, 4], [5, 10, 15, 20], [100, 200, 300, 400]]

The result of getting an item at an index in a list of lists is the list at that index (i.e., the row):

In [71]:
data[1]

[5, 10, 15, 20]

To get a single item from the resulting list all in the same expression, use a second pair of square brackets, like so:

In [73]:
data[1][3]

20

Whoa, weird! But it works. If you have a variable `x` that is a list of lists, you can get the value in column `col` in row `row` with the following expression:

    x[row][col]

## List comprehensions: Applying transformations to lists

A very common task in both data analysis and computer programming is applying some operation to every item in a list (e.g., scaling the numbers in a list by a fixed factor), or to create a copy of a list with only those items that match a particular criterion (e.g., eliminating values that fall below a certain threshold). Python has a succinct syntax, called a *list comprehension*, which allows you to easily write expressions that transform and filter lists.

A list comprehension has a few parts:

- a *source list*, or the list whose values will be transformed or filtered;
- a *predicate expression*, to be evaluated for every item in the list; 
- (optionally) a *membership expression* that determines whether or not an item in the source list will be included in the result of evaluating the list comprehension, based on whether the expression evaluates to `True` or `False`; and
- a *temporary variable name* by which each value from the source list will be known in the predicate expression and membership expression.

These parts are arranged like so:

> `[` *predicate expression* `for` *temporary variable name* `in` *source list* `if` *membership expression* `]`

The words `for`, `in`, and `if` are a part of the syntax of the expression. They don't mean anything in particular (and in fact, they do completely different things in other parts of the Python language). You just have to spell them right and put them in the right place in order for the list comprehension to work.

Here's an example, returning the squares of integers zero up to ten:

In [74]:
[x * x for x in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In the example above, `x*x` is the predicate expression; `x` is the temporary variable name; and `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]` is the source list. There's no membership expression in this example, so we omit it (and the word `if`).

There's nothing special about the variable `x`; it's just a name that we chose. We could easily choose any other temporary variable name, as long as we use it in the predicate expression as well. Below, I use the name of one of my cats as the temporary variable name, and the expression evaluates the same way it did with `x`:

In [76]:
[shumai * shumai for shumai in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Notice that the type of the value that a list comprehension evaluates to is itself type `list`:

In [77]:
type([x * x for x in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

list

The expression we supply for the source list component of a list comprehension doesn't have to be a list that you've written out by hand. It can also be a variable:

In [78]:
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[x * x for x in numbers]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

... or it can be the result of some other expression that evaluates to a list:

In [79]:
[x * x for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

We've used the expression `x * x` as the predicate expression in the examples above, but you can use any expression you want. For example, to scale the values of a list by 0.5:

In [86]:
[x * 0.5 for x in range(10)]

[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]

In fact, the expression in the list comprehension can just be the temporary variable itself, in which case the list comprehension will simply evaluate to a copy of the original list: 

In [87]:
[x for x in range(10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

You don't technically even need to use the temporary variable in the predicate expression:

In [89]:
[42 for x in range(10)]

[42, 42, 42, 42, 42, 42, 42, 42, 42, 42]

> Bonus exercise: Write a list comprehension for the list `range(5)` that evaluates to a list where every value has been multiplied by two (i.e., the expression would evaluate to `[0, 2, 4, 6, 8]`). 

### The membership expression

As indicated above, you can include an expression at the end of the list comprehension to determine whether or not the item in the source list will be evaluated and included in the resulting list. One way, for example, of including only those values from the source list that are greater than or equal to five:

In [91]:
[x*x for x in range(10) if x >= 5]

[25, 36, 49, 64, 81]

## Making lists from other kinds of data

We've learned how lists work, and we've learned some basic techniques for getting data from lists and turning lists into other lists. But so far it's been pretty abstract—we've been working with lists of numbers we've been typing in by hand, instead of with, you know, actual data fetched from some source. The reason for this is that getting data from real-world sources is... well, it's *hard*, and learning how to do that constitutes much of the content of this course.

Data files from real world sources usually come in a series of bytes—basically, a long sequence of numbers that correlate to how data is stored on disk or transmitted over the network. Our job as data mungers is to figure out how to "parse" this data ("parse" is used here loosely, in its colloquial meaning), transforming it from its "raw" form into actual Python data structures, like integers and lists.

### Strings

One way of representing raw data in Python is with a data type called a *string*. A string is essentially a sequence of characters of arbitrary length. You can write one in iPython using single quotes (`'`) or double quotes (`"`) surrounding whatever characters you want. (The rules are [a little more complicated than that](https://docs.python.org/2/tutorial/introduction.html#strings), but we're focusing on the simple stuff for now.) Here's an example of a string: 

In [92]:
print("this is a string. I can put a bunch of characters in here.")

this is a string. I can put a bunch of characters in here.


Using `print` on a string causes Python simply to display the characters in the string. You can assign strings to variables as well, and the `len()` function will return the length of a string, just as it returns the length of a list:

In [93]:
x = "hi i'm a string"
print(x)
print(len(x))

hi i'm a string
15


Strings have their own data type, type `str`:

In [109]:
type("mother said there'd be days like these")

str

### Strings and numbers

Notably, a string that contains what looks like a number does *not* behave like an actual integer or floating point number does. For example, attempting to subtract one string containing a number from another string containing a number will cause an error to be raised:

In [95]:
"15" - "4"

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Attempting to add an integer or floating-point number to a string that has a number inside of it will raise a similar error:

In [96]:
16 + "8.9"

TypeError: unsupported operand type(s) for +: 'int' and 'str'

"TypeError: unsupported operand type(s)" translates from Python talk to "you gave me two values, and asked me to perform an operation on those values, but I don't know how to do that when the values belong to these types." In this case, Python has no idea how to "add" a number and a string of characters. Fortunately, there are built-in functions whose purpose is to convert from one type to another; notably, you can put a string inside the parentheses of the `int()` and `float()` functions, and it will evaluate to (what Python interprets as) the integer and floating-point values (respectively) of the string: 

In [97]:
type("17")

str

In [98]:
int("17")

17

In [99]:
type(int("17"))

int

In [100]:
type("3.14159")

str

In [101]:
float("3.14159")

3.14159

In [102]:
type(float("3.14159"))

float

If you give a string to one of these functions that Python can't interpret as an integer or floating-point number, Python will raise an error:

In [111]:
int("shumai")

ValueError: invalid literal for int() with base 10: 'shumai'

### Strings and lists

Strings and lists share a lot of similarities! The same square bracket slice and index syntax works on strings the same way it works on lists:

In [125]:
message = "importantly"

In [126]:
message[1]

'm'

In [127]:
message[-2]

'l'

In [128]:
message[-5:-2]

'ant'

Weirdly, `max()` and `min()` also work on strings... they just evaluate to the letter that comes latest and earliest in alphabetical order (respectively):

In [129]:
max(message)

'y'

In [130]:
min(message)

'a'

You can turn a string into a list of its component characters by passing it to `list()`:

In [141]:
list(message)

['i', 'm', 'p', 'o', 'r', 't', 'a', 'n', 't', 'l', 'y']

In [142]:
list("我爱猫！😻")

['我', '爱', '猫', '！', '😻']

The letters in a string in alphabetical order:

In [132]:
sorted(list(message))

['a', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 't', 't', 'y']

### Splitting strings

We'll be talking a LOT about what you can do with strings in this class. But for the purposes of this session, I just want to talk about one additional thing: the `split()` method. The `split()` method is a funny thing you can do with a string to transform it into a list. If you have an expression that evaluates to a string, you can put `.split()` right after it, and Python will evaluate the whole expression to mean "take this string, and 'split' it on white space, giving me a list of strings with the remaining parts." For example:

In [105]:
"this is a test".split()

['this', 'is', 'a', 'test']

Notably, while the `type` of a string is `str`, the type of the result of `split()` is `list`:

In [106]:
type("this is a test".split())

list

If the string in question has some delimiter in it other than whitespace that we want to use to separate the fields in the resulting list, we can put a string with that delimiter inside the parentheses of the `split()` method. Maybe you can tell where I'm going with this at this point!

### From string to list of numbers: an example

For example, I happen to have here a string that represents the total points scored by LeBron James in each of his NBA games in the 2013-2014 regular season.

> 17,25,26,25,35,18,25,33,39,30,13,21,22,35,28,27,26,23,21,21,24,17,25,30,24,18,38,19,33,26,26,15,30,32,32,36,25,21,34,30,29,27,18,34,30,24,31,13,37,36,42,33,31,20,61,22,19,17,23,19,21,24,43,15,25,32,38,17,13,32,17,34,38,29,37,36,27

You can either cut-and-paste this string from the notes, or see a file on github with these values [here](https://gist.githubusercontent.com/aparrish/56ea528159c97b085a34/raw/8406bdd866101cf64347349558d9a806c82aceb7/scores.txt).

Now if I just cut-and-pasted this string into a variable and tried to call list functions on it, I wouldn't get very helpful responses:

In [133]:
raw_str = "17,25,26,25,35,18,25,33,39,30,13,21,22,35,28,27,26,23,21,21,24,17,25,30,24,18,38,19,33,26,26,15,30,32,32,36,25,21,34,30,29,27,18,34,30,24,31,13,37,36,42,33,31,20,61,22,19,17,23,19,21,24,43,15,25,32,38,17,13,32,17,34,38,29,37,36,27"
max(raw_str)

'9'

This is wrong—we know that LeBron James scored more than nine points in his highest scoring game. The `max()` function clearly does strange things when we give it a string instead of a list. The reason for this is that all Python knows about a string is that it's a *series of characters*. It's easy for a human to look at this string and think, "Hey, that's a list of numbers!" But Python doesn't know that. We have to explicitly "translate" that string into the kind of data we want Python to treat it as.

> Bonus advanced exercise: Take a guess as to why, specifically, Python evaluates `max(raw_str)` to `9`. Hint: what's the result of `type(max(raw_str))`?

What we want to do, then, is find some way to convert this string that *represents* integer values into an actual Python list of integer values. We'll start by splitting this string into a list, using the `split()` method, passing `","` as a parameter so it splits on commas instead of on whitespace:

In [134]:
str_list = raw_str.split(",")
str_list

['17',
 '25',
 '26',
 '25',
 '35',
 '18',
 '25',
 '33',
 '39',
 '30',
 '13',
 '21',
 '22',
 '35',
 '28',
 '27',
 '26',
 '23',
 '21',
 '21',
 '24',
 '17',
 '25',
 '30',
 '24',
 '18',
 '38',
 '19',
 '33',
 '26',
 '26',
 '15',
 '30',
 '32',
 '32',
 '36',
 '25',
 '21',
 '34',
 '30',
 '29',
 '27',
 '18',
 '34',
 '30',
 '24',
 '31',
 '13',
 '37',
 '36',
 '42',
 '33',
 '31',
 '20',
 '61',
 '22',
 '19',
 '17',
 '23',
 '19',
 '21',
 '24',
 '43',
 '15',
 '25',
 '32',
 '38',
 '17',
 '13',
 '32',
 '17',
 '34',
 '38',
 '29',
 '37',
 '36',
 '27']

Looks good so far. What does `max()` have to say about it?

In [137]:
max(str_list)

'61'

This.. works. (But only by accident—see below.) But what if we wanted to find the total number of points scored by LBJ? We should be able to do something like this:

In [139]:
sum(str_list)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

... but we get an error. Why this error? The reason lies in what kind of data is in our list. We can check the data type of an element of the list with the `type()` function:

In [140]:
type(str_list[0])

str

A-ha! The type is `str`. So the error message we got before (`unsupported operand type(s) for +: 'int' and 'str'`) is Python's way of telling us, "You gave me a list of strings and then asked me to add them all together. I'm not sure what I can do for you."

So there's one step left in our process of "converting" our "raw" string, consisting of comma-separated numbers, into a list of numbers. What we have is a list of strings; what we want is a list of numbers. Fortunately, we know how to write an expression to transform one list into another list, applying an expression to each member of the list along the way—it's called a list comprehension. Equally fortunately, we know how to write an expression that converts a string representing an integer into an actual integer (`int()`). Here's how to write that expression:

In [144]:
[int(x) for x in str_list]

[17,
 25,
 26,
 25,
 35,
 18,
 25,
 33,
 39,
 30,
 13,
 21,
 22,
 35,
 28,
 27,
 26,
 23,
 21,
 21,
 24,
 17,
 25,
 30,
 24,
 18,
 38,
 19,
 33,
 26,
 26,
 15,
 30,
 32,
 32,
 36,
 25,
 21,
 34,
 30,
 29,
 27,
 18,
 34,
 30,
 24,
 31,
 13,
 37,
 36,
 42,
 33,
 31,
 20,
 61,
 22,
 19,
 17,
 23,
 19,
 21,
 24,
 43,
 15,
 25,
 32,
 38,
 17,
 13,
 32,
 17,
 34,
 38,
 29,
 37,
 36,
 27]

Let's double-check that the values in this list are, in fact, integers, by spot-checking the first item in the list:

In [145]:
type([int(x) for x in str_list][0])

int

Hey, voila! Now we'll assign that list to a variable, for the sake of convenience, and then check to see if `sum()` works how we expect it to.

In [209]:
int_list = [int(x) for x in str_list]
print sum(int_list)

2089


Wow! 2089 points in one season! Good work, King James.

## Comma-separated value files (CSVs)

A common task in this class will be to (1) take data from some source, (2) figure out what format that data is in, then (3) parse the data into Python data structures so we can (4) perform operations on it and synthesize useful information. The example in the section above—taking a string containing a comma-separated list of numbers, splitting it apart, converting it to integers, and then finding its sum—is a simple example of that task.

Step (3) above usually turns out to be the most difficult step in the process. Fortunately, Python makes available many "libraries" that know how to take apart data in particular formats and convert them to Python data structures that we can use in our notebooks. (A library is a piece of pre-existing code that you can incorporate into your program. Some libraries come pre-installed with Python; some are pre-installed with Python frameworks, like Anaconda; others still you may need to install by hand. We'll talk about that step when the time comes!)

One such library is called `csv`—it's a library for parsing comma-separated value files (CSVs). CSV is a common "exchange" format, often used when exporting data from a spreadsheet program. In the (plain-text example files)[plaintext-example-files.zip] that you (hopefully!) downloaded earlier, there's a CSV file called `dogs-of-nyc.csv` which has data on domestic dogs in New York City. The contents of the file look like this:

    dog_name,gender,breed,birth,dominant_color,secondary_color,third_color,spayed_or_neutered,guard_or_trained,borough,zip_code
    Buddy,M,Afghan Hound,Jan-00,BRINDLE,BLACK,n/a,Yes,No,Manhattan,10003
    Nicole,F,Afghan Hound,Jul-00,BLACK,n/a,n/a,Yes,No,Manhattan,10021
    Abby,F,Afghan Hound,Nov-00,BLACK,TAN,n/a,Yes,No,Manhattan,10034
    Chloe,F,Afghan Hound,1/2/2017,WHITE,BLOND,n/a,Yes,No,Manhattan,10024
    Jazzle,F,Afghan Hound,10/2/2017,BLOND,WHITE,BLACK,Yes,No,Manhattan,10022
    Trouble,M,Afghan Hound,1/3/2017,BLOND,WHITE,BLACK,Yes,No,Bronx,10472
    Grace,F,Afghan Hound,6/3/2017,CREAM,n/a,n/a,Yes,No,Manhattan,10021
    Sisu,M,Afghan Hound,10/4/2017,BLACK,WHITE,GRAY,No,No,Manhattan,10023
    Jakie,M,Afghan Hound,2/5/2017,WHITE,n/a,n/a,No,No,Queens,11354

The gist of the CSV format is that it consists of a number of *records*, one per line, each of consists of an equal number of *fields*. Fields are separated with a comma. Often, the first line of data in a CSV file gives some clue as to how to interpret the information in the corresponding fields in the following rows. We can surmise that, e.g., the data in the fields corresponding to `gender` represent the dog's gender, that `zip_code` is the zip code of the dog's registration, etc.

Knowing what we know already, we can sort of imagine what the code to parse a file in this format might look like. We'd need to put that whole chunk of data into a big string, then split that string (somehow?) into individual lines; that would give us a list of strings. We could then write an expression to split *those* strings into lists of individual fields. In the end, we'd end up with a *list of lists.*

### Using the `csv` library

Fortunately, we *don't* have to do that work for ourselves (if we don't want to). There's an existing library for Python called `csv` that will do the work of taking a string and turning it into a list of lists for us.

I'm going to show you how to use this library, so you can get started doing simple data tasks with CSV files you find on the web. In the examples below, there is going to be some code that you won't be prepared for just yet—but I'm going to try to be careful to show you which parts of the code you can change yourself, and which parts you need to leave alone.

Make sure you have the `dogs-of-nyc.csv` file from the plain text example files ZIP in the same directory as this notebook before you proceed.

Here's some code for loading an entire CSV into a list of lists:

In [2]:
import csv
import urllib

dogs = list(csv.reader(open("dogs-of-nyc.csv")))

The `import csv` line at the top of the cell simply tells Python that we want to use the `csv` library for the rest of the notebook. The part of the code that does all the work is this one:

    dogs = list(csv.reader(open("dogs-of-nyc.csv")))
    
You can change the filename to whatever filename you want, as long as what's in that file is CSV data. Let's examine the `stats` variable:

In [4]:
type(dogs)

list

It's a list! What's in the list?

In [6]:
type(dogs[0])

list

The first element of this list is... also a list! Looks like we've got a list of lists here. Let's see what's actually inside the list:

In [7]:
dogs[0]

['dog_name',
 'gender',
 'breed',
 'birth',
 'dominant_color',
 'secondary_color',
 'third_color',
 'spayed_or_neutered',
 'guard_or_trained',
 'borough',
 'zip_code']

Looks like the first element of the `dogs` list is a list of column headings. What's in the second element?

In [9]:
dogs[1]

['Buddy',
 'M',
 'Afghan Hound',
 'Jan-00',
 'BRINDLE',
 'BLACK',
 'n/a',
 'Yes',
 'No',
 'Manhattan',
 '10003']

Ah, okay, now we're finally getting some actual data. How many records do we have? We'll use the `len()` function to check, taking care to not include the first record in our count (since that's the column heading row, and doesn't itself represent a dog):

In [12]:
len(dogs[1:])

81542

Eighty-one thousand dogs! That's a lot of puppers.

We can access a particular item in a particular record by using the list indexing brackets twice. According to the column headings, the color of the dog is in the fifth column. So we can get the color of the first dog in the list like so:

In [14]:
dogs[1][5]

'BLACK'

### Selecting a single column

Now we're in a position to do some interesting things with the data from our CSV file. Let's start by creating an expression that evaluates to *all* of the values in a particular column. We'll do this using a list comprehension. Here's what it looks like:

In [16]:
colors = [row[5] for row in dogs[1:]]

In [17]:
colors[:10] # just show the first ten, don't want to see all 81k!

['BLACK',
 'n/a',
 'TAN',
 'BLOND',
 'WHITE',
 'WHITE',
 'n/a',
 'WHITE',
 'n/a',
 'WHITE']

Hey, that looks familiar! It's the same list of numbers we came up with earlier. Let's break down that list comprehension a bit.

* The *source list* is `dogs[1:]`. (Why `dogs[1:]` and not just `dogs`? Because we want to omit the column header row.)
* The *temporary variable name* is `row`. As mentioned above, this can be anything! I chose `row` to remind us of the fact that each element in the source list is itself a row in a table.
* The *predicate expression* is `row[5]`, which translates into English as "get the 5th element of the list called `row`."

> Bonus exercise: Use the `Counter` object to find the most common color.

Here's another example. Let's get a list of the names of all silver female dogs in Brooklyn:

In [19]:
from collections import Counter
Counter([row[5] for row in dogs[1:]])

Counter({'APRICOT': 188,
         'BLACK': 8230,
         'BLOND': 1972,
         'BLUE': 193,
         'BLUE MERLE': 29,
         'BRINDLE': 979,
         'BROWN': 8085,
         'CHARCOAL': 53,
         'CHOCOLATE': 107,
         'CREAM': 598,
         'FAWN': 228,
         'GOLD': 962,
         'GRAY': 1687,
         'ORANGE': 317,
         'RED': 274,
         'RUST': 1117,
         'SILVER': 269,
         'TAN': 10320,
         'WHITE': 20406,
         'n/a': 25528})

In [21]:
[row[0] for row in dogs[1:] if row[5] == "SILVER" and row[9] == "Brooklyn" and row[1] == "F"]

['Patti',
 'Lillipad',
 'Sasha',
 'Lily',
 'Trixie',
 'Lana',
 'Minna',
 'Salome',
 'Naomi',
 'Hunter',
 'Cherry',
 'Lexie',
 'Jane',
 'Bella',
 'Liska',
 'Shadow',
 'Lucy',
 'Serendipity',
 'Miniature',
 'Tosca',
 'Schmutzi',
 'Princess',
 'Bella',
 'Monkey',
 'Nanuk',
 'Hana',
 'Daisy',
 'Layni',
 'Molly',
 'Tish',
 'Bambi',
 'Lolli',
 'Cookie',
 'Phoebe',
 'Bella-Sarah',
 'Cassie',
 'Maggie',
 'Lucy']

## Making changes to lists

Often we'll want to make changes to a list after we've created it---for
example, we might want to append elements to the list, remove elements from
the list, or change the order of elements in the list. Python has a number
of methods for facilitating these operations.

The first method we'll talk about is `.append()`, which adds an item on to
the end of an existing list.

In [147]:
ingredients = ["flour", "milk", "eggs"]
ingredients.append("sugar")
ingredients

['flour', 'milk', 'eggs', 'sugar']

Notice that invoking the `.append()` method doesn't itself evaluate to
anything! (Technically, it evaluates to a special value of type `None`.)
Unlike many of the methods and syntactic constructions we've looked at so far,
the `.append()` method changes the underlying value---it doesn't return a
new value that is a copy with changes applied.

There are two methods to facilitate removing values from a list: `.pop()` and
`.remove()`. The `.remove()` method removes from the list the first value that
matches the value in the parentheses:

In [148]:
ingredients = ["flour", "milk", "eggs", "sugar"]
ingredients.remove("flour")
ingredients

['milk', 'eggs', 'sugar']

(Note that `.remove()`, like `.append()` doesn't evaluate to anything---it
changes the list itself.)

The `.pop()` method works slightly differently: give it an expression that
evaluates to an integer, and it evaluates to the expression at the index
named by the integer. But it also has a side effect: it *removes* that item
from the list:

In [149]:
ingredients = ["flour", "milk", "eggs", "sugar"]
ingredients.pop(1)
ingredients

['flour', 'eggs', 'sugar']

> EXERCISE: What happens when you try to `.pop()` a value from a list at an index that doesn't exist in the list? What happens you try to `.remove()` an item from a list if that item isn't in that list to begin with?

> ANOTHER EXERCISE: Write an expression that `.pop()`s the second-to-last item from a list. SPOILER: <span style="background: black;">(Did you guess that you could use negative indexing with `.pop()`?</span>

The `.sort()` and `.reverse()` methods do exactly the same thing as their
function counterparts `sorted()` and `reversed()`, with the only difference
being that the methods don't evaluate to anything, instead opting to change
the list in-place.

In [150]:
ingredients = ["flour", "milk", "eggs", "sugar"]
ingredients.sort()
ingredients

['eggs', 'flour', 'milk', 'sugar']

In [151]:
ingredients = ["flour", "milk", "eggs", "sugar"]
ingredients.reverse()
ingredients

['sugar', 'eggs', 'milk', 'flour']

## Lists and randomness

Python's `random` library provides several helpful functions for performing
chance operations on lists. The first is `shuffle`, which takes a list and
randomly shuffles its contents:

In [152]:
import random
ingredients = ["flour", "milk", "eggs", "sugar"]
random.shuffle(ingredients)
ingredients

['milk', 'eggs', 'sugar', 'flour']

The second is `choice`, which returns a single random element from list.

In [153]:
import random
ingredients = ["flour", "milk", "eggs", "sugar"]
random.choice(ingredients)

'sugar'

Finally, the `sample` function returns a list of values, selected at random,
from a list. The `sample` function takes two parameters: the first is a list,
and the second is how many items should be in the resulting list of randomly
selected values:

In [155]:
import random
ingredients = ["flour", "milk", "eggs", "sugar"]
random.sample(ingredients, 2)

['flour', 'sugar']

## Iterating over lists with `for`

The list comprehension syntax discussed earlier is very powerful: it allows you to succinctly transform one list into another list by thinking in terms of filtering and modification. But sometimes your primary goal isn't to make a new list, but simply to perform a set of operations on an existing list.

Let's say that you want to print every string in a list. Here's a short text:

In [156]:
text = "it was the best of times, it was the worst of times"

We can make a list of all the words in the text by splitting on whitespace:

In [157]:
words = text.split()

Of course, we can see what's in the list simply by evaluating the variable:

In [158]:
words

['it',
 'was',
 'the',
 'best',
 'of',
 'times,',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times']

But let's say that we want to print out each word on a separate line, without any of Python's weird punctuation. In other words, I want the output to look like:

    it
    was
    the
    best
    of
    times,
    it
    was
    the
    worst
    of
    times
    
But how can this be accomplished? We know that the `print()` function can display an individual string in this manner:

In [161]:
print("hello")

hello


So what we need, clearly, is a way to call the `print()` function with every item of the list. We could do this by writing a series of `print()` statements, one for every item in the list:

In [165]:
print(words[0])
print(words[1])
print(words[2])
print(words[3])
print(words[4])
print(words[5])
print(words[6])
print(words[7])
print(words[8])
print(words[9])
print(words[10])
print(words[11])

it
was
the
best
of
times,
it
was
the
worst
of
times


Nice, but there are some problems with this approach:

1. It's kind of verbose---we're doing exactly the same thing multiple times, only with slightly different expressions. Surely there's an easier way to tell the computer to do this?
2. It doesn't scale. What if we wrote a program that we want to produce hundreds or thousands of lines. Would we really need to write a `print` statement for each of those expressions?
3. It requires us to know how many items are going to end up in the list to begin with.

Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common,
that Python has some built-in syntax to make the task easy: the `for` loop.

Here's how a `for` loop looks:

    for tempvar in sourcelist:
        statements

The words `for` and `in` just have to be there---that's how Python knows it's
a `for` loop. Here's what each of those parts mean.

* *tempvar*: A name for a variable. Inside of the for loop, this variable will contain the current item of the list.
* *sourcelist*: This can be any Python expression that evaluates to a list---a variable that contains a list, or a list slice, or even a list literal that you just type right in!
* *statements*: One or more Python statements. Everything tabbed over underneath the `for` will be executed once for each item in the list. The statements tabbed over underneath the `for` line are called the *body* of the loop.

Here's what the `for` loop for printing out every item in a list might look like:

In [167]:
for item in words:
    print(item)

it
was
the
best
of
times,
it
was
the
worst
of
times


The variable name `item` is arbitrary. You can pick whatever variable name you like, as long as you're consistent about using the same variable name in the body of the loop. If you wrote out this loop in a long-hand fashion, it might look like this:

In [172]:
item = words[0]
print(item)
item = words[1]
print(item)
item = words[2]
print(item)
item = words[3]
print(item)
# etc.

it
was
the
best


Of course, the body of the loop can have more than one statement, and you can assign values to variables inside the loop:

In [174]:
for item in words:
    yelling = item.upper()
    print(yelling)

IT
WAS
THE
BEST
OF
TIMES,
IT
WAS
THE
WORST
OF
TIMES


You can also include other kinds of nested statements inside the `for` loop, like `if/else`:

In [176]:
for item in words:
    if len(item) == 2:
        print(item.upper())
    elif len(item) == 3:
        print("   " + item)
    else:
        print(item)

IT
   was
   the
best
OF
times,
IT
   was
   the
worst
OF
times


This structure is called a "loop" because when Python reaches the end of the statements in the body, it "loops" back to the beginning of the body, and executes the same statements again (this time with the next item in the list). 

Python programmers tend to use `for` loops most often when the problem would otherwise be too tricky or complicated to solve using a list comprehension. It's easy to paraphrase any list comprehension in `for` loop syntax. For example, this list comprehension, which evaluates to a list of the squares of even integers from 1 to 25:

In [178]:
[x * x for x in range(1, 26) if x % 2 == 0]

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

You can rewrite this list comprehesion as a `for` loop by starting out with an empty list, then appending an item to the list inside the loop. The source list remains the same:

In [180]:
result = []
for x in range(1, 26):
    if x % 2 == 0:
        result.append(x * x)
result

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

## Join: Making strings from lists

Once we've created a list of words, it's a common task to want to take that
list and "glue" it back together, so it's a single string again, instead of
a list. So, for example:

In [181]:
element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
glue = ", and "
glue.join(element_list)

'hydrogen, and helium, and lithium, and beryllium, and boron'

The `.join()` method needs a "glue" string to the left of it---this is the
string that will be placed in between the list elements. In the parentheses
to the right, you need to put an expression that evaluates to a list. Very
frequently with `.join()`, programmers don't bother to assign the "glue"
string to a variable first, so you end up with code that looks like this:

In [182]:
words = ["this", "is", "a", "test"]
" ".join(words)

'this is a test'

When we're working with `.split()` and `.join()`, our workflow usually looks
something like this:

1. Split a string to get a list of units (usually words).
2. Use some of the list operations discussed above to modify or slice the list.
3. Join that list back together into a string.
4. Do something with that string (e.g., print it out).

With this in mind, here's a program that splits a string into words, randomizes the order of the words, then prints out the results:

In [183]:
text = "it was a dark and stormy night"
words = text.split()
random.shuffle(words)
' '.join(words)

'night a dark was and it stormy'

> EXERCISE: Write a Python command-line program that prints out the lines of a text file in random order.

## Conclusion

We've put down the foundation today for you to become fluent in Python's very powerful and super-convenient syntax for lists. We've also done a bit of data parsing and analysis! Pretty good for day one.

Further resources:

* [Lists](http://openbookproject.net/thinkcs/python/english2e/ch09.html), from [How To Think Like A Computer Scientist: Learning with Python](http://openbookproject.net/thinkcs/python/english2e/)
* [Using Python as a calculator](https://docs.python.org/2.7/tutorial/introduction.html#using-python-as-a-calculator), from the official [Python tutorial](https://docs.python.org/2.7/tutorial/index.html)
* [Loops and lists](http://learnpythonthehardway.org/book/ex32.html) and [Accessing Elements of Lists](http://learnpythonthehardway.org/book/ex34.html) from [Learn Python The Hard Way](http://learnpythonthehardway.org/book/)
* [List comprehensions tutorial](http://www.secnetix.de/olli/Python/list_comprehensions.hawk)