## Advanced Use of Lists and Dictionaries



In [11]:
#Print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

## Reading in data from files

All the data we've worked with so far disappears when we turn off the computer.  So data is often stored in files to make it persistent.  We'll work with text files in this class.

View your file in Jupyter Lab by double clicking the file name in the file explorer.  We are going to work with states.txt first.

Getting data from files into Python has three steps:
1.  open the file
2.  read data from file
3.  close the file

### Opening a file

To open a file you assign a variable to the result of `open("name_of_your_file", "r")`, which produces a *file object*.  
* The variable name you choose is how you will refer to the file in your Python code.
* The `r` parameter ensures that the file will open as read-only.  If you don't need to change the file this is best so you don't accidentally do so.

In [3]:
## Assuming your file is in the folder you're working in...

f = open('hokeypokey.txt', 'r')  # opens a file to use.  The 'r' argument opens as read-only.  This is also the default.

### Reading in data from a file

We often work with files that are really large, and so it's best practice to read them in one line at a time, process the line, store what you need, and then move on.  This way the entire file is not stored in your computer's memory.

Python naturally thinks of files as a collection of lines.  So if file is an object to loop through in a `for` statement, each element will be a line.

In [4]:
f = open('hokeypokey.txt', 'r')  # opens a file to use.  The 'r' argument opens as read-only.  This is also the default.

for line in f:
    line_list = line.split(" ")
    print(line_list)


['You', 'put', 'your', 'right', 'foot', 'in\n']
['You', 'put', 'your', 'right', 'foot', 'out\n']
['You', 'put', 'your', 'right', 'foot', 'in\n']
['And', 'you', 'shake', 'it', 'all', 'about\n']
['You', 'do', 'the', 'Hokey-Pokey\n']
['And', 'you', 'turn', 'yourself', 'around\n']
["That's", 'what', "it's", 'all', 'about']


### Closing a file

After you are done extracting everything you need from the file (not just completing one iteration of a for loop) it's best practice to close the file.

This prevents you from accidentally overwriting the file.

In [5]:
f = open('hokeypokey.txt', 'r')  # opens a file to use.  The 'r' argument opens as read-only.  This is also the default.

for line in f:
    line_list = line.split(" ")
    print(line_list)
    
f.close()

['You', 'put', 'your', 'right', 'foot', 'in\n']
['You', 'put', 'your', 'right', 'foot', 'out\n']
['You', 'put', 'your', 'right', 'foot', 'in\n']
['And', 'you', 'shake', 'it', 'all', 'about\n']
['You', 'do', 'the', 'Hokey-Pokey\n']
['And', 'you', 'turn', 'yourself', 'around\n']
["That's", 'what', "it's", 'all', 'about']


## String trimming methods

Notice when we print the lines in the list above you see some strange "\n" characters.
* These are newline characters:  you don't see them in the file when you view it, but they're there.

Strings have methods to remove white space and special characters ("\t" indicates a tab, for example).

* `.rstrip()` removes white space and trailing characters from the right.
* `.lstrip()` removes white space and trailing characters from the left.
* `.strip()` removes white space and trailing characters from both sides.

An argument inside any says to only strip certain characters.
* `.rstrip('\n')` removes newline characters only

In [6]:
mystring = "   abc  \t \n"  # \t is a tab character
mystring.rstrip()
mystring.lstrip()
mystring.strip()
mystring2 = mystring.rstrip("\n")
mystring2

'   abc'

'abc  \t \n'

'abc'

'   abc  \t '

### Problem 1
Adjust the code below to strip the lines below of extra spaces and special characters before reading into a list.

In [7]:
f = open('hokeypokey.txt', 'r')  # opens a file to use.  The 'r' argument opens as read-only.  This is also the default.

for line in f:
    line_list = line.split(" ")
    line_list = line.strip()
    print(line_list)
    
f.close()

You put your right foot in
You put your right foot out
You put your right foot in
And you shake it all about
You do the Hokey-Pokey
And you turn yourself around
That's what it's all about


### Problem 2

Preview the file states.txt by double clicking on it in the File Explorer pane of Jupyter Lab.  Then write code to read in the file and store the data in a data structure that will let you look up the capital of a state by state name.  (You do not have to write code to look up capitals, though.)  You may want to refer to the exercises we did in the previous class to help with this.

In [70]:
f = open('states.txt', 'r')

states = {}

for line in f:
    line = line.rstrip()
    state, capital = line.split(",")
    states[state] = capital

f.close()
states['North Carolina']

'Raleigh'

## Getting data **into** complex data structures

## Read data into more complex data structures...

Double click on states2.txt to view.
* List of states:  for each state we have the capital and the 2-letter abbreviation.

To think about the right data structure for your information you can think about two things:
* how you want to display the data
* how you want to access the data.

## Display state information
### Alabama
* AL
* Montgomery

### Alaska
* AK
* Juneau

## Access state information

Say you want to be able to enter a state name and access relevant information about that state.

## Both ways point to nested dictionaries

* If you display your data with headings, the headings are often naturally keys
* If you want to lookup something based on some value, it's naturally a key
* If you have more than one value per heading/lookup, then your value might naturally be another dictionary, or a list, or a combination...


### Problem 3
Write code to read in the data into a dictionary called *states* where each state name is the key, and each value is a dictionary consisting of keys "capital" and "abbrev".

In [83]:
f = open('states2.txt', 'r')

states = {}

for line in f:
    line = line.rstrip()
    state, capital, abbrev = line.split(",")
    states[state] = {'capital':capital , 'abbrev':abbrev}

f.close()
states['North Carolina']['capital']

'Raleigh'

### Problem 4
Now practice accessing elements from your dictionary.
Write code to print
1.  the abbreviation for Hawaii.
2.  the capitals for all states
3.  A sentence for each state that says "The abbreviation for *state* is *abbrev*."

In [97]:
print(states['Hawaii']['abbrev'])

for state in states:
    print(states[state]['capital'])
    
for state in states:
    print(f"The abbreviation for {state} is {states[state]['abbrev']}")

HI
Montgomery
Juneau
Phoenix
Little Rock
Sacramento
Denver
Hartford
Dover
Tallahassee
Atlanta
Honolulu
Boise
Springfield
Indianapolis
Des Moines
Topeka
Frankfort
Baton Rouge
Augusta
Annapolis
Boston
Lansing
St. Paul
Jackson
Jefferson City
Helena
Lincoln
Carson City
Concord
Trenton
Santa Fe
Albany
Raleigh
Bismarck
Columbus
Oklahoma City
Salem
Harrisburg
Providence
Columbia
Pierre
Nashville
Austin
Salt Lake City
Montpelier
Richmond
Olympia
Charleston
Madison
Cheyenne
The abbreviation for Alabama is AL
The abbreviation for Alaska is AK
The abbreviation for Arizona is AZ
The abbreviation for Arkansas is AR
The abbreviation for California is CA
The abbreviation for Colorado is CO
The abbreviation for Connecticut is CT
The abbreviation for Delaware is DE
The abbreviation for Florida is FL
The abbreviation for Georgia is FA
The abbreviation for Hawaii is HI
The abbreviation for Idaho is ID
The abbreviation for Illinois is IL
The abbreviation for Indiana is IN
The abbreviation for Iowa is IA
T

## Store multiple pieces of information AND keep track of counts.

You might use a dictionary to keep track of multiple pieces of information including running counts or sums.

### Problem 5

Let's now look at a related data set, president_states.txt.  We now have a list of states, their capitals, abbreviations.  But not all 50 states are represented; instead there is one line per U.S. president.  In addition to the president's name, the state, state capital, and state abbreviation are all listed.

Your job now is to update the previous data structure such that the sub-dictionary that previously stored capital and abbreviation **also keeps track** of how many presidents have come from that state by adding a key called 'numpres' and the associated value.

Keep in mind that here you will read in info for a state more than once:  Virginia and Massachusetts, for example, were the birthplace of several presidents.  

When you're done making your dictionary use the keys and values to print a message saying that "The capital of \<state\> is \<capital\> and it has had \<numpres\> presidents."

**Hint:**. You will need the skills from the previous class where we used a dictionary to count votes as well as from the previous exercise to do this.

In [23]:
f = open('president_states.txt', 'r')

states = {}

for line in f:
    line = line.rstrip()
    state, capital, abbrev, numpres = line.split(",")
    
    if state not in states:
        states[state] = {'capital':capital , 'abbrev':abbrev, 'numpres': 1}
    else:
        states[state]['numpres'] = states[state]['numpres'] + 1

f.close()
for state in states:
    print(f"The capital of {state} is {states[state]['capital']} and it has had {states[state]['numpres']} presidents")


The capital of Virginia is Richmond and it has had 8 presidents
The capital of Massachusetts is Boston and it has had 4 presidents
The capital of South Carolina is Columbia and it has had 1 presidents
The capital of New York is Albany and it has had 5 presidents
The capital of Pennsylvania is Harrisburg and it has had 2 presidents
The capital of North Carolina is Raleigh and it has had 2 presidents
The capital of New Hampshire is Concord and it has had 1 presidents
The capital of Kentucky is Frankfort and it has had 1 presidents
The capital of Ohio is Columbus and it has had 7 presidents
The capital of Vermont is Montpelier and it has had 2 presidents
The capital of New Jersey is Trenton and it has had 1 presidents
The capital of Iowa is Des Moines and it has had 1 presidents
The capital of Missouri is Jefferson City and it has had 1 presidents
The capital of Texas is Austin and it has had 2 presidents
The capital of Illinois is Springfield and it has had 1 presidents
The capital of Ca

## Optional practice problem

The problems above should be good preparation for PS 07, but if you'd like to practice with some variations on a theme, then check out the problem below.  It is not as challenging as the vote tallying problem.

Read in students1.txt, a file where each line represents a course taken by a student.  The first field is the ID, then the course ID, and last is the grade in the class.  Read the data into a data structure that will let you access each student's course history and look up the grade by course.  You may assume that no student will take the same class twice.  It may help you to plan out the data structure on a piece of scratch paper before you start coding.