<!-- dom:TITLE: Ch.6: Dictionaries and strings -->
# Ch.6: Dictionaries and strings
<!-- dom:AUTHOR: Hans Petter Langtangen at Simula Research Laboratory & University of Oslo, Dept. of Informatics -->
<!-- Author: --> **Hans Petter Langtangen**, Simula Research Laboratory and University of Oslo, Dept. of Informatics

Date: **Aug 15, 2015**

## Goals

 * Learn more about file reading

 * Store file data in a new object type: *dictionary*

 * Interpret content in files via string manipulation



The main focus in the course is on working with files, dictionaries and strings.
The book has additional material on how to utilize data from the Internet.




# Dictionaries

In [1]:
figfiles = {'fig1.pdf': 81761, 'fig2.png': 8754}

figfiles['fig3.png'] = os.path.getsize(filename)

for name in figfiles:
    print 'File size of %g is %d:' % (name, figfiles[name])

<!-- dom:FIGURE: [fig-files/dictionary2.png, width=300 frac=0.3] -->
<!-- begin figure -->

<p></p>
<img src="fig-files/dictionary2.png" width=300>

<!-- end figure -->


## A dictionary is a generalization of a list

  * Features of lists:

   * store a *sequence* of elements in a single object (`[1,3,-1]`)

   * each element is a Python object

   * the elements are indexed by integers 0, 1, ...


  * Dictionaries can index objects in a collection via text
    (= "lists with text index")

  * Dictionary in Python is called hash, HashMap and associative array in other languages



## The list index is sometimes unnatural for locating an element of a collection of objects

Suppose we need to store the temperatures in Oslo, London and Paris.

List solution:

In [2]:
temps = [13, 15.4, 17.5]
# temps[0]: Oslo
# temps[1]: London
# temps[2]: Paris
print 'The temperature in Oslo is', temps[0]

Can look up a temperature by mapping city to index to float

But it would be more natural to write `temps[Oslo]`!



## Dictionaries map strings to objects

In [3]:
# Initialize dictionary
temps = {'Oslo': 13, 'London': 15.4, 'Paris': 17.5}

# Applications
print 'The temperature in London is', temps['London']
print 'The temperature in Oslo is',   temps['Oslo']

Important:

 * The string index, like `Oslo`, is called *key*, while `temps['Oslo']`
   is the associated *value*

 * A dictionary is an *unordered* collection of key-value pairs



## Initializing dictionaries

Two ways of initializing a collection of key-value pairs:

In [4]:
mydict = {'key1': value1, 'key2': value2, ...}

temps = {'Oslo': 13, 'London': 15.4, 'Paris': 17.5}

# or
mydict = dict(key1=value1, key2=value2, ...)

temps = dict(Oslo=13, London=15.4, Paris=17.5)

Add a new element to a dict (dict = dictionary):

In [5]:
temps['Madrid'] = 26.0
print temps

## Looping (iterating) over a dict means looping over the keys

In [6]:
for key in dictionary:
    value = dictionary[key]
    print value

Example:

In [7]:
for city in temps:
  print 'The %s temperature is %g' % (city, temps[city])

Note: the sequence of keys is arbitrary! Use sort if you need a
particular sequence:

In [8]:
for city in sorted(temps):   # alphabetic sort of keys
    value = temps[city]
    print value

## Can test for particular keys, delete elements, etc

Does the dict have a particular key?

In [9]:
if 'Berlin' in temps:
    print 'Berlin:', temps['Berlin']
else:
    print 'No temperature data for Berlin'

In [10]:
'Oslo' in temps     # standard boolean expression

Delete an element of a dict:

In [11]:
del temps['Oslo']   # remove Oslo key w/value
temps

In [12]:
len(temps)          # no of key-value pairs in dict.

## The keys and values can be reached as lists

Python version 2:

In [13]:
temps.keys()

In [14]:
temps.values()

Python version 3: `temps.keys()` and `temps.values()` are *iterators*,
not lists!

In [15]:
for city in temps.keys():  # works in Py 2 and 3
   print city

## Caution: two variables can alter the same dictionary

In [16]:
t1 = temps
t1['Stockholm'] = 10.0    # change t1
temps                     # temps is also changed!

In [17]:
t2 = temps.copy()         # take a copy
t2['Paris'] = 16
t1['Paris']               # t1 was not changed

Recall the same for lists:

In [18]:
L = [1, 2, 3]
M = L
M[1] = 8
L[1]

In [19]:
M = L[:]  # take copy of L
M[2] = 0
L[2]

## Any constant object can be used as key

 * So far: key is text (string object)

 * Keys can be any *immutable* (constant) object (!)

In [20]:
d = {1: 34, 2: 67, 3: 0}   # key is int
d = {13: 'Oslo', 15.4: 'London'} # possible
d = {(0,0): 4, (1,-1): 5}  # key is tuple
d = {[0,0]: 4, [-1,1]: 5}  # list is mutable/changeable

## Example: Polynomials represented by dictionaries

The information in the polynomial

$$
p(x)=-1 + x^2 + 3x^7
$$

can be represented by a dict with power as key (`int`) and
coefficient as value (`float`):

In [21]:
p = {0: -1, 2: 1, 7: 3.5}

Evaluate such a polynomial $\sum_{i\in I} c_ix^i$ for some $x$:

In [22]:
def eval_poly_dict(poly, x):
    sum = 0.0
    for power in poly:
        sum += poly[power]*x**power
    return sum

Short pro version:

In [23]:
def eval_poly_dict2(poly, x):
    # Python's sum can add elements of an iterator
    return sum(poly[power]*x**power for power in poly)

## Polynomials can also be represented by lists

The list index corresponds to the power, e.g.,
the data in $-1 + x^2 + 3x^7$ is represented as

In [24]:
p = [-1, 0, 1, 0, 0, 0, 0, 3]

The general polynomial $\sum_{i=0}^N c_ix^i$ is stored as
`[c0, c1, c2, ..., cN]`.



Evaluate such a polynomial $\sum_{i=0}^N c_ix^i$ for some $x$:

In [25]:
def eval_poly_list(poly, x):
    sum = 0
    for power in range(len(poly)):
        sum += poly[power]*x**power
    return sum

## What is best for polynomials: lists or dictionaries?

Dictionaries need only store the nonzero terms. Compare
dict vs list for the polynomial $1 - x^{200}$:

In [26]:
p = {0: 1, 200: -1}         # len(p) is 2
p = [1, 0, 0, 0, ..., 200]  # len(p) is 201

Dictionaries can easily handle negative powers, e.g., ${1\over2}x^{-3} + 2x^4$

In [27]:
p = {-3: 0.5, 4: 2}
print eval_poly_dict(p, x=4)

## Quick recap of file reading

In [28]:
infile  = open(filename, 'r') # open file for reading

line    = infile.readline()   # read the next line
filestr = infile.read()       # read rest of file into string
lines   = infile.readlines()  # read rest of file into list
for line in infile:           # read rest of file line by line

infile.close()                # recall to close!

## Example: Read file data into a dictionary

**Data file:**

In [29]:
Oslo:          21.8
London:        18.1
Berlin:        19
Paris:         23
Rome:          26
Helsinki:      17.8

Store in dict, with city names as keys and temperatures as values



**Program:**

In [30]:
infile = open('deg2.dat', 'r')
temps = {}                  # start with empty dict
for line in infile.readlines():
    city, temp = line.split()
    city = city[:-1]        # remove last char (:)
    temps[city]  = float(temp)

## A tabular file can be read into a nested dictionary

**Data file `table.dat`:**

In [31]:
       A        B       C      D
1     11.7    0.035    2017    99.1
2      9.2    0.037    2019   101.2
3     12.2     no       no    105.2
4     10.1    0.031     no    102.1
5      9.1    0.033    2009   103.3
6      8.7    0.036    2015   101.9

Create a dict `data[p][i]` (dict of dict) to hold measurement no. `i`
(`1`, `2`, etc.) of property `p` (`'A'`, `'B'`, etc.)



## We must first develop the plan (algorithm) for doing this

 1. Examine the first line:

  1. split it into words

  2. initialize a dictionary with the property names as keys
     and empty dictionaries `{}` as values


 3. For each of the remaining lines:

  1. split line into words

  2. for each word after the first: if word is not `no`,
     convert to float and store


Good exercise: do this now!
(See the book for a complete implementation.)



## Example: Download data from the web and visualize

**Problem:**

  * Compare the stock prices of Microsoft, Apple, and Google over decades

  * <http://finance.yahoo.com/> offers such data in files with tabular form

        Date,Open,High,Low,Close,Volume,Adj Close
        2014-02-03,502.61,551.19,499.30,545.99,12244400,545.99
        2014-01-02,555.68,560.20,493.55,500.60,15698500,497.62
        2013-12-02,558.00,575.14,538.80,561.02,12382100,557.68
        2013-11-01,524.02,558.33,512.38,556.07,9898700,552.76
        2013-10-01,478.45,539.25,478.28,522.70,12598400,516.57
        ...
        1984-10-01,25.00,27.37,22.50,24.87,5654600,2.73
        1984-09-07,26.50,29.00,24.62,25.12,5328800,2.76


## We need to analyze the file format to find the algorithm for interpreting the content

        Date,Open,High,Low,Close,Volume,Adj Close
        2014-02-03,502.61,551.19,499.30,545.99,12244400,545.99
        2014-01-02,555.68,560.20,493.55,500.60,15698500,497.62
        2013-12-02,558.00,575.14,538.80,561.02,12382100,557.68
        2013-11-01,524.02,558.33,512.38,556.07,9898700,552.76
        2013-10-01,478.45,539.25,478.28,522.70,12598400,516.57
        ...
        1984-10-01,25.00,27.37,22.50,24.87,5654600,2.73
        1984-09-07,26.50,29.00,24.62,25.12,5328800,2.76


File format:

  * Columns are separated by comma

  * First column is the date, the final is the price of interest

  * The prizes start at different dates



## We need algorithms before we can write code

**Algorithm for reading data:**

 1. skip first line

 2. read line by line

 3. split each line wrt. comma

 4. store first word (date) in a list of dates

 5. store final word (prize) in a list of prices

 6. collect date and price list in a dictionary (key is company)

 7. make a function for reading one company's file



**Plotting:**

 1. Convert year-month-day time specifications in strings
    into year coordinates along the x axis

 2. Note that the companies' price history starts at different years



## No code is presented here...

See the book for all details. If you understand this quite comprehensive
example, you know and understand a lot!



## Plot of normalized stock prices in logarithmic scale

Much computer history in this plot:

<!-- dom:FIGURE: [fig-files/stockprices1.png, width=700 frac=1] -->
<!-- begin figure -->

<p></p>
<img src="fig-files/stockprices1.png" width=700>

<!-- end figure -->




# String manipulation

In [32]:
s = 'This is a string'
s.split()

In [33]:
'This' in s

In [34]:
s.find('is')

In [35]:
', '.join(s.split())

<!-- dom:FIGURE: [fig-files/string_manipulation.jpg, width=400 frac=0.8] -->
<!-- begin figure -->

<p></p>
<img src="fig-files/string_manipulation.jpg" width=400>

<!-- end figure -->


## String manipulation is key to interpret the content of files

  * Text in Python is represented as strings

  * Inspecting and manipulating strings is the way we can understand the contents of files

  * Plan: first show basic operations, then address real examples

Sample string used for illustrations:

Strings behave much like lists/tuples - they are a sequence of characters:

In [36]:
s[0]

In [37]:
s[1]

In [38]:
s[-1]

## Extracting substrings

Substrings are just as slices of lists and arrays:

In [39]:
s

In [40]:
s[8:]     # from index 8 to the end of the string

In [41]:
s[8:12]   # index 8, 9, 10 and 11 (not 12!)

In [42]:
s[8:-1]

In [43]:
s[8:-8]

Find start of substring:

In [44]:
s.find('Berlin')  # where does 'Berlin' start?

In [45]:
s.find('pm')

In [46]:
s.find('Oslo')    # not found

## Checking if a substring is contained in a string

In [47]:
'Berlin' in s:

In [48]:
'Oslo' in s:

In [49]:
if 'C' in s:
    print 'C found'
else:
    print 'no C'

## Substituting a substring by another string

`s.replace(s1, s2)`: replace `s1` by `s2`

In [50]:
s.replace(' ', '__')

In [51]:
s.replace('Berlin', 'Bonn')

Example: replace the text before the first colon by `'Bonn'`

In [52]:
s

In [53]:
s.replace(s[:s.find(':')], 'Bonn')

1) `s.find(':')` returns 6, 2) `s[:6]` is `'Berlin'`, 3) `Berlin`
is replaced by `'Bonn'`



## Splitting a string into a list of substrings

`s.split(sep)`: split `s` into a list of substrings separated by `sep`
(no separator implies split wrt whitespace):

In [54]:
s

In [55]:
s.split(':')

In [56]:
s.split()

Try to understand this one:

In [57]:
s.split(':')[1].split()[0]

In [58]:
deg = float(_)  # _ represents the last result
deg

## Splitting a string into lines

  * Very often, a string contains lots of text and we want to split the text into separate lines

  * Lines may be separated by different control characters on different platforms: `\n` on Unix/Linux/Mac, `\r\n` on Windows

In [59]:
t = '1st line\n2nd line\n3rd line'     # Unix-line
print t

In [60]:
t.split('\n')

In [61]:
t.splitlines()

In [62]:
t = '1st line\r\n2nd line\r\n3rd line' # Windows
t.split('\n')

In [63]:
t.splitlines()                         # cross platform!

## Strings are constant - immutable - objects

You cannot change a string in-place (as you can with lists and arrays) - all changes of a strings results in a new string

In [64]:
s[18] = 5

In [65]:
# build a new string by adding pieces of s:
s2 = s[:18] + '5' + s[19:]
s2

## Stripping off leading/trailing whitespace

In [66]:
s = '   text with leading/trailing space   \n'
s.strip()

In [67]:
s.lstrip()   # left strip

In [68]:
s.rstrip()   # right strip

## Some convenient string functions

In [69]:
'214'.isdigit()

In [70]:
'  214 '.isdigit()

In [71]:
'2.14'.isdigit()

In [72]:
s.lower()

In [73]:
s.upper()

In [74]:
s.startswith('Berlin')

In [75]:
s.endswith('am')

In [76]:
'    '.isspace()   # blanks

In [77]:
'  \n'.isspace()   # newline

In [78]:
'  \t '.isspace()  # TAB

In [79]:
''.isspace()       # empty string

## Joining a list of substrings to a new string

We can put strings together with a delimiter in between:

In [80]:
strings = ['Newton', 'Secant', 'Bisection']
', '.join(strings)

These are inverse operations:

In [81]:
t = delimiter.join(stringlist)
stringlist = t.split(delimiter)

Split off the first two words on a line:

In [82]:
line = 'This is a line of words separated by space'
words = line.split()
line2 = ' '.join(words[2:])
line2

## Example: Read pairs of numbers (x,y) from a file

**Sample file:**

In [83]:
(1.3,0)    (-1,2)    (3,-1.5)
(0,1)      (1,0)     (1,1)
(0,-0.01)  (10.5,-1) (2.5,-2.5)

**Algorithm:**


1. Read line by line

2. For each line, split line into words

3. For each word, strip off the parethesis
   and split the rest wrt comma



## The code for reading pairs

In [84]:
lines = open('read_pairs.dat', 'r').readlines()

pairs = []   # list of (n1, n2) pairs of numbers
for line in lines:
    words = line.split()
    for word in words:
        word = word[1:-1]  # strip off parenthesis
        n1, n2 = word.split(',')
        n1 = float(n1);  n2 = float(n2)
        pair = (n1, n2)
        pairs.append(pair)

## Output of a pretty print of the pairs list

In [85]:
[(1.3, 0.0),
 (-1.0, 2.0),
 (3.0, -1.5),
 (0.0, 1.0),
 (1.0, 0.0),
 (1.0, 1.0),
 (0.0, -0.01),
 (10.5, -1.0),
 (2.5, -2.5)]

## Alternative solution: Python syntax in file format

Suppose the file format

In [86]:
(1.3, 0)    (-1, 2)    (3, -1.5)
...

was slightly different:

In [87]:
[(1.3, 0),    (-1, 2),    (3, -1.5),
...
]

Running `eval` on the perturbed format produces the desired list!

In [88]:
text = open('read_pairs2.dat', 'r').read()
text = '[' + text.replace(')', '),') + ']'
pairs = eval(text)

## Web pages are nothing but text files

The text is a mix of HTML commands and the text displayed in
the browser:

        <html>
        <body bgcolor="orange">
        <h1>A Very Simple Web Page</h1> <!-- headline -->
        Ordinary text is written as ordinary text, but when we
        need headlines, lists,
        <ul>
        <li><em>emphasized words</em>, or
        <li> <b>boldfaced words</b>,
        </ul>
        we need to embed the text inside HTML tags. We can also
        insert GIF or PNG images, taken from other Internet sites,
        if desired.
        <hr> <!-- horizontal line -->
        <img src="http://www.simula.no/simula_logo.gif">
        </body>
        </html>


## The web page generated by HTML code from the previous slide

<!-- dom:FIGURE: [fig-files/simple_webpage.png, width=600 frac=1.0] -->
<!-- begin figure -->

<p></p>
<img src="fig-files/simple_webpage.png" width=600>

<!-- end figure -->


## Programs can extract data from web pages

 * A program can download a web page, as an HTML file, and extract data by interpreting the text in the file (using string operations).

 * Example: [climate data from the UK](http://www.metoffice.gov.uk/climate/uk/stationdata/)

Download `oxforddata.txt` to a local file `Oxford.txt`:

In [89]:
import urllib
baseurl = 'http://www.metoffice.gov.uk/climate/uk/stationdata'
filename = 'oxforddata.txt'
url = baseurl + '/' + filename
urllib.urlretrieve(url, filename='Oxford.txt')

## The structure of the Oxfort.txt weather data file

In [90]:
Oxford
Location: 4509E 2072N, 63 metres amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by  ---.
Sunshine data taken from an automatic ...
   yyyy  mm   tmax    tmin      af    rain     sun
              degC    degC    days      mm   hours
   1853   1    8.4     2.7       4    62.8     ---
   1853   2    3.2    -1.8      19    29.3     ---
   1853   3    7.7    -0.6      20    25.9     ---
   1853   4   12.6     4.5       0    60.1     ---
   1853   5   16.8     6.1       0    59.5     ---

...

   2010   5   17.6     7.3       0    28.6   207.4
   2010   6   23.0    11.1       0    34.5   230.5
   2010   7   23.3*   14.1*      0*   24.4*  184.4*  Provisional
   2010  10   14.6     7.4       2    43.5   128.8   Provisional

## Reading the climate data

**Algorithm:**

 1. Read the place and location in the file header

 2. Skip the next 5 (for us uninteresting) lines

 3. Read the column data and store in dictionary

 4. Test for numbers with special annotation, "provisional" column, etc.



**Program, part 1:**

In [91]:
local_file = 'Oxford.txt'
infile = open(local_file, 'r')
data = {}
data['place'] = infile.readline().strip()
data['location'] = infile.readline().strip()
# Skip the next 5 lines
for i in range(5):
    infile.readline()

## Reading the climate data - program, part 2

**Program, part 2:**

In [92]:
data['data'] ={}
for line in infile:
    columns = line.split()

    year = int(columns[0])
    month = int(columns[1])

    if columns[-1] == 'Provisional':
        del columns[-1]
    for i in range(2, len(columns)):
        if columns[i] == '---':
            columns[i] = None
        elif columns[i][-1] == '*' or columns[i][-1] == '#':
            # Strip off trailing character
            columns[i] = float(columns[i][:-1])
        else:
            columns[i] = float(columns[i])

## Reading the climate data - program, part 3

**Program, part 3.**

In [93]:
for line in infile:
    ...
    tmax, tmin, air_frost, rain, sun = columns[2:]

    if not year in data['data']:
        data['data'][year] = {}
    data['data'][year][month] = {'tmax': tmax,
                                 'tmin': tmin,
                                 'air frost': air_frost,
                                 'sun': sun}

## Summary of dictionary functionality

<table border="1">
<thead>
<tr><th align="center">            Construction            </th> <th align="center">                 Meaning                  </th> </tr>
</thead>
<tbody>
<tr><td align="left">   <code>a = {}</code>                                </td> <td align="left">   initialize an empty dictionary                </td> </tr>
<tr><td align="left">   <code>a = {'point': [0,0.1], 'value': 7}</code>    </td> <td align="left">   initialize a dictionary                       </td> </tr>
<tr><td align="left">   <code>a = dict(point=[2,7], value=3)</code>        </td> <td align="left">   initialize a dictionary w/string keys         </td> </tr>
<tr><td align="left">   <code>a.update(b)</code>                           </td> <td align="left">   add/update key-value pairs from <code>b</code> in <code>a</code>    </td> </tr>
<tr><td align="left">   <code>a.update(key1=value1, key2=value2)</code>    </td> <td align="left">   add/update key-value pairs in <code>a</code>             </td> </tr>
<tr><td align="left">   <code>a['hide'] = True</code>                      </td> <td align="left">   add new key-value pair to <code>a</code>                 </td> </tr>
<tr><td align="left">   <code>a['point']</code>                            </td> <td align="left">   get value corresponding to key <code>point</code>        </td> </tr>
<tr><td align="left">   <code>for key in a:</code>                         </td> <td align="left">   loop over keys in unknown order               </td> </tr>
<tr><td align="left">   <code>for key in sorted(a):</code>                 </td> <td align="left">   loop over keys in alphabetic order            </td> </tr>
<tr><td align="left">   <code>'value' in a</code>                          </td> <td align="left">   <code>True</code> if string <code>value</code> is a key in <code>a</code>      </td> </tr>
<tr><td align="left">   <code>del a['point']</code>                        </td> <td align="left">   delete a key-value pair from <code>a</code>              </td> </tr>
<tr><td align="left">   <code>list(a.keys())</code>                        </td> <td align="left">   list of keys                                  </td> </tr>
<tr><td align="left">   <code>list(a.values())</code>                      </td> <td align="left">   list of values                                </td> </tr>
<tr><td align="left">   <code>len(a)</code>                                </td> <td align="left">   number of key-value pairs in <code>a</code>              </td> </tr>
<tr><td align="left">   <code>isinstance(a, dict)</code>                   </td> <td align="left">   is <code>True</code> if <code>a</code> is a dictionary              </td> </tr>
</tbody>
</table>

## Summary of some string operations

In [94]:
s = 'Berlin: 18.4 C at 4 pm'
s[8:17]          # extract substring
s.find(':')      # index where first ':' is found
s.split(':')     # split into substrings
s.split()        # split wrt whitespace
'Berlin' in s    # test if substring is in s
s.replace('18.4', '20')
s.lower()        # lower case letters only
s.upper()        # upper case letters only
s.split()[4].isdigit()
s.strip()        # remove leading/trailing blanks
', '.join(list_of_words)