# Text mining


## Notebook Content
The idea in the lecture is summarized in the following examples: 

a. Express data as plain text. 

In [1]:
budget = "My airfare was 300.00. My hotel cost 200.00 for one night. My food cost 100.00."
budget

'My airfare was 300.00. My hotel cost 200.00 for one night. My food cost 100.00.'

b. Split into words. `strip` removes an unwanted \n.  

In [2]:
words = budget.strip().split(' ')  # remove \n from line
words

['My',
 'airfare',
 'was',
 '300.00.',
 'My',
 'hotel',
 'cost',
 '200.00',
 'for',
 'one',
 'night.',
 'My',
 'food',
 'cost',
 '100.00.']

c. Mine numbers out of the text

In [3]:
costs = []
for w in words: 
    try: 
        number = float(w)
        costs.append(number)
    except Exception as e: 
        print(e)
costs

could not convert string to float: 'My'
could not convert string to float: 'airfare'
could not convert string to float: 'was'
could not convert string to float: '300.00.'
could not convert string to float: 'My'
could not convert string to float: 'hotel'
could not convert string to float: 'cost'
could not convert string to float: 'for'
could not convert string to float: 'one'
could not convert string to float: 'night.'
could not convert string to float: 'My'
could not convert string to float: 'food'
could not convert string to float: 'cost'
could not convert string to float: '100.00.'


[200.0]

1. **What's wrong with this? What does it miss?** 

___Your answer:___

Let's try again:

In [4]:
costs = []
for w in words:
    if w.endswith('.'): 
        w = w[:-1]
    try: 
        number = float(w)
        costs.append(number)
    except Exception as e: 
        print(e)
costs

could not convert string to float: 'My'
could not convert string to float: 'airfare'
could not convert string to float: 'was'
could not convert string to float: 'My'
could not convert string to float: 'hotel'
could not convert string to float: 'cost'
could not convert string to float: 'for'
could not convert string to float: 'one'
could not convert string to float: 'night'
could not convert string to float: 'My'
could not convert string to float: 'food'
could not convert string to float: 'cost'


[300.0, 200.0, 100.0]

That's better. 

2. **What does w[:-1] actually do?** Look it up if necessary. 

___Your answer:___

d. Sum up the numbers. 

In [5]:
total = 0
for c in costs: 
    total += c
total   

600.0

# A challenge problem
How to associate numbers with their meaning. 

3. Stare at the text. **Where is the word describing each cost in relationship to the cost?** 

___Your answer:___ From the cost, look backwards till we find the word "My," return the word after that.

4. Can you express that word in terms of the list `words`? **If the cost is words[n], where is the needed word in relation to the cost?**

___Your answer:___ (1) Walk through the list `words`, storing in variable `my_loc` where the word `My` occurs. `my_loc +1` gives us the locations of the description, arriving at the list `description = ['airfare', 'hotel', 'food']`.

5. Please use this to create a list of tuples describing each cost and its source. The output should be 
```
[('airfare', 300), ('hotel', 200), ('food', 100)]
```
___Your answer:___ (2) Use list comprehension to arrive at the answer, something like 
```
expenses = [(description[i], costs[i], ) for i in range(costs)]
```

In [None]:
{ write your answer here , put result into variable expenses }

expenses  # print the result

# Afterword: text mining and scraping
The basic activity of extracting data from text is actually quite evolved. We can study various tools that do much more than this primitive setup. Among other things, they can parse the grammar of text and use that to infer context. 