# Practice exercises \#7 with answers

For the following exercises, you will use the text file with the Oliver Twist book, which you can download from https://gutenberg.org/cache/epub/730/pg730.txt and load it as a file.

**Note:** For this set of exercises, you should remove all the punctuation prior to processing the text. To do this, you can use the following, where *text* is the variable where you want to remove the punctuation:

```
import string

translator = str.maketrans('', '', string.punctuation)
text.translate(translator)
```

The following questions refer to the text of this book:

1. Count the number of words (of any length) that only contain letters within the a-g range in the alphabet.

In [1]:
import string

def is_ag(word):
    for letter in word:
        if letter < 'a' or letter > 'g':
            return False

    return True

translator = str.maketrans('', '', string.punctuation)
ccount = 0
tcount = 0
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        words = [w.translate(translator) for w in line.split()]
        
        for word in words:
            tcount += 1
            ccount += 1 if is_ag(word) else 0
            
print(ccount, 'out of', tcount, 'words match this criterion.')

4827 out of 161005 words match this criterion.


2. What is the most frequent of all the words that match the above criterion of only using letters within the range a-g?

In [2]:
import string

def is_ag(word):
    for letter in word:
        if letter < 'a' or letter > 'g':
            return False

    return True

wordfreqs = {}
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        words = [w.translate(translator) for w in line.split()]
        
        for word in words:
            if is_ag(word):
                wordfreqs[word] = 1 + wordfreqs.get(word, 0)

topword = ''
topfreq = 0
for word, freq in wordfreqs.items():
    if freq > topfreq:
        topword = word
        topfreq = freq
    
print('The most frequent a-g word is', topword, 'with', topfreq, 'occurrences.')

# Alternatively also the following gets the key with max value: max(wordfreqs, key=wordfreqs.get)

The most frequent a-g word is a with 3605 occurrences.


3. How many lines of the book have at least a number in them?

In [3]:
# Option 1:
def has_numbers(text):
    return any(letter.isdigit() for letter in text)

havenumbers = 0
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        havenumbers += 1 if has_numbers(line) else 0
            
print(havenumbers)

56


In [4]:
# Option 2:
def has_numbers(text):
    for letter in text:
        if letter.isdigit():
            return True

    return False

havenumbers = 0
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        if has_numbers(line):
            havenumbers += 1
            
print(havenumbers)

56


For the following exercise, you will use the text file with the Romeo and Juliet book (in addition to the Olivier Twist book which you've already downloaded). You can get the text of Romeo and Juliet from https://gutenberg.org/cache/epub/1513/pg1513.txt.

4. With these two books, after lowercasing and stripping punctuation, identify the number of words that are used in both books, as well as the number of words that are exclusive to each book. You should count each word only once, regardless of its number of occurrences within a book.

In [3]:
# option 1: recommended
otwords = []
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        otwords.extend([w.translate(translator) for w in line.split()])
        
rjwords = []
with open('files/pg1513.txt', 'r') as fh:
    for line in fh:
        rjwords.extend([w.translate(translator) for w in line.split()])

commonwords = len(set(otwords) & set(rjwords))
print('Common words: ' + str(commonwords))

otexclusive = len(set(otwords) - set(rjwords))
print('Exclusive to OT: ' + str(otexclusive))

rjexclusive = len(set(rjwords) - set(otwords))
print('Exclusive to R&J: ' + str(rjexclusive))

Common words: 2795
Exclusive to OT: 11550
Exclusive to R&J: 2141


In [4]:
# option 2: VERY SLOW AND INEFFICIENT, NOT RECOMMENDED, AS IT IS MUCH FASTER USING SETS AS ABOVE
otwords = []
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        otwords.extend([w.translate(translator) for w in line.split()])
        
rjwords = []
with open('files/pg1513.txt', 'r') as fh:
    for line in fh:
        rjwords.extend([w.translate(translator) for w in line.split()])
        
onlyot = []
for word in otwords:
    if word not in rjwords and word not in onlyot:
        onlyot.append(word)
        
print(len(onlyot))

onlyrj = []
for word in rjwords:
    if word not in otwords and word not in onlyrj:
        onlyrj.append(word)
        
print(len(onlyrj))

common = []
for word in rjwords:
    if word in otwords and word not in common:
        common.append(word)
        
print(len(common))

11550
2141
2795


For the following exercises, you will use a dataset comprising food hygiene ratings pertaining to over 600k businesses across the UK. This is a real dataset obtained from the UK's Food Standards Agency, where each business is rated with a score from 0 (urgent improvement necessary) to 5 (very good). The dataset is provided in json format, in a file called 'food-hygiene-ratings.json'. Each entry in the dataset contains a business which, in addition to the hygiene rating score, provides information on the business type, its address, its local authority, etc.

The following questions 5 and 6 are to be answered by using this dataset:

5. If we group businesses by their local authority area (the 'LocalAuthorityName' field), what is the area with lower average rating across all its businesses?

In [40]:
import json

arearatings = {}
avgratings = {}
with open('food-hygiene-ratings.json', 'r') as fh:
    for line in fh:
        business = json.loads(line)
        
        if business['RatingValue'].isnumeric() and not business['LocalAuthorityName'] in arearatings:
            arearatings[business['LocalAuthorityName']] = []
            avgratings[business['LocalAuthorityName']] = []
        if business['RatingValue'].isnumeric():
            arearatings[business['LocalAuthorityName']].append(int(business['RatingValue']))
            
for area, ratings in arearatings.items():
    avgratings[area] = sum(ratings) / len(ratings)
    
minarea = min(avgratings, key=avgratings.get)

print('The local authority area with the lowest average rating is \'' + minarea + '\' with an average rating of ' + str(avgratings[minarea]))

The local authority area with the lowest average rating is 'Waltham Forest' with an average rating of 3.874604847207587


6. The 'RatingDate' indicates the date in which the Food Standards Agency last inspected a business. Using the information in this field for all businesses, produce a ranking of the months where the Food Standards Agency conducts its inspection (ranked from most to least).

In [41]:
import json

months = {}
with open('food-hygiene-ratings.json', 'r') as fh:
    for line in fh:
        business = json.loads(line)
        
        if isinstance(business['RatingDate'], str):
            month = business['RatingDate'].split('-')[1]
            months[month] = 1 + months.get(month, 0)

months = {k: v for k, v in sorted(months.items(), key=lambda item: item[1], reverse=True)}

for month in months:
    print(month + ': ' + str(months[month]) + ' inspections.')

03: 61960 inspections.
02: 54813 inspections.
06: 52720 inspections.
05: 50352 inspections.
07: 50268 inspections.
01: 50238 inspections.
11: 47918 inspections.
10: 42278 inspections.
04: 40843 inspections.
09: 39172 inspections.
08: 38151 inspections.
12: 28710 inspections.


7. We have been tasked with simply counting the number of lines in each of our books (Romeo and Juliet & Oliver Twist). We have written the following code. We are getting seamingly correct answers, but is there anything that we could have done better?

In [1]:
# Original
content_ot = ''
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        content_ot += line
        
content_rj = ''
with open('files/pg1513.txt', 'r') as fh:
    for line in fh:
        content_rj += line
        
print('Lines in Oliver Twist:', len(content_ot.split('\n')))

print('Lines in Romeo and Juliet:', len(content_rj.split('\n')))

Lines in Oliver Twist: 19188
Lines in Romeo and Juliet: 5649


In [2]:
# You may write your improved solution here
lines_ot = 0
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        lines_ot += 1
        
lines_rj = 0
with open('files/pg1513.txt', 'r') as fh:
    for line in fh:
        lines_rj += 1
        
print('Lines in Oliver Twist:', lines_ot)

print('Lines in Romeo and Juliet:', lines_rj)

Lines in Oliver Twist: 19187
Lines in Romeo and Juliet: 5648


**Explanation:** the solution was indeed correct, with just a marginal difference. Let's get to the problem first though.

In the original code, we were storing the entire content of the files into variables, to then count how many lines we had. This causes two problems:
* We are storing the whole content of the files into our computer's memory, when we are not interested in the content. When files are large, this will be problematic and we may run out of memory. As we are only interested in counting lines, there is no need to store the entire content.
* Storing the content then means we need a further step of splitting the content by line using .split() to count how many lines we have. This is an unnecessary step of processing a long string, especially problematic if the files are very large and the strings become very lengthy and difficult to process.

By simply adding +1 for every line we read, we get what we need, and we get rid of the content as soon as we read it, no need to store everything.

**Clarification on the marginal difference:** this is because the first approach storing the content adds a line break at the end of the last row, therefore creating an empty line after it which adds to the count. This is a minor issue and the point of this exercise was to avoid storing all the content.

8. We have been tasked with identifying, from both books, the lines where the word 'English' appears, and to count the total number of words in those lines. We have written the following code and we are getting seamingly correct answers, but is there anything we could have done better?

In [5]:
# Original
content_ot = ''
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        content_ot += line
        
content_rj = ''
with open('files/pg1513.txt', 'r') as fh:
    for line in fh:
        content_rj += line
        
wordcount_ot = 0
for line in content_ot.split('\n'):
    if 'English' in line:
        wordcount_ot += len(line.split(' '))
        
wordcount_rj = 0
for line in content_rj.split('\n'):
    if 'English' in line:
        wordcount_rj += len(line.split(' '))
        
print('Oliver Twist - Words in lines with the word \'English\':', wordcount_ot)
print('Romeo and Juliet - Words in lines with the word \'English\':', wordcount_rj)

Oliver Twist - Words in lines with the word 'English': 62
Romeo and Juliet - Words in lines with the word 'English': 2


In [6]:
# You may write your improved solution here
wordcount_ot = 0
with open('files/pg730.txt', 'r') as fh:
    for line in fh:
        if 'English' in line:
            wordcount_ot += len(line.split(' '))
        
wordcount_rj = 0
with open('files/pg1513.txt', 'r') as fh:
    for line in fh:
        if 'English' in line:
            wordcount_rj += len(line.split(' '))
    
print('Oliver Twist - Words in lines with the word \'English\':', wordcount_ot)
print('Romeo and Juliet - Words in lines with the word \'English\':', wordcount_rj)

Oliver Twist - Words in lines with the word 'English': 62
Romeo and Juliet - Words in lines with the word 'English': 2


**Explanation:** the problem in this case is very similar to the previous one. We have stored the entire content of the books, to then checks which lines have the word 'English'. Instead, there was no need to store any content; as we read through the lines, we check if the line contains the word 'English' to do the necessary calculations, get rid of that line and continue with the remainder of the file.

Our files are relatively small in this case, but if we need to reuse our code with files that are a few GB, then we'll likely run out of space.