kmp

Closely based on 

**Ben Lauwens & Allen Downey "Think Julia: How to Think Like a Computer Scientist"**

https://benlauwens.github.io/ThinkJulia.jl/latest/book.html

Resources:

Julia webpage https://julialang.org/ 

Julia documentation https://docs.julialang.org/en/v1/


## Chapter 09 -- Word processing

https://benlauwens.github.io/ThinkJulia.jl/latest/book.html#chap09

### Reading word lists

We need a list of English words. There are lots of word lists available on the Web, but the one most suitable for our purpose is one of the word lists collected and contributed to the public domain by Grady Ward as part of the Moby lexicon project (see https://wikipedia.org/wiki/Moby_Project). 

It is a list of 113809 official crosswords; that is, words that are considered valid in crosswords and other word games. In the Moby collection, the filename is 113809of.fic; you can download a copy, with the simpler name **words.txt**, from https://github.com/BenLauwens/ThinkJulia.jl/blob/master/data/words.txt.

This file is in plain text, so you can open it with a text editor, but you can also read it from Julia. The built-in function **open** takes the name of the file as a parameter and returns a file stream you can use to read the file.

```Julia
	# if the file is in the present working directory, pwd()
	# otherwise fullpath name
	
	julia> fin = open("words.txt") 
		IOStream(<file words.txt>)
```

**fin** is a **`file stream object`** used for input. When it is no longer needed, it should be closed with **close(fin)**.

Julia provides several functions for reading from the file stream, including **readline**, which reads characters from the file until it gets to a `NEWLINE` character **`'\n'`** and returns the result as a string:

```Julia
	julia> readline(fin)
		"aa"
```

The first word in this particular list is “aa”, which is a kind of lava. 

**The file stream object keeps track of where it is in the file**, so if you call readline again, it will read the next line and you get the next word:

```Julia
	julia> readline(fin)
		"aah"
```

The next word is “aah”. You can also use a file as part of a **`for loop`**. This program reads **`words.txt`** and prints each word, one per line:

```Julia
	for line in eachline("words.txt")
		println(line)
	end
```

## Exercises

### Exercise 9-1

Write a program that reads `words.txt` and prints only the words with more than 20 characters (not counting whitespace).

In [7]:
filename = "E:\\aaa-Julia-course-2023\\lectures-1.9\\words.txt"

function longWords(fin, len=20)
	
	for str in eachline(fin)

        str_no_wspace = filter(chr -> !isspace(chr), str)

		if sizeof(str_no_wspace) > len
		    println(str)
		end
	end
end


fileIn = open(filename)
longWords(fileIn)
close(fileIn)

counterdemonstrations
hyperaggressivenesses
microminiaturizations


### Exercise 9-2

In 1939 Ernest Vincent Wright published a 50,000 word novel called Gadsby that does not contain the letter e. Since e is the most common letter in English, that’s not easy to do.

Write a function called hasno_e that returns true if the given word does not have the letter e in it.

Modify your program from the previous section to print only the words that have no e and compute the percentage of the words in the list that have no e.

In [8]:
function hasno_e(word)
    
    return !('e' in word)

end

hasno_e("abcd")


true

In [9]:
function no_e(fin)
    cnt = 0
    cnttot = 0

	for str in eachline(fin)
        cnttot += 1
		if hasno_e(str)
            cnt += 1
		    #println(str)
		end
	end
    return cnt/cnttot
end

filename = "E:\\aaa-Julia-course-2023\\lectures-1.9\\words.txt"

begin
    fileIn = open(filename)
    percentage = no_e(fileIn)
    close(fileIn)
end

percentage

0.3307383423103621

### Exercise 9-3

Write a function named avoids that takes a word and a string of forbidden letters, and that returns true if the word does not use any of the forbidden letters.

Modify your program to prompt the user to enter a string of forbidden letters and then print the number of words that do not contain any of them. Can you find a combination of 5 forbidden letters that excludes the smallest number of words?

### Exercise 9-4

Write a function named usesonly that takes a word and a string of letters, and that returns true if the word contains only letters in the list. Can you make a sentence using only the letters acefhlo? Other than "Hoe alfalfa?"

In [10]:
function usesonly(word, legal)
    for letter in word
        if letter ∉ legal
            return false
        end
    end
    return true
end

legal = "acefhlo"
word = "alfgalfa"
usesonly(word, legal)

false

### Exercise 9-5

Write a function named usesall that takes a word and a string of required letters, and that returns true if the word uses all the required letters at least once. How many words are there that use all the vowels aeiou? How about aeiouy?

### Exercise 9-6

Write a function called isabecedarian that returns true if the letters in a word appear in alphabetical order (double letters are ok). How many abecedarian words are there?

In [11]:
function isabecedarian(word)
    i = firstindex(word)
    j = nextind(word, i)
    
    while j <= sizeof(word)
        if word[j] < word[i]
            return false
        end
        i = j
        j = nextind(word, i)

    end
    #@show j
    #@show sizeof(word)
    return  true
end

word = "abcdefghij"
isabecedarian(word)

true

In [12]:

function abecedaria(fin)
	
    cnt = 0

	for word in eachline(fin)
		if isabecedarian(word)
            cnt += 1
		end
	end
    return cnt
end

filename = "E:\\aaa-Julia-course-2023\\lectures-1.9\\words.txt"

begin
    fileIn = open(filename)
    nmbr = abecedaria(fileIn)
    close(fileIn)
end

nmbr

596

### Search

All of the exercises in the previous section have something in common; they can be solved with the **search pattern**. The simplest example is:

```Julia
	function hasno_e(word)
		for letter in word
			if letter == 'e'
				return false
			end
		end
		return true
	end
```

The for loop traverses the characters in word. If we find the letter e, we can immediately return false; otherwise we have to go to the next letter. If we exit the loop normally, that means we did not find an e, so we return true.

You could write this function more concisely using the **∉ (\notin<tab>) operator**. `avoids` is a more general version of `hasno_e` but it has the same structure:

```Julia
	function avoids(word, forbidden)
		for letter in word
			if letter ∈ forbidden
				return false
			end
		end
		return true
	end
```

We can return false as soon as we find a forbidden letter; if we get to the end of the loop, we return true. `usesonly` is similar except that the sense of the condition is reversed:

```Julia
	function usesonly(word, available)
		for letter in word
			if letter ∉ available
				return false
			end
		end
		true
	end
```

Instead of an array of forbidden letters, we have an array of available letters. If we find a letter in word that is not in available, we can return false. `usesall` is similar except that we reverse the role of the word and the string of letters:

```Julia
	function usesall(word, required)
		for letter in required
			if letter ∉ word
				return false
			end
		end
		true
	end
```

Instead of traversing the letters in word, the loop traverses the required letters. If any of the required letters do not appear in the word, we can return false.

If you were really thinking like a computer scientist, you would have recognized that `usesall` was an instance of a previously solved problem, and you would have written:

```Julia
	function usesall(word, required)
		usesonly(required, word)
	end
```

This is an example of a program development plan called **reduction to a previously solved problem**, which means that you recognize the problem you are working on as an instance of a solved problem and apply an existing solution.

### Looping with indices

For `isabecedarian` we have to compare adjacent letters, which is a little tricky with a for loop:

```Julia
	function isabecedarian(word)
		i = firstindex(word)
		previous = word[i]
		j = nextind(word, i)
		for c in word[j:end]
			if c < previous
				return false
			end
		previous = c
		end
		true
	end
```

An alternative is to use recursion:

```Julia
	function isabecedarian(word)
		if length(word) <= 1
			return true
		end
		i = firstindex(word)
		j = nextind(word, i)
		if word[i] > word[j]
			return false
		end
		isabecedarian(word[j:end])
	end
```

Another option is to use a while loop:

```Julia
	function isabecedarian(word)
		i = firstindex(word)
		j = nextind(word, 1)
		while j <= sizeof(word)
			if word[j] < word[i]
				return false
			end
			i = j
			j = nextind(word, i)
		end
		true
	end
```

The loop starts at i = 1 and j = nextind(word, 1) and ends when j > sizeof(word). Each time through the loop, it compares the i:th character (which you can think of as the current character) to the j:th character (which you can think of as the next). If the next character is less than (alphabetically before) the current one, then we have discovered a break in the abecedarian trend, and we return false. If we get to the end of the loop without finding a fault, then the word passes the test. 

To convince yourself that the loop ends correctly, consider an example like "flossy".

Another way to write a traversal is with a for loop and use the string as an **iterable** datatype:

Here is a version of `ispalindrome` that uses two indices; one starts at the beginning and goes up; the other starts at the end and goes down.

```Julia
	# this code does not cover all cases
	function ispalindrome(word)
		i = firstindex(word)
		j = lastindex(word)
		while i<j
			if word[i] != word[j]
				return false
			end
		i = nextind(word, i)
		j = prevind(word, j)
		end
		return true
	end
```

Or we could reduce to a previously solved problem:

In [13]:
function ispalindrome_1(word)
    lowercase(word) == lowercase(reverse(word))
end

word = "abctCBA"
ispalindrome_1(word)

true

In [14]:
# this code does not cover all cases
function ispalindrome_2(word)
    i = firstindex(word)
    j = lastindex(word)
    while i<j
        if word[i] != word[j]
            return false
        end
    i = nextind(word, i)
    j = prevind(word, j)
    end
    return true
end

ispalindrome_2(lowercase(word))

true

### Debugging

Testing programs is hard. The functions in this chapter are relatively easy to test because you can check the results by hand. Even so, it is somewhere between difficult and impossible to choose a set of words that test for all possible errors.

Taking hasno_e as an example, there are two obvious cases to check: words that have an e should return false, and words that do not should return true. You should have no trouble coming up with one of each. Within each case, there are some less obvious subcases. 

Among the words that have an “e”, you should test words with an “e” at the beginning, the end, and somewhere in the middle. You should test long words, short words, and very short words, like the empty string. 

The empty string is an example of a **special case**, which is one of the non-obvious cases where errors often lurk. In addition to the test cases you generate, you can also test your program with a word list like words.txt. 

By scanning the output, you might be able to catch errors, but be careful: you might catch one kind of error (words that should not be included, but are) and not another (words that should be included, but are not).

In general, **testing code** can help you find bugs, but it is not easy to generate a good set of test cases, and even if you do, you cannot be sure your program is correct. According to Edsger W. Dijkstra, a legendary computer scientist: **Program testing can be used to show the presence of bugs, but never to show their absence.**