# Morphology tool
Use Whoosh to illustrate English morphology with examples from a given corpus. For instance, the rules of morphology dictate verbs can take different forms or be used to form nouns, adjectives and such, like:

- `render`: 'renders', 'rendered', 'rendering'
- `think`: 'thinks', 'thought', 'thinking', 'thinker', 'thinkers', 'thinkable'
- `put`: 'puts', 'putting', 'putter'
- `do`: 'does', 'did', 'done', 'doing', 'doings', 'doer', 'doers'

Forms like `think`, `put` and `do` illustrate that you cannot approach this problem in a mechanical or brute-force way. It is not as simple as adding 'ed', 'ing', etcetera to the verbs. Sometimes consonants are doubled, sometimes the verb stem changes (in the case of strong verbs), and so on.

Whoosh has a particular feature to deal with this. Look through the documentation and you'll find it easily.

Use it to build a Python application that takes a word as input and returns a list of sentences from the British fiction corpus that contain this word to illustrate its usage. Think about building the index first, so you can then reuse it (without having to rebuild it) for additional searches.

Also, try to display the results nicely, i.e. without the markup tags and whitespace (line breaks, etc.) we saw in the above example. Maybe you can even print the matched word in bold?

illustrate english morphology with examples from a corpus
- roots of words can be used to make different forms that are then conjugations of a word or a noun/adjective / ...
- whoosh: predict how different variant forms can vary
- take the corpus that we have used and make an application that accepts a word 
- look for the results in the index that we have made for whoosh 

https://docs.red-dove.com/whoosh/stemming.html

In [1]:
from whoosh.fields import Schema, TEXT, ID
schema = Schema(title=TEXT(stored=True),content=TEXT(stored=True),
               path=ID(stored=True))

import os.path
from whoosh.index import create_in

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

from whoosh.index import open_dir
ix = open_dir("index")

In [2]:
#OS_SEP = os.sep
#writer = ix.writer()
#for document in os.listdir('InformationScience/course/data/corpus_of_british_fiction'): 
    #with open(document, 'r', encoding='utf-8') as text:
        #writer.add_document(title=document,content=str(text.read()),
                            #path=document)
#writer.commit()

In [1]:
from whoosh.lang.porter import stem
stem('rendering')

'render'

In [2]:
from whoosh.lang.morph_en import variations
# variations("rendered")

In [None]:
query = []

def query_search():
    retry = True
    while retry: 
        look_up = input("Look up a word or type q to quit.   ")
        if not 'q' in look_up:
            query.append(look_up)
        if look_up == 'q':
            retry=False
    if retry == False:
        for word in query:
            print(word, "-", variations(word))

In [None]:
def split(to_split):
    return to_split.split(' ')

def make_word_index(corpus):
    index = {}
    for title in corpus:
        words = split(title)
        for word in words:
            index.setdefault(word,[])
            index[word].append(title)
    return index

def search_with_index()

`if _name__ = "__main__"`: what does this do?

Imagine you're writing a file (script) in which you define a function named test:

```python
def test():
    return "hello world"

print(test())
>>> hello world
```

if you then make a second file and you want to use the code that you have written in first file:

```python
import test
print(test.test())
>>> hello world
hello world 
```
>>> hello world is printed twice. Because first of all you go to the test module and execute the function 'test', but also by importing test, you're importing the whole file and importing the code that is in there, and this is something that you don't want
- you wnat to be able to execute those functions in your own file
- you do this by `if __name__=__main__`, and it will not double execute the code in the import file so it will only print 'hello world' once

## Generators
(code on github)
- function that does the same as making the list yourself, but it doesn't return the list, but it returns a number: it returns the list number by number
- they allow you to use them just as you would use a list, you don't need to keep track of where you are, you can just iterate over them just as you can iterate over a list
- the next time you go to the generator, it will go to the next number of that range
- you don't yield a list, but you will yield it number by number
- if you were to make a list of 10 million items/numbers, it would take up the entire memory of your computer
- but a generator takes it number by number

In our code: 
- we don't want to fill up our memory
- we want to yield hit by hit and get those sentences back
- define a generator for this

1. Try to do time-intensive operations like building an index only once in your code
2. Try to write code that is cross-platform, or at least does not break when executed on a different platform than yours
3. Especially for data science: account for large data structures
4. In Python everything is an object: all of the modules in the Whoosh library returned an object with its own specific methods. Understanding this OOP design of Python will help a lot when reading documentation for external libraries