# Now You Code 2: Information Extraction

How do we make computers seem intelligent? One approach is to use *term extraction*. Term extration is a type of information extration where we attempt to find relevant terms in text. The relevant terms come from a *corpus*, or set of plausible terms we want to extract.

For example, suppose we have the text:

`One day I would like to visit Syracuse`

We has smart humans can be fairly confident that `Syracuse` is a place, more specifically a `city`. 

A rudimentary method to make the computer interpret `Syracuse` as a place is to provide a corpus of cities and have the computer look up `Syracuse` in that corpus. 

In this code exercise we will do just that. Let's first write a function to read cities from the file `NYC2-cities.txt` into a corpus of cities, which will be represented in Python as a list.

Then write a main program loop to input some text, split the text into a list of words and if any of the words match a city in the corpus list we will output the word is a city.

The program should handle upper / lower case matching. A good approach is to title case the input. 

IMPORTANT: Please note that our program will ONLY work for one word cities, like `Syracuse` and will not work for multiple-word cities like `San Diego`. Don't worry about that now. 

SAMPLE RUN

```
Enter some text (or ENTER to quit): one day I would like to visit syraucse and rochester
Syracuse is a city
Rochester is a city
Enter some text (or ENTER to quit): austin is in texas
Austin is a city
Enter some text (or ENTER to quit): 
Quitting...
```

Once again we will solve this problem using the problem simplification approach. First we will write the `load_city_corpus` function to build our city list. Second we will write the  `is_a_city` function which given a word and a city list will return `True` when the word is a city. Finally we conclude with the main program which finds cities in our text, as demonstrated in our sample run.

## Step 1: Problem Analysis for `load_city_corpus`

Inputs: None (reads from a file)

Outputs: a Python list of cities

Algorithm (Steps in Program):
1. Define load_city_corpus()

    2. Start with an empty list
    
    3. Use file handle to name file "NYC2-cities.txt"
    
    4. Open file
    
        5. for line in file
        
               6.append and strip to list
               
    7. return city_list  

In [39]:
## Step 2: write the defintion for the load_city_corpus function
def load_city_corpus():
    city_list = []
    filename = "NYC2-cities.txt"
    with open(filename) as f:
        for line in f:
            city_list.append(line.strip())
    return city_list



## Step 3: Problem Analysis for `is_a_city`

Inputs: a string word and a Python list of cities

Outputs: True or False when word is in the list of cities.

Algorithm (Steps in Program): 

1. Define is_a_city(city,city_list)
    
    2. Try
    
        3. return True if index of city in city_list
        
        4. Except ValueError
        
            5. returns false



In [40]:
## Step 4: write the definition for the is_a_city function
def is_a_city(city,city_list):
    try:
        index = city_list.index(city)
        return True
    except ValueError:
        return False

## Step 5: Problem Analysis for entire program

Inputs: A sample of text containing a city, or 'quit' 

Outputs: program outputs the phrase "(item), is a city" 

Algorithm (Steps in Program): (make sure to use the two functions we created)

1. Use function load_city_corpus() to call file and add cities to list city_list

2. Loop program indefinitely

    3. Input text and assign to variable "city"; text is title cased
    
    4. Split city into a list and assign to variable "city2"
    
    5. For item in city2
    
        6. If return of function is_a_city(item,city_list) is True
        
            7. Print "(item) is a city"
            
    8. If input is "quit"
    
        9. Program prints "Ending..."
        
        10. Program breaks
        

In [60]:
## Step 6: Write complete program, making sure to use your two functions.
city_list = load_city_corpus()
while True:    
    city = input("Enter some text (or type quit to quit):").title()
    city2 = city.split(" ")
    for item in city2:
        if is_a_city(item, city_list) == True:
            print(item, "is a city ")
        else:
            continue
    if city == "Quit":
        print('Ending...')
        break


Enter some text (or type quit to quit):i love syracuse
Syracuse is a city 
Enter some text (or type quit to quit):memphis looks fun
Memphis is a city 
Enter some text (or type quit to quit):dallas and austin are in texas
Dallas is a city 
Austin is a city 
Enter some text (or type quit to quit):quit
Ending...


## Step 7: Questions

1. Explain your approach to solving this problem for cities with 2 words like `New York` or `Los Angeles`?

In order to solve this problem, I would focus on using specific if statements to narrow down words that start with common begginers of 2 word cities, such as "New", "San", and "Los". If the if statement detects any of these inputs in the list, then it will convert that word in the index and also the word after it into one item in the list, not two. This one item in the list will then be ran against the is_a_city(city,city_list) function to determine whether or not it is on our list city_list. For example, if the list we inputted contained ['I','Love','New','York'], then the program would recognize 'New' as one of our inputs clarified in the it statement, and it would pair that word with the following word in the list, in this case 'York'. Our updated list would look like ['I','Love','New York'].  



2. How would you solve the problem where you enter a city name which is not in the corpus?

If the city name is not in the corpus, we would have to create a seperate function that compares each item in the list city2 to each city listed on city_list. If each item comes back as not being on the list, the program would print("Please enter different output.


## Reminder of Evaluation Criteria

1. What the problem attempted (analysis, code, and answered questions) ?
2. What the problem analysis thought out? (does the program match the plan?)
3. Does the code execute without syntax error?
4. Does the code solve the intended problem?
5. Is the code well written? (easy to understand, modular, and self-documenting, handles errors)
