<a href="https://colab.research.google.com/github/mich-kurt/Project-Examples/blob/main/Project_Text_Analysis_Finding_Characters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Introduction to Programming for Data Science using Python**

#**Project 4: Text Analysis - Finding Characters**

One of the goals of the Cliff Note Generator was to generate a list of characters in a novel. We can actually use our current skill set and include the techniques discussed in the ngrams lesson to extract (with a good level of accuracy) the main characters of a novel.

We will also make some improvements with some of the parsing, cleaning, and preparation of the data. It would be best to read this entire lesson before doing any coding. Also note that this lesson is a bit different in that you will be responsible for more of the code writing. What is being specified is a minimum. We highly recommend that you decompose any complex processes into multiple functions.

###**Preparation**
Before doing anything, read through the entire set of directions first. You will get a sense of the restrictions and overall goals.

###**Step 1**
Fill in the functions from the previous lessons (ngrams and stopwords).

In [None]:
#
# from Ngrams and Stopwords Lessons
#

# Copy & paste your code from the ngrams lesson
import re
import collections
def read_text(filename):
  with open(filename,'r') as fd:
    txt = fd.read()

  return txt

def split_text_into_tokens(text):
    f_list = []
    pattern = r"['A-Za-z]+-?['A-Za-z0-9]+"

    words = re.findall(pattern, text)

    for word in words:
        # Remove words with first and last apostrophes
        #dogs'

        if "'" in word[0] and "'"  in word[-1]:
          f_list.append(word[1:-1])

        else:
          f_list.append(word)
    return f_list
def split_text_into_tokens_2(text):
    f_list = []
    pattern = r"['A-Za-z]+-?['A-Za-z0-9]+"

    words = re.findall(pattern, text)

    for word in words:
        # Remove words with first and last apostrophes
        #dogs'
        word = normalize_token(word)
        if "'" in word[0] and "'"  in word[-1]:
          f_list.append(word[1:-1])

        else:
          f_list.append(word)
    return f_list
def bi_grams(tokens):

  i = 1
  l_1 = []
  while i < len(tokens):
    f_tup = (tokens[i - 1], tokens[i])
    l_1.append(f_tup)
    i += 1

  return l_1
def normalize_token(token):

  cleaned_token = token.strip()
  X = cleaned_token.replace("'","")
  return X
def top_n(tokens, n):
  cnt = collections.Counter(tokens)

  return cnt.most_common(n)

# From stopwords lesson
def remove_stop_words(tokens, stoplist):
  f_list= []
  for i in tokens:
    if i.lower() not in stoplist:
      f_list.append(i)
  return f_list

def load_stop_words(filename):
  f_list = []
  txt = read_text(filename)
  txt = txt.split("\n")
  for i in txt:
    if "'" in i[0] or "'" in i[-1]:
     f_list.append(i.replace("'",""))
    else:
      f_list.append(i)
  return f_list


###**Step 2 : Test your code.**
The following should now work

In [None]:
def demo_test():

   text = read_text('lupin.txt')
   stop = load_stop_words('stopwords.txt')

   tokens  = split_text_into_tokens(text)
   cleaned = remove_stop_words(tokens, stop)
   grams = bi_grams(cleaned)

   print(top_n(grams, 10))

demo_test()

[(('said', 'Duke'), 403), (('said', 'Guerchard'), 302), (('said', 'Formery'), 126), (('said', 'Germaine'), 123), (('said', 'Lupin'), 119), (('said', 'Sonia'), 74), (('said', 'Victoire'), 55), (('Mademoiselle', 'Kritchnoff'), 45), (('said', 'millionaire'), 44), (('Duke', 'Charmerace'), 43)]


You should see the following:
```
[(('said', 'Duke'), 403), (('said', 'Guerchard'), 302), (('said', 'Formery'), 126), (('said', 'Germaine'), 123), (('said', 'Lupin'), 119), (('said', 'Sonia'), 74), (('said', 'Victoire'), 55), (('Mademoiselle', 'Kritchnoff'), 45), (('said', 'millionaire'), 44), (('Duke', 'Charmerace'), 43)]
```

Note how this compares to the output when we didn't account for the case of the stopwords.

###**Finding the Characters**
With this machinery in place, we are ready to find characters in a novel (I hope you are reading this with great anticipation) using different strategies. Each of the strategies has a function to implement that strategy.

####**Method #1**
One attribute (or feature) of the text we are analyzing is that proper nouns are capitalized. Let’s capitalize on this and find all single words in the text whose first character is an uppercase letter and the word is **not** a stop word.

Create and define the function find_characters_v1(text, stoplist, top):

* Tokenize and clean the text using the function split_text_into_tokens
* Filter the tokens so it has no stop words in it (regardless of case). The parameter stoplist is the array returned from load_stop_words
* Create a new list of tokens (keep the order) of words that are capitalized. You can test the first character of the token.
* Return the top words as a list of tuples (the first element is the word, the second is the count)

In [None]:
def find_characters_v1(text, stoplist, top):
  tokens = split_text_into_tokens_2(text)
  count  = collections.Counter()
  for i in tokens:
    if i.lower() not in stoplist:
      if i.lower() != i:
        count[i] += 1

  return count.most_common(top)

text = read_text('lupin.txt')
stop = load_stop_words('stopwords.txt')
v1 = find_characters_v1(text,stop, 15)
print(v1)

[('Duke', 727), ('Guerchard', 642), ('Lupin', 331), ('Formery', 258), ('Germaine', 228), ('Sonia', 226), ('Oh', 189), ('Yes', 164), ('Victoire', 160), ('Im', 127), ('Charolais', 124), ('Well', 103), ('Gournay-Martin', 96), ('Charmerace', 84), ('Ive', 82)]


For Lupin, you should get the following (the output is formatted for clarity):
```
text = read_text('lupin.txt')
stop = load_stop_words('stopwords.txt')
v1 = find_characters_v1(text,stop, 15)
print(v1)
```
You should see:
```
('Duke', 727),
('Guerchard', 642),
('Lupin', 331),
('Formery', 258),
('Germaine', 228),
('Sonia', 226),
('Oh', 189),
('Yes', 164),
('Victoire', 160),
('Charolais', 124)
```

Notice with this very simple method we found 11 characters in the top 15. You also found an Oh and a Yes too. You might be inclined to start fiddling with the stop-words. The one you could add is 'Duke' and 'Well' -- the interjection, since we know that word does not provide much content in this context. But as we mentioned in the stop words lesson, that's a dangerous game, since other novels might include some of these:

![](https://drive.google.com/uc?export=view&id=1BztBBMwk5FSLxTy37naTUzeODVwg12kd)

Link to the above [image](https://drive.google.com/file/d/1BztBBMwk5FSLxTy37naTUzeODVwg12kd/view?usp=sharing)

###**Method #2**
Another feature of characters in a novel is that some of them have two names (Arsène Lupin).

Create and define the following function:
```
find_characters_v2(text, stoplist, top)
```
* Tokenize and clean the text using the function split_text_into_tokens
* Convert the list of tokens into a list of bigrams (using your bi_grams method)
* Filter out all bigrams to keep only the ones where both words are capitalized (just the first character)
* Neither word should (either lower or upper) be in stoplist
* Remember stoplist could be an empty list
* Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count


Note that we are **not** removing the stopwords from the text. We are now using the stopwords to make decisions on the text. The stopwords lesson has more details on this as well.

In [None]:
def find_characters_v2(text, stoplist, top):
  #splits word
  split_words = split_text_into_tokens(text)
  #turns words to bigrams
  ngrams = bi_grams(split_words)
  caps_ngrams = []
  n_stoplist_words = []
  count = collections.Counter()
  #iterates through bigrams
  for i in ngrams:
    #checks if the first word and second words lower version doesnt equal
    #upper version and if the word is in the stoplist
    if i[0][0].lower() != i[0][0] and i[1][0].lower() != i[1][0] and i[0].lower() not in stoplist and i[1].lower() not in stoplist:
      caps_ngrams.append(i)
  for j in caps_ngrams:
    count[j] += 1
  count_1 = count.most_common(top)

  return count_1




text = read_text('lupin.txt')
stop = load_stop_words('stopwords.txt')
v2 = find_characters_v2(text,stop, 15)
print(v2)

[(('Mademoiselle', 'Kritchnoff'), 45), (('Duke', 'Yes'), 16), (('Duke', 'Oh'), 15), (('Formery', 'Yes'), 11), (('Guerchard', 'Oh'), 11), (('Guerchard', 'Yes'), 11), (('Mademoiselle', 'Gournay-Martin'), 10), (('South', 'Pole'), 8), (('Firmin', 'Firmin'), 8), (('Mademoiselle', 'Germaine'), 7), (('Mademoiselle', 'Sonia'), 7), (('Chief-Inspector', 'Guerchard'), 7), (('University', 'Street'), 6), (('Du', 'Buits'), 6), (('ARS', 'NE'), 6)]



With the text of Lupin, the following is the expected output of v2:

```
(('Mademoiselle', 'Kritchnoff'), 45),
(('Duke', 'Yes'), 16),
(('Duke', 'Oh'), 15),
(('Formery', 'Yes'), 11),
(('Guerchard', 'Oh'), 11),
(('Guerchard', 'Yes'), 11),
(('Mademoiselle', 'Gournay-Martin'), 10),
(('South', 'Pole'), 8),
(('Du', 'Buits'), 8),
(('Firmin', 'Firmin'), 8)
```

For this book, this method is quite useless, as there aren't many double cases and you can see that the top ones are just matched weirdly. Capitalization can mess with simple algorithms and this is a clear example of how.

Note: in order to match these outputs, use the collections.Counter class. Otherwise, it's possible that your version of sorting will handle those tuples with equal counts differently (unstable sorting).

###**Titles, a short diversion**
Another feature of characters is that many of them have a title (also called honorifics) precede them (Dr. Mr. Mrs. Miss. Ms. Rev. Prof. Sir. etc). We will look for bi-grams that have these titles. However, we will **not** hard code the titles (we won't specify which titles to look for). We will let the data tell us what the 'titles' are.

Here's the process to use to self discover titles:

* Let's define a title as a capital letter followed by 1 to 4 lower case letters followed by a period. This is not perfect, but it captures a good majority of possible titles.
* Create a list named title_tokens whose text matches the above criteria (hint: use regular expressions) for example:
title_tokens = regex1.findall(text)
* Now we need to remove words that might have ended a sentence with those same title characteristics (e.g. Tom. Bill. Pat. Etc. ). These names could have been in a sentence like "Please go Tom." Tom is **not** a title, but it would have been found by our definition.
* Use the same definition for titles (above) but instead of ending with a period, the token must end with whitespace. The idea is that hopefully somewhere in the text the same name will appear but without a period. It’s very likely that you would encounter 'Tom' somewhere in the text without a period, but it’s unlikely that Mr., Mrs., Dr., etc would appear without a period. Let's call this list pseudo_titles.
pseudo_titles = regex2.findall(text)
* The set of titles is essentially the first list of tokens, title_tokens with all the tokens in the second set (pseudo_titles) removed. For example, the first list might have 'Dr.', 'Tom.' and 'Mr.' in it and the second set might have 'Tom' and 'Ted' in it. The final title list would be ['Dr', 'Mr'].

Name your function get_titles that encapsulates the above logic; it should return a list of atitles, with only a few valid titles in that list:

In [None]:
import re
def get_titles(txt):
    f_list = []
    title_tokens = r'\b[A-Z]+[a-z]{1,4}\.'
    pseudo_titles = r'\b[A-Z]+[a-z]{1,4}\s'
    words = re.findall(title_tokens, txt)
    words_2 = re.findall(pseudo_titles,txt)

    for i in words:
      i = i.replace("."," ")
      if i not in words_2:
        f_list.append(i.strip())


    f_set = sorted(set(f_list))
    return f_set



In [None]:
text = read_text('lupin.txt')
titles = get_titles(text)
print(titles)

['Dyck', 'May', 'Mlle', 'Mr', 'Star', 'Ste']


Once you have get_titles working, the following should work:
```
text = read_text('lupin.txt')
titles = get_titles(text)
print(sorted(titles))
```
You should get 6 computed titles in Lupin, again with only a few actual titles:
```
['Dyck', 'May', 'Mlle', 'Mr', 'Star', 'Ste']
```

Do not move forward until this is working.

###**Method #3**
Create and define the following function

find_characters_v3(text, stoplist, top)

* Tokenize and clean the text
* Convert the list of tokens into a list of bigrams
* Filter out all bigrams such that the first word in the bigram is a title and the second word is capitalized (hint: use the output of get_titles) **and** the second word (either lower or upper) should not be in stoplist
* Return the top bigrams as a list of tuples: the first element is the bigram tuple, the second is the count

In [None]:
def find_characters_v3(text, stoplist, top):
  count = collections.Counter()
  titles = ['Dyck', 'May', 'Mlle', 'Mr', 'Star', 'Ste']
  split = split_text_into_tokens(text)
  ngrams = bi_grams(split)
  for i in ngrams:
    if i[0] in titles and i[1].lower() not in stoplist:
      count[i] += 1
  return count.most_common(top)


text = read_text('lupin.txt')
stop = load_stop_words('stopwords.txt')
v3 = find_characters_v3(text, stop, 5)
print(v3)

[(('Mlle', 'Germaine'), 2), (('Mlle', 'Kritchnoff'), 2), (('Mr', 'Inspector'), 2), (('Mlle', 'Gournay-Martin'), 1), (('Mlle', "Germaine's"), 1)]


For Lupin, you should get the following:
```
(('Mlle', 'Germaine'), 2),
(('Mlle', 'Kritchnoff'), 2),
(('Mr', 'Inspector'), 2),
(('Mlle', 'Gournay-Martin'), 1),
(('Mlle', "Germaine's"), 1),
(('Ste', 'Clotilde'), 1)
```

While this doesn't yield a lot of solid information, it does give you a basic understanding of the fundamentals of finding these kinds of specific words and the process behind them.

###**Machine Learning?**
You may have heard of (and used) the NLTK Python library that’s a popular choice for processing text. These libraries include models that were built by processing large amounts of text. We will use both the NLTK and SpaCy NLP libraries to do something similar in another lesson. However, these libraries have models built from using large data sets to extract entities (called NER for named entity recognition). These entities include organizations, people, places, money.

The models that were built essentially learned what features (like capitalization or title words) were important when analyzing text and came up with a model that attempts to do the same thing we did here. However, we hard coded the rules (use bigrams, remove stop words, look for capital letters, etc). This is sometimes referred to as a rule-based system. The analysis is built on manually crafted rules.

In machine learning (sometimes referred to as an automatic system), some of the algorithms essentially learn what features are important (or can learn how much weight to apply to each feature) to build a model and then uses the model to classify tokens as named entities. The biggest issue is that these models could be built with a very different text source (e.g. journal articles or twitter feed) than what you are processing. Also the models themselves require a large set of resources (memory, cpu) that you may not have available. What you built in this lesson is efficient, fast and fairly accurate.

In the follow-on course, you'll be able to build your own text-based models.


##**Submission**

After implementing all the functions and testing them please download the notebook as "info407_project_text_analysis_finding_characters.py" and submit to gradescope under "Project - Text Analysis - Finding Characters" assignment tab.

**NOTES**

* Be sure to use the function names and parameter names as given.
* DO NOT use your own function or parameter names.
* Your file MUST be named "info407_project_text_analysis_finding_characters.py".
* Comment out any lines of code and/or function calls to those functions that produce errors.
* Grading cannot be performed if any of these are violated.