Reading and searching text
=======

_Practical Python for Linguistics and the Humanities -- Alexis
Dimitriadis_

## Contents


**[1. New python features](#1.-New-python-features)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.1 The `import` command](#1.1-The-import-command)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.2 Features for interactive use](#1.2-Features-for-interactive-use)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.3 Custom help text](#1.3-Custom-help-text)  

**[2. More about strings](#2.-More-about-strings)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.1 The `repr()` function](#2.1-The-repr%28%29-function)  

**[3. Reading text from files](#3.-Reading-text-from-files)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.1 Opening a file](#3.1-Opening-a-file)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.2 Scenario 1: Processing text line by line](#3.2-Scenario-1:-Processing-text-line-by-line)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.3 Searching with regular expressions](#3.3-Searching-with-regular-expressions)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.4 Putting it all together](#3.4-Putting-it-all-together)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.5 Building and using lists](#3.5-Building-and-using-lists)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.6 List comprehensions](#3.6-List-comprehensions)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.7 Scenario 2: Processing text word by word](#3.7-Scenario-2:-Processing-text-word-by-word)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.8 Scenario 3: Searching for sentences](#3.8-Scenario-3:-Searching-for-sentences)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.9 Case-insensitive search](#3.9-Case-insensitive-search)  

**[4. Search tasks](#4.-Search-tasks)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.1 Preparation: Download a story](#4.1-Preparation:-Download-a-story)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.2 Search for words in your story](#4.2-Search-for-words-in-your-story)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.3 Search for lines](#4.3-Search-for-lines)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [4.4 Other searches](#4.4-Other-searches)  

**[5. What we learned](#5.-What-we-learned)**  


Handling text involves several stages:


1. "Reading" the text into python.

2. Separating it into suitably-sized pieces (lines, words etc.):
The size _depends on our goals_.

3. Recognizing the pieces we are interested in.
  Regular expressions are a very powerful tool for this.

4. Collecting the pieces that match our search, and/or doing something
with them.

We'll start with some of the features python provides to help us
manage complex tasks like this, then look at some recipes for
searching.

## 1. New python features

### 1.1 The `import` command

"Importing" means we load a _module_ of new python commands,
usually function definitions. They expand what python can do.

The new commands (functions or variables) are in the "namespace" of
the
module. To use them, specify the module name before the command, e.g.,
`math.factorial(...)`:

In [1]:
import math
print("Ten factorial (10!) is", math.factorial(10))

Ten factorial (10!) is 3628800


There is another version of importing: With `from ... import`, we can
import a single name from a module.
The imported name can then be used without the module prefix.

In [2]:
from math import factorial
print("20! =", factorial(20))

20! = 2432902008176640000


### 1.2 Features for interactive use

As our programs get more complicated, it is sometimes hard to get a
handle on what's going on. Two useful commands for interactive use
(not for completed scripts) are `dir()` and `help()`. `dir()` lists
all variables we have defined so far. This is most useful in IDLE,
python's interactive editor and command console. In a Notebook, we'll
first see a bunch of internal commands  defined by the notebook. Most
of their names start with `_` (underscore). But down near the end of
the list you'll see `math` and `factorial`, the two variables we have
defined so far. (Yes, a module name is a variable too.)

In [3]:
dir()

['In',
 'Out',
 '_',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__session__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_i3',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'factorial',
 'get_ipython',
 'math',
 'open',
 'quit']

The command `help()` can show us the internal documentation of
functions, modules or types. When we only need a brief description,
it's often easier than googling for the same information.

In [4]:
help(factorial)

Help on built-in function factorial in module math:

factorial(n, /)
    Find n!.

    Raise a ValueError if x is negative or non-integral.



If we call `help()` without arguments, it will launch a special mode
that reads commands and tries to find their documentation. (If you
enter it, type "q" or Control-D to exit).
And because python will not let us use a keyword like "if" as a
function argument, `help()` will also accept a string containing a
python keyword:

In [5]:
help("if")

The "if" statement
******************

The "if" statement is used for conditional execution:

   if_stmt ::= "if" assignment_expression ":" suite
               ("elif" assignment_expression ":" suite)*
               ["else" ":" suite]

It selects exactly one of the suites by evaluating the expressions one
by one until one is found to be true (see section Boolean operations
for the definition of true and false); then that suite is executed
(and no other part of the "if" statement is executed or evaluated).
If all expressions are false, the suite of the "else" clause, if
present, is executed.

Related help topics: TRUTHVALUE



### Your turn:

Import the module `re`, python's regular expression module. Print out
the help text for the function `re.split()`.

In [9]:
# YOUR CODE:
import re
help(re.split)

Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.



Modules can have help text too. Print out the help for the module
`re`; you don't need to study it carefully, but note that the general
description is followed by a list of the functions in this module. The
function list is automatically generated, and includes brief
documentation for each function.

In [10]:
# YOUR CODE:

help(re)

Help on package re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.12/library/re.html

    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.

    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last matc

For more complete information, look up the official python documentation by googling, e.g., "python re". Near the top of the results you'll find a link to the manual page on http://docs.python.org. (At the top of the manual page there's a box you can use to switch to your version of python.)

### 1.3 Custom help text

Our own functions can have help text! If the very first thing in
a function is a string, it will become the help text (the so-called
"docstring").

In [11]:
def cheer(times=3):
    """
    Cheer wildly.
    For wilder cheering, specify a larger number.
    """
    for n in range(times):
        print("hieperdepiep hoeraaaaa!")

help(cheer)

Help on function cheer in module __main__:

cheer(times=3)
    Cheer wildly.
    For wilder cheering, specify a larger number.



From now on, you should include a docstring when defining your own
functions. Note that **a comment is not a docstring!** Comments are
invisible to `help()`.

### Your turn:

Define a function `lazy()` that takes no arguments and does nothing.
It should have a docstring that says "This function does nothing".
After defining it, use `help()` to view its help string.

In [12]:
# YOUR CODE:

def lazy():
    """
    This function does nothing
    """

help(lazy)

Help on function lazy in module __main__:

lazy()
    This function does nothing



## 2. More about strings

Strings can contain special "escape sequences", which represent
characters that are hard to see or type.
They all start with a backslash: `\n` is a newline, for example.

In [13]:
twolines = "First line...\nSecond line"
print(twolines)

First line...
Second line


An escape sequence is a notational device: The resulting string
contains a real newline, not the two characters "`\`" and "`n`". We
can confirm this by checking the length of a simple string.

In [14]:
print(len("\n"))

1


When we _don't_ want the special meanings, prefix the string
definition with an _r_
(for "raw string"). In raw strings, the backslash does not have a
special meaning-- it is just a character.

In [15]:
oneline = r"This is all \n one \n line"
print(oneline)

This is all \n one \n line


### 2.1 The `repr()` function

Who knew a literal string could be so complicated? To see what's
really going on, it's often helpful to ask python for a clear picture
of what our strings contain. One way to get this is with the function
`repr()` which shows us the "representation" of a string (or other
objects) in an unambiguous way.

In [16]:
print(repr(twolines))

'First line...\nSecond line'


`repr` has surrounded our string with single quotes (the better to see
spaces at the beginning or end), and shows the newline as `\n`. Let's
do the same with a string that contains actual backslashes:

In [17]:
print(repr(r"a backslash \ here"))

'a backslash \\ here'


One way to prevent a backslash from having a special meaning is to add
another backslash before it. This is what `repr` did here.

### Your turn:

Construct a raw string named `ab` containing the text `A\nB`. Answer the following questions, then write code to show the answers.

1. What is the length of `ab`?
2. What will `print(ab[2])` print?

In [19]:
# YOUR CODE:
ab = r"A\nB"
print(len(ab))
print(ab[2])

4
n


## 3. Reading text from files

Most real-world programs will work with data that
was not typed (or pasted) into the program.
Often, the necessary data is read in from a file. We now examine in
more detail how to read text from files.

**Before continuing, ensure you have downloaded the file
`RedCircle.txt` and saved it in the same directory as this Notebook.**

### 3.1 Opening a file

To work with the contents of a file, we must "open" it. Opening a file does not directly produce its contents, but gives us a special "file
object": a kind of connection to the disk file, from which we can "read" text. The file object itself is just the means of access.

In [20]:
connection = open("RedCircle.txt")
print(connection)

<_io.TextIOWrapper name='RedCircle.txt' mode='r' encoding='cp1252'>


There are numerous ways to read text from a file. We can read
one line at a time (though usually we will not: this is for
illustration).

In [21]:
first_line = connection.readline()
second_line = connection.readline()
print("First:", first_line)
print("Second:", second_line)

First: Project Gutenberg's The Adventure of the Red Circle, by Arthur Conan Doyle

Second: 



**Your turn:**
Do you notice a lot of blank space in the above output? Find out
what's going on by printing an unambigous representation of each line
we read, with `repr`.

In [22]:
# EDIT TO ADD YOUR CODE:

print(repr(first_line))
print(repr(second_line))

"Project Gutenberg's The Adventure of the Red Circle, by Arthur Conan Doyle\n"
'\n'


You should be able to see that each line read from the file ends with
a newline, and that the second line contains this newline and nothing
else. `print` will _always_ add an extra newline, except when told not
to (with `end=...`). Because we printed strings that already ended
with a newline, we ended up with too many.

Reading a file advances through it, like playing a music file or video.
Each `readline()` reads the next line in the file.
Once we have read to the end, there is nothing more to read: if we try,
we get the empty string. To read from the start again, the simplest way is to
re-open the file (i.e., make a new connection).

The method `read()` will read all the way to the end of the file.
If called before anything else was read, it will read the entire file,
but if we call it now it will simply read what hasn't been read yet:

In [23]:
the_rest = connection.read()
readagain = connection.readline()

**Your turn:** What does the variable `readagain` contain after running
the above? Do you understand why? Examine it to find out.

In [25]:
# YOUR CODE:
print(repr(readagain))

''



When we are done with reading from a file, it is good practice to
"close" the connection, i.e., shut it down. Attempting to read from a
closed connection will trigger a python error. (A common oversight:
Don't forget the parentheses after `close()`, otherwise you are not
actually calling the function!) 

In [26]:
connection.close()

Closing a file is especially important 
when we are _writing_ to files, which we will see in a later notebook. 
Forgetting to close a file can prevent the text we wrote from being saved. 
Get in the habit now of cleaning up after your open files, to 
avoid problems later.

Another way to make sure you close your connection is by embedding your connection in a `with` statement. Once Python is done with the `with` block, it closes the connection.

In [27]:
with open("RedCircle.txt") as connection:
    first_line = connection.readline()
    second_line = connection.readline()
    

In [28]:
print("First:", first_line)
print("Second:", second_line)

First: Project Gutenberg's The Adventure of the Red Circle, by Arthur Conan Doyle

Second: 



Trying to read from the connection raises an error since it's closed

In [29]:
connection.readline()

ValueError: I/O operation on closed file.

### 3.2 Scenario 1: Processing text line by line

We are often interested in ***finding _lines_ that match a certain
condition.*** To do so, read in the
file as a single, very long string and split it into a list of lines.
(I.e., a list of strings where each string is a whole line.) 
Here I print a sample line for demonstration.

In [30]:
with open("RedCircle.txt") as connection:
    alltext = connection.read()

lines = alltext.splitlines()

print(lines[51])

"You arranged an affair for a lodger of mine last year," she said--"Mr.


**Your turn:** Using loops, print out some lines and ranges of lines
from different parts of the file: The first ten lines, a block of
lines in the middle, etc. You may use a `while` loop, but `for` loops are particularly convenient for examining lists of strings; refresh your memory of how to 
use them from the notebook on lists.

If you look carefully, you'll see that `splitlines()` removes the
newlines at the end of each line (just like `split()` discards the
spaces between words). This is very convenient, not least because
Windows and OS X do not end lines in exactly the same way.

In [47]:
# YOUR CODE:
with open("RedCircle.txt") as connection:
    alltext = connection.read()
    
lines = alltext.splitlines()

i = 1
n = 5
a = 0
b = 10
for i in range(n):
    print(lines[a:b])
    a += 20
    b += 20
    i += 1

["Project Gutenberg's The Adventure of the Red Circle, by Arthur Conan Doyle", '', 'This eBook is for the use of anyone anywhere at no cost and with', 'almost no restrictions whatsoever.  You may copy it, give it away or', 're-use it under the terms of the Project Gutenberg License included', 'with this eBook or online at www.gutenberg.net', '', '', 'Title: The Adventure of the Red Circle', '']
['', '', '', '', 'Produced by David Brannan.  HTML version by Al Haines.', '', '', '', '', '']
['', '', '"Well, Mrs. Warren, I cannot see that you have any particular cause for', 'uneasiness, nor do I understand why I, whose time is of some value,', 'should interfere in the matter.  I really have other things to engage', 'me."  So spoke Sherlock Holmes and turned back to the great scrapbook', 'in which he was arranging and indexing some of his recent material.', '', 'But the landlady had the pertinacity and also the cunning of her sex.', 'She held her ground firmly.']
['', 'Holmes was accessible

Remember that when you "split" a string into a list of lines, the original string remains intact! The full content of the file is still available in the variable `alltext`.

After reading a whole file at once, it is not necessary to read it in again: Just reuse the variable where the text was originally stored, or the list of lines derived from it. 

### 3.3 Searching with regular expressions

We can examine our list of strings with the help of string indexing,
string methods like `.startswith()`, etc. But such methods are
very limited. We can do far more with python's _regular expressions_,
which use the popular and powerful "perl-style" regexp syntax.
To use them, import the module `re`. To check if a regular expression
matches some part of a string, use the function `re.search()`.

Here is a recipe for searching the lines of a file. Can you tell what
it is searching for?

In [48]:
import re
for line in lines:
    if re.search(r"\bshe\b", line):
        print(line)

"You arranged an affair for a lodger of mine last year," she said--"Mr.
The landlady drew an envelope from her bag; from it she shook out two
the lady who fainted on Brixton bus'--she does not interest me.  'Every
"It's a police matter, Mr. Holmes!" she cried.  "I'll have no more of
It was an excellent hiding-place which she had arranged.  The mirror
"Well, she saw something to alarm her.  That is certain.  The general
peering across.  He wants to be sure that she is on the lookout.  Now
she advanced, her face pale and drawn with a frightful apprehension,
"You have killed him!" she muttered.  "Oh, Dio mio, you have killed
him!"  Then I heard a sudden sharp intake of her breath, and she sprang
into the air with a cry of joy.  Round and round the room she danced,
a sight.  Suddenly she stopped and gazed at us all with a questioning
"But where, then, is Gennaro?" she asked.  "He is my husband, Gennaro
"I do not understand how you know these things," she said. "Giuseppe
lady's sleeve with 

Note the use of the "raw string" for the regular expression. The
symbol `\b` in a regexp specifies a word boundary, so that our regexp
will not match words like "sheep". But `"\b"` is one of the sequences
that have special meaning for strings as well. If we don't use
a raw string, python will convert the `\b` to a certain non-printing
character (a "backspace"), and the regular expression engine will
never see it.

**Always write your regular expressions as raw strings.** It's not
always a problem to use a regular string, but it is common enough, and
a source of very mysterious bugs.

### 3.4 Putting it all together

Here's a complete script that combines the above bits of code into a
working program:

In [49]:
import re

# Read text and turn it into a list of lines
with open("RedCircle.txt") as connection:
    alltext = connection.read()

lines = alltext.splitlines()

# Search, and print the lines that match the regexp
for line in lines:
    if re.search(r"\bshe\b", line):
        print(line)

"You arranged an affair for a lodger of mine last year," she said--"Mr.
The landlady drew an envelope from her bag; from it she shook out two
the lady who fainted on Brixton bus'--she does not interest me.  'Every
"It's a police matter, Mr. Holmes!" she cried.  "I'll have no more of
It was an excellent hiding-place which she had arranged.  The mirror
"Well, she saw something to alarm her.  That is certain.  The general
peering across.  He wants to be sure that she is on the lookout.  Now
she advanced, her face pale and drawn with a frightful apprehension,
"You have killed him!" she muttered.  "Oh, Dio mio, you have killed
him!"  Then I heard a sudden sharp intake of her breath, and she sprang
into the air with a cry of joy.  Round and round the room she danced,
a sight.  Suddenly she stopped and gazed at us all with a questioning
"But where, then, is Gennaro?" she asked.  "He is my husband, Gennaro
"I do not understand how you know these things," she said. "Giuseppe
lady's sleeve with 

**Your turn:** Write a loop that prints out lines containing the
pronoun `they` or `They`. Ensure you have already run the previous
block of code, and don't redefine or read anything unnecessarily.

In [50]:
# YOUR CODE:

import re

# Read text and turn it into a list of lines
with open("RedCircle.txt") as connection:
    alltext = connection.read()

lines = alltext.splitlines()

# Search, and print the lines that match the regexp
for line in lines:
    if re.search(r"\b[tT]hey\b", line):
        print(line)

"Well, sir, they were that he was to have a key of the house. That was
"They were on his tray this morning.  I brought them because I had
beside the curb.  They drove him an hour, and then opened the door and
discovering their mistake they released him. What they would have done
so that they could not fail to catch the eye.
upon Mr. Warren further shows that the enemy, whoever they are, are
small flame across the window as the signals were renewed. They came
"Yes, from that window.  They broke off in the middle.  We came over to
should they refuse the money. It seems that Castalotte, our dear friend
system to punish those whom they feared or hated by injuring not only
their own persons but those whom they loved, and it was the knowledge
stories of his dreadful powers.  If ever they were exerted it would be
certain window, but when the signals came they were nothing but


### 3.5 Building and using lists

Often we need to collect what we find, not just print it out.
We already know how to create a new list, containing just the lines
(or words) we want.

### Your turn:

Use a loop or list comprehension to create a list, `pronouns`, of all
lines containing the pronouns "he" or "she".

In [53]:
# YOUR CODE:
pronouns = []

import re

# Read text and turn it into a list of lines
with open("RedCircle.txt") as connection:
    alltext = connection.read()

lines = alltext.splitlines()

# Search, and print the lines that match the regexp
for line in lines:
    if re.search(r"\b[Hh]e\b", line):
        pronouns.append(line)
    elif re.search(r"\b[Ss]he\b", line):
        pronouns.append(line)

print(pronouns)

['in which he was arranging and indexing some of his recent material.', 'She held her ground firmly.', '"You arranged an affair for a lodger of mine last year," she said--"Mr.', '"But he would never cease talking of it--your kindness, sir, and the', 'nervous over it as I am, but he is out at his work all day, while I get', 'no rest from it.  What is he hiding for?  What has he done?  Except for', 'shoulder.  He had an almost hypnotic power of soothing when he wished.', 'into their usual commonplace.  She sat down in the chair which he had', '"If I take it up I must understand every detail," said he.  "Take time', '"He asked my terms, sir.  I said fifty shillings a week.  There is a', '"He said, \'I\'ll pay you five pounds a week if I can have it on my own', 'money meant much to me.  He took out a ten-pound note, and he held it', "long time to come if you keep the terms,' he said.  'If not, I'll have", '"Well, sir, they were that he was to have a key of the house. That was', 'all right.

Let's check your results: There should be either 134 or 166 lines in the list, depending on whether you included capitalized pronouns in your search.

In [54]:
print("I found pronouns on", len(pronouns), "lines")

I found pronouns on 166 lines


**Your turn:**
Build a list of all lines in `RedCircle.txt` that contain the word
"some" (or "Some"). Then print each line found, with an increasing
number (0, 1, 2, ...) in front.

In [71]:
# YOUR CODE:

sommige = []

import re

# Read text and turn it into a list of lines
with open("RedCircle.txt") as connection:
    alltext = connection.read()

lines = alltext.splitlines()

# Search, and print the lines that match the regexp
for line in lines:
    if re.search(r"\b[Ss]ome\b", line):            
        getal = 0
        a = str(getal), line
        sommige.append(a)
        getal +=1
        
print(sommige)

[('0', 'uneasiness, nor do I understand why I, whose time is of some value,'), ('0', 'in which he was arranging and indexing some of his recent material.'), ('0', '"Exactly.  There was evidently some mark, some thumbprint, something'), ('0', 'until we have some reason to think that there is a guilty reason for'), ('0', '"There are certainly some points of interest in this case, Watson," he'), ('0', "little more possible.  Listen to this:  'Be patient.  Will find some"), ('0', 'which told of some new and momentous development.'), ('0', 'It is clear now that some danger is threatening your lodger.  It is'), ('0', 'danger is the rigour of their precautions. The man, who has some work'), ('0', '"This is serious, Watson," he cried.  "There is some devilry going'), ('0', '"We must define the situation a little more clearly.  It may bear some'), ('0', "York, and I've been close to him for a week in London, waiting some"), ('0', 'street, or in some way come to understand how close the danger w

### 3.6 List comprehensions

A "list comprehension" is a very powerful alternative to a list-
building
loop:
Essentially, a loop inside the new list generates all its elements, without using temporary variables or calls to the `append()` method.

In [72]:
squares = [ n*n for n in range(10) ]
print(squares)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


The above makes a list with the squares of the numbers up to 9. It is equivalent to the following regular loop:

In [73]:
squares = [ ]
for n in range(10):
    squares.append(n*n)
print(squares)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


Comprehensions can include a test. Only elements that satisfy the `if`
clause
are added to the list.

In [75]:
she = [x for x in lines if re.search(r"\bshe\b", x)]

The above one-liner is equivalent to our earlier list-building loop:

In [76]:
pronouns = []
for x in lines:
    if re.search(r"\b(he|she)\b", x):
        pronouns.append(x)

You don't _have_ to use comprehensions, but they make life easier
and the NLTK book really likes them. Learn to read them, but use them
only when you feel ready to.

**Your turn:** Use a list comprehension to build a list of all lines in _The Red Circle_ that are less than 20 characters long, but **not empty**. Print a message saying how many such lines you found. 

Check your work by printing the first 10 lines you collected.

In [85]:
# YOUR CODE

samen = []

import re

# Read text and turn it into a list of lines
with open("RedCircle.txt") as connection:
    alltext = connection.read()

lines = alltext.splitlines()

for x in lines:
    if len(x) <= 20:
        if x != "":
            samen.append(x)

tien = [samen[0:10]]
print(tien)

[['Language: English', 'By', 'Fairdale Hobbs."', 'only would."', 'nerves can stand."', 'indicated.', 'lodging?"', 'house."', '"Well?"', '"But his meals?"']]


### 3.7 Scenario 2: Processing text word by word

To examine the individual words in our file, we can split up
the file text into a list of words. We already know how to split a
string: with the method `split()`.

In [83]:
with open("RedCircle.txt") as connection:
    alltext = connection.read()  # A single string

textwords = alltext.split() # A list of (short) strings
print(textwords[0:20])

['Project', "Gutenberg's", 'The', 'Adventure', 'of', 'the', 'Red', 'Circle,', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone']


This method is useful for counting or searching for
individual words. But as you know, the string method `split()` will
break up our string at white spaces. If a word is followed by a comma,
parentheses, etc., these will remain as part of the "word". We can
fine-tune the division into words by using the more powerful function
`re.split()`. Take a quick look at its docstring:

In [86]:
help(re.split)

Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.



A "pattern" in this documentation is a regular expression.
`re.split()` will break a string wherever the regular expression
matches.

Because `split()` normally removes the separator characters that it
splits on, we can split on a regexp that matches all space or
punctuation-- everything that is not part of a word. This discards
everything except the words. (Note that the apostrophe in "It's"
counts as a word separator). Make sure you know what this regexp
matches:

In [87]:
example = "Some text (containing punctuation). It's an 'example'..."
simple = example.split()
advanced = re.split(r"\W+", example)

print("Simple split:", simple)
print("Regexp split:", advanced)

Simple split: ['Some', 'text', '(containing', 'punctuation).', "It's", 'an', "'example'..."]
Regexp split: ['Some', 'text', 'containing', 'punctuation', 'It', 's', 'an', 'example', '']


There is a wrinkle to the list returned by `re.split()`: Notice the
empty string at the end. This is because our example ended with
punctuation, and `split()` always returns "results" for both sides of
any separator it finds. If our sentence started with space or
punctuation, we'd have gotten an empty string at the start too. We can
check for empty strings and delete them from the list, but usually
they do no harm.

### Your turn:

Use the regular expression `r"\W+"` to split the string `alltext` into
a list of words, discarding all punctuation. Save your list of words
as the variable `textwords`.

In [89]:
# YOUR CODE:

textwords = re.split(r"\W+", alltext)
print(textwords)



What if we also want to be able to see the punctuation? What if we
don't want to break up words like "It's" and "don't", that contain an
apostrophe? No problem, regular expressions can do all this. But it
gets complicated. Here's an expression that will break up a string
into tokens, where a token is either a sequence of punctuation or a
word that may contain _internal_ apostrophes (but not at the start or
end). Yeah, it's complicated. Usually it's enough to split on
whitespace, or with the regexp `r"\W+"`.

In [90]:
tokens = [w for w in re.split(r"\s+|((?:\w|(?<=\w)'(?=\w))+)", example) if w]
for tok in tokens:
    print(tok, end="  ")

Some  text  (  containing  punctuation  ).  It's  an  '  example  '...  

### Your turn:

Scan the list `textwords` and print all words that end with `ings`.
Separate words with spaces, not newlines.

In [None]:
# YOUR CODE:



Because we have tokenized our text so carefully, there is no
punctuation stuck to the end of words and we can search with simple
string methods. But solutions that use regexps are also correct.

### 3.8 Scenario 3: Searching for sentences

As linguists, we are rarely interested in lines per se; lines are a
convenient way to see a word used in context, but what we are really
after are sentences. But splitting up a text into sentences is
complicated: A variety of punctuation marks could end a sentence, and
worse, periods don't always terminate a sentence.

> "Well, well, Mrs. Warren, let us hear about it, then.  You don't
object
  to tobacco, I take it?  Thank you, Watson--the matches! You are
uneasy,
  as I understand, because your new lodger remains in his rooms and
you
  cannot see him.  Why, bless you, Mrs. Warren, if I were your lodger
you
  often would not see me for weeks on end."

The problem is actually quite hard. We'll borrow a sentence-splitting
function from the Natural Language Toolkit (nltk). (The NLTK is not a
standard part of python; but it is included with Anaconda). It uses
statistical methods to decide, with reasonable success, where
sentences end. Before we use it on our computer for the first time, we
must download the statistical "model" that the sentence-splitter
relies on.  (This is a ONE-TIME thing; the model is then available
until you remove it or change computers).

In [None]:
import nltk
nltk.download("punkt")

We can now call the function `sent_tokenize()`, which processes a
string and gives us a list of strings, each containing a single
sentence.

In [None]:
import nltk
sentences = nltk.sent_tokenize(alltext)
# Here is our sample text fragment
for s in sentences[15:20]:
    print(s)
    print("****")

The splitting was successful, but our sentences still contain newlines where the file's lines ended! To
simplify printing and searching with regular expressions, we can
replace each newline with a space.

In [None]:
cleansents = [ s.replace("\n", " ") for s in sentences ]

Note that we used a list comprehension to call each string's `replace()` method, and collect the results.

The string method `replace()` will replace all instances of its first argument with a copy of the second. An optional third argument can be used to specify a maximum number of replacements, e.g., `s.replace("\n", " ", 1)` will only replace one newline in `s`.

Once we have a list of sentences, we can use it just like our list of
lines before: It's just another list of multi-word strings.

### 3.9 Case-insensitive search

Matching words regardless of capitalization is a common task. We have practiced writing character ranges, e.g. to match the words `him` and `Him` you might use the  regexp `/[Hh]im/`. This is tiresome and error-prone-- it easy to forget to do it, and it won't even find words like `HIM` that are capitalized in their entirety. Often it is preferable to make our entire search ignore case.

The simplest way to do that is to simply force our strings to be in lower-case: Instead of matching against a string `line`, search a lowercased version of it:

In [None]:
line = '"LET HIM GO!", I said.'

if re.search(r"\bhim\b", line.lower()):
        print(line)

        
It is also possible to tell the regexp engine to ignore case entirely, by passing the flag `re.I` (or its long version, `re.IGNORECASE`) as a third argument to `re.search`:

In [None]:
if re.search(r"\bhim\b", line, re.IGNORECASE):
    print(line)

Instead of adding the flag, we can embed an "ignore-case" directive, `(?i)`, in the regexp string itself. This is sometimes necessary, if we want to write a loop where a single statement executes a whole list of regexps.

In [None]:
if re.search(r"(?i)\bhim\b", line):
    print(line)

## 4. Search tasks

### 4.1 Preparation: Download a story

1. Go to [www.gutenberg.org](http://www.gutenberg.org) and choose an English-language text you want to work with. (But don't get carried away and spend too long on this step! If you can't decide, try any novel by by Jane Austen.)

2. Follow the link "More Files..." and find the non-UTF plain text version, which ends in `.txt`.

2. Download the file and save it in the folder that contains this Notebook.

3. From your computer's file manager, open the folder where you created the file, open the file in a plain text editor like Notepad, and ensure that you can read it. You can edit out the Project Gutenberg license if you wish. Now you can close it.


### 4.2 Search for words in your story

Write code to carry out the following tasks. Use a different, MEANINGFUL
variable name for each list you create. 

**Task 1:** Read in your file

1. Open your text file from python and read it in as a single string.
   Close the file and **do not read it again** for the rest of the
tasks.

2. Split it up as a list of lines, and as a list of words.

For the rest of the tasks in this section you can decide which of the two lists to
use, and whether to use regular expressions or another (possibly
simpler) method. (If the "words" were generated with `split()`, results may be slightly different.)

In [None]:
# YOUR CODE:



**Task 2:** Print all words that:

    a) begin with "un"
    b) end in "ions"
    c) contain "ck" anywere in the word
    d) have exactly four letters (including any attached punctuation)

   Print a message before each search, saying what you'll look for.
   For better readability, print each group of words on one line.

In [None]:
# YOUR CODE:



**Task 3 (Challenge):**  Find and print all words that contain "ck" _inside_ the word, not at the end. (There are no words that begin with "ck".)

In [None]:
# YOUR CODE:



# The rest of this notebook is optional

### 4.3 Search for lines

Word matching tasks are often equally easy to do with regular expressions or with ordinary string methods. 
Searching in lines, however, is best approached with the help of regular expressions. Use regular expressions to solve the following tasks.

**Task 4:** Print all lines that contain a word ending in "ion".

In [None]:
# YOUR CODE:



**Task 5:** Print all lines containing any form of the verb go: _go,
goes, went,_ etc. (Take a minute to make a list of such forms.)
   Careful not to match partial words! ("Long ago" etc.) Use just one
loop to do this.

In [None]:
# YOUR CODE:



**Task 6:** Choose the names of 2-3 main characters from your story.
   Count and report how many times each of the names is used.

In [None]:
# YOUR CODE:



### 4.4 Other searches

**Task 7:**  Open the file `RedCircle.txt` and **build a list** of the
lines that contain both the pronoun "he" and the word "sir" on the
same line. (There are eight of them-- don't forget capitalized words.) 
Then print out each line of this list.

Recall that an `if`-test can consist of several independent tests
combined with `and` or `or`. You might find this simpler than building
a single regular expression.

In [None]:
# YOUR CODE:



**Task 8:** Count the number of *lines* that contain the pronoun "he",
   lines that contain the pronoun "she", and lines that contain both pronouns.
   Don't forget about capitalization and punctuation.

In [None]:
# YOUR CODE:



## 5. What we learned

### What you'll need to know by heart

Practice with this notebook until you can do the following without having to look anything up:

- Use `import` to import optional Python modules.
- Display the help text for a function or module.
- Open a file, read its contents, and split it into a list of words or lines.
- Search for words or lines using a regular expression, and collect the results in a list.
- Print out a list of results "nicely".
- Use the `repr()` function to inspect a string or other variable.

### What you should remember you saw

Here are some things you will not need as often; make sure you
remember that they exist, and where you saw them, and you can always
google them or come back to this notebook when you need them.

- How to provide help text for your own functions
- The list comprehension syntax
- There are ways to split a long text into sentences