# Working with strings

It is possible to extract quite a lot of interesting, structured information from text data simply by using string processing techiques. 

In this session, we'll see how to do some of these things, specifically calculating word frequencies and showing key-words-in-context (concordances). We'll do this for individual files and then you'll work together to write Python code which does this for a larger corpus of texts.

In [1]:
import os
import re
import string
from collections import Counter, OrderedDict

__Loading text files__

We start by defining a filepath using ```os.path.join()``` like we saw last week.

In [2]:
path = os.path.join("..", "data", "Dickens_Expectations_1861.txt")

We then need to load the file that we want to work with.

There are a number of ways to do this in Python, but the following should be considered "best practice".
This method is useful when working with many many different files due to its explicitness

In [15]:
with open(path, "r", encoding = "utf-8-sig") as file: #r stands for read, so we open the file in read mode and assign it to the variable file
    text = file.read() #read the file and assign it to the variable text

When we load the text file, we just have a simple string object which can be indexed and sliced.

In [16]:
print(text[:300]) #print the first 1000 characters of the text
text[:300] # if I don't use the print() function, I get the same output, but the output is not a string, but a Jupyter Notebook object: as pure text

REAT EXPECTATIONS
 1867 Edition 
by Charles Dickens
Chapter I
My father's family name being Pirrip, and my Christian name Philip, my
infant tongue could make of both names nothing longer or more explicit
than Pip. So, I called myself Pip, and came to be called Pip.
I give Pirrip as my father's famil


"REAT EXPECTATIONS\n 1867 Edition \nby Charles Dickens\nChapter I\nMy father's family name being Pirrip, and my Christian name Philip, my\ninfant tongue could make of both names nothing longer or more explicit\nthan Pip. So, I called myself Pip, and came to be called Pip.\nI give Pirrip as my father's famil"

You can see that there are some formatting things that are a little funky, such as lots of newline breaks.

We can get rid of those by using the ```.replace()``` method on strings.

In [17]:
text = text.replace("\n", " ") #replace all newlines (\n) with spaces

__Tokenize text__

So far, we have one long string of characters. But we want to be able to work with individual words. To do that, we have to *tokenize* our data - in other words, to split it into individual tokens (or words).

In [19]:
tokens = text.split() #split the text on white space into a list of words, and print the first 10 words (if I left () empty, it would still split on whitespace since whitespace is the default)
tokens[:10]

['REAT',
 'EXPECTATIONS',
 '1867',
 'Edition',
 'by',
 'Charles',
 'Dickens',
 'Chapter',
 'I',
 'My']

We could also decide to split into sentences using regex

In [21]:
sentences = text.split(".") #split the text on periods into a list of sentences, and print the first 10 sentences
sentences[:10]

TypeError: split() takes at most 2 arguments (3 given)

__Get sentences with regex__

We can use a similar logic to split the data into separate sentences.

This time we use a bit of ```regex``` to do our string splitting.

In this case we are also splitting when meeting a ! ? and .

In [24]:
sentences = re.split(r'[.!?]\s*', text) # \s* means any number of whitespace characters which in this case are spaces, tabs, and newlines
sentences[:3]

["REAT EXPECTATIONS  1867 Edition  by Charles Dickens Chapter I My father's family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip",
 'So, I called myself Pip, and came to be called Pip',
 "I give Pirrip as my father's family name, on the authority of his tombstone and my sister, - Mrs"]

## Find word frequencies

We can count how many times an individual word appears manually, simply by iterating over the list of tokens and using a counter. 

To do this, we use a built in Python function called ```enumerate()```.

In [42]:
# One method
def CountWordFrequency(text, n = 10):
    tokens = text.split()
    word_counts = Counter(tokens)
    return word_counts.most_common(n) # most_common() returns a list of tuples, where each tuple is a word and its count

print(CountWordFrequency(text, 5))

# Another method (Ross method)
counter = 0
keyword = "love"

# For every token in the list of tokens
for token in tokens:
    # Strip punctuation
    stripped = token.strip(string.punctuation)
    # Convert to lowercase
    lowered = stripped.lower()
    # If the token is the keyword
    if lowered == keyword:
        # if yes, add 1 to the counter
        counter += 1
print(f'Number of times "love" occurs: {counter}')

[('the', 7753), ('and', 6566), ('I', 5761), ('to', 4978), ('of', 4349)]
Number of times "love" occurs: 60


In [41]:
cleaned = []
for token in tokens:
    # Strip punctuation
    stripped = token.strip(string.punctuation)
    # Convert to lowercase
    lowered = stripped.lower()
    # add to new list
    cleaned.append(lowered)

Counter(cleaned).most_common(5)

[('the', 8143), ('and', 7078), ('i', 6484), ('to', 5079), ('of', 4431)]

We can use a similar logic to find all sentences where a certain keyword appears.
In how many sentences does a word appear

In [73]:
def wordsinsentences(text, word):
    sentences = re.split(r'[.!?]\s*', text)
    counter = 0
    for sentence in sentences:
        if word in sentence:
            counter += 1
    return counter

#print(wordsinsentences(text, "love"))

# following function prints the sentences in which the word appears and the number of the sentence
def SentencesWordAppears(text, word):
    sentences = re.split(r'[.!?]\s*', text.lower())
    counter = 0
    for sentence in sentences:
        if word in sentence:
            counter += 1
            print(counter, sentence)

print(SentencesWordAppears(text, "love"))

1 but i loved joe, - perhaps for no better reason in those early days than because the dear fellow let me love him, - and, as to him, my inner self was not so easily composed
2 she had not quite finished dressing, for she had but one shoe on, - the other was on the table near her hand, - her veil was but half arranged, her watch and chain were not put on, and some lace for her bosom lay with those trinkets, and with her handkerchief, and gloves, and some flowers, and a prayer-book all confusedly heaped about the looking-glass
3 " "cousin raymond," observed another lady, "we are to love our neighbor
4 chokings and nervous jerkings, however, are nothing new to me when i think with anxiety of those i love
5 i have taken to the sofa with my staylace cut, and have lain there hours insensible, with my head over the side, and my hair all down, and my feet i don't  know where - "  "much higher than your head, my love," said mr
6 "you see, my dear," added miss sarah pocket  a blandly vicious pe

Python also has some built-in tools which we can use to count how many times a token appears in a list.

There are some problems, though! 

## Viewing keywords in context (KWIC, concordancing)

In [74]:
cleaned = []
for token in tokens:
    # Strip punctuation
    stripped = token.strip(string.punctuation)
    # Convert to lowercase
    lowered = stripped.lower()
    # add to new list
    cleaned.append(lowered)

Counter(cleaned).most_common(5)

keyword = "love"

# for every token 
for idx, token in enumerate(cleaned): # enumerate() returns the index and the token
    if token == keyword: # if the token is the keyword
        before = ' '.join(cleaned[idx-5:idx]) # join the 5 words before the keyword
        after = ' '.join(cleaned[idx+1:idx+6]) # join the 5 words after the keyword
        full = [before, token, after] # create a list of the before, keyword, and after
        print("{:50} {:10} {:50}".format(*full)) # print the list with a 50 character space before and after the keyword, the blank spaces begins from the beginning of the line, and not from the end of the sentence on that line.

the dear fellow let me                             love       him  and as to                                    
another lady we are to                             love       our neighbor sarah pocket returned                
with anxiety of those i                            love       if i could be less                                
higher than your head my                           love       said mr camilla i have                            
expect to thank you my                             love       without expecting any thanks or                   
seen the object of one's                           love       and duty for even so                              
get myself to fall in                              love       with you  you don't                               
a certain man who made                             love       to miss havisham i never                          
haughty and too much in                            love       to be advised by any              

## Exercises

In groups, work on the following exercises in class. 

I've left these somewhat underspecified, so you're welcome to solve them in whatever way you please, and to save the results in whatever format you think works best.

- Write some code which searches through *all* of the novels in the folder called *100 English Novels* and shows how many times a given keyword appears in each novel.
   - Save your results in a way which 
- Turn the KWIC in context code above into a function which can be used to show *all* occurrences of a keyword in the corpus. 
  - Bonus: Your results should show the same results as those above but with an additional column showing the filename
  - Bonus: Write your function in such a way that a user can define the context window size to display.