# Working with strings

It is possible to extract quite a lot of interesting, structured information from text data simply by using string processing techiques. 

In this session, we'll see how to do some of these things, specifically calculating word frequencies and showing key-words-in-context (concordances). We'll do this for individual files and then you'll work together to write Python code which does this for a larger corpus of texts.

In [183]:
import os
import re
import string
from collections import Counter, OrderedDict

__Loading text files__

We start by defining a filepath using ```os.path.join()``` like we saw last week.

In [184]:
path = os.path.join('..', 'data', 'Dickens_Expectations_1861.txt')

We then need to load the file that we want to work with.

There are a number of ways to do this in Python, but the following should be considered "best practice".

In [185]:
with open(path, 'r', encoding='utf-8-sig') as f:
    text = f.read()

When we load the text file, we just have a simple string object which can be indexed and sliced.

In [186]:
print(text[0:100])

REAT EXPECTATIONS
 1867 Edition 
by Charles Dickens
Chapter I
My father's family name being Pirrip, 


You can see that there are some formatting things that are a little funky, such as lots of newline breaks.

We can get rid of those by using the ```.replace()``` method on strings.

In [187]:
text = text.replace('\n', ' ')
print(text[0:100])

REAT EXPECTATIONS  1867 Edition  by Charles Dickens Chapter I My father's family name being Pirrip, 


__Tokenize text__

So far, we have one long string of characters. But we want to be able to work with individual words. To do that, we have to *tokenize* our data - in other words, to split it into individual tokens (or words).

In [188]:
words = text.split(' ')
words = [w for w in words if w != '']

__Get sentences with regex__

We can use a similar logic to split the data into separate sentences.

This time we use a bit of ```regex``` to do our string splitting.

In [189]:
sentences = re.split(r'[.?!]\s*', text)

## Find word frequencies

We can count how many times an individual word appears manually, simply by iterating over the list of tokens and using a counter. 

To do this, we use a built in Python function called ```enumerate()```.

In [190]:
# DIY <33
counter = 0
keyword = 'love'

for word in words:
    stripped = word.strip(string.punctuation)
    if stripped.lower() == keyword:
        counter += 1

print(counter)

60


In [191]:
# using Counter
def clean_word(word):
    return word.strip(string.punctuation).lower()

cleaned = [clean_word(w) for w in words]

counter = Counter(cleaned)
counter['love']

60

We can use a similar logic to find all sentences where a certain keyword appears.

In [192]:
keyword_sentences = []

for sentence in sentences:
    sentence = sentence.lower()

    if re.search(pattern=f'[^A-Za-z0-9]{keyword}[^A-Za-z0-9]', string=sentence): # desperate try at regex that checks that the word is not part of another word (all characters before and after the word are not letters or numbers)
        keyword_sentences.append(sentence)

print(len(keyword_sentences))

44


Python also has some built-in tools which we can use to count how many times a token appears in a list.

There are some problems, though! 

## Viewing keywords in context (KWIC, concordancing)

In [214]:
# define keyword
keyword = 'love'

# for every token 
for idx, token in enumerate(cleaned):
    # checks if token is the keyword
    if token == keyword:
        # get the 5 words before the keyword
        before = ' '.join(cleaned[idx-5:idx])
        # get the 5 words after the keyword
        after = ' '.join(cleaned[idx+1:idx+6])

        full = [before, token, after]
        print('{:50} {:20} {:50}'.format(*full))

the dear fellow let me                             love                 him  and as to                                    
another lady we are to                             love                 our neighbor sarah pocket returned                
with anxiety of those i                            love                 if i could be less                                
higher than your head my                           love                 said mr camilla i have                            
expect to thank you my                             love                 without expecting any thanks or                   
seen the object of one's                           love                 and duty for even so                              
get myself to fall in                              love                 with you  you don't                               
a certain man who made                             love                 to miss havisham i never                          
haughty and too 

## Exercises

In groups, work on the following exercises in class. 

I've left these somewhat underspecified, so you're welcome to solve them in whatever way you please, and to save the results in whatever format you think works best.

- Write some code which searches through *all* of the novels in the folder called *100 English Novels* and shows how many times a given keyword appears in each novel.
   - Save your results in a way which 
- Turn the KWIC in context code above into a function which can be used to show *all* occurrences of a keyword in the corpus. 
  - Bonus: Your results should show the same results as those above but with an additional column showing the filename
  - Bonus: Write your function in such a way that a user can define the context window size to display.

In [194]:
novel_path = os.path.join('..', 'data', '100_novels', 'corpus')
novel_files = os.listdir(novel_path)
novel_files = [f for f in novel_files if f.endswith('.txt')] # solves the DS_Store problem

novels = []

for f in novel_files:
    with open(os.path.join(novel_path, f), 'r') as novel:
        novels.append(novel.read())