# Assignment 05
#### Python Basics V - Text Processing

This tutorial was written by Terry L. Ruas (University of Göttingen). The references for external contributors for which this material was anyhow adapted/inspired are in the Acknowledgments section (end of the document).

This notebook will cover the following tasks:

1. Text Pre-Processing
2. Simple Text Analysis

## Task 01 – Text Pre-Processing
A computational analysis of natural language text typically requires several pre-processing steps, such as excluding irrelevant text parts, separating the text into words, phrases, or sentences depending on the analysis use case, removing so-called stop words, i.e., words that contain little to no semantic meaning, and normalizing the texts, e.g., by removing punctuation and capitalization.

Use the *download_file()* function developed in the past assignments to download the plain text versions of Shakespeare’s play [Macbeth](https://ia802707.us.archive.org/1/items/macbeth02264gut/0ws3410.txt) and Bacon’s [New Atlantis](https://ia801309.us.archive.org/24/items/newatlantis02434gut/nwatl10.txt). If you choose not to implement assignment 4, task 6, download the files manually. We will also provide some txt files.

Inspect these real-world texts manually to get an idea of what needs to be done to clean and prepare
the texts for computational analysis. Implement the following functions to perform common pre-processing steps on the texts:
1. *get_speaker_text()* – returns only the text spoken by the characters in the plays and removes all other text in the files, such as:
    - Information about the file, acknowledgements, copyright notices etc.
    - Headings indicating the act and scene
    - Stage instructions
    - Character names
2. *normalize_text()*
    - converts all text to lower case
    - removes all punctuation from the texts
3. *remove_stopwords()* – eliminates all stop words that are part of the list of English stop words (we provide two lists of stopwords, try both and see how they perform)
4. *tokenize_text()* – splits the cleaned text into words

This program is a pre-req for the next one.

In [18]:
import re

def correct_ocr_errors(text):
    #  vowels {V}
    vowel = {"a", "e", "i", "o", "y"}
    #  consonants {C}
    cons = {"q", "w", "r", "t", "p", "s", "d", "f", "h", "j", "k", "l", "c", "x", "b", "n", "m"}

    # 'd -> ed
    text = re.sub("\'d", "ed", text)

    chars = list(text)
    for i in range(len(chars)):
        # v{C} -> u
        if chars[i] == "v" and chars[i + 1] in cons:
            chars[i] = "u"
            i += 1
        # {C,V}u{V} -> v
        if chars[i] == "u" and chars[i + 1] in vowel and chars[i - 1] != "q":
            chars[i] = "v"
            i += 1
        # {C}ne -> {C}n in verbs and some nouns
        if chars[i] == " " and chars[i - 1] == "e" and chars[i - 2] == "n" and chars[i - 3] in cons:
            chars[i - 1] = ""
    # .*esse plural in nouns and adjectives (and some infinitive forms)
    text = "".join(chars)
    text = re.sub("esse ", "ess ", text)
    # {C}ye -> {C}ie
    text = re.sub("ie\s", "y ", text)
    # ie -> y at ending
    text = re.sub("ie ", "y ", text)
    # io -> j at beginning
    text = re.sub(" iu", " ju", text)
    return text

In [26]:
import os
from urllib import request
from urllib.error import HTTPError, URLError
from urllib.request import urlopen


path = str(os.getcwd())+ "\Macbeth.txt"
path2 = str(os.getcwd())+ "\eng_stop_words.txt"


def download_file(url, path):
    if ".txt" in url: 
        r = request.urlretrieve(url, path)
        return print("Download successfull!")
    else: 
        return print("Download unsuccessfull!")

    
def get_speaker_text(path):
    '''opens the file and puts all lines in a list''' 
    with open(path, 'r') as f:
            raw_text = f.readlines()
    
    '''puts all spoken text(everything that is indented or in the lines after indentation) in alist without name of speaker'''
    spoken_text = []
    counte = 0
    for line in raw_text:
        if raw_text.index(line) == 1:
            #print(line)
            counte = 0
        elif line.startswith('  '):
            counte = 1
            splitted_line=line.split('.')
            spoken_text.append(''.join(splitted_line[1:]))
        elif counte == 1:
            spoken_text.append(line)
        else:
            continue
                
    '''puts all lines after the start_index in a list '''
    drama_text = []    
    start_index = spoken_text.index(' When shall we three meet againe?\n')   
    for line in spoken_text:
        if spoken_text.index(line) >= start_index:
            drama_text.append(line)
           
    '''puts all lines without brackets in a list '''        
    speaker_text = []
    counter = 0
    for line in drama_text:
        if counter == 1 and not ')'in line:
            continue 
            
        elif '('in line and ')'in line:
            words_in_line=line.split()
            conter = 0
            words =[]
            for word in words_in_line:
                if conter ==1 and ')' not in word:
                    continue
                elif '(' in word:
                    conter=1
                elif ')' in word:
                    conter=0    
                else:
                    words.append(word)
            words.append('\n')        
            speaker_text.append(' '.join(words)) 
            
        elif '(' in line:
            counter = 1
            words_in_line=line.split()
            conter = 0
            words =[]
            for word in words_in_line:
                if '(' in word:
                    conter=1
                elif conter ==1:
                    continue
                else:
                    words.append(word)        
            speaker_text.append(' '.join(words))
                        
        elif ')' in line:
            counter = 0
            words_in_line=line.split()
            conter = 1
            words =[]
            for word in words_in_line:
                if ')' in word:
                    conter=0
                elif conter ==1:
                    continue
                else:
                    words.append(word)
            words.append('\n')         
            speaker_text.append(' '.join(words))
        else:
            speaker_text.append(line)  
    return speaker_text

def normalize_text(input_text):
    normalized_text=[]
    for line in input_text:
        lower_case_line = line.lower()
        lower_case_line = lower_case_line.replace(',', '')
        lower_case_line = lower_case_line.replace('.', '')
        lower_case_line = lower_case_line.replace('?', '')
        lower_case_line = lower_case_line.replace('!', '')
        lower_case_line = lower_case_line.replace(':', '')
        lower_case_line = lower_case_line.replace('\'', ' ')
        lower_case_line = lower_case_line.replace('\n', '')
        lower_case_line = lower_case_line.replace('&', '')
        lower_case_line = lower_case_line.replace(';', '')
        lower_case_line = lower_case_line.replace(r'[', '')
        lower_case_line = lower_case_line.replace(r']', '')
        normalized_text.append(lower_case_line)
        
    normalized_text = ' '.join(normalized_text)
    return normalized_text

def remove_stopwords(input_text, path2):
    clear_text =[]
    stop_word_list=[]
    stop_word_clear=[]
    with open(path2, 'r') as f:
            stop_words = f.readlines()   
    for line in stop_words:
        stop_word_list.append(line)
    for line in stop_word_list:    
        clear_line=line.replace('\n','')
        stop_word_clear.append(clear_line)
    
    for stop_word in stop_word_clear:
        word_to_clear = ' ' + stop_word + ' '
        input_text = input_text.replace(word_to_clear, ' ')
          
    return input_text  

def tokenize_text(input_string):
    output = input_string.split()
    
    return output
    
speaker_text = get_speaker_text(path)
normalized_text = normalize_text(speaker_text)
corrected_text = correct_ocr_errors(normalized_text)
clear_text = remove_stopwords(corrected_text, path2)
final_text = tokenize_text(clear_text)
print(final_text)

['shall', 'three', 'meet', 'againe', 'hurley-burley', 'done', 'ere', 'set', 'sunn', 'place', 'upon', 'heath', 'meet', 'macbeth', 'come', 'gray-malkin', 'padock', 'calls', 'anon', 'faire', 'foule', 'foule', 'faire', 'bloody', 'man', 'report', 'serieant', 'doubtfull', 'stood', 'valiant', 'cousin', 'worthy', 'gentleman', 'whence', 'sunn', 'gins', 'reflection', 'dismay', 'captaines', 'macbeth', 'yes', 'sparrowes', 'eagles', 'well', 'thy', 'words', 'become', 'thee', 'thy', 'wounds', 'worthy', 'thane', 'rosse', 'haste', 'lookes', 'eyes', 'god', 'save', 'king', 'whence', 'cam', 'st', 'thou', 'worthy', 'thane', 'fiffe', 'great', 'king', 'great', 'happiness', 'sweno', 'norwayes', 'king', 'thane', 'cawdor', 'shall', 'deceive', 'ile', 'see', 'done', 'hath', 'lost', 'noble', 'macbeth', 'hath', 'wonn', 'hast', 'thou', 'beene', 'sister', 'killing', 'swine', 'sister', 'thou', 'saylors', 'wife', 'chestnuts', 'lappe', 'ile', 'give', 'thee', 'winde', 'th', 'art', 'kinde', 'another', 'selfe', 'shew', 'sh

In [16]:
import os
from urllib import request
from urllib.error import HTTPError, URLError
from urllib.request import urlopen
import re


path_new = str(os.getcwd())+ r"\NewAtlantis.txt"
path2_new = str(os.getcwd())+ "\eng_stop_words.txt"

def get_speaker_text_Atlantis(path):
    '''opens the file and puts all lines in a list''' 
    with open(path, 'r') as f:
            raw_text = f.readlines()        
   
    spoken_text = []
    for line in raw_text:
        spoken_text.append(line)
            
    '''puts all lines after the start_index in a list '''
    speaker_text = []    
    start_index = spoken_text.index('We sailed from Peru, (where we had continued for the space of one\n')
    
    for line in spoken_text:
        if spoken_text.index(line) >= start_index:
            speaker_text.append(line)
  
    return speaker_text

def normalize_text_Atlantis(input_text):
    normalized_text=[]
    for line in input_text:
        lower_case_line = line.lower()
        lower_case_line = lower_case_line.replace(',', '')
        lower_case_line = lower_case_line.replace('.', '')
        lower_case_line = lower_case_line.replace('?', '')
        lower_case_line = lower_case_line.replace('!', '')
        lower_case_line = lower_case_line.replace(':', '')
        lower_case_line = lower_case_line.replace('\'', ' ')
        lower_case_line = lower_case_line.replace('\n', '')
        lower_case_line = lower_case_line.replace('&', '')
        lower_case_line = lower_case_line.replace(';', '')
        lower_case_line = lower_case_line.replace(r'[', '')
        lower_case_line = lower_case_line.replace(r']', '')
        lower_case_line = lower_case_line.replace(r'(', '')
        lower_case_line = lower_case_line.replace(r')', '')
        lower_case_line = lower_case_line.replace(r'"', '')
        normalized_text.append(lower_case_line)
    return normalized_text

def remove_stopwords_Atlantis(input_text, path2):
    clear_text =[]
    stop_word_list=[]
    stop_word_clear=[]
    with open(path2, 'r') as f:
            stop_words = f.readlines()   
    for line in stop_words:
        stop_word_list.append(line)
    for line in stop_word_list:    
        clear_line=line.replace('\n','')
        stop_word_clear.append(clear_line)
        
    input_text = ' '.join(input_text)
    
    for stop_word in stop_word_clear:
        word_to_clear = ' ' + stop_word + ' '
        input_text = input_text.replace(word_to_clear, ' ')
          
    return input_text  

def tokenize_text_Atlantis(input_string):
    output = input_string.split()
    
    return output

speaker_text_new = get_speaker_text_Atlantis(path_new)
normalized_text_new = normalize_text_Atlantis(speaker_text_new)
stopword_rmv = remove_stopwords_Atlantis(normalized_text_new, path2_new)
final_text_Atlantis = tokenize_text_Atlantis(stopword_rmv)
print(final_text_Atlantis)



## Task 02 – Classes
The [Baconian theory](https://en.wikipedia.org/wiki/Baconian_theory_of_Shakespeare_authorship) holds that Sir Francis Bacon is the author of Shakespeare’s plays. We want to perform a very simple stylistic analysis between Shakespeare’s play Macbeth and Bacon’s New Atlantis. We check for words that frequently occur in both documents to see whether there are characteristic words that co-occur in the texts, which might give some support to the theory.

Your Task:
1. Download and pre-process the texts as follows:  
  New Atlantis
    1. *get_speaker_text()*
    2. *normalize_text()*
    3. *remove_stopwords()*
    4. *tokenize_text()*   
  
 Macbeth
    1. *get_speaker_text()*
    2. *normalize_text()*
        1. *utils_ocr.correct_ocr_errors()* – we will provide a function to deal with OCR errors.
    3. *remove_stopwords()*
    4. *tokenize_text()*
2. For the pre-processed texts, compute the list of word co-occurrence frequencies, i.e. which words occur in both documents and how often. Use the format:  
[term , frequency_doc1 , frequency_doc2 , sum_of_frequencies]  
Sort the list according to the sum of the frequencies in descending order.
3. Use the csv library to store the ordered word co-occurrence frequency list in a CSV file. **You can zip the csv and upload it to GitHub.**

In [47]:
from operator import itemgetter
import csv


macbeth = final_text
new_atlantis = final_text_Atlantis

output_list = []
word_list = []

for word in macbeth:
    output = []
    if word not in word_list:
        word_list.append(word)
        count_m = macbeth.count(word)
        count_a = new_atlantis.count(word)
        output.append(word)
        output.append(count_m)
        output.append(count_a)
        output.append(count_m + count_a)
        output_list.append(output)
    else:
        continue
        
for word in new_atlantis:
    output = []
    if word not in word_list:
        word_list.append(word)
        count_m = macbeth.count(word)
        count_a = new_atlantis.count(word)
        output.append(word)
        output.append(count_m)
        output.append(count_a)
        output.append(count_m + count_a)
        output_list.append(output)

output_list = sorted(output_list, key=itemgetter(3), reverse=True)       
#print(output_list)

with open('baconian_theory.csv', mode='w') as file:
    fieldnames = ['Word', 'Frequency Macbeth', 'Frequency New Atlantis', 'Sum']
    file_writer = csv.writer(file, delimiter=' ')
    file_writer.writerow(['word', 'frequencyM', 'frequencyNA', 'sum'])
    
    for line in output_list:
        file_writer.writerow(line)
    