## Ex 3: Cleaning the data

2p

In the 2-gram list above, we see lots of uninformative bigrams where the other member is a special character. One common preprocessing step in NLP is to clean the data. There isn't a perfect way to clean the data, since every project finds a bit different features of data irrelevant. For example, some projects may want to keep numbers in their data and some not. 

a) Clean the variable `all_summaries` from numbers and special characters (keep spaces), and save it to a new variable `all_summaries_cleaned`. There are many ways to do this. For example, you can iterate over every character of the string and check if the character is alphabet or space. Or then you can use regular expressions. Google and read about the options! Some useful resources [here](https://www.kite.com/python/answers/how-to-remove-special-characters-from-a-string-in-python) and [here](https://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string). 

In [3]:
import wikipedia, re

all_summaries = wikipedia.summary("Helsinki") + wikipedia.summary("Turku") + wikipedia.summary("Tampere")

all_summaries_cleaned = re.sub(r'\d', '', all_summaries) # numbers
all_summaries_cleaned = re.sub(r'[\.,\?!;\:\-\[\]\(\)\ˈ\ː\"\'–]', '', all_summaries_cleaned) # punctuation
all_summaries_cleaned = re.sub(r'[ŋбæ]', '', all_summaries_cleaned) # IPA characters

print(all_summaries_cleaned)

Helsinki  HELsinkee or  listen helSINKee Finnish helsiki listen Swedish Helsingfors Finland Swedish helsifors listen Latin Helsingia is the capital primate and most populous city of Finland Located on the shore of the Gulf of Finland it is the seat of the region of Uusimaa in southern Finland and has a population of  The citys urban area has a population of  making it by far the most populous urban area in Finland as well as the countrys most important center for politics education finance culture and research while Tampere in the Pirkanmaa region located  kilometres  mi to the north from Helsinki is the second largest urban area in Finland Helsinki is located  kilometres  mi north of Tallinn Estonia  km  mi east of Stockholm Sweden and  km  mi west of Saint Petersburg Russia It has close historical ties with these three cities
Together with the cities of Espoo Vantaa and Kauniainen and surrounding commuter towns Helsinki forms the Greater Helsinki metropolitan area which has a populat

b) Which problem or problems still exist in the data? 

We could discuss what exactly a special character is - for me it is even ä, because I don't have it in my language, but it is essential for Finnish so I decided to leave them in the text. I decided to leave Russian letters (Турку) in the text, too. But I removed IPA characters (ŋ, æ, б), but then the words don't make sense. 
Because of removing the punctuation, some words are now connected together as one.

## Ex4: Reading different types of files 

3p

In this exercise we will practice reading different file types. Usually in digital humanities projects, the data can be in any form and researchers should be able to read and handle multiple data types. 

a) Take a look to [Kafka's Metamorphosis HTML page](https://www.gutenberg.org/files/5200/5200-h/5200-h.htm) and install [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/), a common library for reading and manipulating HTML data. 

b) Download Metamorphosis -file and create a BeautifulSoup -object from it (section 'Quick Start'). 

In [100]:
from bs4 import BeautifulSoup

html_doc = open('metamorphosis.html', 'r')
soup = BeautifulSoup(html_doc, 'html.parser')
html_doc.close()

print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="text/css" http-equiv="Content-Style-Type"/>
  <title>
   The Project Gutenberg eBook of Metamorphosis, by Franz Kafka
  </title>
  <style type="text/css">
   body { margin-left: 20%;
       margin-right: 20%;
       text-align: justify; }

h1, h2, h3, h4, h5 {text-align: center; font-style: normal; font-weight:
normal; line-height: 1.5; margin-top: .5em; margin-bottom: .5em;}

h1 {font-size: 300%;
    margin-top: 0.6em;
    margin-bottom: 0.6em;
    letter-spacing: 0.12em;
    word-spacing: 0.2em;
    text-indent: 0em;}
h2 {font-size: 150%; margin-top: 2em; margin-bottom: 1em;}
h3 {font-size: 130%; margin-top: 1em;}
h4 {font-size: 120%;}
h5 {font-size: 110%;}

.no-break {page-break-before: avoid;} /* for epubs */

di

c) Print only the text with function `soup.get_text()`

In [101]:
soup = soup.get_text()
print(soup)
# I know I can print it just by typing soup.get_text(), but having it in a variable is better for the code below





The Project Gutenberg eBook of Metamorphosis, by Franz Kafka



The Project Gutenberg eBook of Metamorphosis, by Franz Kafka

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online
at www.gutenberg.org. If you
are not located in the United States, you will have to check the laws of the
country where you are located before using this eBook.


*** This is a COPYRIGHTED Project Gutenberg eBook. Details Below.
Please follow the copyright guidelines in this file. ***

Title: Metamorphosis
Author: Franz Kafka
Translator: David Wyllie
Release Date: May 13, 2002 [eBook #5200]
[Most recently updated: May 20, 2012]
Language: English
Character set encoding: UTF-8
Copyright (C) 2002 by David Wyllie.
*** START OF THE PROJECT GUTENBERG EBOOK METAMORPHOSIS ***
Metamorphosis

d) Implement a preprocessing algorithm for the text you printed above. You should first think about what is irrelevant information for you in the text. In this exercise there is no right or wrong answer, because the final purpose of preprocessing isn't clear either. You can name a purpose of the preprocessing (for example, 'I want to calculate bag-of-words presentation of the text'), and justify your preprocessing decisions on the bottom of it. 

Code the preprocessing of the text and justify your decisions with code comments (# Comment). 

In [None]:
# I want to calculate type-token ratio. At first, I am deleting Project Gutenberg header and footer
# because I don't want it to spoil my results. Then I want to remove punctuation because I am interested
# in words only. I substitute '—' for spaces to separate words. But I leave '-' as it is, so words such as
# 'armour-like' stay connected. I am also removing chapter numbers (I, II, III). Then I lowercase all words,
# and tokenize the whole document. Now I have a list of tokens, so I can calculate the type-token ratio.

In [102]:
# Delete Project Gutenberg header and footer
header = re.search(r'\*\*\* START OF THE PROJECT GUTENBERG EBOOK METAMORPHOSIS \*\*\*', soup)
footer = re.search(r'\*\*\* END OF THE PROJECT GUTENBERG EBOOK METAMORPHOSIS \*\*\*', soup)
soup = soup[986:119555]

In [103]:
# Remove punctuation
soup = re.sub(r'[\.\?\:\"\'!;\(\)\[\]\“\”,\'\’]', '', soup)
soup = re.sub(r'\—', ' ', soup)
soup = re.sub(r'\nI{1,3}\n', '', soup)

In [104]:
# Lowercase the document and tokenize it
import nltk

soup = soup.lower()
tokens = nltk.word_tokenize(soup)

print(tokens)



e) Install a package called [pdftotext](https://pypi.org/project/pdftotext/). There are many libraries for pdf extraction and possibly you will use other libraries later in your career. 

Download a research article by Matres, Oiva and Tolonen: [In Between Research Cultures – The State of Digital Humanities in Finland](https://www.researchgate.net/publication/326228905_In_Between_Research_Cultures_-_The_State_of_Digital_Humanities_in_Finland)

Read the file with [these instructions](https://pypi.org/project/pdftotext/) and print the page strings

In [None]:
# I wanted to try all the examples in the instructions, I hope it is not a problem

In [1]:
import pdftotext

In [2]:
# Open PDF and store it in a variable
with open('MatresOivaTolonenInBetweenResearchCultures2018.pdf', 'rb') as f:
    pdf = pdftotext.PDF(f)

In [3]:
# How many pages does the document have
print(len(pdf))

26


In [4]:
# Read all pages
for page in pdf:
    print(page)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/326228905



In Between Research Cultures – The State of Digital Humanities in Finland

Article in Informaatiotutkimus · June 2018
DOI: 10.23978/inf.71160




CITATIONS                                                                                              READS

2                                                                                                      220


3 authors:

             Ines Matres                                                                                          Mila Oiva
             University of Helsinki                                                                               University of Turku
             8 PUBLICATIONS 2 CITATIONS                                                                           9 PUBLICATIONS 4 CITATIONS

                 SEE PROFILE                                                                        

In [5]:
# Read all the text into one string
print("\n\n".join(pdf))

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/326228905



In Between Research Cultures – The State of Digital Humanities in Finland

Article in Informaatiotutkimus · June 2018
DOI: 10.23978/inf.71160




CITATIONS                                                                                              READS

2                                                                                                      220


3 authors:

             Ines Matres                                                                                          Mila Oiva
             University of Helsinki                                                                               University of Turku
             8 PUBLICATIONS 2 CITATIONS                                                                           9 PUBLICATIONS 4 CITATIONS

                 SEE PROFILE                                                                        

In [7]:
# Read only the first page
print(pdf[0])

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/326228905



In Between Research Cultures – The State of Digital Humanities in Finland

Article in Informaatiotutkimus · June 2018
DOI: 10.23978/inf.71160




CITATIONS                                                                                              READS

2                                                                                                      220


3 authors:

             Ines Matres                                                                                          Mila Oiva
             University of Helsinki                                                                               University of Turku
             8 PUBLICATIONS 2 CITATIONS                                                                           9 PUBLICATIONS 4 CITATIONS

                 SEE PROFILE                                                                        