# Exercise Sheet 1.0 - Text Processing with Python


## Learning Objectives

The motivation of this exercise is to gain familiarity with the Python programming language. We are going to do some basic text processing and analysis on a plaintext corpus. If you are not with familiar Python or Jupyter notebooks, it is recommended to start with the Python Tutorial notebook before attempting this exercise.

---


## Exercise 0

For this exercise, we are going to count the 25 most frequent words in **Alice’s Adventures in Wonderland** by Lewis Carroll. You are free to use any other piece of text of your choice for this exercise. This notebook contains step by step instructions (with some hints) and you are required to fill in the code blocks based on the material covered in the Python Tutorial notebook.

### 0. Download the text file.
Run the cell below to download the book **Alice’s Adventures in Wonderland** as a text file from [Project Gutenberg](http://www.gutenberg.org), and save into a file called `alice.txt`.

In [93]:
!curl https://www.gutenberg.org/files/11/11-0.txt > alice.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  170k  100  170k    0     0   217k      0 --:--:-- --:--:-- --:--:--  217k


---
### 1. Read text from file.
Open the text file `alice.txt` and read all the lines into a list.

In [94]:
#####################################################
# TO DO
# Read lines from alice.txt and,
# save the lines in a list.
#####################################################
# Type your code below this line
lines = []
with open('alice.txt',mode = "r", encoding = "utf-8") as f:
    lines = f.readlines()
    

In [95]:
print(len(lines))

3761


#### Hint:

The `open()` function can be used to read the file.

---
### 2. Filter out the metadata.
The text file contains some metadata about the book which is not relevant for our analysis. Discard this information by removing the first 54 lines from the beginning and the last 356 lines from the end.

In [96]:
#############################################################
# TO DO
# From the text, remove the first 54 lines from the beginning.
# Remove the last 356 lines at the end.
#############################################################
# Type your code below this line
line_54 = []
line_54 = lines[54:len(lines)-356]


In [97]:
print(len(line_54))

3351


#### Hint:

Use index slicing to select the required lines.

---
### 3. Remove leading and trailing spaces from each line in the list.
Each line contains a newline character `\n` at the end while some lines also contain leading and trailing spaces. This formatting is done for presentation purposes and not relevant for our analysis.

In [98]:
#############################################################
# TO DO
# store the lines in this list after removing the leading and
# trailing spaces
#############################################################
# Type your code below this line
line_54_lst = []
for line in line_54:
    lines_54_s = line.strip()
    line_54_lst.append(lines_54_s)

#### Hint:

In [99]:
print(len(line_54_lst))

3351


The `strip()` function can be used to remove leading and trailing spaces.

---
### 4. Remove empty lines from the list.
After removing the newline character `\n` from each line in the list, some strings are now empty and can be discarded safely.

In [100]:
#############################################################
# TO DO
# store non empty lines in this list
#############################################################
# Type your code below this line
new_lst = []
for line in trailed_list:
    if line != '' or line != "":
        new_lst.append(line)


In [101]:
print(len(new_lst))

1077


#### Hint:

An empty string in Python is represented by `''` or `""`.

---
### 5. Join all the non empty lines into a single string.
Now that we have cleaned the corpus by removing some editorial details and formatting, we can focus on the actual text. Create a single string which contains all the lines from the text.



In [102]:
#############################################################
# TO DO
# Join all the lines into a string
#############################################################
# Type your code below this line
new_string = ' '.join(filter(None, new_lst))

#### Hint:

In [103]:
print(len(new_string))

62555


The `join` function can be used to join a list of strings into a single string.

---
### 6. Convert to lowercase
To keep the word counts consistent, we are going to covert everything lowercase. If we don't do this, the words `the`, `The` and `THE` would be considered distinct.  

In [104]:
#############################################################
# TO DO
# Convert the text to lower case.
#############################################################
# Type your code below this line
# non_empty_str.upper()
new_string.lower()

'top of the house!” (which was very likely true.) down, down, down. would the fall _never_ come to an end? “i wonder how getting somewhere near the centre of the earth. let me see: that would several things of this sort in her lessons in the schoolroom, and knowledge, as there was no one to listen to her, still it was good then i wonder what latitude or longitude i’ve got to?” (alice had no grand words to say.) presently she began again. “i wonder if i shall fall right _through_ with their heads downward! the antipathies, i think—” (she was rather the right word) “—but i shall have to ask them what the name of the (and she tried to curtsey as she spoke—fancy _curtseying_ as you’re an ignorant little girl she’ll think me for asking! no, it’ll never do talking again. “dinah’ll miss me very much to-night, i should think!” tea-time. dinah my dear! i wish you were down here with me! there are very like a mouse, you know. but do cats eat bats, i wonder?” and here dreamy sort of way, “do cats

#### Hint:

Use the `lower()` function.

---
### 7. Get a list of all the words in the text.

In [105]:
#############################################################
# TO DO
# Get a list of all the words in the text.
#############################################################
# Type your code below this line
new_string_lst = new_string.split()


In [106]:
len(new_string_lst)

11584

#### Hint:

The `split()` can be used to get a list of words from a string.

### 8. Remove punctuation

For a machine, character sequences `rabbit`, `rabbit,` and `rabbit!` are diferrent words, although we as humans understand that this is the same word with/without punctuation marks after it. To avoid this confusion, we can remove punctuation, because it is unnecessary for our task.

In [112]:
#############################################################
# TO DO
# Remove the punctuations in the words contained in the list.
#############################################################
# Type your code below this line
import string

string_lst2 = []
strPunchText = string.punctuation
for wrd in new_string_lst:
    wrd_res = wrd.strip(strPunchText)
    string_lst2.append(wrd_res)

#### Hint:

In [113]:
print(len(string_lst2))

11584


Use `strip()` function. List comprehensions may also come in handy!

---
### 9. How many total words are there in the text?

Individuals elements in a text (usually words, but not only) are called **tokens** in NLP.

In [114]:
#############################################################
# TO DO
# # Count the words in the list.
#############################################################
# Type your code below this line
count_lst = print(len(string_lst2))

11584


#### Hint:

This can be found by finding the length of the `words` list.

---
### 10. How many unique words are there in the text?

Unique words are also called **types** in NLP.

In [119]:
#############################################################
# TO DO
# # Count the unique words in the list.
#############################################################
# Type your code below this line
unq_wrd = set(string_lst2)
    

In [122]:
print(unq_wrd)

{'', 'thistle', 'this?”', 'certainly', 'surprise', 'you’re', 'sense', 'bats?”', 'to-day', 'week', 'Geography', 'seems,”', 'feet', 'White', 'executed', 'baby', 'prize', 'whole', 'newspapers', 'slipped', 'confusing.”', 'furrow', 'sounded', 'flustered', 'And', 'had', 'line', 'half', 'shrinking', 'fire', 'spell', 'can’t', 'Dodo', 'ennyworth', 'letter', 'pieces', 'us', 'saw', 'or', 'followed', 'goose', 'fun!”', 'size,”', '“Nobody', 'Shark', 'loudly', 'bad', 'effect', 'cup,”', 'least', 'be,”', 'hate—C', 'Stop', '“That’s', 'wildly', 'music', 'cup', '“’Tis', 'patriotic', 'knuckles', 'remember,”', 'bat', 'gone', 'Was', 'best', '“Nonsense!”', 'already', 'result', 'sleep', 'When', 'bowed', 'three', 'stupid?”', 'hard', 'moving', 'cheerfully', '“After', 'creatures,”', 'Pat', 'way—never', 'executioner’s', 'precious', 'sing?”', 'against', 'better;”', 'liked', 'do,”', 'why', 'remained', '“Why,”', 'four', 'full', 'Down', 'wonder', '“Shall', 'learned', 'Half-past', 'she’d', 'is—oh', '“or', 'either', 'lo

#### Hint:

The `set` data type can be used to find unique values.

In [130]:

common_dict = {}
for common_wrd in string_lst2:
    if common_wrd not in common_dict:
        common_dict[common_wrd] = 0
    common_dict[common_wrd]+=1

In [135]:
print(common_dict, end = '')

{'top': 5, 'of': 215, 'the': 624, 'house!”': 1, 'Which': 3, 'was': 167, 'very': 63, 'likely': 1, 'true': 1, 'Down': 1, 'down': 36, 'Would': 2, 'fall': 4, 'never': 20, 'come': 15, 'to': 295, 'an': 19, 'end': 6, '“I': 50, 'wonder': 8, 'how': 15, 'getting': 12, 'somewhere': 1, 'near': 8, 'centre': 1, 'earth': 3, 'Let': 3, 'me': 24, 'see': 27, 'that': 125, 'would': 25, 'several': 3, 'things': 15, 'this': 49, 'sort': 10, 'in': 155, 'her': 102, 'lessons': 2, 'schoolroom': 1, 'and': 322, 'knowledge': 3, 'as': 99, 'there': 19, 'no': 33, 'one': 35, 'listen': 3, 'still': 6, 'it': 197, 'good': 9, 'then': 23, 'I': 128, 'what': 48, 'Latitude': 1, 'or': 21, 'Longitude': 1, 'I’ve': 12, 'got': 20, 'to?”': 1, 'Alice': 172, 'had': 63, 'grand': 2, 'words': 10, 'say': 19, 'Presently': 1, 'she': 232, 'began': 32, 'again': 30, 'if': 27, 'shall': 11, 'right': 13, 'through': 7, 'with': 65, 'their': 22, 'heads': 4, 'downward': 1, 'The': 36, 'Antipathies': 1, 'think—”': 2, 'rather': 20, 'word': 5, '“—but': 1, '

In [None]:
print 

[('the', 136),
 ('she', 107),
 ('to', 93),
 ('and', 91),
 ('it', 65),
 ('was', 64),
 ('a', 63),
 ('of', 56),
 ('i', 42),
 ('in', 37),
 ('her', 36),
 ('alice', 33),
 ('that', 29),
 ('down', 27),
 ('for', 26),
 ('very', 25),
 ('on', 24),
 ('as', 24),
 ('but', 23),
 ('you', 23),
 ('little', 22),
 ('had', 20),
 ('be', 20),
 ('all', 19),
 ('so', 18)]

---
### 11. What are the 25 most frequent words?

In [136]:
#############################################################
# TO DO
# # Get the most frequent words in the list.
#############################################################
# Type your code below this line
from collections import Counter
Counter = Counter(string_lst2)
txt_list=Counter.most_common(25)

In [139]:
print(txt_list, end = '')

[('the', 624), ('and', 322), ('to', 295), ('a', 274), ('she', 232), ('of', 215), ('it', 197), ('said', 183), ('Alice', 172), ('was', 167), ('in', 155), ('you', 136), ('I', 128), ('that', 125), ('her', 102), ('as', 99), ('at', 96), ('on', 74), ('little', 67), ('be', 67), ('with', 65), ('all', 65), ('very', 63), ('had', 63), ('out', 53)]

#### Hints:

* Python >= 3.6 supports ordered dictionaries, so there is no need to convert to a list of tuples before sorting.
* Look up the `Counter` container in the `collections` module in the [Python docs](https://docs.python.org/3/library/collections.html#collections.Counter).


---
#### This is the end of this notebook


