## Simple text processing with Python

Outline:
- Working with strings
- f-strings
- Reading/writing files
- Working with dictionaries
- Other tools

## Strings
Anything within ``quotes'' is a string!


In [None]:
s = ' This is a string '
s = " This too! "
s = """ This one too! """
s = ''' And one more! '''

### Strings
Why so many?


In [None]:
s = ' "Do or do not.  No try." said Yoda.'
s = " ' is a mighty lonely quote."
# The triple quoted ones can span multiple lines!

In [None]:
s = """ The quick brown
fox jumped over
    the lazy dingbat.
"""

### Accessing part of strings


In [None]:
w = "hello"
print(w[0], w[1], w[-1])

In [None]:
len(w)

### Strings are immutable

In [None]:
w[0] = 'H'

### String operations


In [None]:
s = 'Hello'
p = 'World'
s + p

In [None]:
s * 4

In [None]:
s * s

### String methods


In [None]:
a = 'Hello World'
a.startswith('Hell')

In [None]:
a.endswith('ld')

In [None]:
a.upper()

In [None]:
a.lower()

In [None]:
a = '  Hello World  '
b = a.strip()
b

In [None]:
b.index('ll')

In [None]:
b.find('lz')

In [None]:
b.replace('Hello', 'Goodbye')

## Strings:`split` & `join`


In [None]:
chars = 'a b c'
chars.split()

In [None]:
' '.join(['a', 'b', 'c'])

In [None]:
alpha = ', '.join(['a', 'b', 'c'])
alpha

In [None]:
alpha.split(', ')

### String formatting


In [None]:
x, y = 1, 1.234
'x is %s, y is %s' % (x, y)

- `%d` , `%f`  etc. available

### f-strings

Much easier to use and super convenient. Let us see some examples:

In [None]:
name = 'Ram'
age = 25
wt = 60
print(f'{name} is {age} and weighs {wt} kgs')

In [None]:
def f(x):
    print(f'x is {x}; name is {name}')
f(1.0)

In [None]:
print(r'\n hello')  # raw strings

- Notice the use of the `f` in front of the string, can also use `F`.
- You can do more!

In [None]:
f'{name.upper()} is {age + 1} and weighs {wt + 0.5} kgs.'


- Can also use format string specifiers to control things

In [None]:
f'{name} is {age:d} and weighs {wt:.1f} kgs.'

- Can also introduce padding of the strings.

In [None]:
f'{name:10} is {age:d} and weighs {wt:.1f} kgs.'

### More documentation on f-strings

- [f-string docs](https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals)
- [Format specification](https://docs.python.org/3/library/string.html#formatspec)

### String containership


In [None]:
fruits = 'apple, banana, pear'
'apple' in fruits

In [None]:

'potato' in fruits

## Exercise 1
Given a 2 digit integer `x` , find the digits of the number.

* For example, let us say `x = 38`
* Find a way to get `a = 3` and `b = 8` using `x` ?


### Possible Solution


In [None]:
x = 38
a = x//10
b = x % 10
a*10 + b == x

### Another Solution


In [None]:
sx = str(x)
a = int(sx[0])
b = int(sx[1])
a*10 + b == x

### Exercise 2
Given an arbitrary integer, count the number of digits it has.


### Possible solution


In [None]:
x = 12345678
len(str(x))  # Sneaky solution!

## Reading/writing files

- This is a high-level and limited introduction.
- Start by reading a simple text file, `data.txt`.

In [None]:
f = open('../data/data.txt')
data = f.read()
f.close()
type(data), len(data)

In [None]:
# Same as:
f = open('../data/data.txt', 'r')  # mode defaults to 'r'
data = f.read()
f.close()

In [None]:
# Can also read them line-by-line
f = open('../data/data.txt')
data = f.readlines()
f.close()
type(data), len(data)


- We must always close a file once we are done with it.
- OS has limits on number of open files.
- Sometimes others cannot use the file when someone else is writing to it
  etc.

Python provides a convenient syntax for these kind of things.

In [None]:
with open('../data/data.txt') as f:
    data = f.readlines()
print(len(data))
data[0]


- `with` introduces a new block.
- Notice the `as f` syntax carefully.
- Closes the file automatically on exit.
- Every line also has a line-ending character at the end, see the `'\n'` at
  the end.

### Writing files
- Easy to write files too
- Use the mode argument and set it to 'w'
- **Beware this will clobber the file**

In [None]:
with open('/tmp/junk.txt', 'w') as f:
    f.write('Hello world!\n')

In [None]:
# Can also do
with open('/tmp/junk.txt', 'w') as f:
    f.writelines(data)

- Note that `f.writelines` does not add a newline to the line.

### Exercise
- Read the data file, `../data/data.txt`
- Write every alternate line to a new data file called `junk.txt`

In [None]:
# Solution

### Counting words in a string

Consider the following words in a paragraph from a famous novel: "Alice in
Wonderland" by Lewis Carroll.

In [None]:
para = """
Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book her
sister was reading, but it had no pictures or conversations in it, 'and what
is the use of a book,' thought Alice 'without pictures or conversations?'
"""


We now want to do a simple count of the words seen in this paragraph. We
want to know what is the most frequent word etc. How would we do this?

If we know a specific word it is easy to do this, for example, how often do
the words, "Alice", "the", "her" occur?

In [None]:
para.lower().count('alice')

In [None]:
para.lower().count('the')


We want more, we want to see all the words and count them all! We can do
this easily with a more powerful data type called the dictionary.

## Working with dictionaries
- Think of a dictionary as a container with keys and values.
- Keys can be strings, integers, floats etc. Basically any immutable quantity.
- Values can be anything.

Here is a simple example:

In [None]:
phonebook = {"mom": 1234, "dad": 5678}
phonebook

In [None]:
type(phonebook)

In [None]:
phonebook['mom']

In [None]:
# Setting values
phonebook['mom'] = 4567
phonebook

In [None]:
phonebook["work"] =  1022
phonebook

In [None]:
# No need for fixed types:
d = {1: 25, 'a': 1.23, 'b': 'hello', 1.23: 'a'}
d

In [None]:
empty = {}
# or
empty = dict()

In [None]:
# Can also do:
x = dict(x=1, y=2.1, zeta='hello')
x

In [None]:
# Iteration is easy
for key in d:
    print(key, d[key])

In [None]:
# Also
for key, value in d.items():
    print(key, value)

In [None]:
d.keys()

In [None]:
d.values()

In [None]:
# Deletion
del d['b']
d

In [None]:
d.get('a', -1)

### Other methods

- `d.clear()`
- `d.copy()`
- `d.fromkeys(iterable, value=None)`
- `d.get(key)`
- `d.pop()`/`d.popitem()`
- `d.setdefault(key, default=None)`
- `d.update(...)`: Update with a dict/iterable

In [None]:
dict.fromkeys(range(5), 0)

Returning to our task of counting words in the string s


In [None]:
# Try this as an exercise!

In [None]:
# Solution
data = {}
for word in para.lower().split():
    if word in data:
        data[word] += 1
    else:
        data[word] = 1
data

There is a problem with the punctuation marks!
- We want to delete any punctuation
- We want to remove additional spaces

Can use some string methods to do this.
- Specifically, we can use `para.translate`
- Takes a table (dictionary) with text replacement

In [None]:
# Using para.translate
table = {':': None, ',': None, '?': None}
t = para.maketrans(table)
para.translate(t)

- But we want to remove all punctuation
- The `string` module is very convenient

In [None]:
import string

In [None]:
string.ascii_letters

In [None]:
string.punctuation

In [None]:
string.whitespace

In [None]:
# Now we are good to go:
table = dict.fromkeys(string.punctuation)
para.translate(para.maketrans(table))

### Exercise: Counting words in Alice in Wonderland

- Take the text file provided in `../data/alice.txt`
- Read the entire novel and clean up the data of punctuation etc.
- Normalize the whitespace to a single space
- Do a full word count
- Show the top 10 words in the text
- Can show a histogram of the top words etc.
- What is the longest word in the book?

## Other tools

Regular expressions are extremely useful for more complex text processing.
You can learn more about them from:
- https://docs.python.org/3/howto/regex.html
- https://docs.python.org/3/library/re.html

If you do decide to learn more about regular expressions, this website can
be extremely helpful when you are trying to learn, write, and debug your
regular expressions: https://pythex.org/ .