# A Gentle Introduction to Parsing in Python

Python is a language that is very good a parsing text.

## Splitting on spaces
The first tool you should be familiar with is [`str.split()`](https://docs.python.org/3/library/stdtypes.html#str.split).

In [1]:
# given a string split() returns the list of the result of the string split along spaces and newlines
a_string = " this is a string    "
a_string.split()

['this', 'is', 'a', 'string']

In [2]:
a_string = " this is a string  \nhello and a new      line\n  "
a_string.split()

['this', 'is', 'a', 'string', 'hello', 'and', 'a', 'new', 'line']

You can also split along other stuff (like commas, or nearly any other character):

In [3]:
a_string = "if i need to, I will go to the store, but not before I eat"
a_string.split(",")

['if i need to', ' I will go to the store', ' but not before I eat']

In [4]:
a_string = "aagttcacgtaaagcctctaag"
a_string.split("g")

['aa', 'ttcac', 'taaa', 'cctctaa', '']

## Determining Word Case

Determining word case is rather simple in Python.

There are two ways we will look into, 

* one is using the build in [`str.isupper()`](https://docs.python.org/3/library/stdtypes.html?highlight=isupper#str.isupper), the other a regular expression.
* the other is by using regular expressions which are handled in Python with the [`re` library](https://docs.python.org/3/library/re.html).

Let's jump into it.

Say we want to take a string and determine if it begins with a capital character.

### Using `str.isupper()`

In [5]:
def is_capitalized(s):
    return s[0].isupper()

In [6]:
is_capitalized("A")

True

In [7]:
is_capitalized("funny")

False

In [8]:
is_capitalized("U.S.")

True

### Using Regular Expressions
We can see that `str.isupper()` is simple, elegant and effective, but not very generalizable.

This is where we will use `re` and try something a little different.

Here is the shape of the solution:

* in the Python `re` library a word character  is designateed with `\w` -- this is called a "character class" 
* there are also expressions that allow for groups of characters to be selected in a regular expression 
    * if we use the expression `[A-Z]` this matches any uppercase letter
    * if we use the expression `[a-z]` this matches any lowercase letter
* combining these, we can create a function that does the same as `str.upper()` in the language of regular expressions

For example:

In [9]:
import re

pat_lower = re.compile(r'[a-z]\w+') # words that start with a lowercase
pat_upper = re.compile(r'[A-Z]\w+') # words that start with an uppercase

In [10]:
def process_string(s):
    m = pat_lower.match(s)
    if m:
        print(f"'{m[0]}' is lowercase")
    else:
        m = pat_upper.match(s)
        print(f"'{m[0]}' is uppercase")

In [11]:
process_string("hello.")

'hello' is lowercase


In [12]:
process_string("Hello.")

'Hello' is uppercase


This solution could be put into play into your homework solution.

Spend some time studying the `re` library.  It is a crucial tool for many programming tasks with strings.