# DSCI 511: Data acquisition and pre-processing <br>Chapter 4: Pre-processing considerations: foresight for downstream needs
Preprocessing encapsulaates the set of operations that are needed in a project to simply prepare the data. Before you can start using the data that you have acquired to build what your project demands – be it visualizations, modeling, or otherwise – you need to get this data into a shape that will let you work with it. More often than not, acquired data will be in formats and structures that aren't useful for what you're trying to do. This is where preprocessing comes in. Since the data science workflow is often non-linear, you might find yourself coming back to preprocessing even after you thought you were done with it. Getting the data into right shape can take up more time, effort, and resources than you think.

## 4.1 Converting Between File Types
The two major structures of data that you will work with in Python are tables and dictionaries. Tabular data is usually stored in text files meant to be read line-by-line with each line containing values separated by a delimiter. Most commonly, when the delimiter is a comma, this means storing data in a CSV file. When you're working with dictionary-structured data, it's very convenient to store it in JSON. Often, the data you acquire will not be in the format you need it to be. In other cases, you might need the same data to be stored in multiple formats for different types of interaction. Here, we'll demonstrate some rudimentary conversion techniques between tabular and dictionary data.

### 4.1.1 CSV to JSON
First, we'll talk about taking data from a CSV file and transforming it to store it in JSON. We have an example CSV file, `colors.csv`, that contains some data about colors. When we load it and print out the contents, we can see that each row corresponds to a color and contains an ID, the name of the color, the hexadecimal code for the color, and the red, green, and blue values of the color for the RGB code.

In [2]:
import csv, json
from pprint import pprint

reader = csv.reader(open("data/colors.csv", "r")) 
color_lists = list(reader)

pprint(color_lists[:10])

[['air_force_blue_raf', 'Air Force Blue (Raf)', '#5d8aa8', '93', '138', '168'],
 ['air_force_blue_usaf', 'Air Force Blue (Usaf)', '#00308f', '0', '48', '143'],
 ['air_superiority_blue',
  'Air Superiority Blue',
  '#72a0c1',
  '114',
  '160',
  '193'],
 ['alabama_crimson', 'Alabama Crimson', '#a32638', '163', '38', '56'],
 ['alice_blue', 'Alice Blue', '#f0f8ff', '240', '248', '255'],
 ['alizarin_crimson', 'Alizarin Crimson', '#e32636', '227', '38', '54'],
 ['alloy_orange', 'Alloy Orange', '#c46210', '196', '98', '16'],
 ['almond', 'Almond', '#efdecd', '239', '222', '205'],
 ['amaranth', 'Amaranth', '#e52b50', '229', '43', '80'],
 ['amber', 'Amber', '#ffbf00', '255', '191', '0']]


Now, we can iterate over this list of colors and create dictionaries to store the values associated with each color. We can put these dictionaries in another list:

In [3]:
color_dicts = []

for row in color_lists:
    
    color = {
        "id" : row[0],
        "name" : row[1],
        "hex_value" : row[2],
        "r" : row[3],
        "g" : row[4],
        "b" : row[5]
    }
    
    color_dicts.append(color)
    
pprint(color_dicts[:10])

[{'b': '168',
  'g': '138',
  'hex_value': '#5d8aa8',
  'id': 'air_force_blue_raf',
  'name': 'Air Force Blue (Raf)',
  'r': '93'},
 {'b': '143',
  'g': '48',
  'hex_value': '#00308f',
  'id': 'air_force_blue_usaf',
  'name': 'Air Force Blue (Usaf)',
  'r': '0'},
 {'b': '193',
  'g': '160',
  'hex_value': '#72a0c1',
  'id': 'air_superiority_blue',
  'name': 'Air Superiority Blue',
  'r': '114'},
 {'b': '56',
  'g': '38',
  'hex_value': '#a32638',
  'id': 'alabama_crimson',
  'name': 'Alabama Crimson',
  'r': '163'},
 {'b': '255',
  'g': '248',
  'hex_value': '#f0f8ff',
  'id': 'alice_blue',
  'name': 'Alice Blue',
  'r': '240'},
 {'b': '54',
  'g': '38',
  'hex_value': '#e32636',
  'id': 'alizarin_crimson',
  'name': 'Alizarin Crimson',
  'r': '227'},
 {'b': '16',
  'g': '98',
  'hex_value': '#c46210',
  'id': 'alloy_orange',
  'name': 'Alloy Orange',
  'r': '196'},
 {'b': '205',
  'g': '222',
  'hex_value': '#efdecd',
  'id': 'almond',
  'name': 'Almond',
  'r': '239'},
 {'b': '80',
 

After this, it is easy to store this list of dictionaries as a JSON file:

In [4]:
json.dump(color_dicts, open("data/color-dicts.json", "w"))

#### 4.1.1.1 Exercise: CSV to JSON conversion
Read the `cities.csv` file and look at its contents. It should have a header (the first line of the file) that tells you which fields contain what data. Next, take the data for  only the cities which have their population listed and store this in JSON format.

In [None]:
# code goes here

### 4.1.2 JSON to CSV
Now we'll explore converting JSON data to tabular formats. In `nobel-laureates.json`, we have data about more than 900 recipients of the Nobel Prize. Let's load this in and take a look:

In [5]:
nobel_laureates = json.load(open("data/nobel-laureates.json", "r"))

pprint(nobel_laureates["laureates"][:10])

[{'born': '1845-03-27',
  'bornCity': 'Lennep (now Remscheid)',
  'bornCountry': 'Prussia (now Germany)',
  'bornCountryCode': 'DE',
  'died': '1923-02-10',
  'diedCity': 'Munich',
  'diedCountry': 'Germany',
  'diedCountryCode': 'DE',
  'firstname': 'Wilhelm Conrad',
  'gender': 'male',
  'id': '1',
  'prizes': [{'affiliations': [{'city': 'Munich',
                                'country': 'Germany',
                                'name': 'Munich University'}],
              'category': 'physics',
              'motivation': '"in recognition of the extraordinary services he '
                            'has rendered by the discovery of the remarkable '
                            'rays subsequently named after him"',
              'share': '1',
              'year': '1901'}],
  'surname': 'Röntgen'},
 {'born': '1853-07-18',
  'bornCity': 'Arnhem',
  'bornCountry': 'the Netherlands',
  'bornCountryCode': 'NL',
  'died': '1928-02-04',
  'diedCountry': 'the Netherlands',
  'diedCountr

We want to take some of this data and put it in a table. We're interested in the first name, surname, and the year of receiving the prize for each laureate. However, if we inspect the data we can see that there are, inexplicably, some entries that are empty of any information, such as they don't even list a name. So, we need to build this condition into our code: a valid laureate must have a first name. Sometimes, the Prize is given to organizations, and organizations have no surnames. In these cases, we'll want to leave the surname empty. Some laureates (for example Marie Curie) may have won multiple Nobel Prizes, in different years. For these cases, we'll want to build the `year` string as containing the different years with a space character in between.  Keeping all these considerations in mind, we can convert the data into a tabular format:

In [6]:
laureates_table = []

for n, laureate in enumerate(nobel_laureates["laureates"]):
    
    if "firstname" in laureate.keys():
        
        if "surname" in laureate.keys():
            surname = laureate["surname"]
        else:
            surname = ""

        years = []

        for prize in laureate["prizes"]:
            years.append(prize["year"])

        if len(years) == 1:
            years = years[0]
        else:
            years = " ".join(years)



        firstname = laureate["firstname"]

        row = [surname, firstname, years]

        laureates_table.append(row)

Some laureates' names might actually contain commas. So in this case, a comma would not be a good delimiter for the data. We can instead use a tab character `\t` as the delimiter and save this data to a "tab-separated values" or ".tsv" file.

In [7]:
with open("data/nobel-laureates-info.tsv", "w") as f:
    for laureate in laureates_table:
        f.write("\t".join(laureate) + "\n")

#### 4.1.2.1 Exercise: JSON to CSV conversion
Load the data in the `american-movies.json` file. We only want the movies that were made from 1990 to 1999 (it was a truly glorious decade for American cinema). Your task is to take the title and year of making for these movies and put these in a tab-separated values file.

In [None]:
# code goes here

#### 4.1.2.2 Can we always go both ways?
One thing to keep in mind is that CSV-like formats can't really handle nesting. Because of this, JSON files that contain nested structures will not translate into tabular formats. The utility of tabular formats is that they are simply text files that require minimal additional processing to read and write, so very often the same data will be faster to access and store in tabular formats than in JSON. File size will be generally smaller, too. However, more complexly structured data isn't compatible with tables.

In general, the structure of a JSON file is that of an _associative array_, meaning data is organized in key-value pairs. Associative arrays are great for data that isn't sequentially organized. Low-level, in-memory implementations of associative arrays make searching for values fast and reliable. Also, as we've mentioned, associative arrays make storing more complex structures possible. Tabular formats are analogous to _ordered arrays_ (think lists in Python). In an ordered array, the data is sequentially arranged. This means, rather than  access the data using keys, they are accessed using indices. However, since there are no keys, performing search operations are computationally costly. Ordered arrays are good for storing data that doesn't need to be searched very often and needs to be accessed sequentially. 

#### 4.1.2.3 Is the JSON file scalable
You might've noticed that the top-level structure of the `'nobel-laureates.json'` file to be a dictionary, even though it only had one key: `'laureates'`, underneith of which is a list of all of the data _records_. In this example, a completely ordered `.csv` structure was not practical, since the value of each record's `'prizes'` field is a variable-length list (
this is why the prize years were joined together as a space-delimited string in the conversion
). Despite a lack of flexibility, a major benefit of tabular serializations is their convenience for line-by-line file reading, specifically when each row is an independent record. Reading a very large file line-by-line prevents your computer's memory from being overloaded and is a basic requirement of hadoop etc. and disk-based implementations of the map-reduce programming pattern. 

#### 4.1.2.4 Exercise: Making JSON file reading scalable
Create a specialized JSON serialization of the data in `'nobel-laureates.json'`. Specifically, create a file called `'data/nobel-laureates-lines.json'` that has each lauriate's record serialized seprately as a json object, with newlines `'\n'` in between, as delimiters. As a follow up, combine the line-by-line file reading syntax introduced in Section 1.4.1.5 in conjunction with the `json.dumps()` string serialization function in Section 1.4.2.2 to _read only the first ten lines_. As you read these lines, load each from json and print the laureate's list of prizes.

## 4.2 The 4 V's of Big Data
The 4 V's of Big Data refer to 4 important considerations to keep in mind when working with large data sets, and all of these are important when dealing with the preprocessing phase of a project.

- __Volume__: This refers to the overall size of a data set, whether in terms of KB, MB, GB, or TB, etc.; its number of records/rows, or represented individuals. In the case of volume, big data refers to the capacity for _quantity_ to create processing challenges, which if overcome might lead to game-changing opportunities.
- __Velocity__: This refers to the _rate_ at which data are being produced. Sometimes, prolonged collection of high-velocity data will result in a data set that is big for reasons of volume. However, velocity can create big data challenges and opportunities independent of volume. Rapid rates of data generation can create opportunities for real-time analysis, where we know whats going on right now in the world.
- __Variety__: The whole may be greater than the sum of its parts. Having more dimensions through which to view an individual or data object can really enrich the outcome for an analysis. However, variety van be challengingm, too: working with high-variety data with many linked dimensions and/or mediums might make it unclear where an analysis should start, or how those dimensions fit together into a bigger picture. A project working with high-variety data will probably spend a lot of time in the weeds with exploratory data analysis just trying to figure out which data are actually useful.
- __Veracity__: This refers to the _reliability_ of data. While it's hard to imagine when data veracity can be a helpful feature (unlike volume, velocity, or variety), it's pretty easy to encapsulate veracity as a big data problem: can you trust the data that you're working with? Here's perhaps one way that veracity might be beneficial: in unstructured data types, like text, a phenomenon is recorded as it operates, and without the imposition of structure by a data collector. This allows for the full _variety_ of behaviors that might occur to occur, at the expense of a reliability mess.

In particular, annotating data reliably is a major issue. Different types of annotation include:

- __Sensed labels__: Some devices can provide precise data labels, like geographic annotations via gps and latitude/longitude pairs. 
- __Unsupervised cues__: Some researchers will use observable features as proxies for a desired annotation, e.g., like using keywords in text to label document topics.
- __User-provided tags__: Some data-production platforms regularize symbols for users to apply to their content. Just like webpage meta tags embed annotations into webpages for search engines, hashtags are data annoations on Twitter that cue topics and link content. Here, it's up to the user to apply the right annotations to their data.
- __Third-party reviews__: This is probably the most common annotation strategy for researchers, and generally occurs after data are generated. Most commonly, a high-quality and representative sample (see __Sec. 1.1.4__) of data are passed out to individuals who are trained to annotate or know about the phenomenon of interest. This might mean doing it yourself, asking/paying graduate students to complete an annotation task, or creating a large survey and distributing it electronically.
    - __Amazon's Mechanical Turk__: [This](https://www.mturk.com/) is an online distributed survey service provided by Amazon that connects survey writers (for a fee) to a very large network of survey takers. The survey takers generally come from across the globe and sometimes even live off of these Mechanical Turn "hits" (surveys) as a form of primary income. For researchers, this has become a quick and (relatively) easy way to annotate large quantities of data.

## 4.3 Randomness

An important module we often need to use is the `random` module. In particular, it is very useful for taking random samples from large volumes of data. For example, we could use the `random.sample()` function to take a sample from `color_lists`.

In [8]:
import random

color_sample = random.sample(color_lists, 5)

pprint(color_sample)

[['raw_umber', 'Raw Umber', '#826644', '130', '102', '68'],
 ['pale_violet_red', 'Pale Violet-Red', '#db7093', '219', '112', '147'],
 ['orange_web_color', 'Orange (Web Color)', '#ffa500', '255', '165', '0'],
 ['cherry', 'Cherry', '#de3163', '222', '49', '99'],
 ['light_crimson', 'Light Crimson', '#f56991', '245', '105', '145']]


In [9]:
color_sample = random.sample(color_lists, 5)

pprint(color_sample)

[['red_pigment', 'Red (Pigment)', '#ed1c24', '237', '28', '36'],
 ['medium_champagne', 'Medium Champagne', '#f3e5ab', '243', '229', '171'],
 ['light_slate_gray', 'Light Slate Gray', '#789', '119', '136', '153'],
 ['midnight_green_eagle_green',
  'Midnight Green (Eagle Green)',
  '#004953',
  '0',
  '73',
  '83'],
 ['coral_red', 'Coral Red', '#ff4040', '255', '64', '64']]


In [10]:
color_sample = random.sample(color_lists, 5)

pprint(color_sample)

[['violet', 'Violet', '#8f00ff', '143', '0', '255'],
 ['saddle_brown', 'Saddle Brown', '#8b4513', '139', '69', '19'],
 ['vivid_auburn', 'Vivid Auburn', '#922724', '146', '39', '36'],
 ['dark_sienna', 'Dark Sienna', '#3c1414', '60', '20', '20'],
 ['lavender_purple', 'Lavender Purple', '#967bb6', '150', '123', '182']]


It is also possible to perform _reproducible_ random operations by setting the value of `random.seed`.

In [11]:
random.seed(42)

color_sample = random.sample(color_lists, 5)

pprint(color_sample)

[['raspberry_glace', 'Raspberry Glace', '#915f6d', '145', '95', '109'],
 ['cadmium_yellow', 'Cadmium Yellow', '#fff600', '255', '246', '0'],
 ['arsenic', 'Arsenic', '#3b444b', '59', '68', '75'],
 ['straw', 'Straw', '#e4d96f', '228', '217', '111'],
 ['electric_crimson', 'Electric Crimson', '#ff003f', '255', '0', '63']]


In [12]:
random.seed(42)

color_sample = random.sample(color_lists, 5)

pprint(color_sample)

[['raspberry_glace', 'Raspberry Glace', '#915f6d', '145', '95', '109'],
 ['cadmium_yellow', 'Cadmium Yellow', '#fff600', '255', '246', '0'],
 ['arsenic', 'Arsenic', '#3b444b', '59', '68', '75'],
 ['straw', 'Straw', '#e4d96f', '228', '217', '111'],
 ['electric_crimson', 'Electric Crimson', '#ff003f', '255', '0', '63']]


##  4.4 Pre-processing text data
Text data is abundant, and very often you will find yourself working with it. In this section, we'll discuss some powerful tools and techniques for text data manipulation that can come in very handy in preprocessing.

### 4.4.1 Regular expressions

Regular expressions, or regex, are "sequences of characters that define a search pattern", according to Wikipedia. These patterns can be used to search for, find, replace, and do a great deal more with strings.

Regular expression patterns are constructed with both ordinary and special characters. The simplest regular expressions are simply ordinary characters like "A", or "5", or "status". These patterns only match themselves, allowing you to search for exact patterns of characters. Some characters are "special" for regex, like "|" or "\[" or "^". These characters can be used to construct regex that is more powerful than straightforward matching.

#### 4.4.1.1 Basics

Python's included `re` module can be used to construct and use regular expressions. It comes with many useful functions. The most basic of match object if the pattern matched the string and a `None` value if it didn't. This means `re.search()` outputs can be used with conditional statements (like `if` statements).

In [13]:
import re 

silly_string = "one fish two fish red fish blue fish"

print(re.search("fish", silly_string))

<re.Match object; span=(4, 8), match='fish'>


In [14]:
print(re.search("salmon", silly_string))

None


In [15]:
if re.search("fish", silly_string):
    print("Fish were found.")
else:
    print("There were no fish.")

Fish were found.


Another useful function is `re.sub()`, which takes two patterns and a string as input and replaces the first pattern with the second.

In [16]:
silly_cats = re.sub("fish", "cat", silly_string)

print(silly_cats)

one cat two cat red cat blue cat


`re.findall()` will return all matches of a pattern in a string:

In [17]:
print(re.findall("cat", silly_cats))

['cat', 'cat', 'cat', 'cat']


For example, if we have some text that we suspect contains Philadelphia area ZIP codes, we could use character classes to extract these.

In [18]:
text = "Drexel's University City campus falls in 19104, while the Collge of Nursing is in 19102 and the Philadelphia City Hall is in 19107."

zipcodes = re.findall("191[0-5][0-9]", text) # we know philly zipcodes go from 19102 to 19154
print(zipcodes)

['19104', '19102', '19107']


#### 4.4.1.2 A few useful character classes and other means for flexibility

- `.` __(wild card)__ In the default mode, this matches any character except a newline.
- `[...]` __(character class)__ Used to indicate flexible matching across a specificed set of characters.
- `[^...]` __(complimentary character class)__ Used to indicate flexible matching across _everything but_ a specificed set of characters.
- `[a-z]` __(lowercase range)__ Used to indicate flexible matching across lowercase letter ranges
- `[A-Z]` __(uppercase range)__ Used to indicate flexible matching across uppercase letter ranges
- `[0-9]` __(numeric range)__ Used to indicate flexible matching across numeric ranges
- '|' __(or)__ Creates a regular expression that will match either A or B. 

Like the `string.split()` method, `re` also has a `re.split()` method that can be used with regex patterns. We could combine this with a character class:

In [19]:
not_a_silly_string = "Oftentimes, different punctuation characters are used; these indicate different types of stops."

## split a string by several types of punctuation
clauses = re.split("[,;.]", not_a_silly_string)
print(clauses)

['Oftentimes', ' different punctuation characters are used', ' these indicate different types of stops', '']


If we have some text that we suspect contains Philadelphia area ZIP codes, we could use character classes to extract these.

In [20]:
text = "Drexel's University City campus falls in 19104, while the Collge of Nursing is in 19102 and the Philadelphia City Hall is in 19107."

zipcodes = re.findall("191[0-5][0-9]", text) # we know philly zipcodes go from 19102 to 19154
print(zipcodes)

['19104', '19102', '19107']


#### 4.4.1.3 Exercise: Regex phone numbers
Read the file `phone-numbers.txt`. It contains a phone number in each line. \[Hint: use something like `lines = open("file.txt", "r").readlines()`\] Store only the phone numbers with the area code "215" in a list and print it out. Use regex-based pattern matching, not any other methods which occur to you.

In [None]:
# code goes here

#### 4.4.1.4 Grouping, numbered groups and extensions
Grouping is a great way to modify and extend strings, without simply replacing them. With grouping, you can use the matched content in a substitute string. It's great for re-formatting text. Groups can also serve extended functions if they are initiated by an unescaped question mark.
- `(...)` __(group)__ Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the `\1`, `\2`, etc., special sequences, described below.
- `\1`, `\2`, etc. __(captured groups)__ Matched groups are captured and held in order: low to high from left to right, and in the case of nested groups, from outside to inside.
- `(?...)` __(non-matching group)__ Matches `...` as in the parentheses, but does not capture it in a group. This becomes especially important when applying multipliers.
- `(?=...)` __(lookahead)__ Matches if `...` matches next, but doesn’t consume any of the string.
- `(?!...)` __(negative look ahead)__ Matches if `...` doesn’t match next.
- `(?<=...)` __(positive look behind)__ Matches if the current position in the string is preceded by a match for `...` that ends at the current position. 
- `(?<!...)` __(negative look behind)__ Matches if the current position in the string is not preceded by a match for `...`.

In [21]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's capture Jenny's phone number and insert the area code
modified_tommy_two_tone = re.sub(r"([0-9][0-9][0-9]-[0-9][0-9][0-9][0-9])",r"1-800-\1", tommy_two_tone)

print(modified_tommy_two_tone)

Apparently, 1-800-867-5307 is Jenny's phone number, but I'm not sure what her area code is.


#### 4.4.1.5 Multipliers (quantifiers)
It was a little bit of overkill to use the numeric character class so many times in a row in the last expression. This is an example of where multiplies can come in really handy.
- `*` __(zero or more)__ Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. 
- `+` __(one or more)__ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
- `?` __(zero or one)__ Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
- `{m}` __(exactly m times)__ Specifies that exactly m copies of the previous RE should be matched.
- `{m,n}` __(m throug n times)__ Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. 

In [22]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's get all of the phone numbers in a string
numbers = re.findall("[0-9]{3}-[0-9]{4}", tommy_two_tone)
print(numbers)

['867-5307']


In [23]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## capture the word that appears before a lookahead: "'s phone number" 
## by matching one or more non-space characters before 
## along with the number itself with flexible "|" matching
whos_number = re.findall("([^ ]+)(?='s phone number)|([0-9]{3}-[0-9]{4})", tommy_two_tone)

print(whos_number)

[('', '867-5307'), ('Jenny', '')]


In [24]:
## We can even get a bit more flexible with our area-code handling!
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."
my_contact_information = "If you need my office line, it's 215-895-2185"

## By grouping and using a `{1,2}` flexible match, we can get full and partial numbers
## Note: we have to use a non-capturing group in order to make sure we get the full expression
## without capturing the first three digits, only.
numbers =  re.findall("(?:[0-9]{3}-){1,2}[0-9]{4}", tommy_two_tone)
print(numbers)
numbers =  re.findall("(?:[0-9]{3}-){1,2}[0-9]{4}", my_contact_information)
print(numbers)

['867-5307']
['215-895-2185']


#### 4.4.1.6 Escapes and special sequences
As it turns out, some character classes are so common that they have their own special-characters. So, our phone-number example could be even more concise with the `\d` special character.
- `\` __(escape)__ Either escapes special characters (permitting you to match characters like `*`, `?`, and so forth), or signals a special sequence.
- `\d` __(digits)__ Matches any Unicode decimal digit. This includes `[0-9]`, and also many other digit characters.
- `\D` __(digits)__ Matches any Unicode non-digit.
- ` \s` __(whitespace)__ Matches Unicode whitespace characters, including `[\t\n\r]` and space.
- `\w` __(word characters)__ Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
- `\W` __(non-word characters)__ Matches Unicode non-word characters;
- `\t` __(tab)__ Matches a tab character.
- `\n` __(newline)__ matches a newline character.
- `\r` __(carriage return)__ matches a carriage return character.

In [25]:
tommy_two_tone = "Apparently, 867-5307 is Jenny's phone number, but I'm not sure what her area code is."

## let's get all of the phone numbers in a string
numbers = re.findall("(?:\d{3}){1,2}-\d{4}", tommy_two_tone)
print(numbers)

['867-5307']


#### 4.4.1.7 Anchors
Anchors allow you to make the positions of matches absolute in the overally position in a string. These become especially handy if you are pre-processing semi-structured text, like a screenplay, stenographer's court record, or the index of a book.
- `^` __(start anchor)__ Matches the start of the string.
- `$` __(end anchor)__ Matches the end of the string or just before the newline at the end of the string.

In [26]:
## an example of some sem-structured text
macbeth = "First Witch: When shall we three meet again? In thunder, lightning, or in rain?\nSecond Witch: When the hurlyburly's done, when the battle's lost and won."
print(macbeth)
print("")

## make some empty lists for our data
speakers = []
speeches = []

## split the document into the lines of the play
lines = macbeth.split("\n")

## loop over the lines
for line in lines:
    
    ## retrieve the matched groups
    ## Note: if we simply split by a colon 
    ## we might mess up what people are saying in the text!
    ## Also note: the super greedy ".*?" matching ANYTHING, zero or more times!
    ## This comes in very handy when you want loosely anything
    ## that happens to be surrounded by some specified structure
    speaker, speech = re.search("^(.*?): (.*?)$", line).groups()

    ## Grow the lists
    speakers.append(speaker)
    speeches.append(speech)

print(speakers)
print("")
print(speeches)
print("")

First Witch: When shall we three meet again? In thunder, lightning, or in rain?
Second Witch: When the hurlyburly's done, when the battle's lost and won.

['First Witch', 'Second Witch']

['When shall we three meet again? In thunder, lightning, or in rain?', "When the hurlyburly's done, when the battle's lost and won."]



#### 4.4.1.8 Exercise: Names of the gods
In the cell below is some text. It's an extract from [A Clash of Kings](https://www.goodreads.com/book/show/10572.A_Clash_of_Kings), specifically, about a character's prayer to some fictional gods. Use regex to extract the names of these gods. Your output should be a list that looks something like `["the Father", "the Mother", "the Warrior"]`.

In [111]:
text = 'Lost and weary, Catelyn Stark gave herself over to her gods. She knelt before the Smith, who fixed things that were broken, and asked that he give her sweet Bran his protection. She went to the Maid and beseeched her to lend her courage to Arya and Sansa, to guard them in their innocence. To the Father, she prayed for justice, the strength to seek it and the wisdom to know it, and she asked the Warrior to keep Robb strong and shield him in his battles. Lastly she turned to the Crone, whose statues often showed her with a lamp in one hand. "Guide me, wise lady," she prayed. "Show me the path I must walk, and do not let me stumble in the dark places that lie ahead."'

# code goes here

### 4.4.2 Tokenization
Tokenization is the process of breaking up text into smaller units. Usually, this means breaking a string up into words. The simplest possible tokenization would be to use the `string.split()` method:

In [27]:
sentence = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

words = sentence.split()

print(words)

['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']


The problem with this is, punctuation has been captured as part of some words. For a more advanced tokenizer, we'll use one of the most well-known Python modules for natural language processing, the Natural Language Toolkit (`nltk`). (Install it with `pip3 install nltk`, then import it with `import nltk` and run `nltk.download()`, which will open up a graphical window and allow you to download the data NLTK needs to perform many tasks.) ([Docs](https://www.nltk.org/))

In [34]:
from nltk.tokenize import word_tokenize

words = word_tokenize(sentence)

print(words)

['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']


A newer set of tools can be found in the `spacy` module (`pip3 install spacy`). ([Docs](https://spacy.io/usage))

In [28]:
import spacy

nlp = spacy.load("en")
doc = nlp(sentence)

# spacy creates "token" objects which have quite a few properties. Check the documentation out if you're interested in learning more.

words = []

for token in doc:
    words.append(token.text)
    
print(words)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


['Good', 'muffins', 'cost', '$', '3.88', '\n', 'in', 'New', 'York', '.', ' ', 'Please', 'buy', 'me', '\n', 'two', 'of', 'them', '.', '\n\n', 'Thanks', '.']


The problem – as you may have noticed – is that there are variations between tokenizers that can result in different outcomes for you further down the line. Choice of tokenizer can make or break a particular application.

### 4.4.3 Formatting Issues in Text Data

As we've already seen in the previous section, delimitation in text can be an issue. While it is usually a good idea to go with a reliable tokenizer, sometimes you might indeed have to deal with delimitation and whitespace issues on a more granular level. We can use simply counting the words in a text as an example.

In [29]:
from collections import Counter

document = "\tYou might have an easy time reading this, \nbut the computer has some extra spaces tabs and \nnewlines to deal with.  After all, two spaces after \na stop isn't strange!"
print(document)
print()
word_counts = Counter()
words = re.split(" ", document)
for word in words:
    word_counts[word] += 1

print(word_counts.most_common())

	You might have an easy time reading this, 
but the computer has some extra spaces tabs and 
newlines to deal with.  After all, two spaces after 
a stop isn't strange!

[('spaces', 2), ('\tYou', 1), ('might', 1), ('have', 1), ('an', 1), ('easy', 1), ('time', 1), ('reading', 1), ('this,', 1), ('\nbut', 1), ('the', 1), ('computer', 1), ('has', 1), ('some', 1), ('extra', 1), ('tabs', 1), ('and', 1), ('\nnewlines', 1), ('to', 1), ('deal', 1), ('with.', 1), ('', 1), ('After', 1), ('all,', 1), ('two', 1), ('after', 1), ('\na', 1), ('stop', 1), ("isn't", 1), ('strange!', 1)]


Since there was irregular whitespace in the text, it ended up attached to some of the words and we also ended up counting an empty string. This is a result of using `re.split()` and specifying a single space character as the delimiter. Using the `string.split()` method is a possible solution, as would be splitting by the more general `\W` character class (non-word characters). We can also apply the `string.strip()` method to remove extra whitespace.

In [30]:
document = "\tYou might have an easy time reading this, \nbut the computer has some extra spaces tabs and \nnewlines to deal with.  After all, two spaces after \na stop isn't strange!"

word_counts = Counter()
words = re.split("\W", document)
for word in words:
    word = word.strip()
    if word:
        word_counts[word] += 1
    
print(word_counts.most_common())

[('spaces', 2), ('You', 1), ('might', 1), ('have', 1), ('an', 1), ('easy', 1), ('time', 1), ('reading', 1), ('this', 1), ('but', 1), ('the', 1), ('computer', 1), ('has', 1), ('some', 1), ('extra', 1), ('tabs', 1), ('and', 1), ('newlines', 1), ('to', 1), ('deal', 1), ('with', 1), ('After', 1), ('all', 1), ('two', 1), ('after', 1), ('a', 1), ('stop', 1), ('isn', 1), ('t', 1), ('strange', 1)]


Another issue here is capitalization, is "After" a different word from "after"? We could solve this by lowercasing all words before counting them:

In [31]:
document = "\tYou might have an easy time reading this, \nbut the computer has some extra spaces tabs and \nnewlines to deal with.  After all, two spaces after \na stop isn't strange!"

word_counts = Counter()
words = re.split("\W", document)
for word in words:
    word = word.strip().lower()
    if word:
        word_counts[word] += 1
    
print(word_counts.most_common())

[('spaces', 2), ('after', 2), ('you', 1), ('might', 1), ('have', 1), ('an', 1), ('easy', 1), ('time', 1), ('reading', 1), ('this', 1), ('but', 1), ('the', 1), ('computer', 1), ('has', 1), ('some', 1), ('extra', 1), ('tabs', 1), ('and', 1), ('newlines', 1), ('to', 1), ('deal', 1), ('with', 1), ('all', 1), ('two', 1), ('a', 1), ('stop', 1), ('isn', 1), ('t', 1), ('strange', 1)]


Again, this might still not be good enough. The contraction "isn't" here is broken up into "isn" and "t", but possibly the best way to tokenize this would be "is" and "n't", since it is a contraction of those two words. Delimitation issues can become problematic very easily, and there is usually no one catch-all way to fix these problems. In general, attention to detail and good use of regular expressions are really the only way to solve these kinds of problems.

### 4.4.4 Datetime parsing
Date-time information is often found in text, and parsing (extracting and interpreting) this information can be a significant preprocessing task. We can use the `dateutil` module to simplify this (`pip3 install py-dateutil`). The `parser` class included in the module can handle a variety of datetime formats.

In [32]:
import dateutil.parser as dateparser

calendar = {
    "Term start date" : "September 24, 2018",
    "Weekly meeting" : "Thursday, 12:30pm",
    "Thanksgiving holiday begins" : "11/20/2018 10pm",
    "Christmas day" : "12/25/2018"
}

for event in calendar:
    print(event)
    print(dateparser.parse(calendar[event]))
    print()

Term start date
2018-09-24 00:00:00

Weekly meeting
2018-09-20 12:30:00

Thanksgiving holiday begins
2018-11-20 22:00:00

Christmas day
2018-12-25 00:00:00



#### 4.4.4.1 The datetime module
While `dateutil.parser` knows some common date-time string patterns, these can be more exactly specified for parsing using the `datetime` module:
- https://docs.python.org/3/library/datetime.html
`datetime` defines Python's temporal objects, and offers a huge number of utilities for working with time, such as an easy funcationality to get the current time of execution (see below). In general, with the `datetime` module dates can be converted to and from numeric objects for calculations to be performed easily and intuitively. Here's an example calculating _timedeltas_, i.e., numeric differences in time between datetime strings:

In [33]:
from datetime import datetime

current_time = datetime.now()

thanksgiving = dateparser.parse(calendar["Thanksgiving holiday begins"])

print("Time to thanksgiving holidays: ")
print((thanksgiving - current_time))
print()

# if we wanted just the days
print("Days until Thanksgiving holidays: ")
print((thanksgiving - current_time).days)
print()

Time to thanksgiving holidays: 
64 days, 4:08:42.530795

Days until Thanksgiving holidays: 
64



#### 4.4.4.2 Exercise: Calculate youre exact age
Calculate your own age using datetime parsing! Can you come up with a datetime format for your birthday that `dateutil.parser` doesn't recognize or recognizes incorrectly? If so, use the `datetime` module to specify the format exactly. [Hint. Review these docs: 
- https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
- https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
]

In [None]:
birthday = ""

# code goes here