Why using Regex in command line
Your text editor (Sublime) can only handle files up to a certain size. Command line is THE place to go when you have one huge file (>~450,000 lines) or more directories full of huge files.
- ex1. Jeb Bush emails
- ex2. Flint water emails in the
- ex3. One batch of Hilary Clinton's emails
Once the script is finished you will have installed the command-line tool that we will be using:
Command line 102
You already know a lot of command-line utilities!
cd ls touch mkdir rm mv git
The following utilities come with your terminal.
command input > output
> takes the input from the left and creates an output named after the right.
command input | another-command | yet-another-command
| processes the results from the first command with the second command, so on and so forth.
We will use these in a second.
Practice the following with files in
sort -r alphabet.txt
sort -r alphabet.txt > reverse-alphabet.txt
sort -n numbers.txt
sort -t "," -k 3 lead-request-zipcode.csv
How many types of lead kit does the city record?
sort lead-request-type.csv | uniq
How many incidents under each type?
sort lead-request-type.csv | uniq -c > lead-type-count.csv
ack is a command-line file pattern searcher that is fast and optimized in a lot of ways. See why ack.
Remember how you tried to get all the phone numbers from Jeb's email? You did something like this in Sublime:
- Search for all the phone numbers in a document with a Regex.
- Copy all the phone numbers.
- Create a new document.
- Paste the phone numbers to that new document.
- Save the new document with a name.
Now these five steps can be done with
ack and redirection in one line. The basic
ack usage looks like this:
ack "pattern" somefile
This gives you a preview of the search in the command line. However, to store the search results in a new file, you will use redirection.
ack "pattern" somefile > resultfile
Excercise 1: Jeb's emails
Get the phone numbers.
ack "your pattern" jeb-bush-telephone-number-challenge.txt > number-only.txt
Refine: how to get the numbers without the sentences around it?
-o flag to specify that we don't want to print the entire line.
ack -o "your pattern" ex1/jeb-bush-telephone-number-challenge.txt > ex1/number-only.txt
Note: if you redirect an input to an output and the output name already exists, you will overwrite the previous file. Make sure you name things with care and/or make copies of files important to you.
Investigate more: how to get unique numbers?
ack -o "your pattern" ex1/jeb-bush-telephone-number-challenge.txt | uniq -c > ex1/numbers-only-uniq.txt
Excercise 2: Flint emails
We installed something called
pdftotext to, well, turn pdf files into text files.
Try it with the pdf in
pdftotext input > output
ack -o '^From:+(.+)Date' ex2/snyder-flint-water-emails-demo.txt > ex2/flint-email-recipients.txt
Exercise 3: Hilary emails
We are going to investigate a batch of Hilary Clinton's email.
As you see, we are dealing with files that come in 293 individual files instead of one single file.
ack -h -m 1 -A 3 '^From: +(.+)' ex3/
Notice that for this exercise we are searching a whole directory rather than a single file. Here,
ack is no longer an alternative, but the only tool suitable for the job (since there are many files with a large total size.)
-A NUMflag specifies number of lines to be printed after the matching lines.
-m NUMflag stops searching in each file after NUM matches
- What do you think the
-hflag does? Try it without
Find SECRET or CONFIDENTIAL files
-l flag to only print the filenames of matching files, instead of the matching text.
ack -l '\b(SECRET|CONFIDENTIAL)\b' ex3/
Sort Hilary's email recipients by frequency
Building on the previous exercise:
ack -h -m 1 -A 3 '^From: +(.+)' ex3/ > ex3/recipients.txt
Here's the challenge:
Now find all the emails in the new file, sort them, find out the count of each email address, and sort by frequency of these recipients.
Use the pipe
| in your command!
Submit your command here: http://goo.gl/forms/hKgssuslfE
Regex is widely used for form validation. You don't really have to write it from scratch. Two ways to validate: