## Character Sets


For the next couple of examples we’re going to need some text data beyond the names of the states. Let’s just create a short text file from the console:

<pre>
touch small.txt
echo "abcdefghijklmnopqrstuvwxyz" >> small.txt
echo "ABCDEFGHIJKLMNOPQRSTUVWXYZ" >> small.txt
echo "0123456789" >> small.txt
echo "aa bb cc" >> small.txt
echo "rhythms" >> small.txt
echo "xyz" >> small.txt
echo "abc" >> small.txt
echo "tragedy + time = humor" >> small.txt
echo "http://www.jhsph.edu/" >> small.txt
echo "#%&-=***=-&%#" >> small.txt
</pre>

In addition to quantifiers there are also regular expressions for describing sets of characters. 
- `\w`: all “word” characters
- `\d`: all “number” characters
- `\s`: “space” characters

<pre>
egrep "\w" small.txt
# abcdefghijklmnopqrstuvwxyz
# ABCDEFGHIJKLMNOPQRSTUVWXYZ
# 0123456789
# aa bb cc
# rhythms
# xyz
# abc
# tragedy + time = humor
# http://www.jhsph.edu/
</pre>

> \w metacharacter matches all letters, numbers, and even the underscore character (_)

<pre>
egrep "\d" small.txt
# 0123456789
</pre>

<pre>
egrep "\s" small.txt
# aa bb cc
# tragedy + time = humor
</pre>

The -v flag (which stands for invert match) makes grep return all of the lines not matched by the regular expression. 

-  add `-v` to the commandis to get the compliment:

<pre>
egrep -v "\w" small.txt
# #%&-=***=-&%#
</pre>

Note that the character sets for regular expressions also have their inverse sets: 

- `\W` for non-words
- `\D` for non-digits
- `\S` for non-spaces

<pre>
egrep "\W" small.txt

## aa bb cc
## tragedy + time = humor
## http://www.jhsph.edu/
## #%&-=***=-&%#
</pre>

The returned strings all contain non-word characters. Note the difference between the results of using the invert flag `-v` versus using **an inverse set regular expression**.

In addition to general character sets we can also create specific character sets using square brackets (`[ ]`) and then including the characters we wish to match in the square brackets. For example the regular expression for the set of vowels is `[aeiou]`. You can also create a regular expression for the compliment of a set by including a caret (`^`) in the beginning of a set. For example the regular expression `[^aeiou]` matches all characters that are not vowels. Let’s test both on small.txt:

<pre>
egrep "[aeiou]" small.txt

## abcdefghijklmnopqrstuvwxyz
## aa bb cc
## abc
## tragedy + time = humor
## http://www.jhsph.edu/
</pre>

<pre>
egrep "[^aeiou]" small.txt

## abcdefghijklmnopqrstuvwxyz
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
## 0123456789
## aa bb cc
## rhythms
## xyz
## abc
## tragedy + time = humor
## http://www.jhsph.edu/
## #%&-=***=-&%#
</pre>

Every line in the file is printed, because every line contains at least one non-vowel! 

If you want to specify a range of characters you can use a hyphen (`-`) inside of the square brackets. For example the regular expression `[e-q]` matches all of the lowercase letters between “e” and “q” in the alphabet inclusively. Case matters when you’re specifying character sets, so if you wanted to only match uppercase characters you’d need to use `[E-Q]`. To ignore the case of your match you could combine the character sets with the `[e-qE-Q]` regex (short for regular expression), or you could use the `-i` flag with grep to ignore the case. Note that the `-i` flag will work for any provided regular expression, not just character sets. Let’s take a look at some examples using the regular expressions that we just described:

<pre>
egrep "[e-q]" small.txt

## abcdefghijklmnopqrstuvwxyz
## rhythms
## tragedy + time = humor
## http://www.jhsph.edu/

egrep "[E-Q]" small.txt

## ABCDEFGHIJKLMNOPQRSTUVWXYZ

egrep "[e-qE-Q]" small.txt

## abcdefghijklmnopqrstuvwxyz
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
## rhythms
## tragedy + time = humor
## http://www.jhsph.edu/

egrep -i "[E-Q]" small.txt

## abcdefghijklmnopqrstuvwxyz
## ABCDEFGHIJKLMNOPQRSTUVWXYZ
## rhythms
## tragedy + time = humor
## http://www.jhsph.edu/
</pre>