The cell below shows one way to extract a customer ID out of a string like you might find in a homework assignment.  This is what I would call a "brute force" method that uses direct string manipulation.

In [63]:
import numpy as np

# create a test string
test_str="<customer_feature customer_id=01234 feature_id=0>-0.57</customer_feature>"

# find the index of the '='
index = test_str.find("=")

# increment the index one past the equals
# we are now pointing at the first digit in the customer ID
index = index + 1

#create an empty list to store the digits
cust_id_list = []

# while the character at index is a digit
while test_str[index].isdigit():
    # append the digit to the custumor ID list
    cust_id_list.append(test_str[index])
    
    # advance the index digit in the number
    # or advance to the space after the last digit
    index = index + 1    

# convert the list of characters to a string
str_cust_number = ''.join(cust_id_list)

# convert the string to an integer
int_cust_number = int(str_cust_number)

print("int_cust_number:", int_cust_number)

int_cust_number: 1234


## The history of grep

A more elegant method of extracting data from input text is regular espressions.  According to wikipedia, a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) is a sequence of characters that defines a search pattern.  The Unix grep program was one of the first widely used programs that used regular expressions.  In his book Mastering Regular Expressions, Jeffery Friedl says that the Unix grep program's origins come from an early text editor called "ed".  The ed text editor had a "global regular expression print" function that was capable of searching through lines in the text file using regular expressions.  It turns out that the global regular expression print function was so useful and popular, it we ported from the ed text editor into the unix program called grep.  The ed text editor died a long time ago but grep lives on to this day.

## Introduction To grep

The following cells show some simple examples of how to use regular expressions in Python.  Note that the specific regular expression syntax for python is easier to learn than Unix grep.  It turns out that all programs that implement grep have their own syntax variation but there is a lot of similarity between grep programs.  For example, if you understand python regular expressions then you will understand Unix grep; notwithstanding, the syntax will be a little different between the two programs.

Here is a [regex tutorial](https://www.guru99.com/python-regular-expressions-complete-tutorial.html), and another [regex tutorial](https://www.tutorialspoint.com/python/python_reg_expressions.htm).  

In [48]:
# import the regular expression package
import re

# create a list containing strings
string_list = ["A string with the word ist718", "Another string with the word ist718", "A string without our class name"]

# Find all strings in string list that contain ist718
for line in string_list:
    match = re.search("ist718", line)
    if match:
        print("line:", line)

line: A string with the word ist718
line: Another string with the word ist718


The following cells introduces the notion of metacharacters.  Metacharacters are used to perform more flexible matches.  The next cell introduces the dot metacharacter.  The dot metacharacter matches any one character.

In [64]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
string_list = ["A string with the word ist718", "A string with our class name isq718 mispelled", 
               "Another string with our class name is718 mispelled", "A string without our class name"]

# Find all strings in string list that contain ist718
for line in string_list:
    # The '.' means match any one character
    match = re.search("is.718", line)
    if match:
        print("line:", line)

line: A string with the word ist718
line: A string with our class name isq718 mispelled


The cell below introduces the star operator.  The star operator means any number of matches.  For example, dot followed by star means match any character any number of times includeing 0 times.

In [50]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
string_list = ["A string with the word istpr718", "A string with our class name isqsplmR1_718 mispelled", 
               "Another string with our class name is718 mispelled", "A string without our class name"]

# Find all strings in string list that contain ist718
for line in string_list:
    # The '.' means match any one character
    match = re.search("is.*718", line)
    if match:
        print("line:", line)

line: A string with the word istpr718
line: A string with our class name isqsplmR1_718 mispelled
line: Another string with our class name is718 mispelled


The following cell introduces the question mark metacharacter which is used to match zero or one occurrence of a character.

In [52]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
string_list = ["A string with the word is718", "A string with our class name ist718 mispelled", 
               "Another string with our class name istt718 mispelled", "A string without our class name"]

# Find all strings in string list that contain ist718
for line in string_list:
    match = re.search("ist?718", line)
    if match:
        print("line:", line)

line: A string with the word is718
line: A string with our class name ist718 mispelled


The cell below shows how to detect numbers using the \d metacharacter which is used to detect digits.

In [65]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
string_list = ["A string with the word ist718", "A string with our class name ist818 mispelled", 
               "Another string with our class name ist918 mispelled", "A string without our class name"]

# Find all strings in string list that contain ist718
for line in string_list:
    match = re.search("ist\d", line)
    if match:
        print("line:", line)

line: A string with the word ist718
line: A string with our class name ist818 mispelled
line: Another string with our class name ist918 mispelled


In python, curly braces can be used to specify the number of times you want to match a character.  The first number in the curly braces specifies the minimum number of expected matches, and the 2nd number in the curly braces specifies the maximum number of expected matches.  For example, {1,5} specifies a minimum of one, and a maximum of 5 matches.  Note that if the minimum is omitted, it defaults to 0, and if the maximum is omitted, it defaults to infinity.  The cell below shows an example with digits.  Note that you might be surprised that the expression below matches 'ist718' because we specify a maximum of 2 matches.  However, note that 'ist71' is matched in the 'ist718' line.  This is typical of the kind of unexpected results you might get and illustrates why experimentation is 

In [54]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
string_list = ["A string with the word ist", "A string with our class name ist7 mispelled", 
               "Another string with our class name ist71 mispelled", "A string with ist718"]

# Find all strings in string list that contain ist718
for line in string_list:
    match = re.search("ist\d{1,2}", line)
    if match:
        print("line:", line)

line: A string with our class name ist7 mispelled
line: Another string with our class name ist71 mispelled
line: A string with ist718


The plus metacharacter is used to match one or more characters.  In the example below, ist followed by one or more characters is matched.

In [55]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
# 
string_list = ["A string with the word ist", "A string with our class name ist7 mispelled", 
               "Another string with our class name ist71 mispelled", "A string with ist718"]

# Find all strings in string list that contain ist718
for line in string_list:
    match = re.search("ist\d+", line)
    if match:
        print("line:", line)

line: A string with our class name ist7 mispelled
line: Another string with our class name ist71 mispelled
line: A string with ist718


The square bracket operator is used to specify a range of numbers or letters to match.

In [56]:
# import the regular expression package
import re

# create a list containing strings with our class name.  
# 
test_string_1 = "1234a56789"
test_string_2 = "1234B56789"
test_string_3 = "123456789"

match = re.search("[a-zA-Z]", test_string_1)
if match:
    print("found lower case letter")
    
match = re.search("[a-zA-Z]", test_string_2)
if match:
    print("found upper case letter")
    
match = re.search("[a-zA-Z]", test_string_3)
if match:
    print("found a match")
else:
    print("no match found")

found lower case letter
found upper case letter
no match found


The cell below shows how to capture matched items in groups.  A group is designated with a pair of parenthesis. A match inside parenthesis is saved and can be extracted from the match object later.  The matches are stored in the match object in the same order as the parenthesis.  

In [57]:
# import the regular expression package
import re

# create a list containing strings with our class name.  

test_string = "tag1 25 tag2 2298 tag3 22"

match = re.search("tag1 (\d+) tag2 (\d+) tag3 (\d+)", test_string)

if match:
    print("match.groups():", match.groups())
    print("match.group(0):", match.group(0))
    print("match.group(1):", match.group(1))
    print("match.group(2):", match.group(2))
    print("match.group(3):", match.group(3))
else:
    print("No match")

match.groups(): ('25', '2298', '22')
match.group(0): tag1 25 tag2 2298 tag3 22
match.group(1): 25
match.group(2): 2298
match.group(3): 22


The following cell shows how to extract the customer ID out of a string from your homework using regular expressions.  Can you add code to match and access the '-0.57'?

In [61]:
# import the regular expression package
import re

# create a test string
test_str="<customer_feature customer_id=1 feature_id=1030010>-0.57</customer_feature>"

# create a regex match pattern for the customer ID
# The leading dot means match any one character
# The dot star '.*' means match any number of characters
# The '.*customer_id=' means match any number of characters followed by customer_id=
# In '(\d{1,})', the \d means match a digit.  The {1,} following \d means match from 1 to infinite digits.
# The enclosing parenthesis in '(\d{1,})' means to store the match in a group.  Since this is the first
# set of parenthesis, this will be group 1.
match_str = ".*customer_id=(\d{1,}).*feature_id=(\d{1,}).*"
match = re.search(match_str, test_str)

# if the match was successful
if match:
    # the entire match is always group 0
    print("group 0:\n", match.group(0))
    
    # the match stored in the first set of parenthesis
    print("group 1:\n", match.group(1))
    
    # the match stored in the 2nd set of parenthesis
    print("group 2:\n", match.group(2))
    
    # the match stored in the 3rd set of parenthesis (added by you)
    print("group 3:\n", match.group(3))
else:
    print("No match found")

group 0:
 <customer_feature customer_id=1 feature_id=1030010>-0.57
group 1:
 1
group 2:
 1030010
group 3:
 -0.57


Up until now we have been using the `re.search` function.  The cell below shows how to extract customer feature numbers using `re.findall` technique.  The findall method returns all strings that match the regular expression in a list.  Note that there is also a `re.match` funciton that will match starting at the very beginning of the string.  The `re.search` function will match anywhere in the string being searched.  

In [61]:
# create a test string - same test string as above
test_str="<customer_feature customer_id=1 feature_id=1030010>-0.57</customer_feature>"

# the following expression uses the OR operator which is the pipe character ('|').
# It searches for the word customer OR a floating point number
# Note that the period '.' is a special operator in regular expressions and matches ANY character.  
# So, to look for a period, we search for '\.'.  The backslash in front of the period tells findall
# to actually look for a period and don't interpret it as a special operator.
# Notice how 2 'customer' words are returned.  Findall returns all matches found.
# The string '-?\d{0,}\.\d{1,}' says look for 0 or more negative signs (-?), followed by zero or more
# digits (\d{0,}), followed by a period ('\.'), followed by one or more digits.
# Can you modify the regex below to look for integers and floats?
numbers = re.findall('customer|-?\d{0,}\.\d{1,}', test_str) 

if match:
    print("numbers[0]:", numbers[0])
    print("numbers[1]:", numbers[1])
    print("numbers[2]:", numbers[2])
else:
    print("No matches found")

numbers[0]: customer
numbers[1]: customer
numbers[2]: -0.57


Here is a useful [stackoverflow post](https://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string) on extracting floats and ints from data.