<a href="https://colab.research.google.com/github/keskinus/Data-Analysis-/blob/main/String_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this colab we will look at string manipulation methods, using python methods and regular expressions.

## String split

In [None]:
import pandas as pd

### Split in python

In [None]:
s = "This is a very long sentence that we will break up. This one has a    lot   of  spaces  .    "

In [None]:
# split() will split with the `space` character.
s.split()

['This',
 'is',
 'a',
 'very',
 'long',
 'sentence',
 'that',
 'we',
 'will',
 'break',
 'up.',
 'This',
 'one',
 'has',
 'a',
 'lot',
 'of',
 'spaces',
 '.']

In [None]:
# You can split with any character.
s.split(".")

['This is a very long sentence that we will break up',
 ' This one has a    lot   of  spaces  ',
 '    ']

In [None]:
# notice that the character used to split is not in the result.
s.split("a")

['This is ',
 ' very long sentence th',
 't we will bre',
 'k up. This one h',
 's ',
 '    lot   of  sp',
 'ces  .    ']

### String split in Pandas

In [None]:
s = pd.Series(["Three word sentence",
               "Another such sentence",
               "Third sentence too"])

In [None]:
s.str.split()

0      [Three, word, sentence]
1    [Another, such, sentence]
2       [Third, sentence, too]
dtype: object

In [None]:
split_s = s.str.split()
print(type(split_s))


<class 'pandas.core.series.Series'>


How to retrieve only the last element of the lists?

`split_s[1]` returns the 1th element of the series.  It does not "broadcast" the `[1]` for the elements of the list. 

In [None]:
split_s[1]

['Another', 'such', 'sentence']

In [None]:
print(type(split_s[1]))

<class 'list'>


In [None]:
split_s

0      [Three, word, sentence]
1    [Another, such, sentence]
2       [Third, sentence, too]
dtype: object

In [None]:
split_s.str[-2]

0        word
1        such
2    sentence
dtype: object

In [None]:
split_s.str.get(-1)

0    sentence
1    sentence
2         too
dtype: object

In [None]:
split_s.str[-1]

0    sentence
1    sentence
2         too
dtype: object

**Exercise**

Split a series of date values in DD/MM/YYYY format into year, month and date.  Retrieve only the year.

In [None]:
s1 = pd.Series(['01/01/2001', '03/12/2020', '10/10/2012', '11/9/2017'])

In [None]:
date = s1.str.split('/')
date.str.get(-1)




0    2001
1    2020
2    2012
3    2017
dtype: object

**Exercise**

In the `EVENTDT` column in the Berkeley crime dataset, what are the time values?  Can we remove them and get only the date component?


## Join strings


In [None]:
".".join(['a', 'b', 'c'])

'a.b.c'

In [None]:
"  ".join(['a', 'b', 'c'])

'a  b  c'

# Regular expressions

Let us look at one entry for block address in the Berkeley dataset. 
```
2100 SHATTUCK AVE
Berkeley, CA
(37.871167, -122.268285)
```

- We want to extract out Lattitude and Longitude.  
- We want to discard "Berkeley, CA".

Simple use of `s.split()` is not good enough. 
- It includes the leading opening parenthesis and a comma in the lattitude,
- similarly the ending closing parenthesis in the longitude
- we do not know how many words to cut before Berkeley CA.

In [None]:
s = '2100 SHATTUCK AVE\nBerkeley, CA\n(37.871167, -122.268285)'
print(s)

2100 SHATTUCK AVE
Berkeley, CA
(37.871167, -122.268285)


In [None]:
s.split()

['2100', 'SHATTUCK', 'AVE', 'Berkeley,', 'CA', '(37.871167,', '-122.268285)']

### Why Regular Expressions?

Suppose we have a bunch of files
```
['Intro to Data Science.ipynb', \
'Data Science syllabus.pdf', \
'data science example 3.ipynb', \
'Data science class 4.ipynb', \
'statistics.ipynb', \
'stats.pdf']
```

We want to get only "data science" python notebooks.  If data was "clean", then all occurrences would have the same representation. Searching for it would be a simple string match. However, now there are many possibilities of how it is represented.  Uppercase or lower case D, upper or lower case S, with or without a space in between.  The number of possibilities will explode.  Regular expression will help in writing this concisely.

Pictorial representation:
https://www.python-course.eu/re.php

Basic patterns:
- Exact match `abc`
- Choice representation `[Aa]bc`
- Match any character `a.c`
- Match any character at least once `a.+c`
- Match any character zero or more `a.*c`
- Any combination of the above `[aA].*c+`

In [None]:
import re

# re.findall( pattern, string )

If there are capture groups in the pattern, then it will return a list of all the captured data, but otherwise, it will just return a list of the matches themselves, or an empty list if no matches are found.
1



In [None]:
txt = "I have a pet in petaluma"
x = re.findall("pet", txt)
print(x)

['pet', 'pet']


## More symbols:

- `[]` Choice:  Example: Match vowels `[aeiou]`.
- `-` Range: example `[a-z]` matches one of the lower case characters.
- `^`  Beginning of the string.
- `$`  End of the string.
- `{m}` Exactly `m` matches
- `{m,n}` Matches between `m` and `n` repeats.
- `?` Zero or one occurences.
- `*` Zero of more occurences.
- `+` Zero or one occurence.

Since some of the characters `*, -, ., +` have special meaning, if we have to match them, we escape the character with a `\`.

## Special groups

- `\d`: digit. matches `[0-9]`
- `\w`: word character. matches `[a-zA-Z0-9_]`
- `\W`: non-word character. opposite of `\w`. matches `[^a-zA-Z0-9_]`
- `\s`: whitespace. matches `[ \t\n\r\f\v]`


In [None]:
zip_code = "Davis, CA 95616"

In [None]:
re.findall(r'\d{5}', zip_code)

['95616']

## Capturing parenthesis

How do we capture a long string with a regex but return only a part we are interested in?

Say, we want to get the first five digits of the zipcode from an address.  We will match the 5-4 digit representation of the zip code and then get only the first five digits.

In [None]:
zip_code = "12345 5th street, Davis, CA 95616-1234"

In [None]:
re.findall(r'(\d{5})-\d{4}', zip_code)

['95616']

# re.search( pattern, string )

This function searches for first occurrence of RE pattern within the string. It returns and object on success, none on failure. We use group() function of match object to get the matched expression.



In [None]:
ex = "There is a Spade near the sand"
r=re.search("\s", ex)
print(r)

<re.Match object; span=(5, 6), match=' '>


In [None]:
ex[5:6]

' '

In [None]:
ex1 = "There is a Spade near the sand"
x = re.search("near", ex1)
print(x)

<re.Match object; span=(17, 21), match='near'>


In [None]:
ex1[17:21]

'near'

# re.sub( pattern, replacement, string)

Every match of the regular expression regex in the string subject will be replaced by the string replacement.


In [None]:
import re

s = "The rain in Spain"
x1 = re.sub("\s", "\t", s, 2)
print(x1)


The	rain	in Spain


### re.split()

Split after matching a regular expression.

In [None]:
# Get only the tokens for the mathematical expression.
s = "a+b*c-d^e/f,g"        

In [None]:
re.split(r'[+,\-\*\/\^]',s)

['a', 'b', 'c', 'd', 'e', 'f', 'g']

## Getting lattitude and longitude in the Berkeley dataset

`Block_Location` column in the Berkeley dataset has entries that look like this.
```
2100 SHATTUCK AVE
Berkeley, CA
(37.871167, -122.268285)
```
How will get extract the lattitude and longitude from this string?

In [None]:
# `\n` denotes the newline character. It is useful for printing neatly.
address = "500 Buchanan street\nBerkeley CA\n95011"

In [None]:
print(address)

500 Buchanan street
Berkeley CA
95011


Here we provide a *regular expression* to be matched for our splitting.  Split whenever you find a `(` or a `)`.

In [None]:
s = '2100 SHATTUCK AVE\nBerkeley, CA\n(37.871167, -122.268285)'
print(s)

2100 SHATTUCK AVE
Berkeley, CA
(37.871167, -122.268285)


In [None]:
a = re.split(r'[()]', s)
a

['2100 SHATTUCK AVE\nBerkeley, CA\n', '37.871167, -122.268285', '']

In [None]:
a[0]

'2100 SHATTUCK AVE\nBerkeley, CA\n'

In [None]:
a[1]


'37.871167, -122.268285'

In [None]:

l= re.split(r',',a[1])

In [None]:
l[0]

'37.871167'

In [None]:
l[1]

' -122.268285'