# Manipulaciones de Strings


Python en es parte muy usado por su facilidad a la hora de manipular strings

In [1]:
val = 'a,b, guido'
val.split(',')

['a', 'b', ' guido']

In [2]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [4]:
first, second, third = pieces
print(f'{first}, {second}, {third}')

a, b, guido


In [None]:
'::'.join(pieces) # podemos unir las el metodo join

In [None]:
'guido' in val

In [5]:
val.index(',')

1

In [6]:
val.find(':')

-1

Observar que la diferencia entre los 2 ultimos es que index eleva un error cuando no encuentra el valor adecuando en cambio find lo que hace es devolver -1

## Python String Methods

| Method | Description |
|---|---|
| `count(substring)` | Returns the number of non-overlapping occurrences of `substring` in the string. |
| `endswith(suffix)` | Returns `True` if the string ends with `suffix`. |
| `startswith(prefix)` | Returns `True` if the string starts with `prefix`. |
| `join(iterable)` | Uses the string as a delimiter to concatenate a sequence of other strings. |
| `index(substring)` | Returns the position of the first character in `substring` if found in the string; raises `ValueError` if not found. |
| `find(substring)` | Returns the position of the first character of the first occurrence of `substring` in the string; like `index`, but returns -1 if not found. |
| `rfind(substring)` | Returns the position of the first character of the last occurrence of `substring` in the string; returns -1 if not found. |
| `replace(old, new)` | Replaces occurrences of `old` with `new`. |
| `strip()`, `rstrip()`, `lstrip()` | Trim whitespace, including newlines; equivalent to `x.strip()` (and `rstrip`, `lstrip`, respectively) for each element. |
| `split(delimiter)` | Breaks the string into a list of substrings using the passed delimiter. |
| `lower()` | Converts alphabet characters to lowercase. |
| `upper()` | Converts alphabet characters to uppercase. |
| `casefold()` | Converts characters to lowercase, and converts any region-specific variable character combinations to a common comparable form. |
| `ljust(width, fillchar=' ')`, `rjust(width, fillchar=' ')` | Left justify or right justify, respectively; pad the opposite side of the string with spaces (or some other fill character) to return a string with a minimum width. |


# Expresiones Regulares

las expresiones regulares on strings que respetan uan convencion de escritura y cuyo uso es para encontrar secuencias que respetan cierto patrones dentro de un texto. Python utliza una libreria llamada re para dar soporte a las expresiones regulares.

In [7]:
import re

In [9]:
text = "foo bar\t baz \tqux"
re.split('\s+', text)

  re.split('\s+', text)


['foo', 'bar', 'baz', 'qux']

In [11]:
regex = re.compile('\s+') # compile nos permite guardar el objeto regex lo que es muy util para mejorar los tiempos y recursos de porcesamiento si es que la expresion se utilizara mulitples veces

  regex = re.compile('\s+') # compile nos permite guardar el objeto regex lo que es muy util para mejorar los tiempos y recursos de porcesamiento si es que la expresion se utilizara mulitples veces


In [12]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [13]:
regex.findall(text)

[' ', '\t ', ' \t']

Se puede apreciar de mejor manera el poder de una regex cuando se tiene un texto que no esta "lmpio"

In [14]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'


In [15]:
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)


In [16]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [17]:
m = regex.search(text)

In [18]:
text[m.start():m.end()]

'dave@google.com'

Las expresiones regulares nos permiten separar cada uno de los patrones encontrados en grupos usando ()

In [20]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [21]:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [22]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

Ganamos acceso a los crupos de las Regex usando simbolos especiales \1,\2,\3,etc

In [23]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



## Regular Expression Methods

| Method | Description |
|---|---|
| `findall(pattern)` | Returns all non-overlapping matching patterns in a string as a list. |
| `finditer(pattern)` | Like `findall`, but returns an iterator. |
| `match(pattern)` | Matches the pattern at the start of the string and optionally segments pattern components into groups; if the pattern matches, returns a match object, and otherwise `None`. |
| `search(pattern)` | Scans the string for a match to the pattern; returning a match object if so; unlike `match`, the match can be anywhere in the string as opposed to only at the beginning. |
| `split(pattern)` | Breaks the string into pieces at each occurrence of the pattern. |
| `sub(pattern, repl, string)` | Replaces all occurrences of the pattern in the string with `repl`. |
| `subn(pattern, repl, string, count=0)` | Replaces the first `n` occurrences of the pattern in the string with `repl`. If `count` is 0, all occurrences are replaced. |


# Expresiones Regulares vectorizadas

En python es posible utilizar expresiones regularas de manera vectorizada para mejorar los timepos de ejectucion esto se puede lograr usando tanto funciones vectoriales como map como usando str la difernecia es que map no maneja valores nulos en cambio str puede manejar valores nulos con lo que el standart es usar str en lugar de map

In [24]:
import pandas as pd 
import numpy as np

In [25]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com','Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)

In [27]:
pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [34]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

In [37]:
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

## Vectorized String Methods

| Method | Description |
|---|---|
| `cat(delimiter)` | Concatenate strings element-wise with an optional delimiter. |
| `contains(pattern, regex=False)` | Return a boolean array indicating if each string contains the pattern or regular expression. |
| `count(pattern, regex=False)` | Count occurrences of the pattern in each string. |
| `extract(pattern)` | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group. |
| `endswith(pattern)` | Equivalent to `x.endswith(pattern)` for each element. |
| `startswith(pattern)` | Equivalent to `x.startswith(pattern)` for each element. |
| `findall(pattern, regex=False)` | Compute a list of all occurrences of the pattern or regular expression for each string. |
| `get(i)` | Index into each element (retrieve the i-th element). |
| `isalnum()` | Equivalent to the built-in `str.isalnum()`. |
| `isalpha()` | Equivalent to the built-in `str.isalpha()`. |
| `isdecimal()` | Equivalent to the built-in `str.isdecimal()`. |
| `isdigit()` | Equivalent to the built-in `str.isdigit()`. |
| `islower()` | Equivalent to the built-in `str.islower()`. |
| `isnumeric()` | Equivalent to the built-in `str.isnumeric()`. |
| `isupper()` | Equivalent to the built-in `str.isupper()`. |
| `join(sep)` | Join strings in each element of the Series with the passed separator. |
| `len()` | Compute the length of each string. |
| `lower(), upper()` | Convert cases; equivalent to `x.lower()` or `x.upper()` for each element. |
| `match(pattern, flags=0)` | Use `re.match` with the passed regular expression on each element, returning matched groups as a list. |
| `pad(side='left', fillchar=' ', width=None)` | Add whitespace to the left, right, or both sides of strings. |
| `center(width=None, fillchar=' ')` | Equivalent to `pad(side='both')`. |
| `repeat(repeats)` | Duplicate values (e.g., `s.str.repeat(3)` is equivalent to `x * 3` for each string). |
| `replace(pat, repl, regex=False)` | Replace occurrences of the pattern or regular expression with some other string. |
| `slice(start=None, stop=None, step=None)` | Slice each string in the Series. |
| `split(pat, n=-1, expand=False)` | Split strings on the delimiter or regular expression. |
| `strip(chars=None)` | Trim whitespace from both sides, including newlines. |
| `rstrip(chars=None)` | Trim whitespace on the right side. |
| `lstrip(chars=None)` | Trim whitespace on the left side. |
