# String e Expressões regulares
- 1. Trabalhando com Regex
- 2. Seleção de Dados em dados de String



#### Contexto
No dia a dia de trabalho do cientista de dados você pode precisar buscar um dataset que contém padrões de Strings.
Você pode usar a lib "re" para buscar por padrões de string, como por exemplo:
- Nomes começando com um caractere específico 
- Pesquisar um padrão dentro de uma coluna de dataframe 
- Extrair tipos de dados de um texto (datas).

Aqui estão as funções do pandas que aceitam expressão regular:

- Links Importantes
    - https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/
    - https://docs.python.org/3/howto/regex.html

#### Roteiro
- 1. Trabalhando com regex
- 2. Seleção de Dados usando String    

In [2]:
##### bibliotecas usadas
import pandas as pd
import re

## 1. Trabalhando com Regex

- pegando elementos que não são dígitos \D
- Verificar se o e-mail é correto
- Verificações de dados usando Series

**Inicia com um texto e termina com Spain**

In [3]:
txt = "The rain in Brazil"
patter = "^rain"

if re.search(patter, txt):
    print('Texto ok')
else:
    print('Texto inválido')

Texto inválido


**Validando email**
- Padrões gerais
    - Uppercase - [A-Z]
    - lowecase - [a-z]
    - numbers - [0-9]

In [4]:
mascara_digitos = '^[A-Z]'
nome = 'nicksson'

if(re.search(mascara_digitos, nome)):
    print('Nome Correto')
else:
    print('Nome Inválido')

Nome Inválido


In [5]:
mascara_numeros = '^[0-9]'
digitos = 'a98989'

if(re.search(mascara_numeros, digitos)):
    print('digitos Correto')
else:
    print('digitos Inválido')

digitos Inválido


In [6]:
"palavra e número"@"palavra".("com")

SyntaxError: invalid syntax (Temp/ipykernel_7684/645989717.py, line 1)

In [7]:
mascara_email = "[a-zA-Z0-9]+@[a-zA-Z]+\.(com|edu|net)"

email = "nicksson9898@hotmail.com"

if(re.search(mascara_email, email)):
    print('Email Correto')
else:
    print('Email Inválido')

Email Correto


### 2. Seleção de Dados por String

In [8]:
df = pd.read_csv('data/titanic.csv')

### Seleção de Dados
- Seleção usando loc e iloc
- Seleção de dados por condicional
- seleção de dados por tipos

In [9]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


In [11]:
df[df['Name'].str.contains('Mr.')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1299,,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C
1302,1303,,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q
1304,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S


In [12]:
df[df['Name'].str.contains('Mr.|Miss')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1302,1303,,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q
1303,1304,,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.7750,,S
1304,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1306,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S


In [13]:
df[df['Name'].str.contains('Allen')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
730,731,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1069,1070,,2,"Becker, Mrs. Allen Oliver (Nellie E Baumgardner)",female,36.0,0,3,230136,39.0,F4,S


In [14]:
df[df['Name'].str.contains('^Allen')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
730,731,1.0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S


In [15]:
df[df['Name'].str.contains('Henry$')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
12,13,0.0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
159,160,0.0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,,S
209,210,1.0,1,"Blank, Mr. Henry",male,40.0,0,0,112277,31.0,A31,C
212,213,0.0,3,"Perkin, Mr. John Henry",male,22.0,0,0,A/5 21174,7.25,,S
222,223,0.0,3,"Green, Mr. George Henry",male,51.0,0,0,21440,8.05,,S
239,240,0.0,2,"Hunt, Mr. George Henry",male,33.0,0,0,SCO/W 1585,12.275,,S
271,272,1.0,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
385,386,0.0,2,"Davies, Mr. Charles Henry",male,18.0,0,0,S.O.C. 14879,73.5,,S
411,412,0.0,3,"Hart, Mr. Henry",male,,0,0,394140,6.8583,,Q


In [16]:
df[df['Name'].str.contains('^Allen.*Henry$')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
## Filter nomes que começam com uma letra OU terminam com outra letra
df[df['Name'].str.contains('^A|y$')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
12,13,0.0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
13,14,0.0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.2750,,S
25,26,1.0,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,1,5,347077,31.3875,,S
40,41,0.0,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.4750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1281,1282,,1,"Payne, Mr. Vivian Ponsonby",male,23.0,0,0,12749,93.5000,B24,S
1283,1284,,3,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.2500,,S
1290,1291,,3,"Conlon, Mr. Thomas Henry",male,31.0,0,0,21332,7.7333,,Q
1292,1293,,2,"Gale, Mr. Harry",male,38.0,1,0,28664,21.0000,,S


In [18]:
## Todos os tikcer que só começam com números
df[df['Ticket'].str.contains('^[0-9]')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0.0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0.0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0.0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1301,1302,,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
1302,1303,,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q
1303,1304,,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.7750,,S
1307,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [19]:
## string que começa com vogais a-z ou A-Z
df[df['Ticket'].str.contains('^[A-Z]')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
10,11,1.0,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,13,0.0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1296,1297,,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20.0,0,0,SC/PARIS 2166,13.8625,D38,C
1300,1301,,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,,S
1304,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1305,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C


In [20]:
## String que não contém números
df[df['Ticket'].str.contains('[0-9]')==False]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
179,180,0.0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0,,S
271,272,1.0,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
302,303,0.0,3,"Johnson, Mr. William Cahoone Jr",male,19.0,0,0,LINE,0.0,,S
597,598,0.0,3,"Johnson, Mr. Alfred",male,49.0,0,0,LINE,0.0,,S


### Mundando valores de string

In [21]:
df[df['Name'].str.contains('Henry$')].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
12,13,0.0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
159,160,0.0,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,,S
209,210,1.0,1,"Blank, Mr. Henry",male,40.0,0,0,112277,31.0,A31,C
212,213,0.0,3,"Perkin, Mr. John Henry",male,22.0,0,0,A/5 21174,7.25,,S


In [22]:
index = df[df['Name'].str.contains('Henry$')].index

In [23]:
df.loc[index, 'Name'].str.replace('Henry', 'Freitas')

4              Allen, Mr. William Freitas
12       Saundercock, Mr. William Freitas
159          Sage, Master. Thomas Freitas
209                    Blank, Mr. Freitas
212              Perkin, Mr. John Freitas
222             Green, Mr. George Freitas
239              Hunt, Mr. George Freitas
271        Tornquist, Mr. William Freitas
385           Davies, Mr. Charles Freitas
411                     Hart, Mr. Freitas
476             Renouf, Mr. Peter Freitas
482            Rouse, Mr. Richard Freitas
594             Chapman, Mr. John Freitas
695          Chapman, Mr. Charles Freitas
722        Gillespie, Mr. William Freitas
949           Davison, Mr. Thomas Freitas
990        Nancarrow, Mr. William Freitas
1037        Hilliard, Mr. Herbert Freitas
1068    Stengel, Mr. Charles Emil Freitas
1251        Sage, Master. William Freitas
1290           Conlon, Mr. Thomas Freitas
Name: Name, dtype: object

In [24]:
df.loc[index, 'Name'] = df.loc[index, 'Name'].str.replace('Henry', 'Freitas')

In [25]:
df[df['Name'].str.contains('Freitas$')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0.0,3,"Allen, Mr. William Freitas",male,35.0,0,0,373450,8.05,,S
12,13,0.0,3,"Saundercock, Mr. William Freitas",male,20.0,0,0,A/5. 2151,8.05,,S
159,160,0.0,3,"Sage, Master. Thomas Freitas",male,,8,2,CA. 2343,69.55,,S
209,210,1.0,1,"Blank, Mr. Freitas",male,40.0,0,0,112277,31.0,A31,C
212,213,0.0,3,"Perkin, Mr. John Freitas",male,22.0,0,0,A/5 21174,7.25,,S
222,223,0.0,3,"Green, Mr. George Freitas",male,51.0,0,0,21440,8.05,,S
239,240,0.0,2,"Hunt, Mr. George Freitas",male,33.0,0,0,SCO/W 1585,12.275,,S
271,272,1.0,3,"Tornquist, Mr. William Freitas",male,25.0,0,0,LINE,0.0,,S
385,386,0.0,2,"Davies, Mr. Charles Freitas",male,18.0,0,0,S.O.C. 14879,73.5,,S
411,412,0.0,3,"Hart, Mr. Freitas",male,,0,0,394140,6.8583,,Q


## Para praticar

- https://www.w3schools.com/python/python_regex.asp
- Filtre pessoas que contém "Mr"
- Filtre nomes que começam com a palavra "Chapman"
- Filtre nomes que terminam com a palavra