[source1](https://www.dataquest.io/blog/regex-cheatsheet/)
[source2](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)

#Special Characters

>^ | Matches the expression to its right at the start of a string. It matches every such instance before each \n in the string.

>$ | Matches the expression to its left at the end of a string. It matches every such instance before each \n in the string.

>. | Matches any character except line terminators like \n.

>\ | Escapes special characters or denotes character classes.

>A|B | Matches expression A or B. If A is matched first, B is left untried.

> | + | Greedily matches the expression to its left 1 or more times.

> | * | Greedily matches the expression to its left 0 or more times.

> ? | Greedily matches the expression to its left 0 or 1 times. But if ? is added to qualifiers (+, *, and ? itself) it will perform matches in a non-greedy manner.

>{m} | Matches the expression to its left m times, and not less.

>{m,n} | Matches the expression to its left m to n times, and not less.

>{m,n}? | Matches the expression to its left m times, and ignores n. See ? above.

> \n |New line
\
>\r |Carriage return
\
>\t |Tab
\
>\v |Vertical tab
\
>\f |Form feed
\
>\xxx |Octal character xxx
\
>\xhh |Hex character hh
#Character Classes (a.k.a. Special Sequences)
>\w | Matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _.

>\d | Matches digits, which means 0-9.

>\D | Matches any non-digits.

>\s | Matches whitespace characters, which include the \t, \n, \r, and space characters.

>\S | Matches non-whitespace characters.

>\b | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W.

>\B | Matches where \b does not, that is, the boundary of \w characters.

>\A | Matches the expression to its right at the absolute start of a string whether in single or multi-line mode.

>\Z | Matches the expression to its left at the absolute end of a string whether in single or multi-line mode.

#Sets
>[ ] | Contains a set of characters to match.

>[amk] | Matches either a, m, or k. It does not match amk.

>[a-z] | Matches any alphabet from a to z.

>[a\-z] | Matches a, -, or z. It matches - because \ escapes it.

>[a-] | Matches a or -, because - is not being used to indicate a series of characters.

>[-a] | As above, matches a or -.

>[a-z0-9] | Matches characters from a to z and also from 0 to 9.

>[(+*)] | Special characters become literal inside a set, so this matches (, +, *, and ).

>[^ab5] | Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 5.

#Groups
>( ) | Matches the expression inside the parentheses and groups it.

>(? ) | Inside parentheses like this, ? acts as an extension notation. Its meaning depends on the character immediately to its right.

>(?PAB) | Matches the expression AB, and it can be accessed with the group name.

>(?aiLmsux) | Here, a, i, L, m, s, u, and x are flags:

>a — Matches ASCII only
\
>i — Ignore case
\
>L — Locale dependent
\
>m — Multi-line
\
>s — Matches all
\
>u — Matches unicode
\
>x — Verbose
\
>(?:A) | Matches the expression as represented by A, but unlike (?PAB), it cannot be retrieved afterwards.

>(?#...) | A comment. Contents are for us to read, not for matching.

>A(?=B) | Lookahead assertion. This matches the expression A only if it is followed by B.

>A(?!B) | Negative lookahead assertion. This matches the expression A only if it is not followed by B.

>(?<=B)A | Positive lookbehind assertion. This matches the expression A only if B is immediately to its left. This can only matched fixed length expressions.

>(?<!B)A | Negative lookbehind assertion. This matches the expression A only if B is not immediately to its left. This can only matched fixed length expressions.

>(?P=name) | Matches the expression matched by an earlier group named “name”.

>(...)\1 | The number 1 corresponds to the first group to be matched. If we want to match more instances of the same expresion, simply use its number instead of writing out the whole expression again. We can use from 1 up to 99 such groups and their corresponding numbers.

#Popular Python re Module Functions
* re.findall(A, B) | Matches all instances of an expression A in a string B and returns them in a list.

* re.search(A, B) | Matches the first instance of an expression A in a string B, and returns it as a re match object.

* re.split(A, B) | Split a string B into a list using the delimiter A.

* re.sub(A, B, C) | Replace A with B in the string C.

#Emails Patterns



```
pattern = r'[a-z0-9]+@[a-z]+\.[a-z]{2,8}'


emails = re.findall(r'\S+@\w+\.\w+', str(text))
```



##Esercizio Emails

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_excel('email_re.xlsx')

In [None]:
df['content']

0                                    Here is our forecast
1       Traveling to have a business meeting takes the...
2                           test successful. way to go!!!
3       Randy, Can you send me a schedule of the salar...
4                       Let's shoot for Tuesday at 11:45.
                              ...                        
1230    loan servicing-jessica weeber 800-393-5626 jwe...
1231                              exit mccollough off 410
1232    >From: "Greg Thorse" >To: >CC: "Phillip Allen"...
1233    This request has been pending your approval fo...
1234    If you cannot read this email, please click he...
Name: content, Length: 1235, dtype: object

In [None]:
text= df['content'].tolist()

In [None]:
import re

In [None]:
emails = re.findall(r'\S+@\w+\.\w+', str(text))

##ALTRO METODO!

In [None]:
lista = df['content'].tolist()
lista= [str(i) for i in lista]
text = ''.join(lista).lower()
text

In [None]:
pattern = r'[a-z0-9]+@[a-z]+\.[a-z]{2,3}' #invece di 3 posso mettere ad esempio 8, perché il numero di caratteri alfanumerici può essere da un minimo di 2 a max 8

# le [] servono per indicare che possono esserci quanti caratteri vogliono
email = re.findall(pattern, text)

#Counter

## Method 1 with Counter

In [None]:
from collections import Counter

words = ['a', 'b', 'c', 'a']

print(Counter(words).keys()) # equals to list(set(words))
print(Counter(words).values()) # counts the elements' frequency

dict_keys(['a', 'b', 'c'])
dict_values([2, 1, 1])


In [None]:
from collections import Counter
print(Counter(email).keys()) # equals to list(set(words))
print(Counter(email).values()) # counts the elements' frequency

k = list(Counter(email).keys())
v = list(Counter(email).values())

dict_keys(['pallen@enron.com', 'grigsby@enron.com', 'kholst@enron.com', 'buckner@honeywell.com', 'cbpres@austin.rr', 'retwell@mail.sanmarco', 'clclegal2@aol.com', 'stone@yahoo.com', 'jeff@freeyellow.com', 'allen@enron.com', 'invest@bga.com', 'stagecoachmama@hotmail.com', 'admin@fsddatasvc.com', '2000@enron.com', 'smith@lrinet.com', 'dnowak@enron.com', 'debe@fsddatasvc.com', 'mark@intelligencepress.com', 'lucci@enron.com', 'arsystem@ect.enron', 'shankman@enron.com', 'ermis@enron.com', 'ss4@skpstnhouston.com', 'mmiller3@enron.com', 'young@enron.com', 'benson@enron.com', 'kean@enron.com', 'shapiro@enron.com', 'nicolay@enron.com', 'cantrell@enron.com', 'jsteele@pira.com', 'rahal@acnpower.com', 'maryrichards7@hotmail.com', 'mlenhart@mail.ev', 'mmitchm@msn.com', 'enorman@living.com', 'ben@living.com', 'stephanie@living.com', 'designadvice@living.com', 'dexter@intelligencepress.com', 'lkuch@mh.com', 'horton@enron.com', 'dmccarty@enron.com', 'jhershey@sempratrading.com', 'postmaster@caprock.ne

In [None]:
tot = pd.DataFrame(list(zip(k, v)), 
                 columns =['email', 'count']) 
tot = tot.sort_values('count',ascending=False)
tot

Unnamed: 0,email,count
0,pallen@enron.com,106
9,allen@enron.com,90
4,cbpres@austin.rr,74
48,pallen70@hotmail.com,18
49,llewter@austin.rr,18
...,...,...
191,kenj@chelanpud.org,1
190,dearing@chelanpud.org,1
189,gtillitson@caiso.com,1
188,tmsnodgrass@bpa.gov,1


## Method 2 with sets

In [None]:
set_ = {"mala", "banana", "arancia","arancia"}
print(set_)

{'mala', 'arancia', 'banana'}


In [None]:
mylist = ['aa', 'bb', 'bb', 'aa', 'c', 'd', 'e']
myset = set(mylist)
myset

{'aa', 'bb', 'c', 'd', 'e'}

In [None]:
email_final = set(email) #unique valuse inside set
len(email_final)

246

In [None]:
pallen = re.findall('pallen@enron.com',text) # because needs string not list
len(pallen)

107

In [None]:
retwell = re.findall('retwell@mail.san',text) # because needs string not list
len(retwell)

18

In [None]:
with open('result.txt', 'w') as f:
    for i in email_final:
        f.write("%s\n" % i)

In [None]:
tot.to_csv('result.csv',index=False)

#Saving a file 

#Drop some words in column

In [None]:
s = df['nomecolonna'] 
def clean_stopwords(x):
	return ' '.join([w for w in x.split(' ') if w not in ['parolechevogliotogliere, parolechevogliotogliere'])
s = s.apply(clean_stopwords)