# Experimenting with regular expressions

### Importing modules

In [1]:
import re

### Defining sample strings 

In [2]:
lowercase_alphabet = "abcdefghijklmnopqrstuvwxyz"
uppercase_alphabet = lowercase_alphabet.upper()
numbers = "1234567890"
sentence = "The Quick Brown Fox Jumps Over The Lazy Dog"
website = "www.medium.com"
phone_numbers = """123-456-7890
                    987.654.321
                    234-567-8901
                    654.321.987
                    345-678-9012
                    321.654.978
                    456-789-0123
                """
special_characters = "[\^$.|?*+()"

### Match explicit character(s)

In [3]:
re.findall("abc", lowercase_alphabet)

['abc']

In [4]:
re.findall("ABC", uppercase_alphabet)

['ABC']

In [5]:
re.findall("abc", uppercase_alphabet)

[]

### Match with special character(s)

In [6]:
re.findall("www\.medium\.com", website)

['www.medium.com']

In [7]:
re.findall("\$", special_characters)

['$']

In [8]:
re.findall("\|", special_characters)

['|']

### Match by pattern

In [13]:
re.findall("\w{1,}", sentence)

['The', 'Quick', 'Brown', 'Fox', 'Jumps', 'Over', 'The', 'Lazy', 'Dog']

In [14]:
re.findall("\d{3}\-\d{3}\-\d{4}", phone_numbers)

['123-456-7890', '234-567-8901', '345-678-9012', '456-789-0123']

###Web Scraping (Data Collection)

Data collection is a very common part of a data scientist’s work and it is relatively easy to find data on the web. It is possible to scrape websites like Wikipedia etc. to collect data, but data scraped from the web is usually messy and full of noise. Suppose the following is the html code you need to process:

In [15]:
html = """<table class="vertical-navbox nowraplinks" style="float:right;clear:right;width:22.0em;margin:0 0 1.0em 1.0em;background:#f9f9f9;border:1px solid #aaa;padding:0.2em;border-spacing:0.4em 0;text-align:center;line-height:1.4em;font-size:88%"><tbody><tr><th style="padding:0.2em 0.4em 0.2em;font-size:145%;line-height:1.2em"><a href="/wiki/Machine_learning" title="Machine learning">Machine learning</a> and<br /><a href="/wiki/Data_mining" title="Data mining">data mining</a></th></tr><tr><td style="padding:0.2em 0 0.4em;padding:0.25em 0.25em 0.75em;"><a href="/wiki/File:Kernel_Machine.svg" class="image"><img alt="Kernel Machine.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/220px-Kernel_Machine.svg.png" decoding="async" width="220" height="100" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/330px-Kernel_Machine.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/440px-Kernel_Machine.svg.png 2x" data-file-width="512" data-file-height="233" /></a></td></tr><tr><td style="padding:0 0.1em 0.4em">
<div class="NavFrame collapsed" style="border:none;padding:0"><div class="NavHead" style="font-size:105%;background:transparent;text-align:left">Problems</div><div class="NavContent" style="font-size:105%;padding:0.2em 0 0.4em;text-align:center"><div class="hlist">
<ul><li><a href="/wiki/Statistical_classification" title="Statistical classification">Classification</a></li>
<li><a href="/wiki/Cluster_analysis" title="Cluster analysis">Clustering</a></li>
<li><a href="/wiki/Regression_analysis" title="Regression analysis">Regression</a></li>
<li><a href="/wiki/Anomaly_detection" title="Anomaly detection">Anomaly detection</a></li>
<li><a href="/wiki/Automated_machine_learning" title="Automated machine learning">AutoML</a></li>
<li><a href="/wiki/Association_rule_learning" title="Association rule learning">Association rules</a></li>
<li><a href="/wiki/Reinforcement_learning" title="Reinforcement learning">Reinforcement learning</a></li>
<li><a href="/wiki/Structured_prediction" title="Structured prediction">Structured prediction</a></li>
<li><a href="/wiki/Feature_engineering" title="Feature engineering">Feature engineering</a></li>
<li><a href="/wiki/Feature_learning" title="Feature learning">Feature learning</a></li>
<li><a href="/wiki/Online_machine_learning" title="Online machine learning">Online learning</a></li>
<li><a href="/wiki/Semi-supervised_learning" title="Semi-supervised learning">Semi-supervised learning</a></li>
<li><a href="/wiki/Unsupervised_learning" title="Unsupervised learning">Unsupervised learning</a></li>
<li><a href="/wiki/Learning_to_rank" title="Learning to rank">Learning to rank</a></li>
<li><a href="/wiki/Grammar_induction" title="Grammar induction">Grammar induction</a></li></ul>
</div></div></div></td>
</tr><tr><td style="padding:0 0.1em 0.4em">
<div class="NavFrame collapsed" style="border:none;padding:0"><div class="NavHead" style="font-size:105%;background:transparent;text-align:left"><div style="padding:0.1em 0;line-height:1.2em;"><a href="/wiki/Supervised_learning" title="Supervised learning">Supervised learning</a><br /><style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style><span class="nobold"><span style="font-size:85%;">(<b><a href="/wiki/Statistical_classification" title="Statistical classification">classification</a></b>&#160;&#8226;&#32;<b><a href="/wiki/Regression_analysis" title="Regression analysis">regression</a></b>)</span></span> </div></div><div class="NavContent" style="font-size:105%;padding:0.2em 0 0.4em;text-align:center"><div class="hlist">
<ul><li><a href="/wiki/Decision_tree_learning" title="Decision tree learning">Decision trees</a></li>
<li><a href="/wiki/Ensemble_learning" title="Ensemble learning">Ensembles</a>
<ul><li><a href="/wiki/Bootstrap_aggregating" title="Bootstrap aggregating">Bagging</a></li>
<li><a href="/wiki/Boosting_(machine_learning)" title="Boosting (machine learning)">Boosting</a></li>
<li><a href="/wiki/Random_forest" title="Random forest">Random forest</a></li></ul></li>
<li><a href="/wiki/K-nearest_neighbors_algorithm" title="K-nearest neighbors algorithm"><i>k</i>-NN</a></li>
<li><a href="/wiki/Linear_regression" title="Linear regression">Linear regression</a></li>
<li><a href="/wiki/Naive_Bayes_classifier" title="Naive Bayes classifier">Naive Bayes</a></li>
<li><a href="/wiki/Artificial_neural_network" title="Artificial neural network">Artificial neural networks</a></li>
<li><a href="/wiki/Logistic_regression" title="Logistic regression">Logistic regression</a></li>
<li><a href="/wiki/Perceptron" title="Perceptron">Perceptron</a></li>
<li><a href="/wiki/Relevance_vector_machine" title="Relevance vector machine">Relevance vector machine (RVM)</a></li>
<li><a href="/wiki/Support-vector_machine" title="Support-vector machine">Support vector machine (SVM)</a></li></ul>
</div></div></div></td></table>"""


It is from a wikipedia page and has links to various other wikipedia pages. The first thing that we can do is to check the topics this page contains:

In [16]:
re.findall(r">([\w\s()]*?)</a>", html)

['Machine learning',
 'data mining',
 '',
 'Classification',
 'Clustering',
 'Regression',
 'Anomaly detection',
 'AutoML',
 'Association rules',
 'Reinforcement learning',
 'Structured prediction',
 'Feature engineering',
 'Feature learning',
 'Online learning',
 'Unsupervised learning',
 'Learning to rank',
 'Grammar induction',
 'Supervised learning',
 'classification',
 'regression',
 'Decision trees',
 'Ensembles',
 'Bagging',
 'Boosting',
 'Random forest',
 'Linear regression',
 'Naive Bayes',
 'Artificial neural networks',
 'Logistic regression',
 'Perceptron',
 'Relevance vector machine (RVM)',
 'Support vector machine (SVM)']

Now, let us extract the links to all those pages:

In [17]:
re.findall(r"\/wiki\/[\w-]*", html)

['/wiki/Machine_learning',
 '/wiki/Data_mining',
 '/wiki/File',
 '/wiki/Statistical_classification',
 '/wiki/Cluster_analysis',
 '/wiki/Regression_analysis',
 '/wiki/Anomaly_detection',
 '/wiki/Automated_machine_learning',
 '/wiki/Association_rule_learning',
 '/wiki/Reinforcement_learning',
 '/wiki/Structured_prediction',
 '/wiki/Feature_engineering',
 '/wiki/Feature_learning',
 '/wiki/Online_machine_learning',
 '/wiki/Semi-supervised_learning',
 '/wiki/Unsupervised_learning',
 '/wiki/Learning_to_rank',
 '/wiki/Grammar_induction',
 '/wiki/Supervised_learning',
 '/wiki/Statistical_classification',
 '/wiki/Regression_analysis',
 '/wiki/Decision_tree_learning',
 '/wiki/Ensemble_learning',
 '/wiki/Bootstrap_aggregating',
 '/wiki/Boosting_',
 '/wiki/Random_forest',
 '/wiki/K-nearest_neighbors_algorithm',
 '/wiki/Linear_regression',
 '/wiki/Naive_Bayes_classifier',
 '/wiki/Artificial_neural_network',
 '/wiki/Logistic_regression',
 '/wiki/Perceptron',
 '/wiki/Relevance_vector_machine',
 '/wik