# Demo 11 - Regular Expressions and Web Scraping

In this notebook we look at the basics of the `requests` library, how to use regular expressions in Python, and grabbing information from the web using Beautiful Soup!

In [1]:
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/nmattei/cmps3160.git
%cd /content/cmps3160/_demos
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
plt.style.use('fivethirtyeight')
# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
plt.rc('font', **font)
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['pdf.fonttype'] = 42

/content
Cloning into 'cmps3160'...
remote: Enumerating objects: 1647, done.[K
remote: Counting objects: 100% (537/537), done.[K
remote: Compressing objects: 100% (229/229), done.[K
remote: Total 1647 (delta 319), reused 445 (delta 261), pack-reused 1110[K
Receiving objects: 100% (1647/1647), 45.27 MiB | 22.54 MiB/s, done.
Resolving deltas: 100% (920/920), done.
/content/cmps3160/_demos


In [2]:
# Note you may have to install requests!  pip3 install requests
import requests

## Simple Webpage Call with Requests Library

It may be good to look at the reference documentation for the [requests library](https://2.python-requests.org/en/master/user/quickstart/).

First, let's have a look at the [PolitWoops](https://projects.propublica.org/politwoops/).

Or even [Prof. Culotta's Website](https://cs.tulane.edu/~aculotta/)

In [3]:
r = requests.get('https://cs.tulane.edu/~aculotta/', timeout=10)
r.status_code

200

In [4]:
r.headers['content-type']

'text/html; charset=UTF-8'

In [5]:
r.url

'https://cs.tulane.edu/~aculotta/'

In [6]:
# Note that this is the same as if we just got to the page!
r.content[:5000]



**Point:** A really great resource is to check out this page [What happens when you type google.com into the address bar](https://github.com/alex/what-happens-when) which goes through the whole stack!

In [7]:
r = requests.get('https://projects.propublica.org/politwoops/', timeout=10)
r.status_code

200

In [8]:
r.headers['content-type']

'text/html; charset=utf-8'

In [9]:
r.url

'https://projects.propublica.org/politwoops/'

In [None]:
r.content[:5000]

## Looking at HTTP Requests

We'll try to get some data from Google.  Note that this is kind of against the TOS and we **should not do it this way in general -- Google has very [specific rules on their site](https://developers.google.com/custom-search/v1/).**

In [11]:
params = {'q':'Tulane University'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}
r = requests.get('http://www.google.com/search', params = params, headers=headers, timeout=10)
r.status_code

200

In [12]:
r.url

'https://www.google.com/search?q=Tulane+University&gws_rd=ssl'

In [13]:
r.headers['content-type']

'text/html; charset=UTF-8'

In [14]:
r.text[:5000]

'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="zh-TW"><head><meta charset="UTF-8"><meta content="origin" name="referrer"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Tulane University - Google 搜尋</title><script nonce="kNTI83Xap_yCMqM4r4t89Q">(function(){var b=window.addEventListener;window.addEventListener=function(a,c,d){"unload"!==a&&b(a,c,d)};}).call(this);(function(){var _g={kEI:\'0l7OZIaDK5ng2roP1Om26Ag\',kEXPI:\'31\',kBL:\'XHw5\',kOPI:89978449};(function(){var a;(null==(a=window.google)?0:a.stvsc)?google.kEI=_g.kEI:window.google=_g;}).call(this);})();(function(){google.sn=\'web\';google.kHL=\'zh-TW\';})();(function(){\nvar h=this||self;function l(){return void 0!==window.google&&void 0!==window.google.kOPI&&0!==window.google.kOPI?window.google.kOPI:null};var m,n=[];function p(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||m}function q(a){for(va

In [15]:
# This is a bit messy, let's use Beautiful Soup (we'll see this more later) to get just the text information.
from bs4 import BeautifulSoup
soup = BeautifulSoup( r.content )
print(soup.prettify()[:5000])
print("\n\nText only: \n\n")
print(soup.get_text().split()[:50])

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="zh-TW">
 <head>
  <meta charset="utf-8"/>
  <meta content="origin" name="referrer"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Tulane University - Google 搜尋
  </title>
  <script nonce="kNTI83Xap_yCMqM4r4t89Q">
   (function(){var b=window.addEventListener;window.addEventListener=function(a,c,d){"unload"!==a&&b(a,c,d)};}).call(this);(function(){var _g={kEI:'0l7OZIaDK5ng2roP1Om26Ag',kEXPI:'31',kBL:'XHw5',kOPI:89978449};(function(){var a;(null==(a=window.google)?0:a.stvsc)?google.kEI=_g.kEI:window.google=_g;}).call(this);})();(function(){google.sn='web';google.kHL='zh-TW';})();(function(){
var h=this||self;function l(){return void 0!==window.google&&void 0!==window.google.kOPI&&0!==window.google.kOPI?window.google.kOPI:null};var m,n=[];function p(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||m}

In [16]:
params = {'q':'Tulane University'}
r = requests.get('https://duckduckgo.com/', params = params, timeout=10)
r.status_code

200

In [17]:
r.url

'https://duckduckgo.com/static/duckduckgo/418.html?bno=2162&is_tor=0&is_ar=0'

In [18]:
r.headers['content-type']

'text/html; charset=UTF-8'

In [19]:
r.text

'<!DOCTYPE html>\n<!--[if IEMobile 7 ]> <html lang="en-US" class="no-js iem7"> <![endif]-->\n<!--[if lt IE 7]> <html class="ie6 lt-ie10 lt-ie9 lt-ie8 lt-ie7 no-js" lang="en-US"> <![endif]-->\n<!--[if IE 7]>    <html class="ie7 lt-ie10 lt-ie9 lt-ie8 no-js" lang="en-US"> <![endif]-->\n<!--[if IE 8]>    <html class="ie8 lt-ie10 lt-ie9 no-js" lang="en-US"> <![endif]-->\n<!--[if IE 9]>    <html class="ie9 lt-ie10 no-js" lang="en-US"> <![endif]-->\n<!--[if (gte IE 9)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!--><html class="no-js" lang="en-US" data-ntp-features="tracker-stats-widget:off"><!--<![endif]-->\n\n<head>\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge" />\n<meta http-equiv="content-type" content="text/html; charset=UTF-8;charset=utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=1" />\n<meta name="HandheldFriendly" content="true"/>\n<meta name="darkreader-lock" />\n\n<link rel="canonical" href="https://duckduckgo.com/418.html">\n\n<link rel

Well, that's lame because it basically just redirects to google :-)

## Simple API Call with Requests Library

It may be good to look at the reference documentation for the [requests library](https://2.python-requests.org/en/master/user/quickstart/).

First, let's have a look at the [GitHub API](https://developer.github.com/v3/).

In [20]:
r = requests.get('https://api.github.com/users/nmattei', timeout=10)
r.status_code

200

In [21]:
r.headers['content-type']

'application/json; charset=utf-8'

In [22]:
r.url

'https://api.github.com/users/nmattei'

In [23]:
r.content

b'{"login":"nmattei","id":1206578,"node_id":"MDQ6VXNlcjEyMDY1Nzg=","avatar_url":"https://avatars.githubusercontent.com/u/1206578?v=4","gravatar_id":"","url":"https://api.github.com/users/nmattei","html_url":"https://github.com/nmattei","followers_url":"https://api.github.com/users/nmattei/followers","following_url":"https://api.github.com/users/nmattei/following{/other_user}","gists_url":"https://api.github.com/users/nmattei/gists{/gist_id}","starred_url":"https://api.github.com/users/nmattei/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/nmattei/subscriptions","organizations_url":"https://api.github.com/users/nmattei/orgs","repos_url":"https://api.github.com/users/nmattei/repos","events_url":"https://api.github.com/users/nmattei/events{/privacy}","received_events_url":"https://api.github.com/users/nmattei/received_events","type":"User","site_admin":false,"name":"Nicholas Mattei","company":"Tulane University","blog":"http://www.nickmattei.net","location":null

In [24]:
r.json()

{'login': 'nmattei',
 'id': 1206578,
 'node_id': 'MDQ6VXNlcjEyMDY1Nzg=',
 'avatar_url': 'https://avatars.githubusercontent.com/u/1206578?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/nmattei',
 'html_url': 'https://github.com/nmattei',
 'followers_url': 'https://api.github.com/users/nmattei/followers',
 'following_url': 'https://api.github.com/users/nmattei/following{/other_user}',
 'gists_url': 'https://api.github.com/users/nmattei/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/nmattei/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/nmattei/subscriptions',
 'organizations_url': 'https://api.github.com/users/nmattei/orgs',
 'repos_url': 'https://api.github.com/users/nmattei/repos',
 'events_url': 'https://api.github.com/users/nmattei/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/nmattei/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Nicholas Mattei',
 'company': 'Tulane Univer

## More Complicated with Parameters

We'll look for some information from the [Apple ITunes API](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/).

In [25]:
params = {'term' : "the+meters"}
r = requests.get('https://itunes.apple.com/search', params=params, timeout=10)
r.status_code

200

In [26]:
r.url

'https://itunes.apple.com/search?term=the%2Bmeters'

In [None]:
r.json()

We can do lots of parameters in the payload like [this](https://2.python-requests.org/en/master/user/quickstart/).

In [28]:
params = {'term' : "the+meters", 'entity' : 'album'}
r = requests.get('https://itunes.apple.com/search', params=params, timeout=10)
r.status_code


200

In [29]:
r.url

'https://itunes.apple.com/search?term=the%2Bmeters&entity=album'

In [None]:
r.json()

In [31]:
x = r.json()

In [32]:
type(x['results'][0])

dict

## Converting the returned JSON to an object!

In [33]:
import json

In [34]:
data = json.loads(r.content)

In [None]:
data.keys()
data['results']

In [36]:
type(data['results'])

list

In [37]:
type(data['results'][1])

dict

In [38]:
data['results'][1]

{'wrapperType': 'collection',
 'collectionType': 'Album',
 'artistId': 7314214,
 'collectionId': 213532006,
 'amgArtistId': 4907,
 'artistName': 'The Meters',
 'collectionName': 'The Very Best of The Meters',
 'collectionCensoredName': 'The Very Best of The Meters',
 'artistViewUrl': 'https://music.apple.com/us/artist/the-meters/7314214?uo=4',
 'collectionViewUrl': 'https://music.apple.com/us/album/the-very-best-of-the-meters/213532006?uo=4',
 'artworkUrl60': 'https://is1-ssl.mzstatic.com/image/thumb/Music/ed/62/d2/mzi.bupngrkr.jpg/60x60bb.jpg',
 'artworkUrl100': 'https://is1-ssl.mzstatic.com/image/thumb/Music/ed/62/d2/mzi.bupngrkr.jpg/100x100bb.jpg',
 'collectionPrice': 9.99,
 'collectionExplicitness': 'notExplicit',
 'trackCount': 16,
 'copyright': '℗ 2005 Warner Strategic Marketing',
 'country': 'USA',
 'currency': 'USD',
 'releaseDate': '2005-03-29T08:00:00Z',
 'primaryGenreName': 'R&B/Soul'}

In [39]:
data['results'][1].keys()

dict_keys(['wrapperType', 'collectionType', 'artistId', 'collectionId', 'amgArtistId', 'artistName', 'collectionName', 'collectionCensoredName', 'artistViewUrl', 'collectionViewUrl', 'artworkUrl60', 'artworkUrl100', 'collectionPrice', 'collectionExplicitness', 'trackCount', 'copyright', 'country', 'currency', 'releaseDate', 'primaryGenreName'])

So that works really well to get a dict, but more importantly Pandas will convert this to a DataFrame for us!! More information in the [read_json() function](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)

In [40]:
df_t = pd.DataFrame.from_dict(data["results"])
df_t

Unnamed: 0,wrapperType,collectionType,artistId,collectionId,amgArtistId,artistName,collectionName,collectionCensoredName,artistViewUrl,collectionViewUrl,...,artworkUrl100,collectionPrice,collectionExplicitness,trackCount,copyright,country,currency,releaseDate,primaryGenreName,contentAdvisoryRating
0,collection,Album,7314214,59401239,4907.0,The Meters,The Meters,The Meters,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/the-meters/59...,...,https://is2-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,12,℗ 2004 Warner Records Inc. Manufactured and Ma...,USA,USD,2005-02-08T08:00:00Z,R&B/Soul,
1,collection,Album,7314214,213532006,4907.0,The Meters,The Very Best of The Meters,The Very Best of The Meters,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/the-very-best...,...,https://is1-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,16,℗ 2005 Warner Strategic Marketing,USA,USD,2005-03-29T08:00:00Z,R&B/Soul,
2,collection,Album,7314214,56763785,4907.0,The Meters,Funkify Your Life: The Meters Anthology,Funkify Your Life: The Meters Anthology,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/funkify-your-...,...,https://is3-ssl.mzstatic.com/image/thumb/Music...,29.99,notExplicit,43,℗ 2005 Atlantic Recording Corp. Manufactured &...,USA,USD,2005-03-29T08:00:00Z,R&B/Soul,
3,collection,Album,7314214,59401178,4907.0,The Meters,Rejuvenation,Rejuvenation,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/rejuvenation/...,...,https://is1-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,9,℗ 2004 Atlantic Recording Corp. Manufactured &...,USA,USD,2005-02-08T08:00:00Z,R&B/Soul,
4,collection,Album,7314214,59401414,4907.0,The Meters,Fire On the Bayou,Fire On the Bayou,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/fire-on-the-b...,...,https://is5-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,11,℗ 2004 Atlantic Recording Corp. Manufactured &...,USA,USD,2005-02-08T08:00:00Z,R&B/Soul,
5,collection,Album,7314214,159362708,4907.0,The Meters,Look-Ka Py Py,Look-Ka Py Py,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/look-ka-py-py...,...,https://is3-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,12,℗ 2004 Warner Strategic Marketing,USA,USD,2005-02-08T08:00:00Z,R&B/Soul,
6,collection,Album,7314214,80003971,4907.0,The Meters,Rhino Hi-Five: The Meters - EP,Rhino Hi-Five: The Meters - EP,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/rhino-hi-five...,...,https://is5-ssl.mzstatic.com/image/thumb/Music...,3.99,notExplicit,5,℗ 2005 Warner Strategic Marketing Inc.,USA,USD,2005-09-20T07:00:00Z,R&B/Soul,
7,collection,Album,7314214,59401344,4907.0,The Meters,Cabbage Alley,Cabbage Alley,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/cabbage-alley...,...,https://is5-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,10,℗ 2004 Warner Records Inc. Manufactured & Mark...,USA,USD,2005-02-08T08:00:00Z,R&B/Soul,
8,collection,Album,7314214,257194137,4907.0,The Meters,The Essentials: The Meters,The Essentials: The Meters,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/the-essential...,...,https://is4-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,12,℗ 2005 Warner Records Inc. Manufactured and Ma...,USA,USD,2005-03-29T08:00:00Z,R&B/Soul,
9,collection,Album,7314214,80014990,4907.0,The Meters,Kickback,Kickback,https://music.apple.com/us/artist/the-meters/7...,https://music.apple.com/us/album/kickback/8001...,...,https://is3-ssl.mzstatic.com/image/thumb/Music...,9.99,notExplicit,14,℗ 2005 Warner Records Inc. Manufactured & Mark...,USA,USD,2005-09-20T07:00:00Z,R&B/Soul,


## Using Beautiful Soup to Parse a Webpage.

The [beautifulsoup4 documentation](https://www.crummy.com/software/BeautifulSoup/).

In [41]:
# Grab the course webpage.
import requests
from bs4 import BeautifulSoup

r = requests.get('https://cs.tulane.edu/~aculotta/')

soup = BeautifulSoup( r.content )

In [42]:
r.content[:5000]



In [43]:
soup.prettify()[:5000]

'<html>\n <head>\n  <link href="style.css" rel="stylesheet"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n </head>\n <body>\n  <div class="main">\n   aron culotta\n   <br/>\n   associate professor of computer science\n   <br/>\n   tulane university\n   <br/>\n   <script language="javascript" type="text/javascript">\n    // Email obfuscator script 2.1 by Tim Williams, University of Arizona\n\t\t\t{ coded = "LzVZhppL@pVZLJW.WlV"\n  \t\t\t\tkey = "fID9HXVvy5dwPqtn0FjROcgms4kr1l7xKpJeaSQ3zoA8NUEYZubiMBWLTGC6h2"\n\t  \t\t\tshift=coded.length\n\t\t\t\t  link=""\n\t\t\t\t  for (i=0; i<coded.length; i++) {\n\t\t\t\t    if (key.indexOf(coded.charAt(i))==-1) {\n\t\t\t\t      ltr = coded.charAt(i)\n\t\t\t\t      link += (ltr)\n\t\t\t\t    }\n\t\t\t\t    else { \n\t\t\t\t      ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length\n\t\t\t\t      link += (key.charAt(ltr))\n\t\t\t\t    }\n\t\t\t\t  }\n\t\t\t\tdocument.write(link)\n\t\t\t\t}\n\t\t\t\t//-->\n 

In [44]:
soup.find("table")

<table border="0" class="teaching">
<tr>
<td>    </td>
<td>tulane</td>
<td>    </td>
<td><a href="https://github.com/tulane-cmps6730/main">natural language processing</a>       </td>
<td><a href="https://github.com/tulane-cmps2200/slides/tree/2020-fall">intro to algorithms</a></td>
</tr>
<tr>
<td>    </td>
<td>illinois tech</td>
<td>    </td>
<td><a href="https://github.com/iit-cs579/main">online social network analysis</a></td>
<td><a href="https://github.com/iit-cs429/main">information retrieval</a></td>
</tr>
</table>

In [None]:
# The above gets the first table, but there could be a lot more!
soup.findAll("table")

In [46]:
# Find all links!

soup.find("table").findAll("a")

[<a href="https://github.com/tulane-cmps6730/main">natural language processing</a>,
 <a href="https://github.com/tulane-cmps2200/slides/tree/2020-fall">intro to algorithms</a>,
 <a href="https://github.com/iit-cs579/main">online social network analysis</a>,
 <a href="https://github.com/iit-cs429/main">information retrieval</a>]

So we can use Pandas and BS4 together as well -- we'll see a lot more of this in the lab this week!

In [47]:
df_tables = []
for t in soup.findAll("table"):
    df_t = pd.read_html(str(t))
    df_tables.append(df_t[0])

for t in df_tables:
    display(t)

Unnamed: 0,0,1,2,3,4
0,,tulane,,natural language processing,intro to algorithms
1,,illinois tech,,online social network analysis,information retrieval


Unnamed: 0,0
0,Online Reviews Are Leading Indicators of Chang...
1,Forecasting COVID-19 Vaccination Rates using S...
2,Safety Reviews on Airbnb: An Information Tale ...
3,Reducing Cross-Topic Political Homogenization ...
4,Identifying Hurricane Evacuation Intent on Twi...
...,...
72,Dependency tree kernels for relation extractio...
73,Interactive information extraction with constr...
74,Confidence estimation for information extracti...
75,Extracting social networks and contact informa...


## Trying out some Regular Expressions.

In [48]:
import re
# Find the index in the raw HTML where we first mention CMPS3160

# Note we use the r to make sure special flags get used correctly.

r = requests.get('https://nmattei.github.io/cmps3160/syllabus/')


In [49]:
# Let's see what we got.
r.text[:5000]

'<!DOCTYPE html>\n<html lang="en">\n  <!-- Beautiful Jekyll | MIT license | Copyright Dean Attali 2016 -->\n  <head>\n  <meta charset="utf-8" />\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, viewport-fit=cover">\n\n  <title>Fall 2022 Syllabus</title>\n\n  <meta name="author" content="Nicholas Mattei" />\n\n  \n\n  <link rel="alternate" type="application/rss+xml" title="CMPS 3160 Intro. to Data Science - Intro to Data Science - Fall 2020" href="https://nmattei.github.io/cmps3160/feed.xml" />\n\n  \n\n  \n\n  \n\n\n  \n    \n      \n  <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css" />\n\n\n    \n  \n\n  \n    \n      <link rel="stylesheet" href="/cmps3160/css/bootstrap.min.css" />\n    \n      <link rel="stylesheet" href="/cmps3160/css/bootstrap-social.css" />\n    \n      <link rel="stylesheet" href="/cmps3160/css/main.css" />\n    \n

In [50]:
match = re.search(r'CMPS 3160', r.text)
print(match.start())

460


In [51]:
r.text[390:500]

'ei" />\n\n  \n\n  <link rel="alternate" type="application/rss+xml" title="CMPS 3160 Intro. to Data Science - Intro'

In [52]:
# Does the start match?
match = re.match(r'CMPS 3160', r.text)
print(match)

None


In [53]:
# Iterate over all occurances and print a few characters.
for m in re.finditer(r'CMPS 3160', r.text):
    print(r.text[m.start()-50:m.start()+50])


rel="alternate" type="application/rss+xml" title="CMPS 3160 Intro. to Data Science - Intro to Data S
-brand" href="https://nmattei.github.io/cmps3160">CMPS 3160 Intro. to Data Science</a></div>

    <d
the <a href="https://github.com/nmattei/cmps3160">CMPS 3160 Github</a>.</p>
  </li>
  <li>
    <p>Al


In [54]:
# Find them all and the word(s)? right after?
match = re.findall(r'CMPS 3160\s\w*', r.text)
print(match)

['CMPS 3160 Intro', 'CMPS 3160 Intro', 'CMPS 3160 Github']


In [55]:
# Can we find all the email addresses?
text = ''' This is a list that has an @ symbol in it.
            But we want to find Nick's address nsmattei@tulane.edu
            But also maybe someone else's eli@gmail.com....
            How would we write a regex for that?


            Also there is more text, and can't like
            phil123@school.edu also be able to be caught?



'''

# Need to test on a few first..
# What rules do we need?
regex = r'\D\w*@\w+\.\w{3}'
match = re.findall(regex, text)
print(match)


[' nsmattei@tulane.edu', ' eli@gmail.com', ' phil123@school.edu']


In [56]:
### ANSWER for full email
regex = r'\w+@\w+.\w{3}'
match = re.findall(regex, text)
print(match)

['nsmattei@tulane.edu', 'eli@gmail.com', 'phil123@school.edu']


In [57]:
### Only names, no domains...
regex = r'\w+@'
match = re.findall(regex, text)
print(match)

['nsmattei@', 'eli@', 'phil123@']


In [58]:
## Eli's more complicated answer with lookaheads
regex = r"[A-z]+(?=[^A-z\s]*@)"
match = re.findall(regex, text)
print(match)

['nsmattei', 'eli', 'phil']


In [59]:
# Now we can use this on the webpage!
regex = r'\w+@\w+.\w{3}'
match = re.findall(regex, r.text)
print(match)

['kshvaram@tulane.edu', 'kshvaram@tulane.edu', 'skellum@tulane.edu', 'skellum@tulane.edu', 'nsmattei@tulane.edu', 'nsmattei@tulane.edu', 'aculotta@tulane.edu', 'aculotta@tulane.edu', 'goldman@tulane.edu', 'srss@tulane.edu', 'srss@tulane.edu', 'msmith76@tulane.edu', 'msmith76@tulane.edu']


In [60]:
# More complicated RegExes - Groups
regex = r'\s*([Uu]niversity)\s([Oo]f)\s(\w{3,})'

text = ''' The university of kentucky is the best
            basketball team and an ok university. and University of North CC
            The University Of Kentucky can be put in
            some weird capitalization and University of Ken spelled wrong'''
m = re.search( regex, text)
print(m.groups())

('university', 'of', 'kentucky')


In [61]:
# Find all
print(re.findall(regex, text))

[('university', 'of', 'kentucky'), ('University', 'of', 'North'), ('University', 'Of', 'Kentucky'), ('University', 'of', 'Ken')]


In [62]:
# Named Groups.
regex = r'\s*([Uu]niversity)\s([Oo]f)\s(?P<school>\w{3,})'
text = ''' The university of kentucky is the best University of Lousiana
            basketball team and an ok university.
            The University Of Kentucky can be put in
            some weird capitalization'''
m = re.search( regex, text)
print(m.groupdict())


{'school': 'kentucky'}


In [63]:
# Find all named groups

# Named Groups.
regex = r'\s*([Uu]niversity)\s([Oo]f)\s(?P<school>\w{3,})'
text = ''' The university of kentucky is the best
            basketball team and an ok university.
            The University Of Kentucky can be put in
            some weird capitalization.  And Kentucky is much better than
            the University of Mississippi.'''
for m in re.finditer(regex, text):
    print(m.groupdict())


{'school': 'kentucky'}
{'school': 'Kentucky'}
{'school': 'Mississippi'}


In [64]:
'abcabcabc'.replace('a', 'X')

'XbcXbcXbc'

In [65]:
text = 'I love Introduction to Data Science'
re.sub(r'Data Science', r'Schmada Schmience', text)

'I love Introduction to Schmada Schmience'

In [66]:
re.sub(r'(\w+)\s([Ss]cience)', r'\2 \1hmience', text)


'I love Introduction to Science Datahmience'

In [67]:
# Let's use it to parse part of a CSV?
text = '12,15,22,36,78,33,77,33,45'

# Use Regex split command
print(re.split(',', text))

# Use string split command
print(text.split(","))

#Use Regex to split into groups...
regex = r'(?P<data>\d*,)'
for m in re.finditer(regex, text):
    print(m.groupdict())


['12', '15', '22', '36', '78', '33', '77', '33', '45']
['12', '15', '22', '36', '78', '33', '77', '33', '45']
{'data': '12,'}
{'data': '15,'}
{'data': '22,'}
{'data': '36,'}
{'data': '78,'}
{'data': '33,'}
{'data': '77,'}
{'data': '33,'}
