## The data collection process

You will typically get data in one of four ways:
1. Directly download a data file (or files) manually
2. Query data from a database
3. Query an API (usually web-based, these days)
4. Scrap data from a webpage

The vast majority of automated data queries you will run will use HTTP requests (it’s become the dominant protocol for much more than just querying web pages)

In python, we do this by using the request library. Some of the common request method are shown in the cell below:

In [216]:
import requests #to import the request library
response = requests.get("http://www.datasciencecourse.org")

# some relevant fields is given in the cells below (the last field is response.headers['Content-Type'])
response.status_code


200

In [217]:
response.content # or response.text. this can be outputted better with beautiful soup[BeautifulSoup(response.content)]

b'<!DOCTYPE html>\n<html lang="en">\n  <!-- Beautiful Jekyll | MIT license | Copyright Dean Attali 2016 -->\n  <head>\n    <meta charset="UTF-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta name="theme-color" content="#157878" />\n    <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />\n\n    <title>Practical Data Science</title>\n\n    <meta name="author" content="Practical Data Science" />\n    \n\n    <link rel="alternate" type="application/rss+xml" title="Practical Data Science - CMU 15-388/688 Spring 2019" href="/feed.xml" />\n  \n    \n      \n        <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.6.0/css/font-awesome.min.css" />\n      \n    \n  \n    \n      \n        <link rel="stylesheet" href="/css/bootstrap.min.css" />\n      \n        <link rel="stylesheet" href="/css/bootstrap-social.css" />\n      \n        <link r

In [213]:
response.headers

{'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Wed, 08 Apr 2020 20:03:15 GMT', 'ETag': 'W/"5e8e2e03-14e9c"', 'Access-Control-Allow-Origin': '*', 'Expires': 'Sat, 11 Apr 2020 01:06:41 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'X-Proxy-Cache': 'MISS', 'X-GitHub-Request-Id': '8D58:7582:54902:6CACA:5E9115C2', 'Content-Length': '10292', 'Accept-Ranges': 'bytes', 'Date': 'Sat, 11 Apr 2020 00:56:41 GMT', 'Via': '1.1 varnish', 'Age': '0', 'Connection': 'keep-alive', 'X-Served-By': 'cache-lcy19221-LCY', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1586566602.703988,VS0,VE85', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '9c691cc92e553838e89a550ab5df2faa04c30450'}

In [214]:
response.headers['Content-Type']

'text/html; charset=utf-8'

In [215]:
response = requests.get("http://www.cmu.edu")
print("Status Code:", response.status_code)
print("Headers:", response.headers)

Status Code: 200
Headers: {'Date': 'Sat, 11 Apr 2020 00:57:37 GMT', 'Server': 'Apache', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'x-frame-options': 'SAMEORIGIN', 'Vary': 'Referer', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=7200, must-revalidate', 'Expires': 'Sat, 11 Apr 2020 02:57:37 GMT', 'Keep-Alive': 'timeout=5, max=500', 'Connection': 'Keep-Alive', 'Transfer-Encoding': 'chunked', 'Content-Type': 'text/html'}


In [195]:
print(response.text[:480]) # this will return the first 480 text

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8"/>
    <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
    <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
    <title>Homepage -     CMU - Carnegie Mellon University</title>    
    <meta content="CMU is a global research university known for its world-class, interdisciplinary programs: arts, business, computing, engineering, humanities, policy and science." name="description"/>
  


#### HTTP Request Basics

Looking at this URL (the part before the question mark ?), https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&cad=rja&uact=8… (which you would like to query), and the weird statements after the url are parameters which will be provided using the requests library as shown in the cell below.

the example stated in the cell below is obviously a dictionary:
- the value before  the (=) sign is the key while the value after the sign is the value and it is safe to say that the (&) sign works as the (,) e.g sa=t&rct=j => "sa":"t", "rct":"j".

In [130]:
param = {"sa":"t", "rct":"j", "q":"", "esrc":"s", "source":"web", "cd":"9", "cad":"rja", "uact":"8"}
response = requests.get("http://www.google.com/url", params=param)
response.headers['Content-Type']

'text/html; charset=ISO-8859-1'

https://www.google.com/search?q=python+download+url+content&source=chrome, 

- If you’ve seen URLS before you’ve noticed that a lot of content needs to be encoded in these parameters, such as spaces replaces with the code “%20” (the Google url above can also handle the “+” character, but “%20” is the actual encoding of a space).

HTTP GET is the most common method, but there are also PUT, POST, DELETE methods that change some state on the server:

+ response = requests.put(...)
* response = requests.post(...)
- response = requests.delete(...)

## RESTful APIs

When we query web APIs, we are most likely encounter REST APIs (Representational State Transfer). REST is more a design architecture, but a few key points:

1. Uses standard HTTP interface and methods (GET, PUT, POST, DELETE), You will probably see GET and POST used most frequently.
2. Stateless – REST servers don’t store state (the server doesn’t remember what you were doing). This means that each time you issue a request, you need to include all relevant information like your account key, etc.
3. REST calls will usually return information in a nice format, typically JSON. The requests library will automatically parse it to return a Python dictionary with the relevant data.

**Rule of thumb:** if you’re sending the your account key along with each API call, you’re probably using a REST API

### Querying a RESTful API

You query a REST API similar to standard HTTP requests, but will almost always need to include parameters (this means that the request method will take in two parameters, the url and the params, e.g. requests.get("http://www.google.com/url", params=params)

The example in the cell below involves querying a github account. Access token for github can be gotten at https://github.com/settings/tokens/new or in your account > settings > Developer settings > generate new token.  

GitHub API uses GET/PUT/DELETE to let you query or update elements in your GitHub account automatically.
Example of REST: server doesn’t remember your last queries, for instance you always need to include your access token if using it this way:

In [7]:
token = "" # The token generated from your account goes within the quote. Be careful not to write it out in the open
response = requests.get("https://api.github.com/user", params={"access_token":token})
print(response.headers)

{'Date': 'Thu, 09 Apr 2020 19:04:45 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Content-Length': '125', 'Server': 'GitHub.com', 'Status': '401 Unauthorized', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '59', 'X-RateLimit-Reset': '1586462685', 'X-GitHub-Media-Type': 'github.v3; format=json', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Vary': 'Accept-Encoding, Accept, X-Requested-With', 'X-GitHub-Request-Id': 'AF16:3F380:3029718:39

In [8]:
print(response.status_code)

401


In [9]:
print(response.headers['content-type'])

application/json; charset=utf-8


## Authentication
Basic authentication has traditionally been the most common approach to access control for web pages. Most APIs have replaced this with some form of OAuth. Most APIs will use an authentication procedure that is more involved than the example below. 

The standard here for a while was called “Basic Authentication”, and can be used via the requests library by simply passing the login and password as the auth argument to the relevant calls, as below.

In [202]:
response = requests.get("https://api.github.com/user", auth=('drgreatwonder', 'passwd'))
print(response)

<Response [401]>


#### The 401 respose
The HTTP 401 Unauthorized client error status response code above indicates that the request has not been applied because it lacks valid authentication credentials for the target resource. 

The 401 Unauthorized error is an HTTP status code that means the page you were trying to access cannot be loaded until you first log in with a valid user ID and password, it means that the credentials you entered were invalid for some reason.

If you've just logged in and received the 401 Unauthorized error, it means that the credentials you entered were invalid for some reason. 

## Common data formats and handling

#### Data formats
The three most common formats:
1. CSV (comma separate value) files
2. JSON (Javascript object notation) files and strings
3. HTML/XML (hypertext markup language / extensible markup language) files and strings


1. CSV Files
Refers to any delimited text file (not always separated by commas). If values themselves contain commas, you can enclose them in quotes (it is done just to be safe)

e.g from slides 

import pandas as pd

dataframe = pd.read_csv("CourseRoster_F16_15688_B_08.30.2016.csv", delimiter=',', quotechar='"')


**result:** "Semester","Course","Section","Lecture","Mini","Last Name","Preferred/First
Name","MI","Andrew ID","Email","College","Department","Class","Units","Grade
Option","QPA Scale","Mid-Semester Grade","Final Grade","Default Grade","Added
By","Added On","Confirmed","Waitlist Position","Waitlist Rank","Waitlisted
By","Waitlisted On","Dropped By","Dropped On","Roster As Of Date”
"F16","15688","B","Y","N",”Kolter","Zico","","zkolter","zkolter@andrew.cmu.edu","S
CS","CS","50","12.0","L","4+"," "," ","","reg","1 Jun
2016","Y","","","","","","","30 Aug 2016 4:34"


In [419]:
#import pandas as pd
#dataframe = pd.read_csv("kpit_weather.csv")
#print(dataframe)
#header=None

import pandas as pd
dataframe = pd.read_csv("kpit_weather.csv", delimiter=",", quotechar='"')
#dataframe = pd.read_csv("kpit_weather.csv", delimiter=".", quotechar='"') # gave a diff result
dataframe.head()


Unnamed: 0,ZTime,Time,OAT,DT,SLP,WD,WS,SKY,PPT,PPT6,Plsr.Event,Plsr.Source
0,20170820040000,20170820000000,178,172,10171,0,0,0,0,-9999,,
1,20170820050000,20170820010000,178,172,10177,0,0,0,0,-9999,,
2,20170820060000,20170820020000,167,161,10181,0,0,0,0,-9999,,
3,20170820070000,20170820030000,161,161,10182,0,0,4,0,-9999,,
4,20170820080000,20170820040000,156,156,10186,180,15,-9999,0,-9999,,


We don’t actually need the delimiter or quotechar arguments here, because the default argument for delimiter is indeed a comma (which is what this CSV file is using), but you can pass an additional argument to this function to use a different delimiter.

One issue that can come up is if any of the values you want to include contain this delimiter; to get around this, you can surround the value with the quotechar character. Several CSV files will just include quotes around any entry, by default. Again, our file here doesn’t contain quotes, so it is not an issue, but its it a common occurrence when handling CSV files. 

One final thing to note is that by default, the first row of the file a header row that lists the name of each column in the file. If this is not in the file, then you can load the data with the additional header=None argument.

## JSON files / string
JSON originated as a way of encapsulating Javascript objects. A number of different data types can be represented

1. Number: 1.0 (always assumed to be floating point)
2. String: "string"
3. Boolean: true or false
4. List (Array): [item1, item2, item3,…]
5. Dictionary (Object in Javascript): {"key":value}
6. Lists and Dictionaries can be embedded within each other: [{"key":[value1, [value2, value3]]}]


**N/B:** If you have the data as a file (i.e., as a file descriptor opened with the Python open() command), you can use the json.load() function instead. In JSON, keys have to be strings

#### Parsing JSON in Python

Built-in library to read/write Python objects from/to JSON files


In [422]:
import json
# load json from a REST API call
token = "37ac78d13ab7ea049f344ce7ff687607317ea4f0" 
response = requests.get("https://api.github.com/user", params={"access_token":token})
data = json.loads(response.content) # or response.json() to load json from the respone
print(data, "\n")
print(response.json().keys()) # or you could use "values" in place of "keys" to get the values
#json.load(file) # load json from file
json.dumps(data) #it converts a Python dictionary to a JSON object, returns a json string. It replaces the single quote in python with double quotes
#json.dump(obj, file) # write json to file

{'message': 'Bad credentials', 'documentation_url': 'https://developer.github.com/v3'} 

dict_keys(['message', 'documentation_url'])


'{"message": "Bad credentials", "documentation_url": "https://developer.github.com/v3"}'

### XML/HTML files

The main format for the web (though XML seems to be loosing a bit of popularity to JSON for use in APIs/file formats). XML files contain hiearchical content delineated by tags.  Example in the cell below:


#### Parsing XML/HTML in Python

There are a number of XML/HTML parsers for Python, but a nice one for data science is the BeautifulSoup library (specifically focused on getting data out of XML/HTML files).

The BeautifulSoup() call creates the object to parse, where the second argument specifies the parser (“lxml-xml” indicates that it is actually XML data, whereas “lxml” is the more common parser for parsing HTML files). 

As illustrated below, when the hierarchical layout of the data is fairly simple, here a “tag” followed by a “subtag” (by default this will return the first such tag), or an “openclosetag”, you can access the various parts of the hierarchy simply by a structure-like layout of the BeautifulSoup object. 

Where this gets trickier is when there are multiple tags with the same name as the hierarchy level, as there is with the two “subtag” tags. In this case, you can use the *find_all* function, which returns a list of all the subtags.

The nice thing about the *find_all* function is that you can call it at previous levels in the tree, and it will recurse down the whole document. So we could have just as easily done.

In [424]:
from bs4 import BeautifulSoup

root = BeautifulSoup("""
<tag attribute="value">
    <subtag>
        Some content for the subtag
    </subtag>
    <openclosetag attribute="value2"/>
    <subtag>
        Second one
    </subtag>
</tag>
""", "lxml-xml")

print(root, "\n", "\n", "\n")
print(root.tag.subtag, "\n", "\n", "\n")
print(root.tag.openclosetag.attrs, "\n", "\n", "\n")
print(root.tag.find_all("subtag")) # or print(root.find_all("subtag"))

<?xml version="1.0" encoding="utf-8"?>
<tag attribute="value">
<subtag>
        Some content for the subtag
    </subtag>
<openclosetag attribute="value2"/>
<subtag>
        Second one
    </subtag>
</tag> 
 
 

<subtag>
        Some content for the subtag
    </subtag> 
 
 

{'attribute': 'value2'} 
 
 

[<subtag>
        Some content for the subtag
    </subtag>, <subtag>
        Second one
    </subtag>]


In [292]:
response = requests.get("http://www.cmu.edu")
root = BeautifulSoup(response.content, "lxml") # removing the "lxml" doesn't do anything
#root.find_all("div", class_="events")
for div in root.find_all("div", class_="events"):
    for li in div.find_all("li"):
        print(li.text.strip())


Apr 14
                                Weekly Online Gratitude Session
Apr 15
                                Webinar: Using the Code Editor in Alice
Apr 16 - Apr 19
                                Virtual Spring Carnival & Reunion Weekend
May 7
                                Webinar: Protecting Your IP for Startups


In [459]:
# get all the links within the data science course schedule
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.w3schools.com/html/html_tables.asp")
root = BeautifulSoup(response.content)
print(root)

<!DOCTYPE html>
<html lang="en-US">
<head>
<title>HTML Tables</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="HTML,CSS,JavaScript,SQL,PHP,jQuery,XML,DOM,Bootstrap,Python,Java,Web development,W3C,tutorials,programming,training,learning,quiz,primer,lessons,references,examples,exercises,source code,colors,demos,tips" name="Keywords"/>
<meta content="Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, Python, Bootstrap, Java and XML." name="Description"/>
<link href="/favicon.ico" rel="icon" type="image/x-icon"/>
<link href="/w3css/4/w3.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Source Code Pro" rel="stylesheet"/>
<style>
a:hover,a:active{color:#4CAF50}
table.w3-table-all{margin:20px 0}
/*OPPSETT AV TOP, TOPNAV, SIDENAV, MAIN, RIGHT OG FOOTER:*/
.top {
position:relative;
background-color:#ffffff;
height:68

In [460]:
print(root.find("div", {"class" : 'w3-example'}))
## I used this syntax: ("div", {"class" : 'w3-white w3-padding notranslate w3-padding-16'})instead of:
##("div",class ='w3-example')

<div class="w3-example">
<h3>HTML Table Example</h3>
<div class="w3-white w3-padding notranslate w3-padding-16">
<table id="customers">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr>
<td>Ernst Handel</td>
<td>Roland Mendel</td>
<td>Austria</td>
</tr>
<tr>
<td>Island Trading</td>
<td>Helen Bennett</td>
<td>UK</td>
</tr>
<tr>
<td>Laughing Bacchus Winecellars</td>
<td>Yoshi Tannamuri</td>
<td>Canada</td>
</tr>
<tr>
<td>Magazzini Alimentari Riuniti</td>
<td>Giovanni Rovelli</td>
<td>Italy</td>
</tr>
</table>
</div>
<a class="w3-btn w3-margin-top w3-margin-bottom" href="tryit.asp?filename=tryhtml_table_intro" target="_blank">Try it Yourself »</a>
</div>


In [485]:
#print(root.find("table",id="customers")\
      #.find_all("tr").find("th").find("tr")
root.find("table",id="customers")\
.find("tr").find_all("th")#.findAll("th")

[<th>Company</th>, <th>Contact</th>, <th>Country</th>]

In [462]:
root.find("div", {"class" : "w3-sidebar w3-collapse"}).find_all("a")
#<div class="w3-sidebar w3-collapse" id="sidenav">
#<div id="leftmenuinner">
#<div class="w3-light-grey" id="leftmenuinnerinner">

[<a href="default.asp" target="_top">HTML HOME</a>,
 <a href="html_intro.asp" target="_top">HTML Introduction</a>,
 <a href="html_editors.asp" target="_top">HTML Editors</a>,
 <a href="html_basic.asp" target="_top">HTML Basic</a>,
 <a href="html_elements.asp" target="_top">HTML Elements</a>,
 <a href="html_attributes.asp" target="_top">HTML Attributes</a>,
 <a href="html_headings.asp" target="_top">HTML Headings</a>,
 <a href="html_paragraphs.asp" target="_top">HTML Paragraphs</a>,
 <a href="html_styles.asp" target="_top">HTML Styles</a>,
 <a href="html_formatting.asp" target="_top">HTML Formatting</a>,
 <a href="html_quotation_elements.asp" target="_top">HTML Quotations</a>,
 <a href="html_comments.asp" target="_top">HTML Comments</a>,
 <a href="html_colors.asp" target="_top">HTML Colors</a>,
 <a href="html_colors.asp" target="_top">Colors</a>,
 <a href="html_colors_rgb.asp" target="_top">RGB</a>,
 <a href="html_colors_hex.asp" target="_top">HEX</a>,
 <a href="html_colors_hsl.asp" tar

# Regular expressions and parsing

Regular expressions are invaluable when parsing any type of unstructured data, if you’re trying to quickly find or extract some text from a long string, and even if you’re writing a more complex parser.

Once you have loaded data (or if you need to build a parser to load some other data format), you will often need to search for specific elements within the data. E.g., find the first occurrence of the string “data science”
1. First we import the regex library
2. Then we declare the text we want to search
3. We then pass the regex action that we want to carry out to a variable/declare our conditionals or loops
4. We print our result

In [386]:
import re
text = "This course will introduce the basics of data science" # this text is used throughout the examples
match = re.search(r"data science", text)  # check if the start of text matches and also find first match or None
print(match.start()) # the spaces are included in the counting

41


From the above example, the important element here is the *re.search(r"data science", text)* call. It searches text for the string *“data science”* and returns a regular expression *“match”* object that contains information about where this match was found: 

for instance, we can find the character index (in text) where the match is found, using the match.start() call.

1. **re.match():** Match the regular expression starting at the beginning of the text string. It won't match if the string does not start with the word to be matched. The **search()** function searches the string for a match, and returns a Match object if there is a match (irrespective of it's position). If there is more than one match, only the first occurrence of the match will be returned. 


**e.g:**        


         tnt = "His dream is to study life science at the university"

         compare = re.match("\w+\s+science",tnt)

         print(compare)

         # this will return **None** because life science did not start the string
                 

         
         tnt = "His dream is to study life science at the university"

         compare = re.search("\w+\s+science",tnt)

         print(compare)

         # this will return match object **<re.Match object; span=(22, 34), match='life science'>**
         
         


2. **re.finditer():** Find all matches in the text, returning a iterator over match objects.


3. **re.findall():** Find all matches in the text, returning a list (The list contains the matches in the order they are found) of the matched text only (not a match object).


4. **re.compile():** You can also “compile” a regular expression and then make all the same calls on this compiled object.


5. **re.split():**   Returns a list where the string has been split at each match. The split() function returns a list where the string has been split at each match.


6. **re.sub():**     Replaces one or many matches with a string

For a simpler and better explaination on python regex, [W3Schools](https://www.w3schools.com/python/python_regex.asp) have great resource

In [494]:
tnt = "His dream is to study life science at the university"
compare = re.match("\w+\s+science",tnt)
#compare = re.search("\w+\s+science",tnt)
print(compare)

None


In [336]:
regex = re.compile(r"data science")
regex.search(text) # it provides the start (41) and end (53) positions

<re.Match object; span=(41, 53), match='data science'>

A Match Object is an object containing information about the search and the result (as given above). The example above did a search that returned a Match Object. The Match object has properties and methods used to retrieve information about the search, and the result:

* span() returns a tuple containing the start-, and end positions of the match.
- string returns the string passed into the function
+ group() returns the part of the string where there was a match

**N/B:** If there is no match, the value *None* will be returned, instead of the Match Object.

Let's look at examples for the math objects and their methods

In [327]:
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) # this case has two occurences of "ai". In this case, it'll pick only the first occurence to give it's start and end position

<re.Match object; span=(5, 7), match='ai'>


In [333]:
# Print the position (start- and end-position) of the first match occurrence.
# The regular expression looks for any words that starts with an upper case "S":

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


In [334]:
# Print the string passed into the function:

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


In [438]:
# Print the part of the string where there was a match.
# The regular expression looks for any words that starts with an upper case "S":

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group()) #there is a diff between group() and groups() (Check the note on Grouping just before substitution)

Spain


In [442]:
for match in re.finditer("data science", text): # Find all matches in the text. removing this line will make no diff.
    all_matches = re.findall(r"data science", text) # return all matches. You can interchange line 1&2 to see d result. 
# iterate over all matches in the text
print(all_matches)

['data science']


In [307]:
# We could use re.finditer() to list the location of all the characters in the string, e.g,
for match in re.finditer(r"c", text):
    print(match.start())

5
24
35
47
51


In [308]:
#re.findall() just returns a list of the matched strings, with no additional info such as where they occurred:

re.findall(r"i", text)

['i', 'i', 'i', 'i', 'i']

In [446]:
# the example below split the string at every white-space character
# txt = "The rain in Spain"

x = re.split("\s", text)
#x = re.split("s", text) this line basically shows that you can split at any point 
print(x)

['This', 'course', 'will', 'introduce', 'the', 'basics', 'of', 'data', 'science']


In [352]:
# We can also control the number of occurrences by specifying the maxsplit parameter:
# The example below will split the string at the first white-space character:

x = re.split("\s", text, 1)
print(x)

['This', 'course will introduce the basics of data science']


As an example, the following regular expression will match “data science” regardless of the capitalization, and with any type of space between the two words.

In [358]:
print(re.search(r"[Dd]ata\s[Ss]cience", text))
# print(re.search(r"[D]ata\s[S]cience", text)) try this, it'll return None

<re.Match object; span=(41, 53), match='data science'>


Matching repeated characters. Can match one or more instances of a character (or set of characters)
Some common modifiers:
+ Match character ‘a’ exactly once: a
+ Match character ‘a’ zero or one time: a?
+ Match character ‘a’ zero or more times: a*
+ Match character ‘a’ one or more times: a+
+ Match character ‘a’ exactly n times: a{n}

Can combine these with multiple character matching:
+ These rules can of course be combined with the rules to match potentially very complicated expressions. For instance, if we want to match the text “something science” where something is any alphanumeric character, and there can be any number of spaces of any kind between something and the word “science”, we could use the expression r"\w+\s+science".

* From the question below, we are match all instances of “<something> science” where <something> is an alphanumeric string with at least one character: \w+\s+science

Which strings would be matched (i.e, calling re.match()) by the regular expression?

$$\w+\s+science$$

1. “life science”
2. “life sciences”
3. “life. Science”
4. “this data science problem”

**Answer:** 1. life science, because it matches perfectly with the re.match() given 

Let's look at more examples (in the cell below)

In [499]:
tnt = "life science"
tnts = "life sciences"
word = "life. Science"
words = "this data science problem"
#compare = re.match("\w+\s+science",tnt) #a match object was returned because it started with "\w+\s+science"

#compare = re.match("\w+\s+science",tnts) #it matched because it started with "\w+\s+science". It doesn't matter that
#that it ended with "s" in the "life science"

#compare = re.match("\w+\s+science",word) 
#the above returned none because there is no match for (.) \w for alphanumeric characters and \s for white space
#the (+) in both cases indicate one or more times. simply put: the alphanumeric&whitespace characters occuring one or 
#more times.

compare = re.match("\w+\s+science",words) #it returned None cos the string did not begin with "\w+\s+science"
print(compare)

None


In [364]:
print(re.match("\w+\s+science", "data science")) # \w+\s+science starts at 0 and ends at 12(spaces included in counting)
print(re.match("\w+\s+science", "life science")) # \w+\s+science starts at 0 and ends at 12(spaces included in counting)
print(re.match("\w+\s+science", "0123_abcd science")) # \w+\s+science starts at 0 and ends at 17(,, ,, ,,)

<re.Match object; span=(0, 12), match='data science'>
<re.Match object; span=(0, 12), match='life science'>
<re.Match object; span=(0, 17), match='0123_abcd science'>


**NOTE:** One thing you may notice is the *r" "* format of the regular expressions (quotes with an ‘r’ preceding them). You can actually use any string as a regular expression, but the *r expressions* are quite handy for the following reason.

+ In a typical Python string, backslash characters denote escaped characters, so for instance "\\" really just encodes a single backslash. But backslashes are also used within regular expressions.

* So if we want the regular expression \\ represented as a string (that is, match a single backslash), we’d need to use the string "\\\\". This gets really tedious quickly. So the r" " notation just ignores any handling of handling of backslashes, and thus makes inputing regular expressions much simpler.

In [406]:
print("\\")
#print("\\\") # SyntaxError: EOL while scanning string literal
print(r"\\") # with r, you get the backslashes as you want them

\
\\


### Grouping
Beyond the ability to just match strings, regular expressions also let you easily find specific sub-elements of the matched strings. The basic syntax is the following: if we want to “remember” different portions of the matched expression, we just surround those portions of the regular expression in parentheses. For example, the regular expression r"(\w+)\s([Ss]cience)" would store whatever element is matched to the \w+ and [Ss]cience portions in the groups() object in the returned match.

We often want to obtain more information that just whether we found a match or not (for instance, we may want to know what text matched).

In [501]:
# (\w+)\s([Ss]cience)
match = re.search(r"(\w+)\s([Ss]cience)", text)
print(match.start(), match.groups(), "\n") #this returns a tuple while .group() prints the part of the string
#where there is a match
# Why the ‘r’ before the string? Avoids need to double escape strings

print(match.group(), "\n") #remember the method from the match object (span(), string(), group())
print(match.span())
#print(match.string()) # TypeError: 'str' object is not callable (I wonder why cos it worked in our Spain example)

41 ('data', 'science') 

data science 

(41, 53)


The *.group(i)* notation also lets you easily find just individual groups, *.group(0)* being the entire text.

In [391]:
match = re.search(r"(\w+)\s([Ss]cience)", text)
print(match.group(0))
print(match.group(1))
print(match.group(2))

data science
data
science


### Substitutions

Regular expression can also be used to automatically substitute one expression for another within the string. This is done using the re.sub() call. 

This returns a string with (all) the instances of the first regular expression replaced with the second one. For example, to replace all the occurrences of “data science” with “data schmience”, we could use the following code:

In [392]:
better_text = re.sub(r"data science", r"schmada science", text) #takes in 3 parameters
print(better_text)

This course will introduce the basics of schmada science


Where this gets really powerful is when we use groups in the first regular expression. These groups can then be backreferenced using the \1, \2, etc notation in the second one (more generally, you can actually use these backreferencs within a single regular expression too, outside the context of substitutions).

So if we have the regular expression r"(\w+) ([Ss])cience" to match “something science” (where science can be capitalized or not), we would replace it with the string “something schmience”, keeping the something the same, and keeping the capitalization of science the same, using the code:

In [400]:
better_text = re.sub(r"(\w+)\s([Ss])cience", r"\1 \2hmience", text)
print(better_text) 
# remember that group(1, which is (\1) here) was data and group(2,  which is (\2chmience) was science. 
# (\w+) matches \1 while ([Ss])cience will be replaced by \2hmience
# "\s" represents space

another_text = re.sub(r"(\w+) ([Ss])cience", r"\1 \2chmience", "Life Science")
print(another_text)

# (\w+) is group(1) which is (\1) and will be replaced by Life
# ([Ss]cience) is group(2) which is (\2chmience) which will replace Science

This course will introduce the basics of data shmience
Life Schmience


In [405]:
# You can control the number of replacements by specifying the count parameter
#txt = "The rain in Spain"
x = re.sub("\s", "-", text, 2) #takes in 4 parameters.
print(x)

#You can replace anything with anything you want, eg:
x = re.sub("the", ",", text, 2)
print(x)

This-course-will introduce the basics of data science
This course will introduce , basics of data science


### Ordering and greedy matching

#### Order of operations
The first point comes in regard to the order of operations for regular expressions. The | character in regular expressions is like an “or” clause, the regular expression should can match the regular expression to the left or to the right of the character. For example, the regular expression r"abc|def" would match the string “abc” or “def”.

In [407]:
print(re.match(r"abc|def", "abc"))
print(re.match(r"abc|def", "def"))

<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 3), match='def'>


But what if we want to match the string “ab(c or d)ef”? We can capture this in a regular expression by parentheses around the portion we want to give a higher order of operations.

In [453]:
print(re.match(r"abc|def", "abdef")) # we don't have abc
# print(re.match(r"abc||def", "abdef")) this returned an empty string
print(re.match(r"ab(c|d)ef", "abdef")) # we don't have c but we have d

None
<re.Match object; span=(0, 5), match='abdef'>


Since we also use the parentheses for specifying groups, we can get around the parentheses (if we don’t want to capture it) by creating a group using *a(?:bc|de)f*. By default, regular expressions try to capture as much text as possible (greedy
matching)

In [455]:
print(re.match(r"ab(c|d)ef", "abdef").groups())
print(re.match(r"ab(?:c|d)ef", "abdef").groups()) #remember ? -> occuring 0 or one time

('d',)
()


In [412]:
print(re.match(r"<.*>", "<tag>hello</tag>"))

<re.Match object; span=(0, 16), match='<tag>hello</tag>'>


In [413]:
print(re.match(r"<.*?>", "<tag>hello</tag>"))

<re.Match object; span=(0, 5), match='<tag>'>


**Crazy Construct**
r".?|(..+?)\\1+" that [matches only prime numbers of characters](https://iluxonchik.github.io/regular-expression-check-if-number-is-prime/).

### Additional features
There are other elements in regex such as start/end lines, lookaheads, named groups, etc. For more on these expressions, we can use the [Python Docs on regex Here](https://docs.python.org/3/howto/regex.html), and [Here too](https://docs.python.org/3/library/re.html)

+ [Pandas Library](https://pandas.pydata.org/docs/getting_started/index.html)
+ [Json Library](https://docs.python.org/3/library/json.html)
* [Beautiful Soup Library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)