## Announcements

Day 14 will be a free cut. Instead, please go through the cells below as these will be important as we go through the remaining topics in this course.


## Software Design Patterns

Source: [Wikipedia](https://www.tutorialspoint.com/python_design_patterns/python_design_patterns_adapter.htm)
    
In software engineering, a software design pattern is a general, reusable solution to a commonly occurring problem within a given context in software design. It is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations. Design patterns are formalized best practices that the programmer can use to solve common problems when designing an application or system.

Object-oriented design patterns typically show relationships and interactions between classes or objects, without specifying the final application classes or objects that are involved. Patterns that imply mutable state may be unsuited for functional programming languages, some patterns can be rendered unnecessary in languages that have built-in support for solving the problem they are trying to solve, and object-oriented patterns are not necessarily suitable for non-object-oriented languages.

Design patterns may be viewed as a structured approach to computer programming intermediate between the levels of a programming paradigm and a concrete algorithm.

Structure of a **design pattern**:

* Pattern Name
* Intent/Motive
* Applicability
* Participants and Consequences

For Python, we need to include:
* Aliases
* Motivation
* Constraints
* Sample Code

### Benefits of using patterns

* Patterns provide developer a selection of tried and tested solutions for the specified problems.
* All design patterns are language neutral.
* Patterns help to achieve communication and maintain well documentation.
* It includes a record of accomplishment to reduce any technical risk to the project.
* Design patterns are highly flexible to use and easy to understand.

### Patterns which may be covered in this class (ones in bold are illustrated in this notebook):
* Model View Controller Pattern
* Singleton pattern
* Factory pattern
* Builder Pattern
* Prototype Pattern
* Facade Pattern
* Command Pattern
* **Adapter Pattern** 
* Prototype Pattern
* Decorator Pattern
* **Proxy Pattern**
* Chain of Responsibility Pattern
* Observer Pattern
* State Pattern
* Strategy Pattern
* Template Pattern
* Flyweight Pattern
* Abstract Factory Pattern
* Object Oriented Pattern

### Adapter Pattern

(source: [Wikipedia](https://www.google.com/search?q=adapter+pattern&rlz=1C5CHFA_enPH504PH504&oq=Adapter+pattern&aqs=chrome.0.0l2j69i60j0l3.2925j0j1&sourceid=chrome&ie=UTF-8))

In software engineering, the adapter pattern is a software design pattern (also known as wrapper, an alternative naming shared with the decorator pattern) that allows the interface of an existing class to be used as another interface.

The adapter design pattern solves problems like:

* How can a class be reused that does not have an interface that a client requires?
* How can classes that have incompatible interfaces work together?
* How can an alternative interface be provided for a class?

The adapter design pattern describes how to solve such problems:

Define a separate adapter class that converts the (incompatible) interface of a class (adaptee) into another interface (target) clients require.
Work through an adapter to work with (reuse) classes that do not have the required interface.
The key idea in this pattern is to work through a separate adapter that adapts the interface of an (already existing) class without changing it.

Clients don't know whether they work with a target class directly or through an adapter with a class that does not have the target interface.



In [2]:
# provide sample code here

# Say you want to have the flexibility of writing sales transactions as either CSV or JSON. 
# How would you go about it?

import csv
import json

class SalesTransactionWriter():
    def write(self,sales_order):
        print("Sales order logged successfully.")
    
class JSONSalesTransactionWriter(SalesTransactionWriter):
    def write(self,sales_order):
        
        print("{}".format(json.dumps(sales_order)))
    
class CSVSalesTransactionWriter(SalesTransactionWriter):
    def write(self,sales_order):
        print("{},{}".format(sales_order["customer"],sales_order["amount"]))        
        
def log_sales_order(sales_order):
    sales_logger = CSVSalesTransactionWriter()
    sales_logger.write(sales_order)


In [5]:
# We will just print out results in this cell to simplify the illustration

log_sales_order({"customer":"Joe","amount":150})

Joe,150


In [6]:
# Change the sales log implementation to JSON.
# Likewise, we will just print out results to simplify the illustration

def log_sales_order(sales_order):
    sales_logger = JSONSalesTransactionWriter()
    sales_logger.write(sales_order)

In [4]:
log_sales_order({"customer":"Joe","amount":150})

{"customer": "Joe", "amount": 150}


### Proxy Pattern

(source: [Wikipedia](https://en.wikipedia.org/wiki/Proxy_pattern))

In computer programming, the proxy pattern is a software design pattern. A proxy, in its most general form, is a class functioning as an interface to something else. The proxy could interface to anything: a network connection, a large object in memory, a file, or some other resource that is expensive or impossible to duplicate. In short, a proxy is a wrapper or agent object that is being called by the client to access the real serving object behind the scenes. Use of the proxy can simply be forwarding to the real object, or can provide additional logic. In the proxy, extra functionality can be provided, for example caching when operations on the real object are resource intensive, or checking preconditions before operations on the real object are invoked. For the client, usage of a proxy object is similar to using the real object, because both implement the same interface.

#### What problems do the Proxy Pattern solve?

* The access to an object should be controlled.
* Additional functionality should be provided when accessing an object.
* When accessing sensitive objects, for example, it should be possible to check that clients have the needed access rights.

#### Oversimplified version of a financial services application

In [8]:
# provide sample code here

joben_account = {"account":"joben","balance":100000,"is_active":False}
joe_account = {"account":"joe","balance":5000,"is_active":True}

print("Before transfer: ")
print(joben_account)
print(joe_account)

def transfer_funds(source, destination, amount):
    source["balance"] -= amount
    destination["balance"] += amount
    
transfer_funds(joben_account, joe_account, 5000)

print("After transfer: ")
print(joben_account)
print(joe_account)
    

Before transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}
After transfer: 
{'account': 'joben', 'balance': 95000, 'is_active': False}
{'account': 'joe', 'balance': 10000, 'is_active': True}


But what if we need to check if an account is active (i.e. not suspended) before allowing the transaction to proceed?

In [6]:
joben_account = {"account":"joben","balance":100000,"is_active":False}
joe_account = {"account":"joe","balance":5000,"is_active":True}

print("Before transfer: ")
print(joben_account)
print(joe_account)

def transfer_funds(source, destination, amount):
    # insert validation code here
    if(source["is_active"]):
        source["balance"] -= amount
        destination["balance"] += amount
    else:
        print("Transaction Failed")
        print("Reason: Source Account is not active.")
    
transfer_funds(joben_account, joe_account, 1000)

print("After transfer: ")
print(joben_account)
print(joe_account)

Before transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}
Transaction Failed
Reason: Source Account is not active.
After transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}


The problem is we several more checks before allowing the transaction to proceed. Financial transactions of this nature are not as simple as adding and subtracting amounts:

- Is the transaction valid?
- Are the accounts active?
- Does the sender have sufficient balance in the account?
- Is the one performing the transaction--whether human or automated (ex ATM or mobile banking app) authorized?

It will be very cumbersome to keep on changing the code of transfer_funds(...) each time.

One solution is to implement a separate validator for transfering funds and validating transactions.
 

In [7]:
joben_account = {"account":"joben","balance":100000,"is_active":False}
joe_account = {"account":"joe","balance":5000,"is_active":True}

print("Before transfer: ")
print(joben_account)
print(joe_account)

def transaction_validator(source, destination, amount):
    validation_status = {
        "is_valid": False,
        "reason": ""
    }

    if(source["is_active"]):
        validation_status["is_valid"] = True
    else:
        validation_status["is_valid"] = False # redundant for sure, but placing here for readability
        validation_status["reason"] = "Source account is not valid."
    return validation_status
    

def transfer_funds(source, destination, amount):
    validation_status = transaction_validator(source, destination, amount)
    
    if(validation_status["is_valid"]):
        source["balance"] -= amount
        destination["balance"] += amount
    else:
        print(validation_status["reason"])
    
transfer_funds(joben_account, joe_account, 1000)

print("After transfer: ")
print(joben_account)
print(joe_account)

Before transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}
Source account is not valid.
After transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}


Another solution is to implement the **Proxy** pattern to separate the concern of transfering funds and validating transactions.
 

In [8]:
joben_account = {"account":"joben","balance":100000,"is_active":False}
joe_account = {"account":"joe","balance":5000,"is_active":True}

print("Before transfer: ")
print(joben_account)
print(joe_account)

def transaction_validator(source, destination, amount):
    validation_status = {
        "is_valid": False,
        "reason": ""
    }

    if(source["is_active"]):
        validation_status["is_valid"] = True
    else:
        validation_status["is_valid"] = False # redundant for sure, but placing here for readability
        validation_status["reason"] = "Source account is not valid."
    return validation_status

def transfer_funds_secure_proxy(source, destination, amount):
    # call transaction_validator first
    validation_status = transaction_validator(source, destination, amount)
    if(validation_status["is_valid"]):
        # proceed with calling transfer_funds(...)
        transfer_funds(source, destination, amount)
    else:
        print(validation_status["reason"])
    
# this was our original, bare-bones transfer_funds(...) function implementation. short and sweet.    
def transfer_funds(source, destination, amount):
    source["balance"] -= amount
    destination["balance"] += amount

# Note that we are now calling the proxy, not the main transaction handler
transfer_funds_secure_proxy(joben_account, joe_account, 1000)

print("After transfer: ")
print(joben_account)
print(joe_account)

Before transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}
Source account is not valid.
After transfer: 
{'account': 'joben', 'balance': 100000, 'is_active': False}
{'account': 'joe', 'balance': 5000, 'is_active': True}


#### What's the difference between Adapter and Proxy?

The role of the adapter is to change the interface of its adaptee (i.e. adapt them to a new interface) whereas the role of a proxy is to provide the access to an object (which might be located on another system) without changing its interface.

### *End of patterns discussions for now*

## Regular Expression (Regex)

**Source:** [RegExOne](https://regexone.com/)

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents. And while there is a lot of theory behind formal languages, the following lessons and examples will explore the more practical uses of regular expressions so that you can use them as quickly as possible.

Best way to dive into **regex** is to go through examples.

In [9]:
import re

In [10]:
some_string = "The quick brown fox jumps over the lazy dogs."

re.findall("THE",some_string.upper())

['THE', 'THE']

In [48]:
# next few examples courtesy of DataCamp
pattern = r"Cookie" # raw string literal
sequence = "Cookie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


In [52]:
# Wildcard
# . - A period. Matches any single character except newline character.
re.search(r'Co.k.e', 'Cookie').group()

# The group() function returns the string matched by the re.

'Cookie'

In [29]:
pattern = r'C.....'
print(re.search(pattern,"Cookie").group())
print(re.search(pattern,"Cuckoo").group())
print(re.search(pattern,"Cockroach").group())

Cookie
Cuckoo
Cockro


In [55]:
# \w - Lowercase w. Matches any single letter, digit or underscore.
pattern = r'Co\wk\we'
print(re.search(pattern, 'Cookie').group())
print(re.search(pattern, 'Coffee'))
print(re.search(pattern, 'Co!kee'))

Cookie
None
None


In [58]:
# \W - Uppercase w. Matches any character not part of \w (lowercase w).
pattern = r'C\Wke'
print(re.search(pattern, 'C*ke').group())

C*ke


In [71]:
# \s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.
print(re.search(r'Eat\scake', 'Eat cake').group())
print(re.search(r'M\sE\s', 'M E ').group())

Eat cake
M E 


In [17]:
# \S - Uppercase s. Matches any character not part of \s (lowercase s).
print(re.search(r'Eats\Sshoots\sand\sleaves', 'Eats,shoots and leaves').group())

Eats,shoots and leaves


In [18]:
#\n - Lowercase n. Matches newline.
#
#\r - Lowercase r. Matches return.
#
#\d - Lowercase d. Matches decimal digit 0-9.

re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

In [74]:
# ^ - Caret. Matches a pattern at the start of the string.
re.search(r'^Eat', 'Eat cake').group()

'Eat'

In [78]:
# $ - Matches a pattern at the end of string.
re.search(r'cake$', 'Throw cake').group()

'cake'

In [85]:
re.search(r'\d$' ,'my mobile number is 0917-9999999').group()

'9'

In [90]:
# [abc] - Matches a or b or c.
# [12] - Matches 1 or 2
#
# [a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). 
# Characters that are not within a range can be matched by complementing the set. 
# If the first character of the set is ^, all the characters that are not in the set will be matched.
re.search(r'Number: [0-6]', 'Number: 6').group()


'Number: 6'

In [103]:
re.search(r'^09[129][78][-* ]9999999', '0998 9999999').group()

'0998 9999999'

In [108]:
# \A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.
pattern = r'\A[A-Za-e]ookie'

print(re.search(pattern, 'Cookie').group())
print(re.search(pattern, 'Bookie').group())
print(re.search(pattern, 'cookie').group())
print(re.search(pattern, 'Zookie').group())

Cookie
Bookie
cookie
Zookie


In [117]:
# \b - Lowercase b. Matches only the beginning or end of the word.

re.search(r'\b[A-Ka-c]ookie', 'bookie').group()

'bookie'

In [119]:
# Warm-up

some_string_2 = "abc123 another123"

print(re.findall("123",some_string_2))
print(re.findall("abc",some_string_2))
print(re.findall("abc123",some_string_2))
print(re.findall("xyz",some_string_2))
print(re.findall("[a-z]",some_string_2)) # find all occurrences of lower case letters (a-z)
print(re.findall("[0-9]",some_string_2)) # find all occurrences of numeric digits (0-9)

['123', '123']
['abc']
['abc123']
[]
['a', 'b', 'c', 'a', 'n', 'o', 't', 'h', 'e', 'r']
['1', '2', '3', '1', '2', '3']


#### Repetitions

It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:

`+` - Checks for one or more characters to its left.

In [125]:
re.search(r'[A-Z].+kie', 'Bo0000okie').group()

'Bo0000okie'

`*` - Checks for zero or more characters to its left.

In [136]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Caaaaokie').group()

'Caaaaokie'

In [27]:
re.search(r'Ca*o*kie', 'Caaaaookie').group()

'Caaaaookie'

`?` - Checks for exactly zero or one character to its left.

In [138]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Colour').group()

'Colour'

In [29]:
contact_info = "Tony Stark 0917-9003000, 0917-9003001, 0917-9003002 tony@example.com tony@ateneo.edu"

# Extract possible phone number(s) from text
print(re.findall("[0-9]{4}-[0-9]{7}", contact_info)) # find all possible phone numbers in the text
print(re.findall("[a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-z]+", contact_info))

['0917-9003000', '0917-9003001', '0917-9003002']
['tony@example.com', 'tony@ateneo.edu']


But what if you want to check for exact number of sequence repetition?

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

`{x}` - Repeat exactly x number of times.

`{x,}` - Repeat at least x times or more.

`{x, y}` - Repeat at least x times but no more than y times.

In [141]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

In [145]:
re.search(r'\d{4}[- ]\d{7}', '0987 6543210').group()

'0987 6543210'

The `+` and `*` qualifiers are said to be greedy.

#### Groups and Groupings

Suppose that, when you're validating email addresses and want to check the user name and host separately.

This is when the group feature of regular expression comes in handy. It allows you to pick up parts of the matching text.

Parts of a regular expression pattern bounded by parenthesis() are called **groups**. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the `group()` function all along in this tutorial's examples. The plain `match.group()` without any argument is still the whole matched text as usual.

In [150]:
email_address = 'Please contact us at: support.coffee@obf.ateneo.edu'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)
if match:
  print(match.group()) # The whole matched text
  print(match.group(1)) # The username (group 1)
  print(match.group(2)) # The host (group 2)

support.coffee@obf.ateneo.edu
support.coffee
obf.ateneo.edu


#### Greedy vs. Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression but sometimes this behavior is not desired:

In [32]:
pattern = "cookie"
sequence = "Cake and cookie"

heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()

'<h1>TITLE</h1>'

However, if you only wanted to match the first `<h1>` tag, you could have used the greedy qualifier `*?` that matches as little text as possible.

Adding `?` after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched. When you run `<.*?>`, you will only get a match with `<h1>`.

In [33]:
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()

'<h1>'

#### Replace text


In [157]:
contact_info = "Tony Stark 0917-9003000, 0917-9003001, 0917-9003002 tony@example.com tony@ateneo.edu"

re.sub(r"(09[0-9][0-9])-",r"(\1) ",contact_info)

# (09[0-9][0-9]) refers to group 1 containing the mobile number prefix
# \1 refers to group 1

'Tony Stark (0917) 9003000, (0917) 9003001, (0917) 9003002 tony@example.com tony@ateneo.edu'

In [173]:
brgy = "Sto.                  Nino   "
brgy = re.sub("\s+"," ",brgy)
brgy = re.sub("Sto.","Santo",brgy)
brgy = re.sub("Sto","Santo",brgy)
brgy = re.sub("Nino\s+","Niño",brgy)

Santo Nino 


In [172]:
brgy

'Santo Niño'

In [37]:
brgy = "Sto. Nino"
brgy = re.sub("Sto.","Santo",brgy)

In [174]:
dirty_location_list = [
    "Paranaque",
    "Pnque",
    "Parañaque",
]

clean_location_list = [re.sub(r'(Paranaque|Pnque|Parañaque)',r'PARAÑAQUE',loc) for loc in dirty_location_list]

In [175]:
print(dirty_location_list)
print(clean_location_list)

['Paranaque', 'Pnque', 'Parañaque']
['PARAÑAQUE', 'PARAÑAQUE', 'PARAÑAQUE']


### IMPORTANT

Please go through the tutorials and sample puzzles [here](https://regexcrossword.com/). We won't have time for a full-blown lecture on regex, but the topic is too important for you not to cover through self-study because future assignments and tests will depend heavily on it.

## Introduction to Data Cleansing (or Data Cleaning)

Source: [Techopedia](https://www.techopedia.com/definition/1174/data-cleansing)

Data cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct. There are many ways to pursue data cleansing in various software and data storage architectures; most of them center on the careful review of data sets and the protocols associated with any particular data storage technology.

Data cleansing is also known as data cleaning or data scrubbing.

### Data Cleaning Exercise

We have remnants of a Google Forms submission where students were asked to specify their courses. Unfortunately, the form designer was not able to restrict entries to a few valid choices, so the free-form text yielded a messy collection shown below. How would you go about cleaning the data?

In [177]:
dirty_course_list = [
    "BS MGT HONORS",
    "BS Management Honors",
    "BS Management Engineering",
    "BS M.E.",
    "M.E.",
    "BS M.E.",
    "BS ME",
    "BS Mgt Eng",
    "BS Mgt. Eng."
    "BSME",
    "BS ME",
    "BS-ME",
    "BMH",
    "BM-H",
    "MH",
    "BS MH",
    "MGTH",
    "Mgt Honors",
    "ME"
]


### Candidate Solution

This is by no means a recipe for general data cleansing. Rather, it is a simple approach for the given problem above. To simplify the illustration, we won't be using regex just yet (We will go through a more general data cleansing process flow in a future lecture.)

#### 1) Find unique names

In [178]:
# Find unique names

unique_names = set(dirty_course_list)

unique_names

{'BM-H',
 'BMH',
 'BS M.E.',
 'BS ME',
 'BS MGT HONORS',
 'BS MH',
 'BS Management Engineering',
 'BS Management Honors',
 'BS Mgt Eng',
 'BS Mgt. Eng.BSME',
 'BS-ME',
 'M.E.',
 'ME',
 'MGTH',
 'MH',
 'Mgt Honors'}

In [179]:
[n for n in unique_names]

['BS Mgt Eng',
 'BS Mgt. Eng.BSME',
 'MH',
 'BS Management Honors',
 'MGTH',
 'BS-ME',
 'BS Management Engineering',
 'BS M.E.',
 'BS MH',
 'M.E.',
 'Mgt Honors',
 'BS MGT HONORS',
 'ME',
 'BM-H',
 'BMH',
 'BS ME']

#### 2) Cluster names and converge to one name per cluster

In [180]:
# add unique_names to dictionary

unique_names_dictionary = {name:None for name in unique_names}

In [181]:
unique_names_dictionary

{'BS Mgt Eng': None,
 'BS Mgt. Eng.BSME': None,
 'MH': None,
 'BS Management Honors': None,
 'MGTH': None,
 'BS-ME': None,
 'BS Management Engineering': None,
 'BS M.E.': None,
 'BS MH': None,
 'M.E.': None,
 'Mgt Honors': None,
 'BS MGT HONORS': None,
 'ME': None,
 'BM-H': None,
 'BMH': None,
 'BS ME': None}

#### 3) Assign an official name per unique name

Oftentimes, this is a tedious, manual process. However, the effort is mostly at the start, and succeeding efforts may be automated.

In [182]:
# Go through each unique name and assign an official name

for i in unique_names_dictionary:
    final_name = input(i+": "+"Enter name: ")
    unique_names_dictionary[i]=final_name

BS Mgt Eng: Enter name: BS Management Engineering
BS Mgt. Eng.BSME: Enter name: BS Management Engineering
MH: Enter name: BS Management Honors
BS Management Honors: Enter name: BS Management Honors
MGTH: Enter name: BS Management Honors
BS-ME: Enter name: BS Management Engineering
BS Management Engineering: Enter name: BS Management Engineering
BS M.E.: Enter name: BS Management Engineering
BS MH: Enter name: BS Management Honors
M.E.: Enter name: BS Management Engineering
Mgt Honors: Enter name: BS Management Honors
BS MGT HONORS: Enter name: BS Management Honors
ME: Enter name: BS Management Engineering
BM-H: Enter name: BS Management Honors
BMH: Enter name: BS Management Honors
BS ME: Enter name: BS Management Engineering


Note: another approach is to dump the unique names in a CSV, upload to Excel (where you can manually assign the official names in another column), redump back to CSV, then import back to Python for further processing.

In [183]:
## Verify the new list
unique_names_dictionary

{'BS Mgt Eng': 'BS Management Engineering',
 'BS Mgt. Eng.BSME': 'BS Management Engineering',
 'MH': 'BS Management Honors',
 'BS Management Honors': 'BS Management Honors',
 'MGTH': 'BS Management Honors',
 'BS-ME': 'BS Management Engineering',
 'BS Management Engineering': 'BS Management Engineering',
 'BS M.E.': 'BS Management Engineering',
 'BS MH': 'BS Management Honors',
 'M.E.': 'BS Management Engineering',
 'Mgt Honors': 'BS Management Honors',
 'BS MGT HONORS': 'BS Management Honors',
 'ME': 'BS Management Engineering',
 'BM-H': 'BS Management Honors',
 'BMH': 'BS Management Honors',
 'BS ME': 'BS Management Engineering'}

#### 4) Clean up the dirty course list

In [184]:
# Clean up the dirty course list
clean_course_list = [re.sub(r''.join(course),r''.join(unique_names_dictionary[course]),course) for course in dirty_course_list]

In [185]:
clean_course_list

['BS Management Honors',
 'BS Management Honors',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Engineering',
 'BS Management Honors',
 'BS Management Honors',
 'BS Management Honors',
 'BS Management Honors',
 'BS Management Honors',
 'BS Management Honors',
 'BS Management Engineering']

In [161]:
dirty_course_list

['BS MGT HONORS',
 'BS Management Honors',
 'BS Management Engineering',
 'BS M.E.',
 'M.E.',
 'BS M.E.',
 'BS ME',
 'BS Mgt Eng',
 'BS Mgt. Eng.BSME',
 'BS ME',
 'BS-ME',
 'BMH',
 'BM-H',
 'MH',
 'BS MH',
 'MGTH',
 'Mgt Honors',
 'ME']

#### End of Introduction to Data Cleansing Discussion

I will be giving an exercise involving a dirty dataset in CSV and ask you to prepare a new and clean CSV using the data cleansing techniques discussed here. The dirty dataset to be provided can only easily be cleaned through your mastery of **regex**.

### Miscellaneous Assignments for Day 15

Install PyPDF2 using the following command (**Terminal** on MacOS, **PowerShell**, **Command Prompt** or equivalent on Windows):
```conda install -c conda-forge pypdf2```
