# Learning Objectives

- [ ] 3.3.1 Determine the attributes of a database: table, record and field.
- [ ] 3.3.2 Explain the purpose of and use primary, secondary, composite and foreign keys in tables.
- [ ] 3.3.3 Explain with examples, the concept of data redundancy and data dependency.
- [ ] 3.3.4 Reduce data redundancy to third normal form (3NF).
- [ ] 3.3.5 Draw entity-relationship (ER) diagrams to show the relationship between tables.
- [ ] 3.3.6 *Understand how NoSQL database management system addresses the shortcomings of relational database management system (SQL). (NoSQL will be addressed in later chapter)
- [ ] 3.3.7 *Explain the applications of SQL and NoSQL. (NoSQL will be addressed in later chapter)
- [ ] 3.3.8 Use a programming language to work with both SQL and NoSQL databases. (NoSQL will be addressed in later chapter)
- [ ] 3.3.9 Understand the need for privacy and integrity of data.
- [ ] 3.3.10 Describe methods to protect data.
- [ ] 3.3.11 Explain the difference between backup and archive.
- [ ] 3.3.12 Describe the need for version control and naming convention.
- [ ] 3.3.13 Explain how data in Singapore is protected under the Personal Data Protection Act to govern the collection, use and disclosure of personal data. 

# References

1. Leadbetter, C., Blackford, R., & Piper, T. (2012). Cambridge international AS and A level computing coursebook. Cambridge: Cambridge University Press.
2. https://www.sparknotes.com/cs/sorting/bubble/section1/#:~:text=The%20total%20number%20of%20comparisons,since%20no%20swaps%20were%20made.
3. https://visualgo.net/en
4. https://www.youtube.com/watch?v=o9nW0uBqvEo
5. Six-Step Relational Database Design™ by Fidel A. Captain

A **database** is a collection of related data where all records have the same structure or  collection of data stored in an organised or logical manner.

# 13.1 Flat Files Database
When referred as a medium of storing data, a **flat file** is usually a plain text file or spreadsheet document, where records usually follow a uniform format, but there are no structures for indexing or recognizing relationships between records. E.g, consider a text file with the following content

>```
Name, Gender, Age
Alex, M, 25
Ben, M, 29
Cindy, F, eighteen
Damian, M, 22
Erica, F, 23
Fanny, F, don't know
Gopal, M, 29
Damian, M, 22
>```

Such approach of storing data using flat files have the following limitations:
- data isolation: different information that is to be retrieved being stored in different files
- data duplication: repeated data item. Duplication is wasteful as it costs time and money. Data has to be entered more than once, therefore it takes up user time and storage space. Duplication is also likely to lead to a loss of data integrity and inconsistent data (copies of data item which should be the same but are not.)
- data dependence: program that need to use specific data formats might have to be changed to work with data from the flat files.
- difficulty in changing applications programs.

# 13.2 Relational Database

A **table** (also called **relation** in relational database) is a two-dimensional representation of data stored in rows and columns. A table stores data about an **entity** – i.e. some “thing” about which data are stored, for example, a customer or a product.

**Relational database** is a database where data are organised in one or more tables with relationships between them, i.e. a collection of relational tables.

In each table, a complete set of data about a single item is called a **record**, i.e. it's a row in a table.

On the other hand, a column in a table is called an **field**. **Attributes** are the describing characteristics or properties that define all items pertaining to a certain category applied to all cells of a column.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

## Example 1

The following table has 5 records and 3 fields, and the attributes are `Colour`, `Price` and `Stock`.

<center>

| `Colour` | `Price` | `Stock` |
|-|-|-|
| Red | 0.50 | 30 |
| Green | 0.50 | 18 |
| Yellow | 0.80 | 43 |
| Blue | 0.90 | 66 |
| White | 0.85 | 39 |

</center>

## Example 2

Attributes can be used to describe a table. The following table has the following description:

>```
Student (RegNo, Name, Gender, MobileNo)
>```

<center>

| RegNo | Name | Gender | MobileNo |
|-|-|-|-|
| 1 | Adam | M | 92313291 |
| 2 | Adrian | M | 92585955 |
| 3 | Agnes | F | 83324112 |
| 4 | Aisha | F | 88851896 |
| 5 | Ajay | M | 94191061 |
| 6 | Alex | M | 98671715 |
| 7 | Alice | F | 95029176 |
| 8 | Amy | F | 98640883 |
| 9 | Andrew | M | 95172444 |
| 10 | Andy | M | 95888639 |

</center>

In general, a table in a relational database can be described as:

>```
TABLE_NAME(ATTRIBUTE_1, ATTRIBUTE_2, ATTRIBUTE_3, ATTRIBUTE_4,....)
>```

Usually, the description of the entity are used for `TABLE_NAME` as well. 

## Exercise 3

Provide the table description of the following table on number of balloons sold and in stock.

<center>

| `Colour` | `Price` | `AmountSold`| `Stock` |
|-|-|-|
| Red | 0.50 |40| 30 |
| Green | 0.50 |17| 18 |
| Yellow | 0.80 |57| 43 |
| Blue | 0.90 |24| 66 |
| White | 0.85 |36| 39 |

</center>

In [None]:
#YOUR_ANSWER_HERE

## 13.2.1 Properties of a table

Table in a relational database if it fulfills the following conditions:

- Values are **atomic**, i.e., for each record, each entry contains only 1 piece of information, e.g. in Example 2, a student cannot have 2 mobile phone numbers in the table.
- Columns are of the same kind
- Rows are unique, i.e no repeated rows
- The order of columns is insignificant
- Each column must have a unique name

## 13.2.2 Key Fields
When we consider a database, it is important to be able to able to identify each record in table given some information in the fields, e.g., being able to identify the name of the person given some phone number. 

A **key field**, or **key** in short, is either a column or a combination of columns in a database that uniquely identifies the specific record in question.

There are different types of keys.
- A **candidate key** is defined as a **minimal** set of fields which can uniquely identify each record in a table. A candidate key should never be NULL or empty
- A **primary key** is a candidate key that is most appropriate to become the main key for a table. It uniquely identifies each record in a table. It is a special case of the candidate key. It is used to uniquely identify a record or row in a table. In the table description, the primary key is denoted with an underline on the attribute, e.g. $$\text{Student}\left(\underline{\text{MatricNo}},\text{ Name, Gender, CivicsClass}\right)$$
- A **secondary key** is a candidate key that is not chosen as the primary key, i.e. an alternative to the primary key. A user often wants to search the database using the secondary key. However, it is up to the designer of the database which of the attributes will form the secondary key. The setting up of these secondary keys is called *indexing*. 
- A **composite key** is a combination of two or more fields in a table that can be used to uniquely identify each record in a table. Uniqueness is only guaranteed when the fields are combined.
- A **foreign key** is an attribute (field) in one table that refers to the primary key in another table, i.e. it links to a primary key in a second table and form relationships between the tables. Foreign keys are indicated by using a dashed underline.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

## Exercise 4

Consider the following table.

<center>

| RegNo | Name | Gender | MobileNo |
|-|-|-|-|
| 1 | Adam | M | 92313291 |
| 2 | Adrian | M | 92585955 |
| 3 | Agnes | F | 83324112 |
| 4 | Aisha | F | 88851896 |
| 5 | Ajay | M | 94191061 |
| 6 | Alex | M | 98671715 |
| 7 | Alice | F | 95029176 |
| 8 | Amy | F | 98640883 |
| 9 | Andrew | M | 95172444 |
| 10 | Andy | M | 95888639 |

</center>

- What is/are the candidate key(s)?
- What is the primary key?

In [None]:
#YOUR_ANSWER_HERE

## Exercise 5

Consider the following table.

<center>

| RegNo | Name | Gender | CivicsClass |
|-|-|-|-|
| 1 | Adam | M | 18S12 |
| 2 | Adrian | M | 18S12 |
| 3 | Agnes | F | 18S12 |
| 4 | Aisha | F | 18S12 |
| 5 | Ajay | M | 18S12 |
| 6 | Alex | M | 18S12 |
| 7 | Alice | F | 18S12 |
| 8 | Amy | F | 18S12 |
| 9 | Andrew | M | 18S12 |
| 10 | Andy | M | 18S12 |
| 1 | Adam | M | 18A10 |
| 2 | bala | M | 18A10 |
| 3 | Bee Lay | F | 18A10 |
| 4 | Ben | M | 18A10 |
| 5 | Boon Kiat | M | 18A10 |
| 6 | Boon Lim | M | 18A10 |
| 7 | Charles | M | 18A10 |
| 8 | Chee Seng | M | 18A10 |
| 9 | Cher Leng | F | 18A10 |
| 10 | Choo Tuan | M | 18A10 |

</center>

- What is/are the composite key(s)?

In [None]:
#YOUR_ANSWER_HERE

## Example 6

Consider the following tables `Student` and `ClassInfo` respectively.

<center>

| RegNo | Name | Gender| CivicsClass |
|-|-|-|-|
| 1 | Adam | M | 18S12 |
| 2 | Adrian | M | 18S12 |

</center>

<center>

| CivicsClass | CivicsTutor | HomeRoom|
|-|-|-|
| 18S12 | Mr Tan | CR1 | 
| 18A10 | Ms Aishya | CR2 | 

</center>

- What is/are the primary key(s) in each table?
- What is the attribute in the table `ClassInfo` that is the foreign key in the `Student` table?

In [None]:
#YOUR_ANSWER_HERE

# 13.3 Designing Relational Database

Consider the following example of a `ORDER` table.

<center>

| Num | CustName | City | Country | ProdID | Description|
|-|-|-|-|-|-|
| 005 | Bill Jones | London | England | 1| Table
| 005 | Bill Jones | London | England | 2| Desk
| 005 | Bill Jones | London | England | 3| Chair
| 008 | Amber Arif | Lahore | Pakistan | 2| Desk|
| 008 | Amber Arif | Lahore | Pakistan | 7| Cupboard|
| 014 | M. Ali | Kathmandu | Nepal | 5| Cabinet|
| 002 | Omar Norton | Cairo | Egypt | 7| Cupboard|
| 002 | Omar Norton | Cairo | Egypt | 1| Table|
| 002 | Omar Norton | Cairo | Egypt | 2| Desk|

</center>

If we look back at Exercise 5, we see that same data being stored more than once. This repetition of entry in a database is termed **data redundancy**.

## 13.3.1 Normalisation
**Normalisation** is the process of organising the tables in a database to reduce data redundancy and prevent inconsistent data. During normalisation, a table is usually separated to two or more tables, but still linked to each other via keys. The separated tables can be of the following type:

### 13.3.1.1 First Normal Form (1NF)

For a table to be in 1NF:
- all columns must be atomic, i.e.in the database, entities (objects of interest, e.g. person, item, place) do not contain repeated groups of attributes.
- Columns would not hold a collection such as an array or another table. This means the information in each column cannot be broken down further.

We remove the repeating groups by:
- moving the some of attributes to a new table
- linking the new table to the original table with a foreign key.

### Example 7
Using the `ORDER` table as an example. The following `ORDER(1NF)` and `ORDER-PRODUCTS` is in the first normal form.
<center>

| Num | CustName | City | Country | 
|-|-|-|-|
| 005 | Bill Jones | London | England |
| 008 | Amber Arif | Lahore | Pakistan | 
| 014 | M. Ali | Kathmandu | Nepal |
| 002 | Omar Norton | Cairo | Egypt | 


</center>

<center>

| Num | ProdID | Description |
|-|-|-|
| 005 | 1 | Table | 
| 005 | 2 | Desk | 
| 005 | 3 | Chair | 
| 008 | 2 | Desk | 
| 008 | 7 | Cupboard | 
| 014 | 5 | Cabinet | 
| 002 | 1 | Table | 
| 002 | 2 | Desk | 
| 002 | 7 | Cupboard | 
|...|   |

</center>

The primary key in `ORDER(1NF)` is **Num**, while the primary key in `ORDER-PRODUCTS` is **Num and ProdID**. In other words, **Num** is a foreign key in `ORDER-PRODUCTS` table.

### Example 8

Are the following tables in 1NF?

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

In [None]:
#YOUR_ANSWER_HERE

### 13.3.1.2 Second Normal Form (2NF)

To continue with our process of normalisation, we first introduce the following definitions.

Let $x,y$ be attributes in a table. We say that attribute $y$ is **functionally dependent** on attribute $x$ (usually the primary key), if for every valid instance of $x$, the value of $x$ **uniquely determines** the value of $y$ ($x \rightarrow y$). 

Let $y$ be an attribute and $S$ be a set of attributes of a table. $y$ is **fully dependent** on $S$ if all the attributes in $S$ are required to **uniquely determine** the value of $y$. If not all the attributes are required, we say that $y$ is **partially dependent** on $S$.

### Example 9
In the `ORDER(1NF)` table,
- $\text{Num}\rightarrow \text{CustName}$
- $\text{City}\rightarrow \text{Country}$.

In `ORDER-PRODUCTS` table, $\text{Description}$ is *partially dependent* on the primary key $(\text{Num},\text{ProdID})$.

For a table to be in 2NF:
- it has to be in 1NF
- every non-key attribute must be **fully** dependent on **all** of the primary key. This means no attribute can depend on part of the primary key only

We remove the partial dependencies by:
- moving the partially dependent attribute to a new table
- linking the new table to the table with a foreign key.

### Example 10
From the definitions above, `ORDER-PRODUCTS` table is not 2NF. To make it 2NF, we move the attribute $\text{Description}$ to a new table to get the following tables `ORDER-PRODUCTS(2NF)` and `PRODUCT`. Furthermore, $\text{ProdID}$ is a foreign key in the `Product` table.

<center>

| Num | ProdID | 
|-|-|
| 005 | 1 | 
| 005 | 2 | 
| 005 | 3 | 
| 008 | 2 | 
| 008 | 7 | 
| 014 | 5 | 
| 002 | 1 | 
| 002 | 2 | 
| 002 | 7 | 
|...|   |

</center>

<center>

| ProdID | Description |
|-|-|
| 1 | Table | 
| 2 | Desk | 
| 3 | Chair | 
| 5 | Cabinet | 
| 7 | Cupboard | 
|...|   |

</center>

### Exercise 11

Reduce the following table into 2NF tables.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>


### 13.3.1.3 Third Normal Form (3NF)

For the last form, we will introduce yet another definition.

Let $x,y,z$ be attributes in a table. A functional dependency $x\rightarrow z$ is said to be **transitive** if there exists an attribute $y$ such that $x\rightarrow y$ and $y\rightarrow z$. 

Note that $x\rightarrow y$ **does not** necessarily implies the converse  $y \rightarrow x$.

For a table to be in 3NF:
- it has to be in 2NF
- The table should not have transitive dependencies between the non-key attributes.

#### Example 14
 Consider the `ORDER(1NF)` table.
 
 <center>

| Num | CustName | City | Country | 
|-|-|-|-|
| 005 | Bill Jones | London | England |
| 008 | Amber Arif | Lahore | Pakistan | 
| 014 | M. Ali | Kathmandu | Nepal |
| 002 | Omar Norton | Cairo | Egypt | 

</center>

Note that the table is in 2NF but not in 3NF as the attribute City determines the attribute Country, so we have two non-key attributes which are dependent. 

To make it 3NF, we break the table down further into the following tables `ORDER(3NF)` and `CITY-COUNTRIES`.

<center>

| Num | CustName | City | 
|-|-|-|
| 005 | Bill Jones | London | 
| 008 | Amber Arif | Lahore | 
| 014 | M. Ali | Kathmandu | 
| 002 | Omar Norton | Cairo | 
|...|||

</center>

<center>

| City | Country | 
|-|-|
| London | England |
| Lahore | Pakistan | 
| Kathmandu | Nepal |
| Cairo | Egypt | 
|...|

</center>

To summarize, during the normalization process we end up with more tables, but each table is small enough to enable us to retrieve the information that we want and by that, we avoid data redundancy.


# 13.4 Entity-Relationship Diagram

Recall that entities are objects of which data are stored in the database. To illustrate the relationship between entities, an **entity–relationship diagram (E–R diagram)** can be used. 

In an E-R diagram, 
- entities are represented as rectangles,
- relationships, which are the between two entities, are represented therefore by specific lines connecting the rectangles. There are 3 types of relationships
    - **one to one** : when a single instance of an entity is associated with a single instance of another entity, e.g. a person (legally) has only one NRIC number. Represented by a line with single ends between the entities.
    <center>
    <img src="images/mario_2.jpg" width="250" align="center"/>
    </center>
    
    - **one to many** : When a single instance of an entity is associated with more than one instances of another entity, e.g. many students study in a school but a student cannot be under multiple school at the same time. Represented by a line with a single end on the entity with a single instance and a "crow's feet" on the entity having multiple instances. 
    <center>
    <img src="images/mario_2.jpg" width="250" align="center"/>
    </center>
    
    - **many to many** : when more than one instances of an entity is associated with more than one instances of another entity, e.g. a student can be assigned to many projects and a project can be assigned to many students. Represented by a line with "crow's feet" on both ends.
    <center>
    <img src="images/mario_2.jpg" width="250" align="center"/>
    </center>

#### Example 15

A small library wants to keep track of its <u>collections</u> (e.g., fiction, non-fiction, journals, etc.), the <u>items</u> in those collections, the physical <u>location</u> in the library of these collections, the <u>members</u> of the library and the <u>items</u> they borrow from the various collections.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

# 13.5 Advantages of Relational Database Over Flat Files

There are two main areas where using relational database is more advantegous over flat files.
## 13.5.1 Data Storage

<center>

| Flat Files | Relational Database | 
|-|-|
| Data are stored in a number of files. | Data are contained in a single software application – the relational database or DBMS software. |
| Data are highly likely to be duplicated and may become inconsistent – it can never be certain that all copies of a piece of data have been updated. | Duplication of data is minimised and so the chance of data inconsistency is reduced. As long as there is a link to the table storing the data, they can always be accessed via the link rather than repeating the data. Good database design avoids data duplication. | 
| Because of data duplication, the volume of data stored is large. | Because data duplication is minimised, the volume of data is reduced, leading to faster searching and sorting of data.

</center>

## 13.5.2 Program-data independence

<center>

| Flat Files | Relational Database | 
|-|-|
| When data structures need to be altered, the software must be re-written. | Data structures remain the same even when the tables are altered. Existing programs do not need to be altered when a table design is changed.|
| Views of the data are governed by the different files used to control the data and produced by individual departments. All views of the data have to be programmed and this is very time-consuming. |  Queries and reports can be set up with simple “point and click” features or using the data manipulation language. A novice user can write queries quickly. |
</center>


# 13.6 SQLite Database

**Structured Query Language** (SQL) is a standard computer language for the operation and management of relational databases. It is a language used to query, insert, update and modify data.

There are many variants of SQL Engines, e.g. MySQL, Microsoft SQL, SQLite, PostgreSQL etc. However, for your syllabus, you are required to be able to work with **SQLite Databases**.

Each value stored in an SQLite database (or manipulated by the database engine) has one of the following storage classes:
- `NULL`: The value is a NULL value.
- `INTEGER`: used for a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.
- `REAL`: used for a floating point value, stored as an 8-byte IEEE floating point number.
- `TEXT`: used for a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
- `BLOB`: used for large binary data, such as images or multimedia in a database.

For the most part, "storage class" is indistinguishable from "datatype" and the two terms can be used interchangeably.

SQLite supports the concept of type affinity on columns/fields. Type affinity refers to the preferred data type stored in a column. This means that you can store any type of data in a column with the recommended types, but they are not enforced.

Each column in a SQLite table is assigned one of the following type affinities:
- `INTEGER` 
- `TEXT` 
- `REAL` 
- `BLOB` 
- `NUMERIC` : A column with `NUMERIC` affinity may contain values using all five storage classes mentioned previously. TLDR, it tries to accomodate the values entered, make a guess on its type when the value is stored in the database.

To work with databases, it is imperative to get familiar with CRUD operations,
- <u>C</u>reate
- <u>R</u>ead(Retrieve)
- <u>U</u>pdate(Modify)
- <u>D</u>elete(Destroy)

## 13.6.1 DBBrowser for SQLite
[DB Browser for SQLite(DB4S)](https://sqlitebrowser.org/) is a simple and easy to use Graphical User Interface (GUI) - based software for the creation and editing of database files compatible with SQLite. It abstracts and hides the details of complex SQL commands while providing an easy to user interface for performing the same database operations.

We will illustrate how we can do the CRUD operations in DB4S

### Example 16

A library contains books that can be on loan to borrowers where:
- A borrower can take one or many loans.
- Each loan record belongs to only one borrower.
- A book can be loaned many times.
- A publisher publishes one or many books.
- A book can be published by zero or one publisher.

So, the ER diagram looks like

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

## 13.6.1.1 Creating Database with DBBrowser
1.	Create a folder called `DBTASK`. You will save all your files inside this folder.
2.	Open `DBBrowser for SQLite`.
3.	Click `File`, then `New Database`.
4.	Save and name your database file as `library`. The default extension is `.db`. Note: other database file extensions are `sqlite/sqlite3/db3`
5.	Create a table called `Borrower` with the fields and constraints listed above.
6.	Click `Write Changes` or `CTRL+S` to save changes to the database.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

# 13.7 SQLite in Python

## 13.6.1.2 Inserting Records [<u>C</u>reate]
1. Under the `Browse Data` tab, click `New Record`.
2. Click on the `FirstName` cell of the first record.
3. Under `Edit Database Cell`, type the value for `FirstName`. Click `Apply`.
4. Repeat the above two steps for `Surname` and `Contact`. If the record has been entered correctly, you should see the following in the table:
5. Click `New Record` to enter values for the next few records.

       	Borrower
ID	FirstName	Surname	Contact
1	Peter 	Tan	999
2	Sarah	Lee	81111123
3	Kumara	Ravi	94456677
4	Some	User	11111111

6.	Write changes to the database.


There are different types of keys.


Assume that these steps take constant time
- mathematical operations
- comparisons
- assignments
- accessing objects in memory
then count the number of operations executed as function of size of input

In [None]:
# 3 ops
def c_to_f(c):
    return c*9.0/5 +32

# 1+3x ops
def mysum(x):
    total = 0
    for i in range (x+1):
        total += i
    return total

- best case : minimum running time over all possible inputs of a given size
- average case : average running time over all possible inputs of a given size
- worst case : maximum running time over all possible inputs of a given size

- ignore additive constants
- ignore multiplicative constants
- focus on dominant terms

- n^2+2n+2
- n^2+10000n+3^10000
- \log(n)+n+4
- 0.0001*n*\log(n)+300n
- 2n^30+3^n

Law of addition for $O()$
- used with *sequential* statements

In [None]:
#O(n^2)
for i in range(n):
    print('a')
for j in range(n*n):
    print('b')

Law of multiplication for $O()$
- used with *nested* statements/loops

In [None]:
#O(n^2)
for i in range(n):
    for j in range(n):
        print('a')

- O(1)
- O(\log n)
- O(n)
- O(n \log n)
- O(n^c)
- O(c^n)

# 10.1 Search Algorithm

A search algorithm is an algorithm to retrieve information from some data structure. Some examples include:
- Finding the maximum or minimum value in a list or array
- Checking to see if a given value is present in a set of values
- Retrieving a record from a database

## 10.1.1 Linear Search

A **linear search**, also called **serial** or **sequential** searches an item in a given array sequentially till the end of the collection. It does not require the data to be in any particular order. 

To find the position of a particular value involves looking at each value in turn – starting with the first – and comparing it with the value you are looking for. When the value is found, you need to note its position. You must also be able to report the special case that a value has not been found. This last part only becomes apparent when the search has reached the final data item without finding the required value.

### Example

In this example, you have the array `[10,14,19,26,27,31,33,35,42,44]` and you are looking for the value `33` in the array.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

The pseudocode for linear search function is given below. It returns the index of the searched value in the array if it exists. In the case that the value is not in the array, the function returns `-1`.

In [None]:
FUNCTION LINEARSEARCH(A: ARRAY of INTEGER, t: INTEGER) RETURNS INTEGER
    DECLARE index: INTEGER
	index ← -1
	FOR i = 1 TO A.SIZE
		IF A[i] = t THEN
			index ← i
			BREAK
		ENDIF
	ENDFOR
	RETURN index
ENDFUNCTION

### Exercise

Implement a function `linear_search(array, val)` which searches the list `array` for a value `val` using the linear search algorithm.

Test your function with the following list
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`

with the values `9` and `2`. What do you observe for the latter value?

In [None]:
#YOUR_CODE_HERE

In linear search, all items are searched one-by-one to find the required item.

If the array has $n$ elements to be compared to,

- The best-case lookup to find an item is $1$ comparison, i.e., the item is at the head of the array.
- The worst-case lookup to find an item is $n$ comparisons, i.e. the item is at the end of the array.
- The average lookup to find an item is approximately $\frac{n}{2}$ comparisons. 

Clearly, if $n$ is large,  this can be a very large number of comparisons and the serial search algorithm can take a long time.

Consequently, we have for serial search,
- Advantage:
    - algorithm is straightforward and easy to implement,
    - data need not be in any particular order,
    - works well if there is a small number of data item.
- Disadvantage:
    - search can take a long time if value of $n$ is large, i.e. inefficient if there is a large number of data items.

o	Variations:
	Search target requires a different criteria (not just object existence).
	Must find all instances of target.
	Must find particular instance of target (first, last, etc.).
	Must find object just greater/smaller than target.


## 10.1.2 Binary Search

In the previous section, we looked at linear search where the data is not required to be stored in any particular order. On the other hand, if we know that the data is stored in an ascending order, we can utilize the another algorithm called the **binary search**. 

Workings of binary search algorithm:
- First check the MIDDLE element in the list.
- If it is the value we want, we can stop.
- If it is HIGHER than the value we want, we repeat the search process with the portion of the list BEFORE the middle element.
- If it is LOWER than the value we want, we repeat the search process with the portion of the list AFTER the middle element.

Note that  if there is an even number of values in the array, dividing by two gives a whole number and we split the array there. However, if the array consists of an odd number of values we need to find the integer part of it, as an array index must be an integer. 

The pseudocode for binary search function is given below. It returns the index of the searched value in the array if it exists. In the case that the value is not in the array, the function returns `-1`.

In [None]:
FUNCTION BinarySearch(A: ARRAY of INTEGER, t: INTEGER) RETURNS INTEGER
	DECLARE start, mid, end: INTEGER
	start ← 1
	end ← A.SIZE
	WHILE start <= end DO
		mid ← (start + end) DIV 2
		IF t = A[mid] THEN
			RETURN mid
		ENDIF
		IF t < A[mid] THEN
			end ← mid – 1
		ELSE
			start ← mid + 1
		ENDIF
	ENDWHILE
	RETURN -1
ENDFUNCTION

### Exercise

Implement a function `binary_search(array, val)` which searches the list `array` for a value `val` using the binary search algorithm described above.

Test your function with the following list
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`

with the values `9` and `2`.

In [None]:
#YOUR_CODE_HERE

If the array has $n$ elements to be compared to,

- The best-case lookup to find an item is $1$ comparison, i.e., the item is at the middle of the array.
- The worst-case lookup to find an item is approximately $\log_2{n}$ comparisons.

In [None]:
#YOUR_CODE_HERE

Jupyter Notebook provides a magic function `%timeit` and `%%timeit` to time a code execution.
* `%timeit` is used to time a single line of statement
* `%%timeit` is used to time all codes in a cell. `%%timeit` must be placed at first line of cell. 

### Exercise 
Use `%timeit` to time the code executions for both the functions:
- `linear_search`,
- `binary_search`

that you have coded in the previous exercise, using the 
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`

with the search value `9`.

In [None]:
#YOUR_CODE_HERE

# 10.2 Sorting Algorithms

Sorting refers to arranging a fixed set of data in a particular order. Sorting orders could be numerical (`1`,`2`, `3`, ...), lexicographical/dictionary (`AA`, `AB`, `AC`, ...) or custom ('Mon', 'Tue', 'Wed', ...).

Sorting algorithms specify ways to arrange data in particular ways to put the data in order. In this section, it is assumed that the sorted data is in ascending order.

## 10.1 Insertion Sort

In insertion sort algorithm, we compare each element, termed `key` element, in turn with the elements before it in the array. We then insert the `key` element into its correct position in the array.

### Example

In this example, the array `[6,5,3,1,8,7,2,4]` is sorted with insertion sort.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

The pseudocode for insertion sort function for an array containing integer elements is given below:

In [None]:
FUNCTION InsertionSort(A: ARRAY of INTEGER) RETURNS ARRAY of INTEGER
	DECLARE j, temp: INTEGER
    FOR i = 2 to A.SIZE
        j ← i
        WHILE j > 1 AND A[j] < A[j – 1] DO
            temp ← A[j]
            A[j] ← A[j - 1]
            A[j - 1] ← temp
            j ← j - 1
        ENDWHILE
    ENDFOR
    RETURN A
ENDFUNCTION

### Exercise

Implement a function `insertion_sort(array)` which sorts the list `array` in the ascending order according to the insertion algorithm given above.

Test your function with the following list
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`.

In [None]:
#YOUR_CODE_HERE

Note:
- The outer for-loop in Insertion Sort function always iterates $n-1$ times.
- The inner for-loop will make $1 + 2 + 3 ... + (n-1)=\frac{n(n-1)}{2}$ comparisons in worst case.

# 10.2 Bubble Sort

The next sorting algorithm iterates over an array multiple times. 
* In each iteration, it takes 2 consecutive elements and compare them. 
* It swaps the smaller value to the left and larger value to the right.
* It repeats until the larger elements "bubble up" to the end of the list, and the smaller elements moves to the "bottom". This is the reason for the naming of the algorithm.
* The right-hand side of the array are sorted. 

### Example

In this example, the array `[6,5,3,1,8,7,2,4]` is sorted with bubble sort.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

We see that
* For 1st iteration, we need to make $n-1$ comparisons. It will bring the largest value to the extreme right.
* For 2nd iteration, we need to make $n-2$ comparisons. It will bring 2nd largest value to the 2nd extreme right.
* And so on...

Consequently, we need a nested loops to make multiple iterations. 

The pseudocode for bubble sort function for an array containing integer elements is given below:

In [None]:
FUNCTION BubbleSort(A: ARRAY of INTEGER) RETURNS ARRAY of INTEGER
    DECLARE swap: BOOLEAN
    DECLARE temp: INTEGER
    FOR i = 1 to (A.SIZE – 1)
        swap ← FALSE
        FOR j = 1 to (A.SIZE – i)
            IF A[j] > A[j + 1] THEN
                temp ← A[j]
                A[j] ← A[j + 1]
                A[j + 1] ← temp
                swap ← TRUE
            ENDIF
        ENDFOR
        IF NOT swap THEN
            BREAK
        ENDIF
    ENDFOR
    RETURN A
ENDFUNCTION


### Exercise

Implement a function `bubble_sort(array)` which sorts the list `array` in the ascending order according to the bubble algorithm given above.

Test your function with the following list
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`.

In [None]:
#YOUR_CODE_HERE

Note:
- The amount of comparisons in Bubble Sort algorithgm is $(n - 1) + (n - 2) + ... + 1=\frac{n(n-1)}{2}$ comparisons,
- Best case is when the array is already sorted and bubble sort will terminate after the first iterations. 
- Bubble sort is also efficient when one random element needs to be sorted into a sorted array, provided that new element is placed at the beginning and not at the end. 
- The absolute worst case for bubble sort is when the smallest element of the array is the last element in the end of the array. Because in each iteration only the largest unsorted element gets put in its proper location, when the smallest element is at the end, it will have to be swapped each time through the array, and it wont get to the front of the list until all $n$ iterations have occurred.

# 10.3 Quicksort

Quicksort is a sorting technique based on divide and conquer technique. Quicksort first selects an element, termed the `pivot`, and partitions the array around the pivot, putting every smaller element into a low array and every larger element into a high array. 

* The `pivot` element can selected randomly, but one way to select the pivot is to use the element in the middle of the array as the pivot
* The first pass partitions data into 3 sub-arrays, `lesser` (less than pivot), `equal` (equal to pivot) and `greater` (greater than pivot).
* The process repeats for `lesser` array and `greater` array.

<center>
<img src="images/mario_2.jpg" width="250" align="center"/>
</center>

The pseudocode for quicksort function for an array containing $N$ elements is given below:

In [None]:
Quicksort(A,p,r) {
    if (p < r) {
       q <- Partition(A,p,r)
       Quicksort(A,p,q)
       Quicksort(A,q+1,r)
    }
}



Partition(A,p,r)
    x <- A[p]
    i <- p-1
    j <- r+1
    while (True) {
        repeat
            j <- j-1
        until (A[j] <= x)
        repeat
            i <- i+1
        until (A[i] >= x)
        if (i A[j]
        else 
            return(j)
    

In [None]:
PROCEDURE QuickSort(MyList, LB, UB)
    IF LB <> UB THEN
    #there is more than one element in MyList
        LeftP ← LB #Left pointer
        RightP ← UB #Right pointer
        REPEAT
            WHILE LeftP <> RightP AND MyList[LeftP] < MyList[RightP] DO
            #move right pointer left
                RightP ← RightP — l
            ENDWHILE
            IF LeftP <> RightP THEN 
                swap MyList[LeftP] and MyList[J]
            WHILE LeftP <> RightP AND MyList[LeftP] < MyList[RightP] DO
            #move left pointer right
                LeftP ← LeftP + 1
            ENDWHILE
            IF LeftP <> RightP THEN 
                swap MyList[LeftP] and MyList[RightP]
        UNTIL LeftP = RightP
        #value now in correct position so sort left sub-list
        QuickSort(MyList, LB, LeftP — 1)
        #now sort right sub-list
        QuickSort(MyList, LeftP + l, UB)
    ENDIF
END PROCEDURE

### Exercise

Implement a function `quicksort(array)` which sorts the list `array` in the ascending order according to the quicksort algorithm given above.

Test your function with the following list
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`.

In [None]:
#YOUR_CODE_HERE

Note: 
- The worst case scenario is when the smallest or largest element is always selected as the pivot. This would create partitions of size $n-1$, causing recursive calls $n-1$ times.
- With a good pivot, the input list is partitioned in linear time, O(n), and this process repeats recursively an average of $\log_2{n}$ times. 
- This leads to a final complexity of $O(n \log_2n)$.
- The above implementation of the quicksort algorithm does not sore “in place”, and has a high space complexity. In order to overcome this, you need to change the algorithm slightly – i.e., use a variant that does not create new linked lists to store elements greater/less than the pivot

# 10.4 Merge Sort

Similar to Quicksort, Merge sort is a sorting technique based on divide and conquer technique. It first divides the array into equal halves and then combines them in a sorted manner.
- if it is only one element in the list it is already sorted, return.
- divide the list recursively into two halves until it can no more be divided.
- merge the smaller lists into new list in sorted order.

The important subroutine `merge` : Given two sorted array, $A$ and $B$ of size $n_1$ and $n_2$, 

The pseudocode for merge sort function for an array containing $N$ elements is given below:

In [None]:
function merge_sort(list m) is
    // Base case. A list of zero or one elements is sorted, by definition.
    if length of m ≤ 1 then
        return m

    // Recursive case. First, divide the list into equal-sized sublists
    // consisting of the first half and second half of the list.
    // This assumes lists start at index 0.
    var left := empty list
    var right := empty list
    for each x with index i in m do
        if i < (length of m)/2 then
            add x to left
        else
            add x to right

    // Recursively sort both sublists.
    left := merge_sort(left)
    right := merge_sort(right)

    // Then merge the now-sorted sublists.
    return merge(left, right)

Note: 
- In sorting $n$ objects, merge sort has an average and worst-case performance of O(n log n). If the running time of merge sort for a list of length n is T(n), then the recurrence relation T(n) = 2T(n/2) + n follows from the definition of the algorithm (apply the algorithm to two lists of half the size of the original list, and add the n steps taken to merge the resulting two lists). The closed form follows from the master theorem for divide-and-conquer recurrences.

https://en.wikipedia.org/wiki/Master_theorem_(analysis_of_algorithms)

### Exercise

Implement a function `merge_sort(array)` which sorts the list `array` in the ascending order according to the merge sort algorithm given above.

Test your function with the following list
> `
[39, 96, 51, 20, 42, 42, 74, 28, 66, 16, 10, 86, 6, 43, 67, 98, 32, 73, 99, 7, 80, 88, 57, 83, 1, 64, 33, 38, 38, 8, 68, 38, 42, 80, 71, 82, 25, 29, 2, 85, 2, 96, 34, 14, 9, 65, 50, 63, 99, 94, 5, 93, 84, 46, 64, 22, 59, 31, 74, 13, 93, 13, 98, 93]`.

In [None]:
#YOUR_CODE_HERE