#### What is data processing
    - We have data, but we need something to process data so we can gain information.
* In big data
    - Volume is huge
    - Data keeps growing at staggering rate
    - Data comes in variery of sources and fromats.
* Scalable data processing
    - Allows database processing system to cope with volume, velocity and variety.

#### Example of data processing system
    - RDBMS
    - NoSQL
    - Hadoop, Spark

#### What is database?
    - A very large integrated collection of data
    - Models real world entities(students, course) and relationships (Purvil is taking CS101)

#### What is DBMS?
    - We need some system to store data, retrieve data and manage (Insert, update) data.
    - Benefits of DBMS
        - Data independence: Does not need to know how data is stored, how it is organized. Only  care about application. DBMS takes care of all others.
        - Efficient data access: No need to take care of how to index data
        - Concurrent access and crash recovery
        - Data consistency, integrity and security
        - Control redundancy
        - Reduced app development time : focus on logic of your own application
        - Multiple user support
        - Data independence : Ability to change implementation of database to be more efficient without affecting user interface.
        - Backup and recovery

#### Data model
    - Structure or format of database.
    - Collection of concepts that describe data
    - Ex. Relational model (Model data using relation(table))
    - Entity relationship model
    - Hierarchical model (XML database use it)
    - Data model has several aspects, data structure (schema, type of columns), Constraint (NOT NULL, UNIQUE, range of values) and  various allowed operations (changes and retrieval of data).

#### Schema
- Description of a particular collection of data, using given data model

#### Level of abstraction
    - **Physical schema** : Describe how data is physically stored and organized in database. B+-tree, index on some column to retrieve it fast.
    - **Logical schema (Conceptual schema)** : How data is logically represented. How table is set. What is the schema.
    - **Views** : How user see the data. External schema. (Multiple views)
        - Just a view, it has no physical existance in database.
![](images/data_abstraction.jpg)
    - Example: University database
![](images/university_database.jpg)
    - Logical data independence protects from changes in logical structure of data. Measure of how much we can change conceptual schema without affecting application program.
    - Physical data independence protects from changes in physical structure of data. It is measure of how much the internal schema can change without affecting the application program

#### Integrity
    - Does database reflect reality well?
#### Consistency
    - Database without internal conflicts
#### Metadata
    - DBMS stores metadata like where data come from, how data were changes, how data are stored, who owns data, who can access data, data usage history, data usage statistics.

--------------------

### Entity-Relationship model
* Requirement analysis: What user expects from database
* Conceptual database design : Build ER diagram
* Logical database design : Convert ER diagram to relational database schema.

#### Conceptual design (ER-model)
* Contains set of entities and relationship between them
* Entity means object in real world. (User, student, Employee)
* Entity is described as set of attributes.
* Entity Set  : A collection of similar entity meaning all employees. All entities in an entity set have same set of attributes. Each entity set has key. Each attribute set has domain.
* Relationship is association among 2 or more entities. Ex. Purvil works in Software.
* Relationship set is collection of similar relationship.
* Multi value property type Ex interest/hobby can be represented by double elipse.
![](images/ER.jpg)

* To transfer relationship set to relation, attributes of each relation must include
    - Keys for each participating entity set as foreign key
    - All descriptive attributes of relationship.
    
```
CREATE TABLE Works_IN(
    ssn CHAR(11),
    did INTEGER,
    since DATE,
    PRIMARY KEY (ssn, did),
    FOREIGN KEY (ssn) REFERENCING Employee,
    FOREIGN KEY (did) REFERENCING DEPARTMENTS)
```

* A relationship is uniquely identify by its participant entities.
* Key constraint
    - Employee can work in many departments, a dept can have many employees.
    - Employee can manage only 1 department.
    - Department can be managed by only 1 employee.
    - There is arrow from Department to manages means department can have only 1 employee.
    
![](images/manages.jpg)

#### Participation constraints
    - Does every department have a manager? If so the participation of department in manages is said to be total (vs partial)
    - Shown by bold arrow
    - Every departments entity must appear in an instance of the manages relationship.
    - Employee must appear in works in relation, can not have department without employee working in it.
![](images/participation.jpg)

#### Weak entity
    - Existence of such entity is depends on other stronger entity.
    - Weak entity can be identified uniquely only by considering primary key of other (owner)  entity.
    - Owner entity set and weak entity set must be 1 to many relationship set. One owner, many weak entities.
    - Weak entity set must have total participation in this identifying relationship
![](images/weak_entity.jpg)
![](images/weak_entity_sql.jpg)
![](images/partial_identifier.jpg)
- Status update at specific time can be many so to differentiate it we need email. Without email/ RegularUser there is no existence of Statusupdate entity. So it is called weak entity and DateAndTime is called partial identifier.

#### class hierarchies
    - If we declare A ISA B,every A entity is also considered to be B entity.
![](images/ISA.jpg)
    - Overlap constraints:
        - Can Joe be an Hourly_emp as well as contract_emp entity? 
    - Covering constraint
        - Does every employee entity must have to be an hourly_emp or a contract_emp.
    -  Using ISA we can specialize the entity
    - Disjoint constraint says 2 subtype can not overlap
![](images/isa_2.jpg)
#### Aggregation
    - Used when we have to model a relationship involving relationship set
    - Allows us to treat a relationship set as an entity set for purpose of participation in other relationship.
 
![](images/aggregation.jpg)

#### Binary vs Ternary relationship
* Example: What if an employee can work in given department for more than 1 period?
    - Can we add attributes in relationship like from and to? No because relationship is identified by participated entities, i.e employee and department. Multiple entries of same employee and department is NOT allowed.
    - We can have another entity called duration. (Ternary relationship)
![](images/ternary.jpg)

    - One employee can work in department in one duration, same employee can work in same department in other duration.

#### relationship partial function
![](images/1-1_partial.jpg)
* 1-1 because for 1 element of male user there is only 1 female user. It is partial because there are some instance which is not mapped.
![](images/1-manay_partial.jpg)
![](images/1-manay_total.jpg)
* As we saw total mapping is shown as bold line

# ER to Relational
* In relational database , we have set of relation(table).
* Relation is made up of 2 parts
    - Instance : a table with rows and column, number of rows is cardinality and number of fields /columns is degree of relation.
    - Schema: specify name of relation, name and type of each column
    
 ![](images/mapping.jpg)
 
-----------------

 ![](images/mapping1.jpg)
 
-----------------
 ![](images/mapping2.jpg)
 
-----------------
 ![](images/mapping3.jpg)
 
-----------------
 ![](images/manages4.jpg)
 
-----------------
 ![](images/manages5.jpg)
 
-----------------
 ![](images/manages6.jpg)
 
-----------------
 ![](images/mapping6.jpg)
 
-----------------
 ![](images/mapping7.jpg)
 
-----------------
 ![](images/mapping8.jpg)
 
-----------------
 ![](images/mapping9.jpg)
 
-----------------
 ![](images/mapping10.jpg)

* Instance of student relation
![](images/instance.jpg)

* Create relation in SQL

```
CREATE TABLE Students 
    (sid:CHAR(20),
    name:CHAR(20),
    login:CHAR(10),
    age:INTEGER,
    gpa:REAL)
    
CREATE TABLE Enrolled
    (sid:CHAR(20),
    cid:CHAR(20),
    grade:CHAR(2))
```

* To destroy relation, schema as well as all rows will be deleted
```
DROP TABLE Students
```

* Altering relation
    - Every row in current instance will have `null` value in new field

```
ALTER TABLE Students ADD COLUMN first_year:integer
```
* Adding new row
```
INSERT INTO Students(sid, name, login, age, gpa) VALUES (53688, 'Smith', 'smith@ee', 18, 3.2)
```

* Delete all tuples satisfying some condition
```
DELETE FROM Students S WHERE S.name='Smith'
```

* Integrity constraints
    - Condition that must be true for any instance of the database.
    - It is specified when schema is defined.
    - It is checked when relation is modified.

* Primary key constraints:
    - Specific field for this row will be unique for entire relation.
    - A set of fields is a key for relation if,
        - No two distinct tuples can have same values in all key fields
    - `sid` is key for students, set `{sid, gpa}` is super key.
```
CREATE TABLE Enrolled
    (sid:CHAR(20),
    cid:CHAR(20),
    grade:CHAR(2), 
    PRIMARY KEY (sid, cid))
```
* Possibly many candidate key (specified as UNIQUE), one of which chosen as primary key.
* Foreign key, referential integrity
    - Set of field in one relation that is used to refer to a tuple in another relation (primary key of other relation).
![](images/Foreign_key.jpg)

    - Each sid in enrolled must have corresponding value in Students table. it is like a logical pointer.
    - What is student table deleted?
        - Also delete all the enrolled tuples that refer to it.
        - Disallow deletion of student's tuple that is referred to.
        - Set sid in enrolled tuples that refer to it to a default sid. Meaning set to null.
        - Default action in SQL is `ON DELETE NO ACTION`
        - `ON DELETE CASCADE`: also delete all tuples that refer to deleted tuple
        - `ON UPDATE CASCADE` : Also Update referencing attributes.
        - `ON UPDATE SET DEFAULT` : sets foreign key value of referencing tuple.
        
```
CREATE TABLE Enrolled (
    sid CHAR(20),
    cid CHAR(20),
    grade CHAR(2),
    PRIMARY KEY (sid,cid),
    FOREIGN KEY (sid) REFERENCES Students ON DELETE CASCADE ON UPDATE SET DEFAULT) 
```

### Normalization
* We have relation and functional dependencies,
* Given email can we know birthyear, currentcity and salary
* Given email and interest can we know sinceage
* Given birthyear can we know salary?
* How to normalize relation meaning decompose in small relation without loosing information and without loosing functional dependencies.
* No redundancy of facts
* No cluttering of facts
* must preserve information
* must preserve functional dependencies.
* Data structure not a relation, data structure of non first normal form.
![](images/non_first_normal_form.jpg)
* We can solve this by
![](images/temp1.jpg)

* Now we have redundancy, which can lead to in consistent state.
* If we add new user who has no interest then we will have to enter NULL values in interest and sinceage field.
* If we delete some row then we can lose some valuable information.
* When we want to update the current city we will have to update all records of particular email.

* To over come this problem we can decompose table in multiple tables. Later we can combine them to get original multi rows data when needed.
* Make sure to combine relation we need some common attributes among the tables.
* Decomposing table can lead to information loss and dependency loss
![](images/temp2.jpg)
* Making join field a key in relation will avoid information loss.
* Joining table we create additional row, which lead to information loss.
* After decomposing we are no longer able to find salary from email. or salary from birth year. Which is dependency loss.

#### Functional dependencies
* X and Y be set of attributes in R. Y is functionally dependent on X in R iff for each x in X there is precisely one y in Y.
![](images/functional_dependent.jpg)
* curcity is functional dependent on email
* sinceAge is functional dependent on Email and interest combined.

#### Full Functional dependent
* X and Y attribute in R. Y is full functional dependent on X in R iff Y is functional dependent on X and Y is not functional dependent on any other proper subset of X.
* We use key to enforce full functional dependencies. In a relation the value of the key are unique. and it will enforce a functional dependencies.
* Making email and interest combined as key we can enforce sinceage functional dependent on them. Making key will have unique value and that can identify exactly one sinceage.
* Make email key, which is used to find birth year and current city.
* TO find salary from birth year, make birth year a key in that relation.

### Normal forms
* NF2 : non first normal form
* 1NF : Relation is in 1NF iff all domain values are atomic
* 2NF : Relation is in 2NF iff R is in 1NF and every non key attribute is full dependent on key.
* 3NF : if R is 2NF and every non key attribute is non transitively dependent on key.
* BCNF (Boyce codd normal form): R is in BCNF iff every determinant is a candidate key.
* Determinant is a set of attributes on which other attributes is fully dependent.
![](images/NF.jpg)

![](images/transitivity.jpg)

* When relation has overlapping keys we can not normalize it to BCNF, but still get lossless and dependency preserving relation. But it will almost never happen.

### Indexing
### Physical database design
* File is logical collection  of data, physically stored as set of pages.
* File organization is method of storing file of records on external storage.
* Heap files
    - Random order of data.
    - We have data page and they are connected to each other like linked list node.
    - Data inserted in empty page.
    - We can easily bulk load data. If we have small data index and sort is not necessary, heap is good. Queries when need to fetch large portion of record it is fine.
    - Not efficient for selective queries.
    - For sorting operation it is time consuming
* Sorted files
    - Sorted record
* Indexes
    - Sorted data, index to some data
    - We have index and it leads us directly to the location where our data is instead of checking every piece one by one.
    - B+ Tree index
    ![](images/b+.jpg)
    - Blue are index entires.
    - Leaf pages/yellow are sorted by key on which they are indexed.
    - Index using age, we can search record easily by age. So key is age
    ![](images/b+1.jpg)
* Hash-based indexes
    - Good for equality selections. Ex. Student with id = 5. Give me record for transaction id 12345.
    - We have hash function, as a key we can use any thing we want to index on.
    - Hash function determine where data is stored.
    - Just like hash table.
* Clustered vs unclustered indexes.
    - In a clustered index data entries in index are sorted pretty much same as real data records.
    - Data entries has pointer to real data.
    - In unclustered index no same sorting order.
    - We can only have 1 clustered index and many unclustered index.
    ![](images/clustered.jpg)

* What kind of index we should use?
    - If most of the workload is for reading and some table is used most of the time we can create index for that relation.
    - How selective is the query. 1 % data is selected then good candidate for index.
    - With index we have storage and maintainance overhead.
    - Lots of update then think twice before creating index.
    - WHERE clause gives us idea for key of index.
    ![](images/clustered_index.jpg)
    - Sometimes we get answer from index directly (when select statement only includes key of index) without checking actual data records.