# Data Modeling <br><br>


Data modeling is a critical aspect of data engineering, where a data model is created to represent the data and its relationships. It involves defining the data structure, data types, and relationships between the data entities. The goal of data modeling is to ensure that the data is accurate, consistent, and meaningful.<br>

In data engineering, data modeling is the process of creating a conceptual, logical, and physical model of the data. Here are the three main types of data models:

<b>Conceptual Data Model:</b> The conceptual data model is the first step in the data modeling process. It defines the high-level view of the data, without going into much detail. The conceptual model identifies the entities, their attributes, and the relationships between them.<br>

<img src="https://i.ibb.co/HVWTPY1/Screenshot-2023-03-10-at-4-09-21-PM.png" height = "800" width = "800"><br><br>

<b>Logical Data Model:</b> The logical data model is a more detailed view of the data, where the focus is on the structure and relationships between the entities. The logical model defines the tables, columns, keys, and relationships between them.<br>


<img src="https://i.ibb.co/Bj8crVb/Screenshot-2023-03-10-at-4-07-35-PM.png" height = "800" width = "800"><br><br>


<b>Physical Data Model:</b> The physical data model is a representation of the logical data model in the form of a database schema. It defines the actual structure of the database, including data types, indexes, and constraints.
Data modeling is an iterative process, and it involves collaborating with various stakeholders to ensure that the model meets the requirements. The data model must be flexible, scalable, and easy to maintain, as the data may change over time.


A physical data model is the actual manifestation of the entity and relationship in code.


<div class="sql">
    <pre><code><font color = 'indigo'>
CREATE TABLE Customer (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(50),
    address VARCHAR(255)
);

CREATE TABLE Order (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    status VARCHAR(20),
    FOREIGN KEY (customer_id) REFERENCES Customer(customer_id)
    FOREIGN KEY (product_id) REFERENCES Product(product_id)
);

CREATE TABLE Product (
    product_id INT PRIMARY KEY,
    name VARCHAR(100),
    description VARCHAR(255),
    price DECIMAL(10,2),
    category VARCHAR(50)
);
    </font></code></pre>
</div>


In summary, data modeling is the process of creating a representation of the data and its relationships in a structured manner. It is a crucial step in data engineering that helps ensure the accuracy, consistency, and meaningfulness of the data.
<br>&nbsp;

<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />


## Introduction to NoSQL databases

1. Understanding the differences between NoSQL and SQL databases

<ul>
  <li>SQL databases are based on a relational model with a fixed schema and follow ACID (Atomicity, Consistency, Isolation, Durability) properties.</li>
  <li>In contrast, NoSQL databases are non-relational, schema-less, and often prioritize BASE (Basically Available, Soft state, Eventual consistency) properties.</li>
  <li>SQL databases are typically more suited for structured data, while NoSQL databases handle structured, semi-structured, and unstructured data.</li>
  <li>SQL databases use SQL (Structured Query Language) for querying and managing data, whereas NoSQL databases use various query languages depending on the database type.</li>
</ul>


2. Reasons to choose NoSQL databases for specific use cases

<ul>
    <li><b>Scalability:</b> NoSQL databases are often designed to scale horizontally, making them suitable for handling large amounts of data and high write loads.</li>
    <li><b>Flexibility:</b> NoSQL databases allow for more flexible data models, which can be beneficial when dealing with complex, evolving data structures or when the schema is not well-defined.</li>
    <li><b>Performance:</b> NoSQL databases can offer better performance for certain types of queries, such as key-value lookups or graph traversals, due to their specialized data models.</li>
</ul>


3. Types of NoSQL databases: key-value, document, column-family, and graph databases

<ul>
    <li><b>Key-Value Databases:</b>
    <ul>
      <li>Stores data as key-value pairs</li>
      <li>Example: Redis</li>
      <li>Usage:</li>
<pre><code class="language-python"><font color = "indigo">
import</font> redis
r = redis.Redis()
r.<font color = "orange">set</font>(<font color = "green">"key", "value"</font>)
value = r.get(<font color = "green">"key"</font>)
</code></pre>
    </ul>
  </li>
    <li><b>Document Databases:</b>
        <ul>
            <li>Stores data as documents, typically in formats like JSON or BSON</li>
            <li>Example: MongoDB</li>
            <li>Usage:</li>
    <pre><code class="language-python">
    <font color="indigo">from</font> pymongo <font color="indigo">import</font> MongoClient
    client = MongoClient()
    db = client.<font color="purple">example_db</font>
    collection = db.<font color="purple">example_collection</font>
    document = {<font color="orange">"name"</font>: <font color="green">"John"</font>, <font color="orange">"age"</font>: <font color="purple">30</font>, <font color="orange">"city"</font>: <font color="green">"New York"</font>}
    collection.<font color="orange">insert_one</font>(document)
    </code></pre>
        </ul></li>
    <li><b>Column-Family Databases:</b>
    <ul>
        <li>Stores data as columns grouped into column families, optimized for write-heavy workloads</li>
        <li>Example: Cassandra</li>
        <li>Usage:</li>
<pre><code class="language-python">
  <font color="indigo">from</font> cassandra.cluster <font color="indigo">import</font> Cluster
  cluster = Cluster()
  session = cluster.connect()
  session.execute("""
      <font color="green">CREATE KEYSPACE</font> example_keyspace
      <font color="green">WITH replication</font> = {'<font color="purple">class</font>': '<font color="red">SimpleStrategy</font>', '<font color="purple">replication_factor</font>': '<font color="red">1</font>'}
  """)
  session.set_keyspace('example_keyspace')
  session.execute("""
      <font color="green">CREATE TABLE</font> example_table (
          <font color="orange">id</font> UUID <font color="purple">PRIMARY KEY</font>,
          <font color="orange">name</font> TEXT,
          <font color="orange">age</font> INT
      )
  """)
</code></pre></ul></li>
    <li><b>Graph Databases:</b>
         <ul>
             <li>Stores data as nodes and edges in a graph, optimized for querying relationships</li>
             <li>Example: Neo4j</li>
             <li>Usage:</li>
 <pre><code class="language-python">
      <font color="indigo">from</font> neo4j <font color="indigo">import</font> GraphDatabase
      driver = GraphDatabase.driver("<font color="green">bolt://localhost:7687</font>", auth=(<font color="green">"neo4j"</font>, <font color="green">"password"</font>))
      <font color="indigo">with</font> driver.session() <font color="indigo">as</font> session:
          session.run("<font color="green">CREATE</font> (a:<font color="orange">Person</font> {<font color="orange">name</font>: '<font color="green">John</font>', <font color="orange">age</font>: <font color="purple">30</font>})")
   </code></pre></ul></li>
</ul>&nbsp;

<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />&nbsp;

## Entity-Relationship (ER) Diagrams

An Entity-Relationship (ER) diagram is a visual representation of the major entities within a system, along with their relationships and attributes. ER diagrams are often used in database design to help model the structure and organization of data. There are three main components in an ER diagram:

<ol>
  <li><strong>Entities:</strong> These are objects or concepts that represent something significant in the system. Entities are usually represented by rectangles in an ER diagram.</li>&nbsp;
  <li><strong>Attributes:</strong> These are properties or characteristics that describe entities. Attributes are represented by ovals connected to the corresponding entity.</li>&nbsp;
  <li><strong>Relationships:</strong> These define how entities are related to each other. Relationships are represented by diamonds connecting the related entities.</li>&nbsp;
</ol>

To create an ER diagram, you can use tools like <a href="https://app.diagrams.net/">draw.io</a> or <a href="https://www.lucidchart.com/pages/">Lucidchart</a> 

Example - ![ER Diagram](er_diagram.png)

<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />&nbsp;

<h2>Normalization</h2>

<p>Normalization is a process of organizing the data in a database to reduce redundancy and improve data integrity. It involves decomposing a table into smaller tables and defining relationships between them. The main goal of normalization is to ensure that each piece of data is stored in only one place. There are several normal forms, and each has its rules:</p>

<ol>
  <li><strong>First Normal Form (1NF)</strong>: A table is in 1NF if it contains no repeating groups or arrays, and each column contains atomic values. This means that each cell should contain only one value.</li>
  <li><strong>Second Normal Form (2NF)</strong>: A table is in 2NF if it is in 1NF and all non-key attributes are fully dependent on the primary key. This means that there should be no partial dependencies.</li>
  <li><strong>Third Normal Form (3NF)</strong>: A table is in 3NF if it is in 2NF and all non-key attributes are directly dependent on the primary key, and there are no transitive dependencies.</li>
</ol>

<p>As an example, consider a table called <code>orders</code> with the following columns: <code>order_id</code>, <code>customer_id</code>, <code>customer_name</code>, <code>product_id</code>, and <code>product_name</code>. Here's a sample of the data:</p>

<table border="1" cellpadding="5">
  <tr>
    <th>order_id</th>
    <th>customer_id</th>
    <th>customer_name</th>
    <th>product_id</th>
    <th>product_name</th>
  </tr>
  <tr>
    <td>1</td>
    <td>101</td>
    <td>Alice</td>
    <td>1001</td>
    <td>Widget</td>
  </tr>
  <tr>
    <td>2</td>
    <td>101</td>
    <td>Alice</td>
    <td>1002</td>
    <td>Gadget</td>
  </tr>
  <tr>
    <td>3</td>
    <td>102</td>
    <td>Bob</td>
    <td>1001</td>
    <td>Widget</td>
  </tr>
</table>

<p>To normalize this table, we can decompose it into the following three tables:</p>

<ol>
  <li><strong>customers</strong>: <code>customer_id</code>, <code>customer_name</code></li>
  <li><strong>products</strong>: <code>product_id</code>, <code>product_name</code></li>
  <li><strong>orders</strong>: <code>order_id</code>, <code>customer_id</code>, <code>product_id</code></li>
</ol>

<p>This example demonstrates normalization up to the third normal form (3NF). The new table structure reduces redundancy and ensures data integrity by storing each piece of information in only one place.</p>
