# Data Modeling <br><br>


Data modeling is a critical aspect of data engineering, where a data model is created to represent the data and its relationships. It involves defining the data structure, data types, and relationships between the data entities. The goal of data modeling is to ensure that the data is accurate, consistent, and meaningful.<br>

In data engineering, data modeling is the process of creating a conceptual, logical, and physical model of the data. Here are the three main types of data models:

<b>Conceptual Data Model:</b> The conceptual data model is the first step in the data modeling process. It defines the high-level view of the data, without going into much detail. The conceptual model identifies the entities, their attributes, and the relationships between them.<br>

<img src="https://i.ibb.co/HVWTPY1/Screenshot-2023-03-10-at-4-09-21-PM.png" height = "800" width = "800"><br><br>

<b>Logical Data Model:</b> The logical data model is a more detailed view of the data, where the focus is on the structure and relationships between the entities. The logical model defines the tables, columns, keys, and relationships between them.<br>


<img src="https://i.ibb.co/Bj8crVb/Screenshot-2023-03-10-at-4-07-35-PM.png" height = "800" width = "800"><br><br>


<b>Physical Data Model:</b> The physical data model is a representation of the logical data model in the form of a database schema. It defines the actual structure of the database, including data types, indexes, and constraints.
Data modeling is an iterative process, and it involves collaborating with various stakeholders to ensure that the model meets the requirements. The data model must be flexible, scalable, and easy to maintain, as the data may change over time.


A physical data model is the actual manifestation of the entity and relationship in code.


<div class="sql">
    <pre><code><font color = 'indigo'>
CREATE TABLE Customer (
    customer_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(50),
    address VARCHAR(255)
);

CREATE TABLE Order (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    status VARCHAR(20),
    FOREIGN KEY (customer_id) REFERENCES Customer(customer_id)
    FOREIGN KEY (product_id) REFERENCES Product(product_id)
);

CREATE TABLE Product (
    product_id INT PRIMARY KEY,
    name VARCHAR(100),
    description VARCHAR(255),
    price DECIMAL(10,2),
    category VARCHAR(50)
);
    </font></code></pre>
</div>


In summary, data modeling is the process of creating a representation of the data and its relationships in a structured manner. It is a crucial step in data engineering that helps ensure the accuracy, consistency, and meaningfulness of the data.
<br>&nbsp;

<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />


## Introduction to NoSQL databases

1. Understanding the differences between NoSQL and SQL databases

<ul>
  <li>SQL databases are based on a relational model with a fixed schema and follow ACID (Atomicity, Consistency, Isolation, Durability) properties.</li>
  <li>In contrast, NoSQL databases are non-relational, schema-less, and often prioritize BASE (Basically Available, Soft state, Eventual consistency) properties.</li>
  <li>SQL databases are typically more suited for structured data, while NoSQL databases handle structured, semi-structured, and unstructured data.</li>
  <li>SQL databases use SQL (Structured Query Language) for querying and managing data, whereas NoSQL databases use various query languages depending on the database type.</li>
</ul>


2. Reasons to choose NoSQL databases for specific use cases

<ul>
    <li><b>Scalability:</b> NoSQL databases are often designed to scale horizontally, making them suitable for handling large amounts of data and high write loads.</li>
    <li><b>Flexibility:</b> NoSQL databases allow for more flexible data models, which can be beneficial when dealing with complex, evolving data structures or when the schema is not well-defined.</li>
    <li><b>Performance:</b> NoSQL databases can offer better performance for certain types of queries, such as key-value lookups or graph traversals, due to their specialized data models.</li>
</ul>


3. Types of NoSQL databases: key-value, document, column-family, and graph databases

<ul>
    <li><b>Key-Value Databases:</b>
    <ul>
      <li>Stores data as key-value pairs</li>
      <li>Example: Redis</li>
      <li>Usage:</li>
<pre><code class="language-python"><font color = "indigo">
import</font> redis
r = redis.Redis()
r.<font color = "orange">set</font>(<font color = "green">"key", "value"</font>)
value = r.get(<font color = "green">"key"</font>)
</code></pre>
    </ul>
  </li>
    <li><b>Document Databases:</b>
        <ul>
            <li>Stores data as documents, typically in formats like JSON or BSON</li>
            <li>Example: MongoDB</li>
            <li>Usage:</li>
    <pre><code class="language-python">
    <font color="indigo">from</font> pymongo <font color="indigo">import</font> MongoClient
    client = MongoClient()
    db = client.<font color="purple">example_db</font>
    collection = db.<font color="purple">example_collection</font>
    document = {<font color="orange">"name"</font>: <font color="green">"John"</font>, <font color="orange">"age"</font>: <font color="purple">30</font>, <font color="orange">"city"</font>: <font color="green">"New York"</font>}
    collection.<font color="orange">insert_one</font>(document)
    </code></pre>
        </ul></li>
    <li><b>Column-Family Databases:</b>
    <ul>
        <li>Stores data as columns grouped into column families, optimized for write-heavy workloads</li>
        <li>Example: Cassandra</li>
        <li>Usage:</li>
<pre><code class="language-python">
  <font color="indigo">from</font> cassandra.cluster <font color="indigo">import</font> Cluster
  cluster = Cluster()
  session = cluster.connect()
  session.execute("""
      <font color="green">CREATE KEYSPACE</font> example_keyspace
      <font color="green">WITH replication</font> = {'<font color="purple">class</font>': '<font color="red">SimpleStrategy</font>', '<font color="purple">replication_factor</font>': '<font color="red">1</font>'}
  """)
  session.set_keyspace('example_keyspace')
  session.execute("""
      <font color="green">CREATE TABLE</font> example_table (
          <font color="orange">id</font> UUID <font color="purple">PRIMARY KEY</font>,
          <font color="orange">name</font> TEXT,
          <font color="orange">age</font> INT
      )
  """)
</code></pre></ul></li>
    <li><b>Graph Databases:</b>
         <ul>
             <li>Stores data as nodes and edges in a graph, optimized for querying relationships</li>
             <li>Example: Neo4j</li>
             <li>Usage:</li>
 <pre><code class="language-python">
      <font color="indigo">from</font> neo4j <font color="indigo">import</font> GraphDatabase
      driver = GraphDatabase.driver("<font color="green">bolt://localhost:7687</font>", auth=(<font color="green">"neo4j"</font>, <font color="green">"password"</font>))
      <font color="indigo">with</font> driver.session() <font color="indigo">as</font> session:
          session.run("<font color="green">CREATE</font> (a:<font color="orange">Person</font> {<font color="orange">name</font>: '<font color="green">John</font>', <font color="orange">age</font>: <font color="purple">30</font>})")
   </code></pre></ul></li>
</ul>&nbsp;
<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />&nbsp;

## Data modeling in NoSQL Databases


<br>
<body>
	<ul>
		<li>a. Principles of NoSQL data modeling: denormalization, data duplication, and data aggregation
			<ul>
				<li><b>Denormalization:</b></li>
				<ul>
					<li>In NoSQL databases, data is often denormalized to reduce the need for joins or complex queries, which can be less performant in these systems.</li>
					<li>Denormalization involves storing related data together, such as embedding an address document within a user document in a document-oriented database like MongoDB.</li>
					<li>The trade-off is increased storage space and potential update anomalies, but improved query performance.</li>
				</ul>
                <li><b>Data Duplication:</b></li>
				<ul>
					<li>Data duplication is sometimes used in NoSQL databases to optimize read performance by replicating data across multiple structures or partitions.</li>
					<li>For example, in a graph database like Neo4j, a node's properties may be duplicated across multiple relationships to avoid extra traversal steps when querying the data.</li>
					<li>This approach can improve read performance but may increase complexity when updating or maintaining the data.</li>
				</ul>
                <li><b>Data Aggregation:</b></li>
				<ul>
					<li>NoSQL databases often use data aggregation techniques, such as precomputing aggregated values or creating materialized views, to optimize read-heavy workloads.</li>
					<li>For instance, in a column-family database like Cassandra, precomputed aggregates can be stored as additional columns or separate tables to speed up read operations.</li>
					<li>Aggregating data can reduce query complexity and improve performance, but may increase storage requirements and update complexity.</li>
				</ul>
			</ul>
		</li><br>&nbsp;&nbsp;&nbsp;
        <li>b. Key considerations when modeling data in NoSQL databases, such as scalability, performance, and flexibility
            <ul>
                <li><b>Scalability:</b>
                    <ul>
                        <li>Data models in NoSQL databases should be designed with horizontal scalability in mind, enabling the system to handle increased data volume and read/write loads by adding more nodes to the cluster.</li>
                        <li>Considerations include partitioning strategies, sharding, and replication factors.</li>
                    </ul>
                </li>
                <li><b>Performance:</b>
                    <ul>
                        <li>NoSQL data models should be optimized for the most common query patterns and workloads, taking into account the specific database type and its strengths.</li>
                        <li>Techniques include schema-less designs, embedding related data, and using polymorphic data structures.</li>
                    </ul>
                </li>
                <li><b>Flexibility:</b>
                    <ul>
                        <li>NoSQL data models should be designed with flexibility in mind, allowing for easy adaptation to evolving data structures and requirements.</li>
                        <li>Techniques include schema-less designs, embedding related data, and using polymorphic data structures.</li>
                    </ul>
                </li>
            </ul>
        </li><br>
        <li>c. Comparing data modeling approaches across different NoSQL database types
            <ul>
                <li><b>Key-Value Databases:</b>
                    <ul>
                        <li>Data modeling in key-value databases is relatively simple, as data is stored as key-value pairs.</li>
                    <li>Focus on creating a suitable key design, considering factors such as key length, structure, and hashing to ensure efficient storage and retrieval.</li>
                    <li> Example - Assuming you have a dataset of users and their associated metadata, you can store this information in Redis using hashes.</li>
                    <pre><code class="language-python">
<font color = "indigo">import</font> redis
r = redis.Redis()

<font color = "gray"># Storing user data as a hash</font>
user_data = {
<font color = "green">"id"</font>: <font color = "green">"1"</font>,
<font color = "green">"name"</font>: <font color = "green">"Alice"</font>,
<font color = "green">"email"</font>: <font color = "green">"alice@example.com"</font>,
<font color = "green">"age"</font>: <font color = "purple">30</font>
}

r.hmset(<font color = "orange">f"user:{user_data['id']}"</font>, user_data)

<font color = "gray"># Retrieving user data</font>
retrieved_user_data = r.hgetall(<font color = "green">"user:1"</font>)
                    </code></pre>
                </ul>
            </li>
            <li><b>Document Databases:</b>
                <ul>
                    <li>Data modeling in document databases involves structuring data as hierarchical documents, often in formats like JSON or BSON.</li>
                    <li>Considerations include embedding related data within documents, denormalization, and designing appropriate document structures to support common query patterns.</li>
                    <li>Example - Consider a dataset of blog posts and associated comments. You can store this information in MongoDB as embedded documents.</li>
                    <pre><code class="language-python">
<font color="indigo">from</font> pymongo <font color="indigo">import</font> MongoClient

client = MongoClient()
db = client.<font color="orange">blog_db</font>
posts = db.<font color="orange">posts</font>

post = {
    <font color="orange">"title"</font>: <font color="green">"My First Blog Post"</font>,
    <font color="orange">"content"</font>: <font color="green">"This is my first blog post."</font>,
    <font color="orange">"author"</font>: <font color="green">"Alice"</font>,
    <font color="orange">"comments"</font>: [
        {
            <font color="orange">"author"</font>: <font color="green">"Bob"</font>,
            <font color="orange">"text"</font>: <font color="green">"Great post!"</font>
        },
        {
            <font color="orange">"author"</font>: <font color="green">"Charlie"</font>,
            <font color="orange">"text"</font>: <font color="green">"Interesting read."</font>
        }
    ]
}

post_id = posts.insert_one(post).inserted_id

<font color="indigo"># Querying a post with its comments</font>
post_with_comments = posts.find_one({"_id": post_id})</code></pre>
                </ul>
             </li>
                <li><b>Column-Family Database:</b>
                    <ul>
                        <li>Data modeling in column-family databases requires designing column families and rows to store data in a columnar format.</li>
                        <li>Focus on defining efficient partition keys, clustering columns, and data types to optimize storage and query performance.</li>
                        <li>Example - Suppose you have a dataset of user activities on a website. You can store this information in Cassandra using a time-series data model.</li>
                        <pre><code class="language-python">
<font color="indigo">from</font> cassandra.cluster <font color="indigo">import</font> Cluster
<font color="indigo">from</font> uuid <font color="indigo">import</font> uuid1

cluster = Cluster()
session = cluster.connect()
session.execute(<font color="orange">"CREATE KEYSPACE IF NOT EXISTS webapp WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}"</font>)
session.set_keyspace(<font color="orange">"webapp"</font>)

session.execute(<br>&nbsp;&nbsp;&nbsp;<font color="orange">"""</font>
    CREATE TABLE IF NOT EXISTS user_activity (
        user_id UUID,
        activity_time TIMESTAMP,
        activity_id UUID,
        activity_type TEXT,
        PRIMARY KEY (user_id, activity_time)
    ) WITH CLUSTERING ORDER BY (activity_time DESC)
<font color="orange">"""</font>)

user_id = uuid1()
activity = {
    <font color="orange">"user_id"</font>: user_id,
    <font color="orange">"activity_time"</font>: <font color="green">"2023-04-08 12:00:00"</font>,
    <font color="orange">"activity_id"</font>: uuid1(),
    <font color="orange">"activity_type"</font>: <font color="green">"login"</font>
}

session.execute(
    <font color="orange">"INSERT INTO user_activity (user_id, activity_time, activity_id, activity_type) VALUES (%s, %s, %s, %s)"</font>,
    (activity[<font color="orange">"user_id"</font>], activity[<font color="orange">"activity_time"</font>], activity[<font color="orange">"activity_id"</font>], activity[<font color="orange">"activity_type"</font>])
)

<font color="indigo"># Querying user activities</font>
rows = session.execute(<font color="orange">"SELECT * FROM user_activity WHERE user_id=%s"</font>, (user_id,))

</code></pre>
                    </ul>
                </li>
                <li><b>Graph Databases:</b>
                    <ul>
                        <li>Data modeling in graph databases involves designing nodes, edges, and properties to represent connected data.</li>
                        <li>Considerations include designing efficient traversal patterns, indexing, and denormalization to optimize graph queries and traversals.</li>
                        <li>Example - Assume you have a dataset of people and their friendships. You can store this information in Neo4j using nodes and relationships.</li>
                        <pre><code class="language-python">
<font color="indigo">from</font> neo4j <font color="indigo">import</font> GraphDatabase

driver = GraphDatabase.driver(<font color="green">"bolt://localhost:7687"</font>, auth=(<font color="green">"neo4j"</font>, <font color="green">"password"</font>))

<font color="indigo">def</font> create_friendship(tx, name1, name2):
    tx.run(<font color="green">"MERGE (a:Person {name: $name1}) "
           "MERGE (b:Person {name: $name2}) "
           "MERGE (a)-[:FRIENDS]->(b)"</font>, name1=name1, name2=name2)

<font color="indigo">with</font> driver.session() <font color="indigo">as</font> session:
    session.write_transaction(create_friendship, <font color="green">"Alice"</font>, <font color="green">"Bob"</font>)
    session.write_transaction(create_friendship, <font color="green">"Alice"</font>, <font color="green">"Charlie"</font>)
    session.write_transaction(create_friendship, <font color="green">"Bob"</font>, <font color="green">"David"</font>)

<font color="indigo"># Querying friends of a person</font>
<font color="indigo">def</font> get_friends_of_person(tx, name):
    result = tx.run(<font color="green">"MATCH (p:Person {name: $name})-[:FRIENDS]->(friend) RETURN friend.name"</font>, name=name)
    <font color="indigo">return</font> [record["friend.name"] <font color="indigo">for</font> record <font color="indigo">in</font> result]

<font color="indigo">with</font> driver.session() <font color="indigo">as</font> session:
    friends_of_alice = session.read_transaction(get_friends_of_person, <font color="green">"Alice"</font>)

<font color="indigo">print</font>(<font color="green">"Friends of Alice:"</font>, friends_of_alice)
</code></pre>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
</body>  

<hr style="background: linear-gradient(to right, #f00, #00f); height: 5px; border: none;" />