# Schema Modeling

Schema modeling is an important aspect of data engineering, as it helps structure your data in a way that is efficient for querying and easy to understand. There are different types of schema modeling approaches, such as star schema, snowflake schema, and hybrid schema (which combines elements of both star and snowflake). 

Let's discuss each of these.

### Star Schema:

In a `star` schema, a central fact table is connected to one or more dimension tables via foreign key relationships. The fact table contains quantitative data (e.g., sales, revenue) and keys to join with the dimension tables. Dimension tables store descriptive attributes (e.g., customer information, product details) and are usually denormalized, meaning they contain redundant data to minimize the number of joins required during querying.

Star schema is a popular choice for data warehouse design because of its simplicity and query performance advantages. Here's an example of a star schema in the context of a retail sales data warehouse:

Imagine a retail company wants to analyze its sales data. A typical star schema for this scenario would have a central fact table (fact_sales) and several dimension tables, such as dim_date, dim_product, dim_customer, and dim_store.

1. fact_sales (fact table): This table contains quantitative data (measures) about each sale, such as:
- sales_id (primary key)
- date_id (foreign key to Dim_Date)
- product_id (foreign key to Dim_Product)
- customer_id (foreign key to Dim_Customer)
- store_id (foreign key to Dim_Store)
- quantity_sold
- total_price

2. dim_date (dimension table): This table stores information about dates, such as:
- date_id (primary key)
- date
- day_of_week
- day_of_month
- month
- quarter
- year


3. dim_product (dimension table): This table contains information about products, such as:
- product_id (primary key)
- product_name
- category
- subcategory
- brand
- color
- size


4. dim_customer (dimension table): This table stores information about customers, such as:
- customer_id (primary key)
- first_name
- last_name
- email
- phone
- address
- city
- state
- country


5. dim_store (dimension table): This table holds information about store locations, such as:
- store_id (primary key)
- store_name
- address
- city
- state
- country
- region


<b> Star Schema Model: </b><br><br>
<img src="https://i.ibb.co/QHCFGrX/star-schema-model.png" height = "600" width = "800"><br><br>


### Snowflake Schema:

In a `snowflake` schema, the central fact table is surrounded by multiple dimension tables, which can further be connected to other dimension tables, forming a structure resembling a snowflake. Unlike the star schema, snowflake schema normalizes data, reducing data redundancy at the cost of more complex queries due to an increased number of joins.

The snowflake schema is favored for its efficient data storage since it minimizes data duplication by separating data into related tables. Here's an example of a snowflake schema in the context of a retail sales data warehouse:

Imagine a retail company wishes to analyze its sales data. A typical snowflake schema for this scenario would consist of a central fact table (fact_sales) and several dimension tables including Dim_customer, Dim_Contact, Dim_Address, Dim_Country, Dim_Date, Dim_day_details, dim_store, dim_product, dim_brand, and dim_category.


1. fact_sales (fact table): This table contains data about each sale, such as:
- sales_id (primary key)
- customer_id (foreign key to dim_customer)
- product_id (foreign key to dim_product)
- date_id (foreign key to dim_date)
- store_id (foreign key to dim_store)
- sold_date
- quantity_sold
- total_sale_amount

2. dim_customer (dimension table): This table stores information about customers, with details like:
- customer_id (primary key)
- first_name
- last_name
- contact_id (foreign key to dim_contact)
- address_id (foreign key to dim_address)

3. dim_contact (dimension table): This table holds contact information with columns such as:
- contact_id (primary key)
- email
- phone

4. dim_address (dimension table): This table holds address information with columns such as:
- address_id (primary key)
- address
- city
- state
- country_id (foreign key to dim_country)

5. dim_country (dimension table): This table maintains information about various countries, encapsulating details like:
- country_id (primary key)
- country

6. dim_date (dimension table): This table holds temporal data that can be referenced in various other tables, with attributes such as:
- date_id (primary key)
- date
- month
- year
- quarter
- day_details_id (foreign key to dim_day_details)

7. dim_day_details (dimension table): This table provides a deeper dive into the details associated with particular days, including:
- day_details_id (primary key)
- day_of_the_week
- day_of_the_month

8. dim_store (dimension table): This table keeps track of the store details where the sales are happening, encapsulating elements like:
- store_id (primary key)
- store_name
- address_id (foreign key to dim_address)
- region

9. dim_product (dimension table): This table houses information on the various products available for sale, with columns including:
- product_id (primary key)
- product_name
- brand_id (foreign key to dim_brand)
- category_id (foreign key to dim_category)
- price

10. dim_brand (dimension table): This table maintains details about the different brands of the products, encapsulating attributes such as:
- brand_id (primary key)
- brand

11. dim_category (dimension table): This table keeps a record of product categories and subcategories, with columns like:
- category_id (primary key)
- category
- subcategory

<b> Snowflake Schema Model: </b><br><br>
<img src="https://i.postimg.cc/CMGV50Lm/Screenshot-2024-02-22-at-11-50-31-AM.png" height = "800" width = "1200"><br><br>


### Hybrid Schema

A hybrid schema combines the elements of both star and snowflake schemas, aiming to utilize the strengths of both. It features a central fact table connected to a mixture of normalized and denormalized dimension tables. This offers a balanced approach, optimizing storage efficiency while maintaining a reasonable query performance. 

In the context of a retail sales data warehouse, a hybrid schema might look like the following:

Imagine a retail company seeks to scrutinize its sales data. The hybrid schema for this scenario could consist of a central fact table (fact_sales) and several dimension tables such as dim_customer, dim_contact, dim_address, dim_date, dim_store, and dim_product.

1. fact_sales (fact table): This table stores transactional data, encompassing details like:
- sales_id (primary key)
- customer_id (foreign key to dim_customer)
- product_id (foreign key to dim_product)
- date_id (foreign key to dim_date)
- store_id (foreign key to dim_store)
- sold_date
- quantity_sold
- total_sale_amount

2. dim_customer (dimension table): This table holds customer details, with columns such as:
- customer_id (primary key)
- first_name
- last_name
- contact_id (foreign key to dim_contact)
- address_id (foreign key to dim_address)

3. dim_contact (dimension table): This table contains contact details, featuring columns like:
- contact_id (primary key)
- email
- phone

4. dim_address (dimension table): This table stores address details of the customers and stores, including:
- address_id (primary key)
- address
- city
- state
- country

5. dim_date (dimension table): This table encompasses date-related information, with columns such as:
- date_id (primary key)
- date
- month
- year
- quarter
- day_of_the_week
- day_of_the_month

6. dim_store (dimension table): This table maintains information about various store locations, with details like:
- store_id (primary key)
- store_name
- address
- city
- state
- country
- region

7. dim_product (dimension table): This table harbors product details, including:
- product_id (primary key)
- product_name
- brand
- category
- subcategory
- price

In this hybrid schema, certain dimension tables (like dim_contact and dim_address) are normalized to save storage space, while others (like dim_product and dim_store) are denormalized to facilitate simpler and faster queries. This schema allows for a flexible approach in data warehouse design, accommodating varying degrees of complexity and storage efficiency based on the specific requirements of the data analysis tasks at hand.


<b> Hybrid Schema Model: </b><br><br>
<img src="https://i.postimg.cc/gj0hrW6j/Screenshot-2024-02-22-at-11-22-38-AM.png" height = "800" width = "1000"><br><br>


Below is a comparision table between the schemas
<br><br>

| Aspect                | Star Schema                                                         | Snowflake Schema                                                      | Hybrid Schema                                                             | Scenario Where One is Better                                              | Scenario Where It Doesn't Matter                                          |
|-----------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------|
| Data Organization     | Denormalized: Fewer tables with redundant data                      | Normalized: More tables with minimized redundancy                     | Combination of normalized and denormalized tables, optimizing storage and query complexity  | Star: Simplicity and fast queries; Snowflake: Storage efficiency; Hybrid: Balance of query performance and storage efficiency | When emphasis is on other factors like data security or specific analysis requirements |
| Query Performance     | Faster due to fewer joins required                                  | Potentially slower due to increased number of joins                   | Balanced: Mix of complex and simple queries, aiming for an optimum performance               | Star: Real-time analytics; Snowflake: Detailed analytical processing; Hybrid: A balance of both, accommodating diverse query requirements | In scenarios where performance variation between the schemas is marginal |
| Storage Efficiency    | Requires more space due to data redundancy                          | More efficient due to reduced data redundancy                         | Moderately efficient: Aims to optimize storage without significantly affecting query performance | Snowflake: When minimizing storage cost is a priority; Hybrid: When seeking a balanced approach to storage efficiency | When storage considerations are not a priority and focus is on other attributes of data management |
| Complexity            | Simpler structure, easier to understand and maintain                | More complex structure with a deeper understanding of data relations | Moderately complex: Combines elements of both star and snowflake schemas to manage complexity | Star: Smaller data warehouses; Snowflake: Large, detailed data warehouses; Hybrid: When a balanced complexity level is desired | When the structure's complexity is not a significant factor in the data warehouse design |
| Flexibility in Analysis | Limited to simpler, direct analyses                                | Facilitates detailed, multifaceted analyses                           | Offers a balanced flexibility, accommodating a wide range of analysis requirements           | Snowflake: For complex, multifaceted analyses; Hybrid: For a balanced approach to analysis flexibility | When the analysis requirements are not intricately detailed, allowing for a flexible schema choice |

<br><br>

-------------------------------------------------------------------------------------------------------------------
Laslty, here is a summary of selection criteria with some real-world scenarios to help choose b/w the models

1. <b>Star Schema:</b>

- <b>Scenario:</b> A small e-commerce startup wants to set up a data warehouse to analyze sales data to make informed decisions on inventory management and marketing strategies.
- <b>Criteria:</b> Given that the startup is small, the amount of data is not enormous. A star schema would allow for simple, straightforward queries with faster response times, making it easier for the startup to quickly gain insights without getting entangled in complex data relationships.
- <b>Example Query:</b> Determining the best-selling products in a particular month, which can be quickly found by performing a few joins between the fact and dimension tables.

2. <b>Snowflake Schema: </b>

- <b>Scenario:</b> A multinational corporation wants to analyze global sales and customer data, which includes a vast amount of details such as customer demographics, sales transactions, and product information distributed across various countries.
- <b>Criteria:</b> Given the large scale of data and the necessity to store data efficiently, a snowflake schema is chosen. It minimizes data redundancy, thus saving storage costs. The corporation has the resources to manage the complexity of the schema and can afford slightly slower query times in exchange for more detailed and multifaceted analyses.
- <b>Example Query:</b> Analyzing the buying patterns of customers in different countries, which requires detailed data from several normalized tables, facilitating a deeper analysis of the global customer base.

3. <b>Hybrid Schema:</b>

- <b>Scenario:</b> A healthcare research organization wants to build a data warehouse to store and analyze varied data including patient records, treatment histories, and research data.
- <b>Criteria:</b> The organization deals with a diverse set of data, where some aspects require detailed analyses (like research data) while others need quicker query times (like patient records). A hybrid schema is chosen to balance the storage efficiency and query performance, catering to the diverse data analysis needs without compromising on storage space or query speed.
- <b>Example Query:</b> Conducting a study on the effectiveness of a treatment method, which would involve querying both denormalized tables (for quick retrieval of patient records) and normalized tables (for detailed analysis of research data).

## Schema Modeling Exercise

A telecommunications company is planning to revamp its data warehouse to integrate and analyze data from various departments including sales, customer service, network operations, and marketing. The data encompasses a wide variety of information such as customer profiles, call detail records, network traffic data, service usage patterns, marketing campaign data, and customer feedback.

The company aims to achieve the following objectives:

1. Analyze customer usage patterns to offer personalized service packages and promotions.
Monitor and optimize network performance by analyzing traffic data and identifying potential issues before they affect customers.
2. Evaluate the effectiveness of marketing campaigns by analyzing customer responses and feedback.
3. Enhance customer service by integrating customer profiles with feedback and service usage data to provide personalized assistance.
4. Comply with regulatory requirements by ensuring secure and well-organized storage of customer data.

The company is faced with the challenge of managing a vast amount of diverse data, which needs to be stored efficiently while still allowing for complex analyses and reporting. The data warehouse must support quick query response times for real-time monitoring of network operations and flexible, detailed analyses for marketing and customer service strategies.

<b>Considerations:</b>
1. The data volume is enormous, given the detailed call records and network traffic data.
2. The requirement for real-time monitoring necessitates quick query response times.
3. Marketing and customer service strategies require detailed, multifaceted analyses.
4. Regulatory compliance demands secure and well-organized data storage.

Based on this scenario, readers should analyze the following aspects to choose the most suitable schema:

- Which schema(s) would allow for efficient storage of the vast amount of diverse data?
- Which schema(s) could facilitate quick queries for real-time monitoring of network operations?
- Which schema(s) would support detailed analyses for developing marketing and customer service strategies?
- Which schema(s) can ensure regulatory compliance through well-organized and secure data storage? 

<details>
<summary><b>Click here to reveal the answers</b></summary>

1. **Hybrid Schema**: Given the diverse requirements, a hybrid schema might be the most suitable as it can balance the need for quick queries (from denormalized tables) and detailed analyses (from normalized tables).
2. **Star Schema**: For real-time monitoring of network operations, a star schema with denormalized tables can provide quicker query response times.
3. **Snowflake Schema**: For detailed analyses for marketing and customer service strategies, a snowflake schema can offer more depth and detail in the analyses, though at the cost of slower query times.
4. **Hybrid/Star Schema**: To ensure regulatory compliance, a hybrid schema can offer a balanced approach, while a star schema might offer simpler data organization and thus potentially simpler compliance processes.

</details>