### Intro to Big Data
- Learning Objectives
    - Define Big Data and its various types.
    - Explain the concept of the Internet of Things (IoT).
    - List the characteristics of Big Data and the 7Vs associated with it.
    - Memorize the Big Data life cycle.
    - Differentiate different Big Data file formats.
    - Identify real-world applications of Big Data in various industries.
    - Explain the historical evolution of the term "Big Data" and how it has evolved over time.

#### Big Data Overview
- Massive volumes of both structured and unstructured data, which are challenging to process using typical database and software techniques, are known as Big Data. Big Data grows exponentially over time. Big Data is a high-volume, high-velocity, and high-variety information asset.
- Due to its vastness and complexity, traditional data management technologies cannot store or process this information efficiently.
- Most enterprise data is either too large, too fast, or exceeds current processing capacity. These datasets come from various sources, including social media, sensors, devices, and business transactions.
- Big Data is used when conventional data mining and handling approaches cannot uncover the underlying insights and meaning. Relational database engines cannot process unstructured, time-sensitive, or large datasets. Big Data processing uses massive parallelism on readily available hardware to manage this type of data.


#### Overview of Structured Data
- Structured data is typically categorized as quantitative data and is well-organized and machine-readable.
- Example: Structured data is stored in relational databases (RDBMS), where fields contain information like names, dates, addresses, credit card numbers, phone numbers, social security numbers, and ZIP codes.
- Structured Query Language (SQL) is used to manage and query structured data.
- Example: Imagine a spreadsheet where data is organized into rows and columns. Specific elements are defined by certain variables, making the information easy to retrieve and analyze.

#### Overview of Unstructured Data
- The Unstructured data is frequently referred to as Qualitative data since it cannot be handled or evaluated using standard data tools and procedures. 
- Because unstructured data lacks a predetermined data model, it cannot be arranged in relational databases. Rather than that, non-relational or NoSQL databases are the best suited for unstructured data management.
- Another method of managing unstructured data is to allow it to flow into a data lake in its raw, unstructured state.

#### Overview of Semi-Structured Data
- Structured and unstructured data exist in semi-structured data. Although semi-structured data looks to be structured, it is not specified in the same way as a table in a relational database management system.
- Metadata: Semi-structured data typically contains metadata, such as tags, attributes, or keys, which provide context and organization to the data elements.
- The “structure” in semi-structured data comes from using tags, markers, metadata, or hierarchies to separate and define elements within the data. This makes it more flexible, simpler to store, and easier to analyze than unstructured data.

#### Overview of Big Data File Formats
- Big Data file formats are designed to optimize data storage, retrieval, and processing efficiency. Some formats are general-purpose, while others are tailored for specific use cases or designed to handle unique data characteristics.
- Choosing an appropriate file format can have some significant benefits:
    1. Faster read times.
    1. Faster write times.
    1. Splittable files.
    1. Schema evolution support.
    1. Advanced compression support.
- Common File Formats
    - CSV (Comma-Separated Values) – Simple, human-readable, widely used but lacks advanced optimizations.
    - JSON (JavaScript Object Notation) – Flexible, self-descriptive format often used in NoSQL and web applications.
    - Avro – Binary format with schema evolution support, optimized for row-based storage.
    - Parquet – Columnar storage format, ideal for analytical workloads with fast read performance.
    - ORC (Optimized Row Columnar) – Optimized for Hive and other big data processing frameworks.
    - SequenceFile – Binary key-value storage format used in Hadoop.


### Summary
- Big Data is everywhere and is collected and used to drive business decisions and influence people's lives. Big Data is the digital trace that gets generated through the entire digital ecosystem and is a high-volume, high-velocity, and high-variety information asset:
    - Personal assistants (e.g., Siri or Alexa) use Big Data to devise answers to inquiries, and the Internet of Things (IoT) devices continually generate massive volumes of data. Big Data analytics help companies gain insights from the data collected by the IoT devices.
    - "Embarrassingly parallel” calculations are the kinds of workloads that can easily be divided and run independently of one another. If any single process fails, that process has no impact on the other processes and can simply be re-run. Open-source projects, which are free and completely transparent, run the world of Big Data and include the Hadoop project and big data tools like Apache Hive and Apache Spark.
    - The Big Data tool ecosystem includes the following six main tooling categories: data technologies, analytics and visualization, business intelligence, cloud providers, NoSQL databases, and programming tools.
