### Lesson 346.1 Intro to Big Data
- Learning Objectives
    - Define Big Data and its various types.
        - a term that describes large volumes of high velocity, complex, and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information
        - Types:
            - Structured
            - Unstructured
            - Semi-structured
    - Explain the concept of the Internet of Things (IoT).
        - connects physical devices to the internet, enabling them to collect, share, and analyze data, making them smarter. For example:
            - Today, we have smart air conditioners, televisions, and other connected devices.
            - A smart air conditioner continuously monitors both indoor and outdoor temperatures and automatically adjusts the room temperature accordingly.
            - To achieve this, it gathers outdoor temperature data from the internet and collects real-time sensor data from within the room.
        - an Internet-enabled connected network of smart devices such as sensors, processors, embedded devices and communication hardware that collect and transfer massive amounts of data over the Internet without any manual intervention by using embedded technologies
    - List the characteristics of Big Data and the 7Vs associated with it.
    - Memorize the Big Data life cycle.
    - Differentiate different Big Data file formats.
    - Identify real-world applications of Big Data in various industries.
    - Explain the historical evolution of the term "Big Data" and how it has evolved over time.

#### Big Data Overview
- Massive volumes of both structured and unstructured data, which are challenging to process using typical database and software techniques, are known as Big Data. Big Data grows exponentially over time. Big Data is a high-volume, high-velocity, and high-variety information asset.
- Due to its vastness and complexity, traditional data management technologies cannot store or process this information efficiently.
- Most enterprise data is either too large, too fast, or exceeds current processing capacity. These datasets come from various sources, including social media, sensors, devices, and business transactions.
- Big Data is used when conventional data mining and handling approaches cannot uncover the underlying insights and meaning. Relational database engines cannot process unstructured, time-sensitive, or large datasets. Big Data processing uses massive parallelism on readily available hardware to manage this type of data.


![BidDataTypes.png](attachment:BidDataTypes.png)
- POS (Point of Sale):
    - Refers to the physical location or terminal where a retail transaction is completed, often involving payment processing. 
- POL (Platform Orchestration Layer):
    - A layer in a system architecture that manages and coordinates different components and services within a platform. 
- IR (Intermediate Representation):
    - A representation of code or data used by a compiler or virtual machine during processing, allowing for optimization and translation. 
- IMS (IP Multimedia Subsystem):
    - A standardized architecture for providing multimedia services over IP networks, such as VoIP and video conferencing. 
- MSA (Measurement System Analysis):
    - A systematic process used to evaluate the accuracy and stability of measurement systems, determining their suitability for specific applications. 
![Structured_vs_Unstructured.png](attachment:Structured_vs_Unstructured.png)
![Structured_vs_Unstructured_Examples_UseCases.png](attachment:Structured_vs_Unstructured_Examples_UseCases.png)

#### Overview of Structured Data
- Structured data is typically categorized as quantitative data and is well-organized and machine-readable.
- Example: Structured data is stored in relational databases (RDBMS), where fields contain information like names, dates, addresses, credit card numbers, phone numbers, social security numbers, and ZIP codes.
- Structured Query Language (SQL) is used to manage and query structured data.
- Example: Imagine a spreadsheet where data is organized into rows and columns. Specific elements are defined by certain variables, making the information easy to retrieve and analyze.

#### Overview of Unstructured Data
- The Unstructured data is frequently referred to as Qualitative data since it cannot be handled or evaluated using standard data tools and procedures. 
- Because unstructured data lacks a predetermined data model, it cannot be arranged in relational databases. Rather than that, non-relational or NoSQL databases are the best suited for unstructured data management.
- Another method of managing unstructured data is to allow it to flow into a data lake in its raw, unstructured state.

#### Overview of Semi-Structured Data
- Structured and unstructured data exist in semi-structured data. Although semi-structured data looks to be structured, it is not specified in the same way as a table in a relational database management system.
- Metadata: Semi-structured data typically contains metadata, such as tags, attributes, or keys, which provide context and organization to the data elements.
- The “structure” in semi-structured data comes from using tags, markers, metadata, or hierarchies to separate and define elements within the data. This makes it more flexible, simpler to store, and easier to analyze than unstructured data.

### Knowledge Check
1. What are the key characteristics of big data?
1. Where does Big Data come from?
1. What type of data is being produced by web application data? Explain why?
1. How would you transform unstructured data into structured data?
1. Explain how healthcare companies change operations and services since the introduction of big data?

#### Characteristics of Big Data

![7Vs_BigData.png](attachment:7Vs_BigData.png)

- <b>Volume:</b>
    - Example: Twenty years worth of medical records from an insurance company is best described as “volume.”
    - Example: 100 terabytes of data are uploaded daily to Facebook, and Akamai analyses 75 million events a day to target online ads. Walmart handles 1 million customer transactions every single hour. Ninety percent of all data created was generated in the past two years.
    - Scale is certainly a part of what makes Big Data big. The internet-mobile revolution, bringing with it a torrent of social media updates, sensor data from devices, and an explosion of e-commerce, means that every industry is swamped with data, which can be incredibly valuable if you know how to use it.
- <b>Variety:</b>
    - Example: A large data set with census data is incomplete or has conflicting data from different sources.
- <b>Velocity:</b>
    - Example: In 1999, WalMart’s data warehouse stored 1,000 terabytes (1,000,000 gigabytes) of data. In 2012, it had access to over 2.5 petabytes (2,500,000 gigabytes) of data.
    - Every minute of every day, we upload 100 hours of video on YouTube, and send over 200 million emails, and 300,000 tweets.
    - Google alone processes on average of more than 40,000 search queries every second, which roughly translates to more than 3.5 billion searches per day.
- <b>Veracity:</b>
    - The simplest example is contacts that enter your marketing automation system with false names and inaccurate contact information. How many times have you seen Mickey Mouse in your database? It is the classic “garbage in, garbage out” challenge.
    - Although there is widespread agreement about the potential value of Big Data, the data is virtually worthless if it is not accurate. This is particularly true in programs that involve automated decision-making or feed the data into an unsupervised machine learning algorithm. The results of such programs are only as good as the data they are working with.
- <b>Variability:</b>
    - Example: A company was trying to gauge sentiment towards a cafe using these tweets:
        - “Delicious muesli from the @imaginarycafe - what a great way to start the day!”
        “Greatly disappointed that the local Imaginary Cafe stopped stocking BLTs.”
        “Had to wait in line for 45 minutes at the Imaginary Cafe today. Great, well there’s my lunch break gone…”
    - Evidently, “great” on its own is not a sufficient signifier of positive sentiment. Instead, companies have to develop sophisticated programs, which can understand context, and decode the precise meaning of words.
- <b>Visualization:</b>
    - After processing, data needs to be presented in a legible and accessible manner; this is where visualization comes in. One of the challenges of Big Data is presenting information in a way that makes the findings evident.
    - Current big data visualization technologies suffer from in-memory technological restrictions, and inadequate scalability, functionality, and reaction time. To plot a billion data points, you need to use multiple methods, including data clustering, tree maps, sunbursts, parallel coordinates, circular network diagrams, or cone trees.
- <b>Value:</b>
    - Example: 500 zettabytes DNA data that if analyzed could potentially generate a cure for cancer is best described the value of big data
    - The potential value of Big Data is huge. Speaking about new Big Data initiatives in the U.S. healthcare system last year, “McKinsey” estimated that if these initiatives were rolled out system-wide, they “could account for $300 billion to $450 billion in reduced healthcare spending, or 12 to 17 percent of the $2.6 trillion baseline in U.S. health-care costs.”  However, the cost of poor data is also huge, it is estimated to cost U.S. businesses $3.1 trillion a year. In essence, data on its own is virtually worthless. The value lies in rigorous analysis of accurate data and the information and insights it provides.

#### Challenges of Big Data
- `Data Growth` - Managing terabyte-sized datasets can be difficult. Data storage becomes difficult and costly as datasets expand in size. To combat this, businesses are now focusing on data compression and deduplication. 
- `Data Security` - Data security is generally a low priority in Big Data workflows, which can backfire. With so much data being collected, security issues are going to arise. Securing sensitive data, generating fake data, and implementing encryption are some of the issues firms encounter when using Big Data approaches.
- `Data Integration` - Data comes from many sources (social media applications, emails, customer verification documents, survey forms, etc.). All of this data is often difficult to combine and reconcile. Several Big Data solution providers offer Extract, Transform, Load (ETL) services and data integration solutions to businesses that struggle with data integration.
- `Lack of Skilled Staff` - Big Data has taken off fairly recently, and as such, many companies report difficulty in finding staff with the skills necessary to manage it. In fact, that was the top challenge listed in TDWI’s report, cited by 40 percent of respondents. One strategy to combat this problem and grow the talent pool internally is to offer employees Big Data training.
- `Data Governance Issues` - With so much data available, it becomes even more critical to have a framework in place for deciding what data belongs in the system. However, just 30 percent of the companies surveyed by TDWI responded that data governance teams were heavily involved in Big Data management.
- `Organizational Readiness` - As with business intelligence, successfully analyzing Big Data takes more than just installing software and other tools. The entire organization needs to be on the same page, and there must be a clearly articulated strategy built around actual business goals.
- `Dark Data` -  Dark data is the information assets or resources for the organizations. They collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets. Thus, organizations often retain dark data for compliance purposes only. Storing and securing data typically incurs more expense (and sometimes greater risk) than value.
- According to grow.com, 99.5% of collected data remains unused, primarily due to a lack of infrastructure, resources, and management. Here are a few of the most common ways businesses leave data in the dark.
    - No strategy.
    - Stakeholder misalignment.
    - Lack of resources.


#### Overview of Big Data File Formats
- Big Data file formats are designed to optimize data storage, retrieval, and processing efficiency. Some formats are general-purpose, while others are tailored for specific use cases or designed to handle unique data characteristics.
- Choosing an appropriate file format can have some significant benefits:
    1. Faster read times.
    1. Faster write times.
    1. Splittable files.
    1. Schema evolution support.
    1. Advanced compression support.
- Common File Formats
    - `CSV` (Comma-Separated Values) – Simple, human-readable, widely used but lacks advanced optimizations.
    - `JSON` (JavaScript Object Notation) – Flexible, self-descriptive format often used in NoSQL and web applications.
    - `Avro` – Binary format with schema evolution support, optimized for row-based storage.
    - `Parquet` – Columnar storage format, ideal for analytical workloads with fast read performance.
    - `ORC (Optimized Row Columnar)` – Optimized for Hive and other big data processing frameworks.
    - `SequenceFile` – Binary key-value storage format used in Hadoop.


### Knowledge Check
1. In choosing the correct file format, describe the importance of each criteria: schema, splitability, and compression.
1. Explain the steps to be followed to deploy a Big Data solution.
1. What is applicable for cluster computing?
1. What are the seven characteristics of Big Data?
1. What are the key steps in Big Data Solutions?
1. Search the Internet for Big Data Platforms. Choose only two, and list their advantages and disadvantages.

### Summary
- Big Data is everywhere and is collected and used to drive business decisions and influence people's lives. Big Data is the digital trace that gets generated through the entire digital ecosystem and is a high-volume, high-velocity, and high-variety information asset:
    - Personal assistants (e.g., Siri or Alexa) use Big Data to devise answers to inquiries, and the Internet of Things (IoT) devices continually generate massive volumes of data. Big Data analytics help companies gain insights from the data collected by the IoT devices.
    - "Embarrassingly parallel” calculations are the kinds of workloads that can easily be divided and run independently of one another. If any single process fails, that process has no impact on the other processes and can simply be re-run. Open-source projects, which are free and completely transparent, run the world of Big Data and include the Hadoop project and big data tools like Apache Hive and Apache Spark.
    - The Big Data tool ecosystem includes the following six main tooling categories: data technologies, analytics and visualization, business intelligence, cloud providers, NoSQL databases, and programming tools.
