## What is data discovery? ## 
Before you can architect and deploy a data analytics system, the following questions must be answered:

- Which data should be analyzed? What is its value to the business or organization?
- Who owns the data? Where is it located?
- Is the data usable in its current state? What transformations are required?
- Who needs to see the data?
- After the data is curated and ready for consumption, how should it be presented?


## Define business value ##
Conduct data discovery kick-off workshops with stakeholders to understand business goals, prioritize use cases, and identify potential data sources. 



The following are example questions that define the business opportunity:

- How would getting insight into data provide value to the business?
- Are you looking to create a new revenue stream from your data?
- What are the challenges with your current approach and tool? 
- Would your business benefit from managing fraud detection, predictive maintenance, and root-cause analysis to reduce mean time to detection and mean time to recovery?
- How are you continually innovating on behalf of your customers and improving their user experience?

## Identify Your Data Sources

Research the different internal and external sources where data is generated and stored. This includes databases, applications, and files.

---

### Data Can Be Categorized Along Three Lines:
- **Data Type**
- **Data Source**
- **Ingest Mode**

---

#### **Data Types**

| Data Type            | Example Data Types                                                      |
|----------------------|------------------------------------------------------------------------|
| Structured data      | Relational databases, spreadsheets, CSV files, XML                     |
| Semi-structured data | Non-relational databases, JSON, log files, XML with attributes, IoT sensor data |
| Unstructured data    | Text documents, images, audio files, video files                       |

---

#### **Data Sources**

| Data Source           | Example Data Sources                                                        |
|-----------------------|-----------------------------------------------------------------------------|
| Databases             | CRM applications, ERP applications, CMS applications                        |
| Files                 | On-premises file servers, document libraries, archives                      |
| Logs                  | Application logs, device logs                                               |
| IoT devices           | Sensor data, device metadata, time-series data                              |
| Mobile devices        | Social media, messaging apps                                                |
| Video                 | Media and entertainment services, surveillance cameras, video libraries     |
| SaaS apps             | User activity logs, transactional data, marketing analytics, e-commerce data|
| Datasets              | Demographic data, weather data, geospatial mapping, transportation data     |

---

#### **Ingest Modes**

| Ingest Mode | Example Ingest Modes                                                      |
|-------------|---------------------------------------------------------------------------|
| Streaming   | Sensors, social media platforms, media and entertainment services, IoT devices |
| Micro-batch | Sensors, website logs, graphics and video rendering                       |
| Batch       | Medical imagery, genomic data, financial records, usage data              |

---

### Example Questions to Identify Data Types, Sources, and Ingest Modes

- How many data sources do you have to support?
- Where and how is the data generated?
- What are the different types of data?
- What are the different formats of data?
- Is your data originating from on-premises, a third-party vendor, or the cloud?
- Is the data source streaming, batch, or micro-batch?
- What is the velocity and volume of ingestion?
- What is the ingestion interface?
- How does your team onboard

## Define your storage, catalog, and data access needs ##
Determine the best storage for specific data types. Assess data quality to determine processing needs. Catalog and register details about data sources.



The following are example questions to identify your data storage and data access requirements:

- Which data stores do you have?
- What is the purpose of each data store?
- Why are you using that storage method?  (for example, files, SQL, NoSQL, or a data warehouse)
- How do you currently organize your data? (for example, data tiering or partition)
- How much data are you storing now, and how much do you expect to store in the future? (for example, 18 months from now)
- How do you manage data governance? 
- Which regulatory and governance compliance standards are applicable to you? 
- What is your disaster recovery (DR) strategy?

## Define your data processing requirements ##
Extract relevant data from sources like databases, data lakes, and CRM systems using tools such as AWS Glue crawlers or custom scripts.

Curate and transform the raw data as needed using services like AWS Glue and Amazon EMR.



The following example questions can help identify your data processing requirements:

- Do you have to transform or enrich the data before you consume it?
- Which tools do you use for transforming your data?
- Do you have a visual editor for the transformation code? 
- What is the frequency of your data transformation? (for example, real time, micro-batching, overnight batch) 
- Are there any constraints with your current tool of choice?