**Data Acquisition**

Process of gathering information from a variety of sources to store, process, and analyze it

*Process*
- Sensing with sensors or instruments to measure physical or digital data
- Signal Conditioning; refinement of raw data, such as amplifying, filtering, or even digitizing
- Converting analog data to digital data
- Logging, storing, analyzing, processing, and visualizing to extract insights
- One of the challenges is the many types of data that might be collected; numerical, textual, multimodal
    - Data may be unstructured, requiring manual cleaning (or unsupervised learning) to prepare 

*Retrieval Methods*
- File-based; structured .csv, .xlsx, .json file extensions
- Database; SQL and NoSQL query commands
- APIs; Structured, programmatic
- Web scraping; Automated extraction from websites
- Data marketplaces; Paid access to certain data 

*Challenges*
- Data quality;
    - Is the data accurate and complete?
    - Is the data biased? 
- Data volume - Storing, analyzing, and computing a massive amount of data
- Data variety - Related data sourced from a variety of sources and modalities
- Integration - Getting data from different siloes 
- Costs - Sensors, equipment, software, personnel
- Scalability - Marginal cost of storing additional data 
- Governance - Who owns data? Who manages the lifecycle and stores it?

**API and Web Scraping**
- API; standardized interface from which to collect data, which allows developers to directly access a structured dataset in a machine-readable format
- Web scraping; Technique for extracting information from HTML/human readable formats. Fragile, requires parsing, and often manual cleaning
    - All websites are different and require custom work
    - Websites are constantly changing, and so web scraping apps must be constantly updated as well
    - Fragility, manual work, unstructured, etc.

*Beautiful Soup*
- Library for HTML and XML parsing; tackles poorly structured HTML sites
- Add Selenium to handle dynamic, JS pages
- Practical applications;
    - Competitor analysis
    - Sentiment analysis
    - Market research
    - Price monitoring
    - Content aggregation
- Can be used with Pandas, lxml, and Selenium

*API*
- Send request; receive structured response
- APIs can be structured in different ways; many are REST, but there are other options
    - Some require authentication; private APIs might be locked behind an account or paywall 
- Python has specialized libraries for different kinds of APIs; requests is straightforward for most
- API Key is a unique code granted when you sign up for an API; Controls access and tracks usage
    - Keep private

*Formatting*
- JSON; Organized as key-value pairs

**Ensuring High-Quality Data**
- Curate sources and clean data
- Audit for bias
- Balance content to remove bias
- Validate outputs by checking for hallucinations
- Document process with transparency