# Data Collection

Data Collection is the process of gathering raw information from different sources so it can later be analyzed.

In Data Science and Machine Learning, models do not work on logic alone — they work on data.

So the first step of any AI system is:

Collect → Clean → Analyze → Model

This notebook demonstrates how real-world data is collected using:
- APIs
- Web Scraping

## Types of Data

Data mainly exists in two forms:

### Structured Data
Organized and machine-friendly.

Examples:
- Tables (rows & columns)
- CSV files
- Databases
- JSON APIs

Easy to analyze using Pandas / SQL.

### Unstructured Data
Not organized in fixed format.

Examples:
- Webpages (HTML)
- Text documents
- Images
- Social media posts

Requires extraction before analysis.

## APIs vs Web Scraping

### API (Application Programming Interface)
A server directly provides structured data.

Advantages:
- Clean data
- Reliable
- Fast
- No parsing needed

We simply request data and receive JSON.

---

### Web Scraping
Data is extracted from webpages (HTML).

Advantages:
- Works when API not available

Disadvantages:
- Requires parsing
- Website structure may change
- Slower than API

In this notebook we will collect data using both methods
and understand their differences.

# Collecting Data using API

An API returns structured data directly from the server.

We send an HTTP request and receive JSON.

Step 1 — Send Request

In [12]:
import requests

API = "https://stephen-king-api.onrender.com/api/books"

response = requests.get(API)

print("Status Code:", response.status_code)

Status Code: 200


Step 2 — Inspect Response Content

In [13]:
print(response.text[:500])

{"data":[{"id":1,"Year":1974,"Title":"Carrie","handle":"carrie","Publisher":"Doubleday","ISBN":"978-0-385-08695-0","Pages":199,"Notes":[""],"created_at":"2023-11-13T23:48:47.848Z","villains":[{"name":"Tina Blake","url":"https://stephen-king-api.onrender.com/api/villain/4"},{"name":"Cindi","url":"https://stephen-king-api.onrender.com/api/villain/14"},{"name":"Myra Crewes","url":"https://stephen-king-api.onrender.com/api/villain/16"},{"name":"Billy deLois","url":"https://stephen-king-api.onrender.


Step 3 — Convert JSON to Python Object

In [14]:
data = response.json()

print(type(data))
print(len(data))

<class 'dict'>
1


Preview first record

In [21]:
data["data"][0]

{'id': 2,
 'Year': 1975,
 'Title': "Salem's Lot",
 'handle': 'salem-s-lot',
 'Publisher': 'Doubleday',
 'ISBN': '978-0-385-00751-1',
 'Pages': 439,
 'Notes': ['Nominee, World Fantasy Award, 1976[2]'],
 'created_at': '2023-11-13T23:48:48.098Z',
 'villains': [{'name': 'Kurt Barlow',
   'url': 'https://stephen-king-api.onrender.com/api/villain/2'},
  {'name': 'Richard Straker',
   'url': 'https://stephen-king-api.onrender.com/api/villain/98'}]}

Step 4 — Extract Useful Fields

We only keep important columns instead of full response.

In [23]:
dataset = []

for book in data["data"]:
    row = {
        "title": book.get("Title"),
        "year": book.get("Year"),
        "publisher": book.get("Publisher"),
        "pages": book.get("Pages")
    }
    dataset.append(row)

dataset[:5]

[{'title': 'Carrie', 'year': 1974, 'publisher': 'Doubleday', 'pages': 199},
 {'title': "Salem's Lot",
  'year': 1975,
  'publisher': 'Doubleday',
  'pages': 439},
 {'title': 'The Shining',
  'year': 1977,
  'publisher': 'Doubleday',
  'pages': 447},
 {'title': 'Rage', 'year': 1977, 'publisher': 'Signet Books', 'pages': 211},
 {'title': 'The Stand', 'year': 1978, 'publisher': 'Doubleday', 'pages': 823}]

Step 5 — Clean Data<br>
Convert numeric fields<br>
Handle missing values

In [25]:
for row in dataset:
    try:
        row["year"] = int(row["year"])
    except (TypeError, ValueError):
        row["year"] = None

    try:
        row["pages"] = int(row["pages"])
    except (TypeError, ValueError):
        row["pages"] = None

dataset[:5]

[{'title': 'Carrie', 'year': 1974, 'publisher': 'Doubleday', 'pages': 199},
 {'title': "Salem's Lot",
  'year': 1975,
  'publisher': 'Doubleday',
  'pages': 439},
 {'title': 'The Shining',
  'year': 1977,
  'publisher': 'Doubleday',
  'pages': 447},
 {'title': 'Rage', 'year': 1977, 'publisher': 'Signet Books', 'pages': 211},
 {'title': 'The Stand', 'year': 1978, 'publisher': 'Doubleday', 'pages': 823}]

We now have structured dataset ready for analysis.

# Collecting Data using Web Scraping

Unlike APIs, webpages return HTML instead of ready-to-use data.

We must:
1) Download page
2) Parse HTML
3) Extract useful information

Step 1 — Download Webpage

In [29]:
from bs4 import BeautifulSoup

websiteURL = "https://www.scrapethissite.com/pages/simple/"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(websiteURL, headers=headers)

print("Status Code:", response.status_code)

Status Code: 200


Step 2 — Parse HTML

In [30]:
html = response.text

soup = BeautifulSoup(html, "html.parser")

print(type(soup))

<class 'bs4.BeautifulSoup'>


Step 3 — Inspect Structure

Each country is stored inside a block:

<div class="country">

In [31]:
countries = soup.find_all("div", class_="country")

print("Total Countries:", len(countries))

Total Countries: 250


Preview first block to understand tags

In [32]:
print(countries[0].prettify()[:700])

<div class="col-md-4 country">
 <h3 class="country-name">
  <i class="flag-icon flag-icon-ad">
  </i>
  Andorra
 </h3>
 <div class="country-info">
  <strong>
   Capital:
  </strong>
  <span class="country-capital">
   Andorra la Vella
  </span>
  <br/>
  <strong>
   Population:
  </strong>
  <span class="country-population">
   84000
  </span>
  <br/>
  <strong>
   Area (km
   <sup>
    2
   </sup>
   ):
  </strong>
  <span class="country-area">
   468.0
  </span>
  <br/>
 </div>
</div>



Step 4 — Extract Fields

We collect:
- name
- capital
- population
- area

In [33]:
raw_data = []

for c in countries:
    name = c.find("h3").text.strip()
    capital = c.find("span", class_="country-capital").text.strip()
    population = c.find("span", class_="country-population").text.strip()
    area = c.find("span", class_="country-area").text.strip()

    raw_data.append({
        "name": name,
        "capital": capital,
        "population": population,
        "area": area
    })

raw_data[:5]

[{'name': 'Andorra',
  'capital': 'Andorra la Vella',
  'population': '84000',
  'area': '468.0'},
 {'name': 'United Arab Emirates',
  'capital': 'Abu Dhabi',
  'population': '4975593',
  'area': '82880.0'},
 {'name': 'Afghanistan',
  'capital': 'Kabul',
  'population': '29121286',
  'area': '647500.0'},
 {'name': 'Antigua and Barbuda',
  'capital': "St. John's",
  'population': '86754',
  'area': '443.0'},
 {'name': 'Anguilla',
  'capital': 'The Valley',
  'population': '13254',
  'area': '102.0'}]

Step 5 — Clean Numeric Columns

In [34]:
dataset_scraped = []

for row in raw_data:
    clean_row = {
        "name": row["name"],
        "capital": row["capital"],
        "population": int(row["population"].replace(",", "")),
        "area": float(row["area"])
    }
    dataset_scraped.append(clean_row)

dataset_scraped[:5]

[{'name': 'Andorra',
  'capital': 'Andorra la Vella',
  'population': 84000,
  'area': 468.0},
 {'name': 'United Arab Emirates',
  'capital': 'Abu Dhabi',
  'population': 4975593,
  'area': 82880.0},
 {'name': 'Afghanistan',
  'capital': 'Kabul',
  'population': 29121286,
  'area': 647500.0},
 {'name': 'Antigua and Barbuda',
  'capital': "St. John's",
  'population': 86754,
  'area': 443.0},
 {'name': 'Anguilla',
  'capital': 'The Valley',
  'population': 13254,
  'area': 102.0}]

Now the webpage data is converted into structured dataset.

# API vs Web Scraping

We collected data using two different techniques:

API → structured JSON  
Web Scraping → HTML extraction

Both achieve the same goal but behave very differently.

## Key Differences

| Feature        | API                     | Web Scraping                     |
|---------------|--------------------------|----------------------------------|
| Data Format   | Structured (JSON)        | Unstructured (HTML)              |
| Reliability   | High                     | Medium                           |
| Speed         | Fast                     | Slower                           |
| Maintenance   | Stable                   | Breaks when website changes      |
| Legal Safety  | Allowed                  | Sometimes restricted             |
| Complexity    | Easy                     | Moderate                         |

## Reliability Comparison

API:<br>
Server is designed to provide data → predictable output

Scraping:<br>
Website designed for humans → structure may change

So scraping code can stop working anytime.

## When to Use API

Use API when available:

- official data access
- stable format
- faster processing
- scalable pipelines

## When to Use Web Scraping

Use scraping only when:

- no API exists
- data visible only on webpage
- one-time dataset collection

## Mental Rule

Always try API first  
Scrape only if necessary

# Saving Collected Data

We store collected datasets so they can be used later in analysis.

API data → JSON (structured hierarchical data)
Scraped data → CSV (tabular data)

To avoid cluttering the project, files will be saved inside a temporary folder created during execution.

In [40]:
import os

output_dir = "../output_data"

# create folder only if not exists
os.makedirs(output_dir, exist_ok=True)

print("Saving files in:", os.path.abspath(output_dir))

Saving files in: d:\Workspace\AI_ML\ai-engineering-handbook\06-Data-Collection\output_data


## Save API Dataset as JSON

In [41]:
import json

api_file = os.path.join(output_dir, "books_api.json")

with open(api_file, "w", encoding="utf-8") as f:
    json.dump(dataset, f, indent=4)

print("API data saved:", api_file)

API data saved: ../output_data\books_api.json


## Save Scraped Dataset as CSV

In [42]:
import csv

csv_file = os.path.join(output_dir, "countries_scraped.csv")

with open(csv_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=dataset_scraped[0].keys())
    writer.writeheader()
    writer.writerows(dataset_scraped)

print("Scraped data saved:", csv_file)

Scraped data saved: ../output_data\countries_scraped.csv


# Data Collection Pipeline Summary

In this notebook we collected data using two different approaches:

API → Structured JSON data  
Web Scraping → Extracted HTML data

Both were converted into structured datasets ready for analysis.

## Steps Followed

1. Sent request to server
2. Received response
3. Parsed data
4. Extracted useful fields
5. Cleaned values
6. Stored dataset

## What We Learned

API collection:
Reliable and preferred method for data pipelines.

Web scraping:
Fallback technique when API not available.

Both methods ultimately produce structured data suitable for Pandas.

## Real World Perspective

In production systems:

Databases → primary source  
APIs → public/third-party data  
Scraping → rare edge cases

Most data engineers spend significant effort on data collection before modeling.

This notebook demonstrates the first step of any data science workflow:

Collect → Clean → Analyze → Model