# Data Hunting and Gathering (Part 1)

![Web Scraping](http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg)

Welcome to the first part of our journey into the world of web scraping. Web scraping, also known as web harvesting or web data extraction, is a technique used for extracting data from websites. This process involves fetching the web page and then extracting data from it.

## Why Learn Web Scraping?
Understanding how to scrape data from the web is a valuable skill for any data professional. In the digital era, data is the new gold, and web scraping is the mining equipment. Here's why it's essential:

- **Data Availability**: The internet is a vast source of data for all kinds of analyses, from market trends to academic research.
- **Automation**: Web scraping can automate the process of collecting data, saving time and effort.
- **Competitive Advantage**: In many fields, having timely and relevant data can be a game-changer.

## Real-World Applications
- **Market Research**: Analyzing competitors, understanding customer sentiments, and identifying market trends.
- **Price Comparison**: Aggregating pricing data from various websites for comparison shopping.
- **Social Media Analysis**: Gathering data from social networks for sentiment analysis or trend spotting.

## Ethical Considerations in Web Scraping

Web scraping, while a powerful technique for data extraction, comes with significant ethical and legal responsibilities. As budding data scientists and web scrapers, it's crucial to navigate this landscape with a deep understanding and respect for these considerations.

### Respecting Website Policies and Laws

- **Adhering to Terms of Service**: Every website has its own set of rules, usually outlined in its Terms of Service (ToS). It's important to read and understand these rules before scraping, as violating them can have legal implications.

- **Following Copyright Laws**: The data you scrape is often copyrighted. Ensure that your use of scraped data complies with copyright laws and respects intellectual property rights.

- **Privacy Concerns**: Be mindful of personal data. Scraping and using personal information without consent can breach privacy laws and ethical standards.

### Example: Understanding Google's `robots.txt`

Google's `robots.txt` file is an excellent example of how websites communicate their scraping policies. Accessible at [Google's robots.txt](https://www.google.com/robots.txt), this file provides directives to web crawlers about which pages they can or cannot scrape.

#### Implications of Google's `robots.txt`

- **Selective Access**: Google allows certain parts of its site to be crawled while restricting others. For instance, crawling the search results pages is generally disallowed.

- **Dynamic Nature**: The content of `robots.txt` files can change, reflecting the website's evolving stance on web scraping. Regular checks are necessary for compliance.

- **Respecting the Limits**: Even if a `robots.txt` file allows scraping of some pages, it does not automatically mean all scraping activities are legally or ethically acceptable. It's a guideline, not a blanket permission.

# 1. Introduction to Data Hunting in the Digital Age

## The Evolution of Data Sourcing

In this course, we focus on data as our foundational element. Traditionally, data has been sourced from structured formats like spreadsheets from scientific experiments or records in relational databases within organizations. But with the digital revolution, particularly the advent of the internet, our approach to data collection must evolve. The internet is a vast reservoir of unstructured data, presenting both challenges and opportunities for data retrieval and analysis.

## Understanding the Landscape of Web Data

When seeking data from the internet, it's essential to first consider how the website in question provides access to its data. Many large-scale websites like Google, Facebook, and Twitter offer an **Application Programming Interface (API)**. APIs are designed to facilitate easy access to a website's data in a structured format, simplifying the process of data extraction.

### The Role of APIs

- **APIs as a Primary Tool**: An API acts as a bridge between the data seeker and the website's database, allowing for streamlined data retrieval.
- **Limitations**: However, not all websites provide an API. Additionally, even when an API is available, it may not grant access to all the data a user might need.

### The Need for Web Scraping

In cases where an API is absent or insufficient, we turn to **web scraping**. Web scraping involves extracting raw data directly from a website's frontend - essentially, the same information presented to users in their web browsers.

#### Diving into Scraping

- **Dealing with Unstructured Data**: Scraping requires us to interact with unstructured data, necessitating custom coding and data parsing techniques.
- **Legal and Ethical Considerations**: It's crucial to approach web scraping with an awareness of the legal and ethical implications, respecting website policies and user privacy.

<img style="border-radius:20px;" src="./files/big_picture.jpg">

### Starting Our Journey

Our first practical step in this journey will be to explore how to connect to the internet and retrieve a basic webpage. We'll begin by using Python's `urllib.request` module, a powerful tool for interacting with URLs and handling web requests.

Join us as we embark on this exciting journey to master the art of data hunting in the digital era, where we'll navigate the complexities of APIs, web scraping, and the ethical considerations that come with them.

In [9]:
# Import the 'urlopen' function from the 'urllib.request' module.
# This function is used for opening URLs, which is the first step in web scraping.
from urllib.request import urlopen

# Use the 'urlopen' function to open the URL 'http://www.google.com/'.
# The function returns a response object which can be used to read the content of the page.
# Here, 'source' is a variable that holds the response object from the URL.
source = urlopen("http://www.google.com/")

# Print the response object.
# This command does not print the content of the webpage.
# Instead, it prints a representation of the response object, 
# which includes information like the URL, HTTP response status, headers, etc.
print(source)

<http.client.HTTPResponse object at 0x000001FFC97A3C10>


This code snippet demonstrates the basic usage of the `urlopen` function for accessing a webpage. However, it is important to note that `print(source)` will not display the HTML content of the webpage but rather the HTTP response object's representation. To view the actual content of the page, you would need to read from the `source` object using methods like `source.read()`.

# Exploring the Content Retrieved by `urlopen`

After opening a URL using the `urlopen` function from the `urllib.request` module, we typically want to access the actual content of the webpage. This is where `source.read()` comes into play.

## Understanding `source.read()`

When you call `urlopen`, it returns an HTTPResponse object. This object, which we've named `source` in our example, holds various data and metadata about the webpage. To extract the actual HTML content of the page, we use the `read` method on this object.

### What Does `source.read()` Do?

- **Retrieves Webpage Content**: `source.read()` reads the entire content of the webpage to which the URL points. This content is usually in HTML format, which is the standard language for creating webpages.

- **Binary Format**: The data retrieved is in binary format. To work with it as a string in Python, you might need to decode it using a method like `.decode('utf-8')`.

- **One-time Operation**: It's important to note that you can read the content of the response only once. After `source.read()` is executed, the response object does not retain the content in a readable form. If you need to access the content again, you must reopen the URL.

Here's a simple example to illustrate this:

In [None]:
#Let us check what is in
something = source.read()
print(something)

In [None]:
# Check type
type(something)

# Warm-Up Exercises

Let's get our hands-on with some initial exercises to get warmed up with web scraping!

## Exercises

1. **Python.org Content Check**: Does [https://www.python.org](https://www.python.org) contain the word `Python`?  
   _Hint: You can use the `in` keyword to check._

2. **Google.com Image Search**: Does [http://google.com](http://google.com) contain an image?  
   _Hint: Look for the `<img>` tag._

3. **First Characters of Python.org**: What are the first ten characters of [https://www.python.org](https://www.python.org)?

4. **Keyword Check in Pyladies.com**: Is there the word 'python' in [https://pyladies.com](https://pyladies.com)?

In [None]:
# EX1: Check if 'Python' is in the content of http://www.python.org/

# Import the urlopen function from the urllib.request module
# This function is used to open a URL and retrieve its contents
from urllib.request import urlopen

# Use the urlopen function to access the webpage at http://www.python.org/
# The function returns an HTTPResponse object which is stored in the variable 'source'
source = urlopen("http://www.python.org/")

# Read the content of the response object using the read() method
# The read() method retrieves the content of the webpage in binary format
# The binary content is then decoded to a string using the 'latin-1' encoding
# The decoded string is stored in the variable 'something'
something = source.read().decode('latin-1')

# Check if the word "Python" is in the decoded string
# This is done using the 'in' keyword, which checks for the presence of a substring in a string
# The result is a boolean value: True if "Python" is found, False otherwise
"Python" in something

# Note: The choice of 'latin-1' for decoding might not always be appropriate
# It's often better to use 'utf-8', which is a more common encoding for webpages
# For example: something = source.read().decode('utf-8')

In [None]:
# EX2: Check if 'https://www.google.com/' contains an image tag ("<img>")

# Import the urlopen function from the urllib.request module.
# This function is used to open a URL and retrieve its contents.
from urllib.request import urlopen

# Use the urlopen function to open the webpage at 'https://www.google.com/'.
# The function returns an HTTPResponse object, which we store in the variable 'source'.
source = urlopen("https://www.google.com/")

# Read the content of the response object using the read() method.
# The read() method retrieves the content of the webpage in binary format.
# After reading, the content is in bytes, which is not human-readable.
# We then decode this binary content into a string using the 'latin-1' encoding.
# The resulting string, which contains the HTML of the page, is stored in 'something'.
something = source.read().decode('latin-1')

# Check if the string "img" is in the decoded HTML content.
# This is a simple way to check if there is an <img> tag in the HTML,
# as "img" is part of the standard HTML tag for images.
# The result will be True if "img" is found (indicating the presence of an image),
# and False if not.
"img" in something

# Note: Decoding with 'latin-1' might not be suitable for all websites,
# especially if the website uses a different character set.
# 'utf-8' is a more commonly used encoding and is often a better choice.
# For instance: something = source.read().decode('utf-8')

In [None]:
# Now is your turn for EX3 and EX4

# Using `urlopen` vs. `Request` in Web Scraping

When performing web scraping tasks in Python, you have the option to use either the `urlopen` function from the `urllib.request` module or the `Request` object in combination with `urlopen`. Here, we'll explain why you might choose one approach over the other.

## Using `urlopen` Directly

**Advantages**:

- **Simplicity**: It's a straightforward way to access a webpage and retrieve its content without the need for additional objects or customization.
  
- **Default Behavior**: `urlopen` uses default settings for the HTTP request, which is suitable for many common use cases.

- **Convenience**: For simple web scraping tasks, it provides a concise and readable solution.

## Using `Request` with `urlopen`

**Advantages**:

- **Customization**: You can set custom headers, use different HTTP methods (e.g., POST, PUT), and configure advanced options like handling redirects, cookies, and timeouts.

- **Fine-Grained Control**: It offers greater flexibility for handling complex scenarios.

In summary, the choice between using `urlopen` directly and creating a `Request` object depends on the complexity of your web scraping task. For simple tasks like fetching webpage content, `urlopen` is often sufficient and more straightforward. However, if you need to customize headers, use non-GET HTTP methods, or handle advanced scenarios, creating a `Request` object allows for fine-grained control over your HTTP requests.


In [None]:
# Solution exercise 4
# import urllib
# url ='https://www.pyladies.com'
# req = urllib.request.Request(url, headers = {'User-Agent': 'Magic Browser'})
# con = urllib.request.urlopen(req)
# html = con.read().decode()

# 'Python' in html

# Crawling and Scraping: Unveiling the Web's Secrets

Crawling and scraping are two fundamental techniques in the world of web data acquisition. They form the backbone of many data-driven applications and are crucial skills for data analysts and web developers.

## Crawling: Navigating the Web

Crawling, often referred to as web crawling or web scraping, is the process of systematically navigating the World Wide Web to retrieve web pages. Think of it as a web robot or spider, tirelessly traversing the internet to discover and index web content. This technique is at the heart of search engines like Google and Bing.

### Why Do We Crawl?

Crawling serves several important purposes:

- **Indexing**: It allows search engines to index and catalog web pages, making them searchable by users.
  
- **Link Discovery**: Crawlers extract links from web pages, helping build a vast network of interconnected web resources. This link structure is crucial for understanding the web's architecture.
  
- **Data Retrieval**: Crawlers may scrape or extract data from web pages, but their primary goal is to discover and navigate to other web pages.

## Scraping: Harvesting Data

Scraping is the process of extracting specific data or information from a single web page. Unlike crawling, which focuses on navigating the web, scraping zooms in on a single webpage to harvest valuable data.

### Use Cases of Scraping

Scraping is used for a variety of purposes, such as:

- **Data Extraction**: It allows us to extract structured data like product prices, news headlines, or stock market information from websites.

- **Content Monitoring**: Scraping can be employed to track changes in content on specific web pages, such as monitoring price changes on e-commerce sites or tracking news updates.

- **Competitor Analysis**: Businesses often use scraping to gather data on competitors, such as pricing strategies or product listings.

- **Research and Analysis**: Data analysts and researchers use scraping to collect data for studies, reports, and data-driven insights.

## Crawling and Scraping Synergy

In practice, crawling and scraping often work together. Crawlers traverse the web to find new pages, and once they reach a page of interest, scraping techniques are applied to extract valuable data. This synergy is what powers search engines, news aggregators, and data-driven applications on the internet.

## Conclusion

Understanding the concepts of crawling and scraping is essential for anyone looking to work with web data. Whether you want to build a search engine, gather market research, or simply automate data collection, these techniques are your gateway to unlocking the wealth of information available on the web.

**WARM-UP PROJECT: Building a Simple Web Spider**

In this warm-up project, we'll delve into the world of web spiders or crawlers. These are specialized programs designed to systematically explore the World Wide Web, retrieving web pages and their contents. Web spiders play a crucial role in various applications, including indexing web pages for search engines, data extraction from websites, and more. In this project, we'll focus on constructing a basic web spider.

---

**Project Overview:**

A web spider, also known as a web crawler, is essentially a digital agent that navigates the vast landscape of the internet. Its primary mission is to:

- Explore the web by following links from one webpage to another.
- Retrieve web page content.
- Store valuable data for analysis or other purposes.

Think of it as a robotic explorer, tirelessly traversing the web to gather information. In our project, we aim to create a simplified version of such a web spider.

**Key Tasks:**

1. **Identifying Links**: The initial challenge for our crawler is to identify which links it should follow and explore further. Consider how you would instruct the spider to locate and track these links within a web page.

2. **Creating the Spider Class**: To bring our spider to life, we'll start by crafting a Python class aptly named "Spider." This class will serve as the core engine of our web crawler, and its constructor will accept three crucial parameters:
   - `starting_url`: The initial URL from which our spider embarks on its journey.
   - `crawl_domain`: A domain restriction to ensure that only relevant links are considered for crawling.
   - `max_iter`: A limit on the maximum number of web items the spider will collect.

3. **Main Method: Spider.run()**: To set our spider in motion, we'll implement a method called `run` within the Spider class. This method will orchestrate the spider's actions, and it's here that we'll outline the core functionalities or building blocks that empower our crawler.

Through this project, you'll gain hands-on experience in creating a simplified web spider, providing a foundational understanding of web crawling techniques.    

### Web Scraping/Crawling Project Workflow

Web scraping and crawling involve a series of steps to access, retrieve, and process web data efficiently and responsibly. The typical workflow for such a project includes the following stages:

1. **Accessing the Web (`Acceder web`)**:
   - The initial step is to access the target website(s) from which data needs to be scraped.
   - This involves sending an HTTP request and receiving the response from the web server.

2. **Downloading Web Content (`Bajar web`)**:
   - Once access is granted, the next step is to download the content of the webpage.
   - This may include HTML, CSS, JavaScript, and media files which make up the webpage.

3. **Searching for Links (`Buscar enlaces`)**:
   - This step involves parsing the downloaded web content to search for hyperlinks.
   - Hyperlinks are identified by the `<a href="...">` HTML tag and are pointers to other webpages.

4. **Storing Web Content (`Almacenar web`)**:
   - The retrieved web content is then stored locally for processing.
   - This storage can be in the form of raw files, or more structured formats like databases.

5. **Storing Extracted Links (`Almacenar enlaces`)**:
   - Extracted hyperlinks are also stored.
   - This forms the basis of the crawling process, where each link can be followed to retrieve more content.

6. **Verifying Quality of Links (`Verificar enlaces de saldo`)**:
   - Not all links may be relevant or functional.
   - This step ensures that the stored links are valid and lead to the necessary content.

In [None]:
# Importing necessary libraries
from urllib.request import urlopen  # For opening URLs
from urllib.error import HTTPError  # To handle HTTP errors
import time  # To implement delays if needed

In [22]:
# Function to extract links from HTML content
def getLinks(html, max_links=10):
    url = []  # List to store the extracted URLs
    cursor = 0  # Cursor to track position in HTML content
    nlinks = 0  # Counter for number of links extracted

    # Loop to extract links until the maximum is reached or no more links are found
    while (cursor >= 0 and nlinks < max_links):
        start_link = html.find("a href", cursor)  # Find the start of a link
        if start_link == -1:  # If no more links are found, return the list of URLs
            return url
        start_quote = html.find('"', start_link)  # Find the opening quote of the URL
        end_quote = html.find('"', start_quote + 1)  # Find the closing quote of the URL
        url.append(html[start_quote + 1: end_quote])  # Extract and append the URL to the list
        cursor = end_quote + 1  # Move the cursor past this URL
        nlinks += 1  # Increment the link counter

    return url  # Return the list of URLs

# Example usage:
# Suppose you have some HTML content stored in a variable `html_content`
# You would call the function like this:
# links = getLinks(html_content)
# This would return a list of URLs extracted from `html_content`

# Expected Output:
# The output will be a list containing up to `max_links` number of URLs extracted from the given HTML content.
# If the HTML content has fewer than `max_links` URLs, all found URLs will be included in the list.

In [26]:
# Define the Spider class for web crawling
class Spider:
    # Initializer or constructor for the Spider class
    def __init__(self, starting_url, crawl_domain, max_iter):
        self.crawl_domain = crawl_domain  # The domain within which the spider will crawl
        self.max_iter = max_iter  # The maximum number of pages to crawl
        self.links_to_crawl = [starting_url]  # Queue of links to crawl
        self.links_visited = []  # List to keep track of visited links
        self.collection = []  # List to store the collected data

    # Method to retrieve HTML content from a URL
    def retrieveHtml(self):
        try:
            # Open the URL and read the response
            socket = urlopen(self.url)
            # Decode the response using 'latin-1' encoding
            self.html = socket.read().decode('latin-1')
            return 0  # Return 0 if successful
        except HTTPError as e:
            # If an HTTP error occurs, print the error and return -1
            print(f"HTTP Error encountered: {e}")
            return -1

    # Main method to control the crawling process
    def run(self):
        # Continue to crawl while there are links to crawl and the max_iter is not reached
        while self.links_to_crawl and len(self.collection) < self.max_iter:
            # Get the next link to crawl
            self.url = self.links_to_crawl.pop(0)
            print(f"Currently crawling: {self.url}")
            # Add the link to the list of visited links
            self.links_visited.append(self.url)
            # If HTML retrieval is successful, store the HTML and find new links
            if self.retrieveHtml() >= 0:
                self.storeHtml()
                self.retrieveAndValidateLinks()

    # Method to retrieve and validate links in the HTML content
    def retrieveAndValidateLinks(self):
        # Get a list of links from the current HTML content
        items = getLinks(self.html)
        # Temporary list to store valid links
        tmpList = []

        # Iterate over the found links
        for item in items:
            item = item.strip('"')  # Remove any extra quotes
        
            # Check if the link is an absolute URL that contains the crawl domain
            if self.crawl_domain in item and item.startswith('http'):
                tmpList.append(item)
            # Handle relative links
            elif item.startswith('/'):
                # Construct the full URL using the crawl domain and relative link
                tmpList.append('https://' + self.crawl_domain + item)
            # Handle potential relative links without a leading slash (assuming they are not absolute URLs)
            elif not item.startswith('http'):
                # Construct the full URL assuming it is a relative link
                tmpList.append('https://' + self.crawl_domain + '/' + item)

        # Add valid, unvisited links to the crawl queue
        for item in tmpList:
            if item not in self.links_visited and item not in self.links_to_crawl:
                self.links_to_crawl.append(item)
                print(f'Adding to crawl queue: {item}')


    # Method to store HTML content and associated metadata
    def storeHtml(self):
        # Create a dictionary to represent the document
        doc = {
            'url': self.url,  # URL of the page
            'date': time.strftime("%d/%m/%Y"),  # Current date
            'html': self.html  # HTML content of the page
        }
        # Add the document to the collection
        self.collection.append(doc)
        print(f"Stored HTML from: {self.url}")

# Example usage of the Spider class:
# Initialize the spider with the starting URL, domain to crawl, and the maximum number of iterations.
# my_spider = Spider("http://www.example.com", "example.com", 20)

# Start the crawling process.
# my_spider.run()

# After running, my_spider.collection will contain up to 20 pages' HTML from 'example.com'.
# Each page's data includes the URL, the date when it was scraped, and the HTML content.


# Summary of the Spider Class for Web Crawling

The `Spider` class is designed for the web crawling process, which systematically browses the web to collect data. Below is an overview of its key functionalities:

## Initialization
- The class initializes with a `starting_url`, the domain to crawl within (`crawl_domain`), and a maximum number of pages to crawl (`max_iter`).

## Crawling Process
- The spider maintains a queue of links (`links_to_crawl`) to visit and a list of links already visited (`links_visited`).
- The `run` method processes each link in the queue, continuing until the queue is empty or the `max_iter` limit is reached.

## HTML Content Retrieval
- The `retrieveHtml` method opens each link, reads its content, and decodes it into a string format. It handles success and error cases during this process.

## Link Extraction and Validation
- `retrieveAndValidateLinks` extracts new links from the current page's HTML, validates them (ensuring they belong to the specified domain), and adds unvisited, valid links to the crawl queue.

## Data Storage
- The `storeHtml` method saves the HTML content of each visited page, along with the page's URL and the current date, to a collection for later analysis or processing.

This class allows for the automated collection of data from a series of web pages within a specific domain, efficiently managing the discovery of new pages to visit based on the links in each page.

Let us validate the crawler with the following code: 

In [None]:
# Assuming the Spider class is defined as before with getLinks function properly defined

# Example usage of the Spider class:

# Instantiate the Spider with the starting URL, the domain to crawl within, and the maximum number of iterations.
# The crawl domain is typically the base domain from which the crawler should not deviate.
spider = Spider('https://books.toscrape.com/', 'books.toscrape.com', 20)

# Start the crawling process.
spider.run()

# After running, `spider.collection` will contain the HTML content of up to 20 pages from 'ironhack.com'.
# Each entry in the collection will include the URL, the date when it was scraped, and the HTML content.

In [None]:
#How many elements does our colletion have?
len(spider.collection)

In [None]:
spider.collection[0]

In [None]:
#Enumerate the urls retreived
[spider.collection[i]['url'] for i in range(len(spider.collection))]

It seems that the simple crawler more or less works as expected. There are still many functionalities to work on , such as valid domains, valid urls, etc. One important issue to consider is **persistence**, or how to store the data retrieved for further analysis.

# Small Business Challenge: Extracting Product Data from a Fake Online Store

## Objective
Your task is to perform a product analysis by collecting data from a simulated online store, such as "Fake Store API" (https://fakestoreapi.com/). This website is designed for practice and offers a safe environment for web scraping.

## Steps
1. **Data Collection**:
   - Use the `Spider` class to crawl the "Fake Store API" website.
   - Collect data on products, including names, categories, prices, and descriptions.

2. **Data Storage**:
   - Store the scraped data in a structured format, such as a CSV file or a database.

3. **Analysis**:
   - Analyze the collected data to understand product distribution across different categories, price ranges, and other relevant metrics.

4. **Report**:
   - Prepare a report summarizing your findings, including insights on product trends, pricing strategies, and category popularity.

## Tasks for Students
- Work in pairs to plan and execute the web scraping process.
- Ensure ethical scraping practices are followed, including adhering to `robots.txt` guidelines and rate limiting requests.
- Conduct a thorough analysis of the collected data and collaborate to create a comprehensive report.
- Present your findings in class, highlighting key insights and methodologies used.

## Expected Outcome
Gain hands-on experience in web scraping, data analysis, and presenting findings in a business context. This project will also enhance your understanding of online retail market dynamics.


In [None]:
# Your code here

## 2. Using APIs for Data Retrieval

### Understanding the Big Picture

When aiming to retrieve specific data from a website, it's crucial to first check if the website offers a programmatic interface for querying data. Such interfaces, known as Application Programming Interfaces (APIs), provide a more efficient and structured way of accessing data compared to web scraping.

### The Advantage of APIs

APIs, particularly RESTful APIs, offer a well-defined method for interacting with web services. They are built on a set of rules and standards that allow for predictable and straightforward data retrieval. Here's what characterizes a RESTful API:

- **Base URI**: Every RESTful API has a base Uniform Resource Identifier (URI), which serves as the entry point for the API. For example, `http://example.com/resources/` could be a base URI.

- **Internet Media Type**: RESTful APIs often return data in a specific format, such as JSON (JavaScript Object Notation), which is widely used due to its simplicity and readability. However, other formats like XML, Atom, or even images can be used.

- **Standard HTTP Methods**: These APIs leverage standard HTTP methods for operations:
    - `GET`: Retrieve data from the server (e.g., a list of products).
    - `PUT`: Update existing data or create new data if it doesn't exist, and it's an idempotent operation (repeating the request results in the same state).
    - `POST`: Create new data or update existing data (not idempotent).
    - `DELETE`: Remove data.

- **Hypertext Links for State and Navigation**: RESTful APIs often use hypertext links (URLs) to represent the current state of an application or to navigate between related resources.

### Using APIs with Authentication

Many RESTful APIs require authentication for security reasons. This is typically done by sending a token or key with your API request, which verifies your identity and authorizes your access to the API. The process of obtaining and using authentication tokens varies between APIs, so it's essential to refer to the specific API's documentation for guidance.

### Summary

Leveraging APIs for data retrieval not only aligns with ethical web practices but also ensures a more stable and efficient way of accessing data. When an API is available, it's usually the preferred method over web scraping.

## Example: Fetching Weather Data Using OpenWeatherMap API in Python

This example demonstrates how to use the OpenWeatherMap API to fetch current weather data for a specific city using Python.

### Prerequisites
- An API key from OpenWeatherMap.
- Python's `requests` library installed. (Install via `pip install requests` if needed.)

### Steps to Follow
1. **Sign Up for OpenWeatherMap API**:
   - Register for an account at [OpenWeatherMap](https://openweathermap.org/api).
   - Obtain your free API key (note that there might be an activation delay).

2. **Python Script for Weather Data Retrieval**:
   - The script uses the `requests` library to make an API call.
   - Replace `'YOUR_API_KEY'` with your actual OpenWeatherMap API key.
   - Replace `'CITY_NAME'` with your desired city name.

In [None]:
import requests

def get_weather(api_key, city):
    base_url = "http://api.openweathermap.org/data/2.5/weather?"
    city_name = city
    complete_url = f"{base_url}appid={api_key}&q={city_name}"
    response = requests.get(complete_url)
    return response.json()

# Replace 'YOUR_API_KEY' with your actual API key and 'CITY_NAME' with your city
api_key = 'YOUR_API_KEY'
city_name = 'CITY_NAME'
weather_data = get_weather(api_key, city_name)

print(f"Weather in {city_name}:")
print(weather_data)

### Expected Output
The script will output the current weather data in JSON format, which includes temperature, humidity, weather description, etc.

### Note
- Ensure you replace `'YOUR_API_KEY'` and `'CITY_NAME'` with your actual API key and desired city.
- The OpenWeatherMap API provides data in various formats and details. You might want to explore their documentation for more specific use cases.


# Challenge: Analyzing Instagram Hashtag Trends with Instaloader

## Objective
Leverage `Instaloader`, a Python library, to download posts associated with a specific hashtag on Instagram. Analyze the collected data to identify trends, popular content, and user engagement.

## Steps

### 1. Install Instaloader
- Ensure Python is installed on your system.
- Install `Instaloader` using pip: `pip install instaloader`


### 2. Data Collection
- Choose a hashtag relevant to a topic of interest (e.g., #nature, #travel, #food).
- Use `Instaloader` to download posts tagged with the chosen hashtag. Consider limitations like the number of posts to avoid overwhelming the API.

```python
import instaloader

L = instaloader.Instaloader()
posts = instaloader.Hashtag.from_name(L.context, 'YOUR_HASHTAG').get_posts()

for post in posts:
    # Add code to process and store post details
```
### 3. Data Analysis
Analyze the downloaded data for:
- Popular trends in the hashtag.
- Common themes or subjects in images or captions.
- Levels of user engagement (likes, comments).

### 4. Reporting
- Compile your findings into a report.
- Include visual representations (graphs, word clouds) to illustrate key trends.

### Important Notes
- Respect Instagram's terms of service and ethical guidelines in data scraping.
- Be mindful of privacy and consent, especially with user-generated content.
- The scope of data collection should be limited for educational purposes.

### Expected Outcome
This challenge aims to provide practical experience with Instaloader, develop data analysis skills, and offer insights into social media trends and user behavior.

In [None]:
# your code here