##  Python for Data Science , AI & Development 

### APIs and Data Collection

---

Some status code examples are shown in the table below, the prefix indicates the class. These are shown in yellow, with actual status codes shown in  white. Check out the following <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01">link </a> for more descriptions.

<div class="alert alert-block alert-info" style="margin-top: 20px">
         <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/images/status_code.png" width="300" align="center">
</div>

### Libraries 

We need the following libraries

- **BeautifulSoup** 
    - BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

- **Scrapy** 
    - Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract the data from the website.

- **Selenium** 
    - Selenium is a tool used for controlling web browsers through programs and automating browser tasks.

In [2]:
import pandas as pd
import numpy as np
import scrapy # For web scraping
import requests # To make HTTP requests
from bs4 import BeautifulSoup # To parse HTML content
from selenium import webdriver # For browser automation

---

### Fetching and parsing HTML

To start web scraping, you need to fetch the HTML content of a webpage and parse it using Beautiful Soup. Here's a step-by-step example:

In [24]:
def response_status(status: int) -> str:
    if 100 <= status < 200:
        return "informational"
    if 200 <= status < 300:
        return "parse - success"
    if 300 <= status < 400:
        return "redirect"
    if status in {429, 500, 502, 503, 504}:
        return "retry"
    if 400 <= status < 500:
        return "fail_fast"
    if status >= 500:
        return "retry_and_log"
    return "ignore"

In [25]:

# Specify the URL of the webpage you want to scrape
url = 'https://finance.yahoo.com/'

# To avoid being blocked, we can set a User-Agent header
headers = {
    "User-Agent": "JoseMelchor/1.0 (ITESO, academic use)"
}

# Send an HTTP GET request to the webpage
response = requests.get(url, headers=headers)

# Store the HTML content in a variable
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Display a snippet of the HTML content
print(html_content[:500])
print("\n---\n")
print(response_status(response.status_code))
print("\n---\n")
print(response.headers.get("Content-Type"))


<!doctype html>
<html lang="en-US" theme="auto" data-color-theme-enabled="true" data-color-scheme="auto" class="desktop neo-green dock-upscale">
    <head>
        <meta charset="utf-8" />
        <meta name="oath:guce:consent-host" content="guce.yahoo.com" />
        <link rel="preconnect" href="//s.yimg.com" crossorigin="anonymous"><link rel="preconnect" href="//geo.yahoo.com"/><link rel="preconnect" href="//query1.finance.yahoo.com"/><link rel="preconnect" href="//consent.cmp.oath.com"/><link

---

parse - success

---

text/html; charset=utf-8


200 is correct 

---

### Navigating the HTML structure

In [35]:
links = soup.find_all('a')
for link in links:
    print(link.text)

Skip to navigation  
Skip to main content  
Skip to right column  
News
Today's news
US
Politics
2025 Election
World
Weather
Climate change
Health
Wellness 
Mental health
Sexual health
Dermatology
Oral health
Hair loss
Foot health
Nutrition 
Healthy eating
Meal delivery
Weight loss
Vitamins and supplements
Fitness 
Equipment
Exercise
Women's health 
Sleep 
Healthy aging 
Hearing
Mobility
Science
Originals
The 360 
Newsletters
Games
Life
Health
Wellness 
Nutrition
Fitness
Healthy aging
Sleep
Your body 
Children's health
Dermatology
Foot health
Hair loss
Oral health
Sexual health
Weight management
Women's health
Conditions 
Cardiovascular health
Digestive health
Endocrine system
Hearing
Mental health
Parenting
Family health 
So mini ways 
Style and beauty
It Figures 
Unapologetically 
Horoscopes
Shopping
Style 
Accessories
Clothing
Luggage
Shoes
Beauty 
Fragrance
Hair
Makeup
Nails
Skincare
Sunscreen
Health 
Dental
Fitness
Hair loss
Hearing
Mental health
Mobility
Nutrition
Personal care
S

---

### Web Scraping 

#### Importance of Web Scraping in Data Science
In the field of data science, web scraping plays an integral role. It is used for various purposes such as:

- **Data Collection:** Web scraping is a primary method of collecting data from the internet. This data can be used for analysis, research, etc.

- **Real-time Application:** Web scraping is used for real-time applications like weather updates, price comparison, etc.

- **Machine Learning:** Web scraping provides the data needed to train machine learning models.

#### Applications of Web Scraping

Web scraping is used in various fields and has many applications:

- **Price Comparison:** Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.

- **Email address gathering:** Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails.

- **Social Media Scraping:** Web scraping is used to collect data from Social Media websites such as Twitter to find out what's trending.

---

### Get a table of Wikipedia

In [None]:
URL = "https://es.wikipedia.org/wiki/Copa_M%C3%A9xico"

headers = {
    "User-Agent": "JoseMelchor/1.0 (ITESO, academic use)"
}

response = requests.get(URL, headers=headers)


tables = pd.read_html(response.text) # This will return a list of DataFrames

tables[2] # Display the third table on the page

  tables = pd.read_html(response.text)  # List of all tables found


Unnamed: 0_level_0,Edición,Campeón[23]​,Resultado,Subcampeón,D.T. Campeón[24]​[25]​,Sede final,Nota(s)
Unnamed: 0_level_1,Copa México,Copa México,Copa México,Copa México,Copa México,Copa México,Copa México
0,1932-33,C. Necaxa,3-1,Germania F. V.,Alfred C. Crowle,Parque Necaxa,Época amateur
1,1933-34,C. F. Asturias,3-0,C. Necaxa,---------,Parque España,Época amateur
2,1934-35,No se disputó,No se disputó,No se disputó,No se disputó,No se disputó,Época amateur
3,1935-36,C. Necaxa,2-1,C. F. Asturias,Ernesto Pauler,Parque Asturias,Época amateur
4,1936-37,C. F. Asturias,5-3,C. América,,Parque Necaxa,Época amateur
...,...,...,...,...,...,...,...
69,CLA. 2018,C. Necaxa,1-0,C. D. Toluca,Ignacio Ambriz,Victoria,
70,APR. 2018,Cruz Azul F. C.,2-0,C. F. Monterrey,Pedro Caixinha,BBVA Bancomer,
71,CLA. 2019,C. América,1-0,F. C. Juárez,Miguel Herrera,Olímpico Benito Juárez,
72,2019-20,C. F. Monterrey,1-0 / 1-1,C. Tijuana,Antonio Mohamed,BBVA,


---