<a href="https://colab.research.google.com/github/pitelet222/Machine-learning-models/blob/master/Another_copy_of_WebScrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INTRO

![Web Scraping](http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg)

Welcome to the first part of our journey into the world of web scraping. Web scraping, also known as web harvesting or web data extraction, is a technique used for extracting data from websites. This process involves fetching the web page and then extracting data from it.

### Why Learn Web Scraping?
Understanding how to scrape data from the web is a valuable skill for any data professional. In the digital era, data is the new gold, and web scraping is the mining equipment. Here's why it's essential:

- **Data Availability**: The internet is a vast source of data for all kinds of analyses, from market trends to academic research.
- **Automation**: Web scraping can automate the process of collecting data, saving time and effort.
- **Competitive Advantage**: In many fields, having timely and relevant data can be a game-changer.

### Real-World Applications
- **Market Research**: Analyzing competitors, understanding customer sentiments, and identifying market trends.
- **Price Comparison**: Aggregating pricing data from various websites for comparison shopping.
- **Social Media Analysis**: Gathering data from social networks for sentiment analysis or trend spotting.

### Ethical Considerations in Web Scraping

Web scraping, while a powerful technique for data extraction, comes with significant ethical and legal responsibilities. As budding data scientists and web scrapers, it's crucial to navigate this landscape with a deep understanding and respect for these considerations.

### Respecting Website Policies and Laws

- **Adhering to Terms of Service**: Every website has its own set of rules, usually outlined in its Terms of Service (ToS). It's important to read and understand these rules before scraping, as violating them can have legal implications.

- **Following Copyright Laws**: The data you scrape is often copyrighted. Ensure that your use of scraped data complies with copyright laws and respects intellectual property rights.

- **Privacy Concerns**: Be mindful of personal data. Scraping and using personal information without consent can breach privacy laws and ethical standards.

### Example: Understanding Google's `robots.txt`

Google's `robots.txt` file is an excellent example of how websites communicate their scraping policies. Accessible at [Google's robots.txt](https://www.google.com/robots.txt), this file provides directives to web crawlers about which pages they can or cannot scrape.

#### Implications of Google's `robots.txt`

- **Selective Access**: Google allows certain parts of its site to be crawled while restricting others. For instance, crawling the search results pages is generally disallowed.

- **Dynamic Nature**: The content of `robots.txt` files can change, reflecting the website's evolving stance on web scraping. Regular checks are necessary for compliance.

- **Respecting the Limits**: Even if a `robots.txt` file allows scraping of some pages, it does not automatically mean all scraping activities are legally or ethically acceptable. It's a guideline, not a blanket permission.

### 1. Introduction to Data Hunting in the Digital Age

#### The Evolution of Data Sourcing

In this course, we focus on data as our foundational element. Traditionally, data has been sourced from structured formats like spreadsheets from scientific experiments or records in relational databases within organizations. But with the digital revolution, particularly the advent of the internet, our approach to data collection must evolve. The internet is a vast reservoir of unstructured data, presenting both challenges and opportunities for data retrieval and analysis.

#### Understanding the Landscape of Web Data

When seeking data from the internet, it's essential to first consider how the website in question provides access to its data. Many large-scale websites like Google, Facebook, and Twitter offer an **Application Programming Interface (API)**. APIs are designed to facilitate easy access to a website's data in a structured format, simplifying the process of data extraction.

##### The Role of APIs

- **APIs as a Primary Tool**: An API acts as a bridge between the data seeker and the website's database, allowing for streamlined data retrieval.
- **Limitations**: However, not all websites provide an API. Additionally, even when an API is available, it may not grant access to all the data a user might need.

##### The Need for Web Scraping

In cases where an API is absent or insufficient, we turn to **web scraping**. Web scraping involves extracting raw data directly from a website's frontend - essentially, the same information presented to users in their web browsers.

###### Diving into Scraping

- **Dealing with Unstructured Data**: Scraping requires us to interact with unstructured data, necessitating custom coding and data parsing techniques.
- **Legal and Ethical Considerations**: It's crucial to approach web scraping with an awareness of the legal and ethical implications, respecting website policies and user privacy.

## Starting Our Journey

Our first practical step in this journey will be to explore how to connect to the internet and retrieve a basic webpage. We'll begin by using Python's `urllib.request` module, a powerful tool for interacting with URLs and handling web requests.

Join us as we embark on this exciting journey to master the art of data hunting in the digital era, where we'll navigate the complexities of APIs, web scraping, and the ethical considerations that come with them.

In [None]:
# Import the 'urlopen' function from the 'urllib.request' module.
# This function is used for opening URLs, which is the first step in web scraping.
from urllib.request import urlopen

# Use the 'urlopen' function to open the URL 'http://www.google.com/'.
# The function returns a response object which can be used to read the content of the page.
# Here, 'source' is a variable that holds the response object from the URL.
source = urlopen("http://www.google.com/")

# Print the response object.
# This command does not print the content of the webpage.
# Instead, it prints a representation of the response object,
# which includes information like the URL, HTTP response status, headers, etc.
print(source)

<http.client.HTTPResponse object at 0x79f697026a40>


## Exploring the Content Retrieved by `urlopen`

This code snippet demonstrates the basic usage of the `urlopen` function for accessing a webpage. However, it is important to note that `print(source)` will not display the HTML content of the webpage but rather the HTTP response object's representation. To view the actual content of the page, you would need to read from the `source` object using methods like `source.read()`.

After opening a URL using the `urlopen` function from the `urllib.request` module, we typically want to access the actual content of the webpage. This is where `source.read()` comes into play.

### Understanding `source.read()`

When you call `urlopen`, it returns an HTTPResponse object. This object, which we've named `source` in our example, holds various data and metadata about the webpage. To extract the actual HTML content of the page, we use the `read` method on this object.

### What Does `source.read()` Do?

- **Retrieves Webpage Content**: `source.read()` reads the entire content of the webpage to which the URL points. This content is usually in HTML format, which is the standard language for creating webpages.

- **Binary Format**: The data retrieved is in binary format. To work with it as a string in Python, you might need to decode it using a method like `.decode('utf-8')`.

- **One-time Operation**: It's important to note that you can read the content of the response only once. After `source.read()` is executed, the response object does not retain the content in a readable form. If you need to access the content again, you must reopen the URL.

Here's a simple example to illustrate this:

In [None]:
#Let us check what is in
something = source.read()
print(something)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp, " name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="evFP8Fx28ryHN3fOMU0Csw">(function(){var _g={kEI:\'9U_3aPjpGKjewN4P1ZHdmQM\',kEXPI:\'0,18168,184686,2,3947057,90132,48791,46127,344796,238459,5281877,12216,36811816,25228681,123988,28395,4558,52567,5382,2663,3431,27198,9139,4599,328,6225,1116,49460,3614,9975,15049,8210,3286,4134,30380,28333,42889,5407,5917,352,10731,7739,410,5870,3856,3851,7,5773,24950,1,3,2658,4719,11805,7580,2126,2863,1,2491,2,3477,4649,2361,7395,2352,5683,3604,594,6734,10437,7,1,2729,12082,4105,364,15320,5

## DEMO

Let's get our hands-on with some initial exercises to get warmed up with web scraping!

### Exercises

1. **Python.org Content Check**: Does [https://www.python.org](https://www.python.org) contain the word `Python`?  
   _Hint: You can use the `in` keyword to check._

2. **Google.com Image Search**: Does [http://google.com](http://google.com) contain an image?  
   _Hint: Look for the `<img>` tag._

3. **First Characters of Python.org**: What are the first ten characters of [https://www.python.org](https://www.python.org)?

4. **Keyword Check in Pyladies.com**: Is there the word 'python' in [https://pyladies.com](https://pyladies.com)?

In [None]:
# EX1: Check if 'Python' is in the content of http://www.python.org/

# Import the urlopen function from the urllib.request module
# This function is used to open a URL and retrieve its contents
from urllib.request import urlopen
import gzip # Import the gzip module for decompression
import io   # Import the io module for working with binary data

# Use the urlopen function to access the webpage at http://www.python.org/
# The function returns an HTTPResponse object which is stored in the variable 'source'
source = urlopen("http://www.python.org/")

# Read the content of the response object
content = source.read()

# Check for Content-Encoding header and decompress if necessary
if source.info().get('Content-Encoding') == 'gzip':
    # Wrap the content in a BytesIO object to make it file-like
    buf = io.BytesIO(content)
    # Decompress the content
    f = gzip.GzipFile(fileobj=buf)
    # Read the decompressed content
    content = f.read()

# Decode the content to a string using 'utf-8'. Use errors='ignore' or 'replace'
# if there are still occasional decoding issues with valid characters.
# You could also try to infer the encoding from the headers or HTML meta tags,
# but utf-8 is common.
something = content.decode('utf-8')

# Check if the word "Python" is in the decoded string
# This is done using the 'in' keyword, which checks for the presence of a substring in a string
# The result is a boolean value: True if "Python" is found, False otherwise
print("Python" in something)

# Note: The choice of 'latin-1' for decoding might not always be appropriate
# It's often better to use 'utf-8', which is a more common encoding for webpages
# For example: something = source.read().decode('utf-8')

True


In [None]:
print(something)

<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <script defer data-domain="python.org" src="https://analytics.python.org/js/script.outbound-links.js"></script>

    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js">

    <meta name="application-name" content="Python.org">
    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">
    <meta name="apple-mobile-web-app-title" content="Python.org">
    <meta name="apple-mobile-web-

## Definitions: Request, Crawling and Scrapping

### Using `urlopen` vs. `Request` in Web Scraping

When performing web scraping tasks in Python, you have the option to use either the `urlopen` function from the `urllib.request` module or the `Request` object in combination with `urlopen`. Here, we'll explain why you might choose one approach over the other.

### Using `urlopen` Directly

**Advantages**:

- **Simplicity**: It's a straightforward way to access a webpage and retrieve its content without the need for additional objects or customization.
  
- **Default Behavior**: `urlopen` uses default settings for the HTTP request, which is suitable for many common use cases.

- **Convenience**: For simple web scraping tasks, it provides a concise and readable solution.

### Using `Request` with `urlopen`

**Advantages**:

- **Customization**: You can set custom headers, use different HTTP methods (e.g., POST, PUT), and configure advanced options like handling redirects, cookies, and timeouts.

- **Fine-Grained Control**: It offers greater flexibility for handling complex scenarios.

In summary, the choice between using `urlopen` directly and creating a `Request` object depends on the complexity of your web scraping task. For simple tasks like fetching webpage content, `urlopen` is often sufficient and more straightforward. However, if you need to customize headers, use non-GET HTTP methods, or handle advanced scenarios, creating a `Request` object allows for fine-grained control over your HTTP requests.


### Crawling and Scraping: Unveiling the Web's Secrets

Crawling and scraping are two fundamental techniques in the world of web data acquisition. They form the backbone of many data-driven applications and are crucial skills for data analysts and web developers.

### Crawling: Navigating the Web

Crawling, often referred to as web crawling or web scraping, is the process of systematically navigating the World Wide Web to retrieve web pages. Think of it as a web robot or spider, tirelessly traversing the internet to discover and index web content. This technique is at the heart of search engines like Google and Bing.

### Why Do We Crawl?

Crawling serves several important purposes:

- **Indexing**: It allows search engines to index and catalog web pages, making them searchable by users.
  
- **Link Discovery**: Crawlers extract links from web pages, helping build a vast network of interconnected web resources. This link structure is crucial for understanding the web's architecture.
  
- **Data Retrieval**: Crawlers may scrape or extract data from web pages, but their primary goal is to discover and navigate to other web pages.

### Scraping: Harvesting Data

Scraping is the process of extracting specific data or information from a single web page. Unlike crawling, which focuses on navigating the web, scraping zooms in on a single webpage to harvest valuable data.

### Use Cases of Scraping

Scraping is used for a variety of purposes, such as:

- **Data Extraction**: It allows us to extract structured data like product prices, news headlines, or stock market information from websites.

- **Content Monitoring**: Scraping can be employed to track changes in content on specific web pages, such as monitoring price changes on e-commerce sites or tracking news updates.

- **Competitor Analysis**: Businesses often use scraping to gather data on competitors, such as pricing strategies or product listings.

- **Research and Analysis**: Data analysts and researchers use scraping to collect data for studies, reports, and data-driven insights.

### Crawling and Scraping Synergy

In practice, crawling and scraping often work together. Crawlers traverse the web to find new pages, and once they reach a page of interest, scraping techniques are applied to extract valuable data. This synergy is what powers search engines, news aggregators, and data-driven applications on the internet.

### Conclusion

Understanding the concepts of crawling and scraping is essential for anyone looking to work with web data. Whether you want to build a search engine, gather market research, or simply automate data collection, these techniques are your gateway to unlocking the wealth of information available on the web.

## Requests vs Urllib

### Url Lib

In [None]:
import urllib.request

# Define the URL to scrape
url = 'https://www.pyladies.com'

# Set up the request with a custom user-agent header
req = urllib.request.Request(url, headers={'User-Agent': 'Magic Browser'})

# Open the URL and retrieve the HTML content
con = urllib.request.urlopen(req)
html = con.read().decode()

# Check if 'Python' is in the HTML content
print('Python' in html)


True


### Requests

In [None]:
# the main library you will need for webscraping is called Beautiful Soup
from bs4 import BeautifulSoup
# the second package we will need we already know it
import requests


url = "https://en.wikipedia.org/wiki/Marie_Curie"

response = requests.get(url)
response

<Response [403]>

![HTTPStatus](https://www.whatismyip.com/static/51e6afd43d8a39f7a6e03805c1328e11/https-codes.webp)

In [None]:
## ANALYZE THE RESPONSE METHODS
#response.
response.content

## This is not very easy to analyze...

b'Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.\n'

### Beautiful Soup

In [None]:
# turning the response into a beautiful soup object
soup = BeautifulSoup(response.content)
# prettify the soup to then copy it to a text editor and study its structure
print(soup.prettify())

<html>
 <body>
  <p>
   Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.
  </p>
 </body>
</html>



In [None]:
"Marie Curie" in soup.text

False

In [None]:
"Marie Curie" in soup.prettify()

False

## BREAK: Html

### 1. "Making Your Own API": Web Scraping

### Understanding Web Scraping
Web scraping becomes essential when data is available on the web but isn't accessible through an API, or the existing API lacks certain functionalities or has restrictive terms of service. In such scenarios, **Web Scraping** is the technique that enables automated extraction of this data, replicating the access a human would have visually.

### Why Web Scraping?
- **Data Accessibility**: Sometimes, the only way to access certain data is directly from the web pages where it is displayed.
- **Flexibility**: Web scraping allows you to tailor data extraction to specific needs, bypassing limitations of existing APIs.

### Preparing for Web Scraping: Understanding Web Page Structure
Before delving into scraping, it's crucial to have a basic understanding of web page structure and how data is stored and presented. This session covers:

#### Basic HTML and CSS Static Pages
- **HTML (HyperText Markup Language)**: The standard markup language used to create web pages. Understanding HTML is key to identifying the data you want to scrape.
- **CSS (Cascading Style Sheets)**: Used for describing the presentation of a document written in HTML. Knowing CSS helps in pinpointing specific elements on a page.

#### Dynamic HTML
- **Basic JavaScript Example Using JQuery**: Websites often use JavaScript to load data dynamically. Understanding how this works is crucial for scraping data from such dynamic pages.

### Understanding the Foundation of Web Pages

The most fundamental web pages are constructed using HTML and CSS. These technologies serve two primary purposes: **HTML (Hypertext Markup Language)** structures and stores the content, making it the primary target for web scraping, while **CSS (Cascading Style Sheets)** formats and styles the content, highlighting visual elements like fonts, colors, borders, and layout.

#### HTML: The Structure of the Web
HTML is a markup language typically rendered by web browsers. It uses 'tags' to define elements on a web page. A typical tag format includes a tag name, attributes (if any), and the content between opening and closing tags.

#### Key Components of an HTML File

- **DOCTYPE Declaration**:
  - Begins with `<!DOCTYPE html>`, indicating the use of HTML5.
  - Earlier HTML versions had different DOCTYPEs.

- **HTML Tag**:
  - The `html` tag (and its closing `/html` tag) encloses the entire web page content.

- **Head and Body**:
  - The `head` section often includes the `title` tag, defining the webpage's name, links to CSS stylesheets, and JavaScript files for dynamic behavior.
  - The `body` contains the visible webpage content.

- **Common HTML Elements**:
  - **Headings and Paragraphs**: Use `h#` (where # is a number) for headings and `p` for paragraphs.
  - **Hyperlinks**: Defined with the `href` attribute in `a` (anchor) tags.
  - **Images**: Embedded using `img` tags with the `src` attribute. Note: `img` is self-closing.



## Back to Requests and Beautiful Soup

### Titles, Paragraphs and Tables

In [None]:
# now that we have the html code inside a soup object -> we can explore it's attributes
# I can call the title tag of the webpage -> this brings the tag and the content
soup.title

In [None]:
# imagine you only wanted the content
soup.title.string

AttributeError: 'NoneType' object has no attribute 'string'

In [None]:
# imagine I want paragraphs (p tag)
soup.p
# this is no good, clearly there are many p tags which we want

paragraphs = soup.find_all('p')
paragraphs

for element in paragraphs:
  print(element.text)

In [None]:
# you can search both by the tag but also by other attributes, such as the class name
tables = soup.find_all('table', attrs= {'class' : 'infobox biography vcard'})

#this is very helpful to identify boxes that use the same css styling, for which an attrivute is already defined

# finds all the text elements inside the table
table = tables[0]
table

In [None]:

# inside the first level of my table, there are still many many tags
# you can find more tags within your table

# the table itself has many tags inside -> it is a soup object itself
for line in table.find_all('li'):
  print(line.text)


In [None]:
# do it yourself:
# find all the bio fields category names for Mdme Curie



### Web Scraping Exercise: Extracting News Headlines from BBC Technology

#### Objective
Write a Python script to scrape headlines from BBC's Technology news section and categorize them based on keywords.

#### Task Details

1. **Website to Scrape**:
   - Target the BBC's 'Technology' section: [BBC Technology News](https://www.bbc.co.uk/news/technology).

2. **Scraping Requirement**:
   - Scrape the main headlines from the page, typically found in `h3` tags or a specific class.

3. **Categorization**:
   - Categorize the headlines based on predefined keywords like 'Apple', 'Microsoft', 'Google', etc.
   - Count the number of headlines that fall into each category.

4. **Output**:
   - Print each headline along with its respective category.
   - Summarize with the count of headlines in each category.

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the BBC technology news section
url = "https://www.bbc.co.uk/news/technology"

# Send a GET request and parse the HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Define categories and associated keywords
categories = {
    'Apple': ['Apple', 'iPhone', 'iPad'],
    'Microsoft': ['Microsoft', 'Windows', 'Bill Gates'],
    'Google': ['Google', 'Android', 'Alphabet']
    # Add more categories as needed
}

# Function to determine the category of a headline
def categorize_headline(headline):
    # Logic to determine the category based on keywords
    # Return the category name if a keyword is found, else return 'Other'
    pass

# Scrape and process the headlines
# Look for 'h3' tags or other relevant tags
# Use the categorize_headline function to categorize each headline
# Print each headline and its category

# Print the count of headlines in each category