In this notebook, we will be learning about the basic techniques of web scraping and data extraction using Python. More specifically, we'll be targeting SVG badges that are used in README files on GitHub.

There are occasions when external image links or badges might become inaccessible due to various reasons. It's always a good practice to host the necessary files within our own repository to ensure longevity and accessibility. Hence, we'll be scraping badges from an external `README` and downloading them so that they can be hosted internally.

Let's get started!


**Import Necessary Libraries**

- `os`: This module provides a portable way of using operating system-dependent functionality like reading or writing to the file system.
- `requests`: Allows us to send HTTP requests using Python. We'll use this to fetch the content of web pages.
- `BeautifulSoup`: A library for pulling data out of HTML and XML files. Helps in web scraping tasks.

In [43]:
import os
import requests
from bs4 import BeautifulSoup

**Load the README file**

Here, we'll read the content of the `README_external_badges.md` file, which contains the badges we aim to scrape and download.

In [54]:
readme = open("README_external_badges.md", mode='r', encoding='utf8')
lines = readme.readlines()
readme.close()

soup = "".join(lines)
# print(soup)

**Parse the README Content**

Transform the raw README content into a structured `BeautifulSoup` object to make extraction tasks easier.


In [55]:
html = BeautifulSoup(soup, 'html.parser')
html

<!DOCTYPE html>

<html>
<head>
</head>
<body>
<h1>Welcome to Our GitHub Profile! 👋</h1>
<div align="center">
<img src="https://user-images.githubusercontent.com/74038190/235224431-e8c8c12e-6826-47f1-89fb-2ddad83b3abf.gif" width="300"/>
</div>
<h2>Who We Are</h2>
<p>Hello! We are <strong>Hit the Code Labs</strong>, a tech company specializing in data science and process automation solutions. We pride ourselves on delivering cutting-edge solutions that span from academia to industry, utilizing the latest technologies and methodologies. 🛠️📊</p>
<h2>Expertise</h2>
<h3>Academic Area</h3>
<ul>
<li><strong>Thesis &amp; Papers</strong>: Prolific in contributing academic content and research.</li>
<li><strong>Artificial Intelligence Projects</strong>: Specializing in both undergraduate and postgraduate AI implementations.</li>
<li><strong>Computer Vision in Medicine</strong>: Leveraging computer vision technologies for medical applications.</li>
<li><strong>Blockchain</strong>: Focused on suppl

**Find All Image Tags**

Extract all image (`<img>`) tags from the parsed content. These tags represent the badges or any other images in the README.


In [56]:
img_tags = html.find_all("img")
img_tags[0] ### shows the first img tag from the list

<img src="https://user-images.githubusercontent.com/74038190/235224431-e8c8c12e-6826-47f1-89fb-2ddad83b3abf.gif" width="300"/>

**Filter and List URLs for Download**

Go through the list of image tags, filter out any GIFs, and prepare a list of URLs that point to the badges we want to download.


In [57]:
to_download = []

for img in img_tags:
    link = img['src']
    # print(link)  # Uncomment to view individual URLs.
    if "gif" not in link:
        link_2 = link.split("?")[0]
        print(link_2)
        to_download.append(link_2)

https://img.shields.io/badge/python-3670A0
https://img.shields.io/badge/javascript-%23323330.svg
https://img.shields.io/badge/php-%23777BB4.svg
https://img.shields.io/badge/r-%23276DC3.svg
https://img.shields.io/badge/typescript-%23007ACC.svg
https://img.shields.io/badge/Colab-F9AB00
https://img.shields.io/badge/Kaggle-035a7d.svg
https://img.shields.io/badge/Apache%20Spark-FDEE21.svg
https://img.shields.io/badge/numpy-%23013243.svg
https://img.shields.io/badge/pandas-%23150458.svg
https://img.shields.io/badge/Matplotlib-%23ffffff.svg
https://img.shields.io/badge/Plotly-%233F4F75.svg
https://img.shields.io/badge/SciPy-%230C55A5.svg
https://img.shields.io/badge/TensorFlow-%23FF6F00.svg
https://img.shields.io/badge/Keras-%23D00000.svg
https://img.shields.io/badge/PyTorch-%23EE4C2C.svg
https://img.shields.io/badge/scikit--learn-%23F7931E.svg
https://img.shields.io/badge/opencv-%23white.svg
https://img.shields.io/badge/-RaspberryPi-C51A4A
https://img.shields.io/badge/html5-%23E34F26.svg
htt

**Inspect the First Badge URL**

Let's take a look at the first URL in our download list.


In [30]:
### show the first url element
to_download[0]

'https://img.shields.io/badge/php-%23777BB4.svg'

**Define the Download Function**

Here, we create a function `download_badges()` that:
1. Takes a URL and an optional folder name as arguments.
2. Downloads the SVG from the given URL.
3. Saves the SVG into the specified folder (default is `"badges"`).


In [52]:
def download_badges(url, folder='badges'):
    
    os.makedirs(folder, exist_ok=True)
    name_svg = url.split("/")[-1].replace("/-", "/").split("-")[0].replace(".", "")
    filename = f'{folder}/{name_svg}.svg'
    response = requests.get(url)

    # Ensure the request was successful
    if response.status_code == 200:
        # Write the image to a file
        with open(filename, 'wb') as f:
            f.write(response.content)
        print(f"Image: {filename} downloaded successfully!")
    else:
        print(f"Failed to download the image. HTTP status code: {response.status_code}")

**Download All Badges**

Use the `download_badges()` function to download all the SVG badges from the previously prepared list.


In [53]:
for url in to_download:
    download_badges(url=url, folder='badges')

Image: badges/python.svg downloaded successfully!
Image: badges/javascript.svg downloaded successfully!
Image: badges/php.svg downloaded successfully!
Image: badges/r.svg downloaded successfully!
Image: badges/typescript.svg downloaded successfully!
Image: badges/Colab.svg downloaded successfully!
Image: badges/Kaggle.svg downloaded successfully!
Image: badges/Apache%20Spark.svg downloaded successfully!
Image: badges/numpy.svg downloaded successfully!
Image: badges/pandas.svg downloaded successfully!
Image: badges/Matplotlib.svg downloaded successfully!
Image: badges/Plotly.svg downloaded successfully!
Image: badges/SciPy.svg downloaded successfully!
Image: badges/TensorFlow.svg downloaded successfully!
Image: badges/Keras.svg downloaded successfully!
Image: badges/PyTorch.svg downloaded successfully!
Image: badges/scikit.svg downloaded successfully!
Image: badges/opencv.svg downloaded successfully!
Image: badges/RaspberryPi.svg downloaded successfully!
Image: badges/html5.svg download

### Conclusion

All the SVG badges from the external README have now been successfully downloaded and saved locally. This ensures that even if the external sources become unavailable in the future, the badges will remain accessible from within our repository.