What You Will Learn

Gain an understanding of the basics of web scraping using Python libraries.
Learn how to use the requests library to send HTTP requests and receive responses.
Understand how to use BeautifulSoup to parse HTML content and extract data.
Acquire skills in extracting specific information like movie ratings and details using HTML tags.
Learn how to manipulate and store extracted data using Pandas, a powerful data analysis tool in Python.


What You Will Create

A Python script to scrape the GitHub Topics webpage.
Utilizing BeautifulSoup to parse the saved HTML content of the GitHub Topics page.
A dictionary into a pandas DataFrame for better data manipulation and analysis.


Web Scraping:

use this website : Github/topics
Write a Python script using the requests library to fetch the HTML content of the chosen website.
Print the status code of the response to ensure the request was successful using .status_code, it should be 200.
Print the first 100 characters of the HTML content to verify the response.
Save the HTML content to a file named webpage.html. Ensure you handle the text encoding correctly.
Use BeautifulSoup to parse the saved HTML content.
Identify two distinct pieces of information on the webpage to extract (e.g., titles of the topics and their descriptions).
Write code to extract these pieces of information. Ensure you identify the correct HTML tags and classes used for these elements on the webpage.
Print the length and content of each extracted list to verify the extraction process.
Create a Python dictionary to structure the extracted data, with keys representing the type of information (e.g., ‘title’ and ‘description’).
Convert this dictionary into a pandas DataFrame.
Print the DataFrame to confirm its structure and contents.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://github.com/topics'
response = requests.get(url)

print(f'Status Code: {response.status_code}')

Status Code: 200


In [2]:
print(response.text[:100])



<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-t


In [3]:
with open('webpage.html', 'w', encoding='utf-8') as file:
    file.write(response.text)

In [4]:
with open('webpage.html', 'r', encoding='utf-8') as file:
    content = file.read()

soup = BeautifulSoup(content, 'html.parser')

In [15]:
topic_title_tags = soup.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
topic_titles = [tag.text.strip() for tag in topic_title_tags]

topic_desc_tags = soup.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})
topic_descriptions = [tag.text.strip() for tag in topic_desc_tags]

print(f'Number of Topics Found: {len(topic_titles)}')
print(topic_titles)

print(f'Number of Descriptions Found: {len(topic_descriptions)}')
print(topic_descriptions)



Number of Topics Found: 30
['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']
Number of Descriptions Found: 30
['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and su

In [16]:
data = {
    'Title': topic_titles,
    'Description': topic_descriptions
}

df = pd.DataFrame(data)

print(df)

                     Title                                        Description
0                       3D  3D refers to the use of three-dimensional grap...
1                     Ajax  Ajax is a technique for creating interactive w...
2                Algorithm  Algorithms are self-contained sequences that c...
3                      Amp  Amp is a non-blocking concurrency library for ...
4                  Android  Android is an operating system built by Google...
5                  Angular  Angular is an open source web application plat...
6                  Ansible  Ansible is a simple and powerful automation en...
7                      API  An API (Application Programming Interface) is ...
8                  Arduino  Arduino is an open source platform for buildin...
9                  ASP.NET  ASP.NET is a web framework for building modern...
10           Awesome Lists  An awesome list is a list of awesome things cu...
11     Amazon Web Services  Amazon Web Services provides on-dema