# Web Scraping with BeautifulSoup and Requests: Extract text and links from web pages

## Created by:
**Khaled Ashraf**  
AI Engineer

## Purpose:
This Jupyter Notebook demonstrates web scraping techniques using the `requests` and `BeautifulSoup` libraries. The notebook is divided into three sections:
1. Extracting text content from a Wikipedia page (Arabic).
2. Extracting links from dynamically generated URLs.
3. Extracting text content from a specific webpage, in this case, the Wikipedia page for Zamalek Club.

## Libraries Used:
- **requests**: Used to make HTTP requests to fetch webpage content.
- **BeautifulSoup** (from `bs4`): Used to parse HTML content and extract specific elements like text and links.

## Steps:
### 1. Extracting Text from a Wikipedia Page (Arabic)
This section demonstrates how to fetch content from the Wikipedia page for "Nadi El Zamalek" (Arabic), extract the text, and save it as a plain text file.

### 2. Extracting Links from Dynamically Generated URLs
In this section, URLs are dynamically constructed from a predefined list (`T`), and the script extracts all the hyperlinks (`<a>` tags) from each page.

### 3. Combining Both Extracting Text and Links
This section combines the previous two techniques, allowing you to extract both plain text and links from different pages in the same notebook.

## Output:
- The first part of the notebook will generate a file named `Zamalek_Club.txt` containing the extracted text from the Wikipedia page.
- The second part will print a list of all the links (`<a>` tags) found on the dynamically generated pages.

## Notes:
- You can customize the list `T` to scrape different pages.
- This notebook demonstrates two types of scraping: extracting plain text from a webpage and extracting links from a list of pages.
ineer
links from a list of pages.


# Web Scraping with BeautifulSoup and Requests: Extract text and links



In [1]:
pip install requests


Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install beautifulsoup4


Note: you may need to restart the kernel to use updated packages.


In [3]:
import requests
from bs4 import BeautifulSoup

def GetPage(Link, FileName):
    page = requests.get(Link)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # طباعة عدد الفقرات
    paragraphs = soup.find_all('p')
    print(f'Number of paragraphs is : {len(paragraphs)}')
    
    # طباعة عنوان الصفحة إذا كان موجودًا
    try:
        title = soup.find(id="firstHeading")
        print(f'Page title is : {title.string}')
    except AttributeError:
        pass  # تجاهل إذا لم يوجد عنوان

    # إذا لم توجد فقرات، الخروج من الدالة
    if len(paragraphs) == 0:
        return None
    
    # حفظ الفقرات في ملف
    with open(FileName, 'w', encoding='utf-8') as f:
        for para in paragraphs:
            f.write(para.get_text())
            f.write('\n')
    print(f"Text has been saved to {FileName}")


In [4]:
GetPage('https://en.wikipedia.org/wiki/Albert_Einstein','BSAlbert.txt')

Number of paragraphs is : 137
Page title is : Albert Einstein
Text has been saved to BSAlbert.txt


In [5]:
GetPage("https://ar.wikipedia.org/wiki/%D9%86%D8%A7%D8%AF%D9%8A_%D8%A7%D9%84%D8%B2%D9%85%D8%A7%D9%84%D9%83","zamalek.txt")

Number of paragraphs is : 122
Page title is : نادي الزمالك
Text has been saved to zamalek.txt


In [6]:
import urllib.request

url = 'https://raw.githubusercontent.com/HeshamAsem/NLTK/master/Files/HardTimes.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')
print(data)


Thomas Gradgrind, sir.  A man of realities.  A man of facts and calculations.  A man who proceeds upon the principle that two and two are four, and nothing over, and who is not to be talked into allowing for anything over.  Thomas Gradgrind, sir—peremptorily Thomas—Thomas Gradgrind.  With a rule and a pair of scales, and the multiplication table always in his pocket, sir, ready to weigh and measure any parcel of human nature, and tell you exactly what it comes to.  It is a mere question of figures, a case of simple arithmetic.  You might hope to get some other nonsensical belief into the head of George Gradgrind, or Augustus Gradgrind, or John Gradgrind, or Joseph Gradgrind (all supposititious, non-existent persons), but into the head of Thomas Gradgrind—no, sir!

In such terms Mr. Gradgrind always mentally introduced himself, whether to his private circle of acquaintance, or to the public in general.  In such terms, no doubt, substituting the words ‘boys and girls,’ for ‘sir,’ Thoma

##  urllib.request


In [7]:
import urllib.request

# رابط الملف
url = 'https://raw.githubusercontent.com/HeshamAsem/NLTK/master/Files/HardTimes.txt'

# فتح الرابط وقراءة البيانات
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')

# اسم الملف الذي سيتم حفظ البيانات فيه
file_name = 'HardTimes.txt'

# حفظ البيانات في الملف
with open(file_name, 'w', encoding='utf-8') as f:
    f.write(data)

print(f"Data has been saved to {file_name}")


Data has been saved to HardTimes.txt


In [8]:
import urllib.request
from bs4 import BeautifulSoup

# رابط الصفحة
url = 'https://ar.wikipedia.org/wiki/%D9%86%D8%A7%D8%AF%D9%8A_%D8%A7%D9%84%D8%B2%D9%85%D8%A7%D9%84%D9%83'

# فتح الرابط وقراءة البيانات
response = urllib.request.urlopen(url)
data = response.read().decode('utf-8')

# استخدام BeautifulSoup لتحليل الصفحة واستخراج النص فقط
soup = BeautifulSoup(data, 'html.parser')

# استخراج النص من جميع الفقرات فقط
cleaned_data = soup.get_text()

# اسم الملف الذي سيتم حفظ البيانات فيه
file_name = 'Zamalek_Club.txt'

# حفظ البيانات في الملف مع التأكد من الترميز 'utf-8'
with open(file_name, 'w', encoding='utf-8') as f:
    f.write(cleaned_data)

print(f"Data has been saved to {file_name}")


Data has been saved to Zamalek_Club.txt


## WebTeb

In [9]:
T = 'دجحخهعغفقثصضذشسيبلاتنمكطظزوةىلارؤءئ'
T = list(set(T))
len(T)

33

In [10]:
import requests
from bs4 import BeautifulSoup
urls = []
for t in T :
    url = r'https://www.webteb.com/drug/list/' + t
    reqs = requests.get(url)
    soup = BeautifulSoup(reqs.text, 'html.parser')


    for link in soup.find_all('a'):
        urls.append(link.get('href'))
urls

['/',
 'https://webteb.miavitals.com/',
 'https://webteb.miavitals.com/',
 '/',
 'https://twitter.com/WebTeb_com',
 'https://www.facebook.com/Webteb.net',
 'https://www.instagram.com/webteb/',
 '/medical',
 '/lifestyle',
 'https://baby.webteb.com',
 '/diseases',
 '/drug',
 '/testyourself',
 'https://webteb.miavitals.com/',
 'https://www.webteb.com/medical',
 'https://www.webteb.com/body-organs',
 'https://www.webteb.com/dental-health',
 'https://www.webteb.com/heart',
 'https://www.webteb.com/alternative-medicine',
 'https://www.webteb.com/woman-health',
 'https://www.webteb.com/cancer',
 'https://www.webteb.com/eye-health',
 'https://www.webteb.com/sex-education',
 'https://www.webteb.com/mental-health',
 'https://www.webteb.com/symptoms',
 'https://www.webteb.com/diabetes',
 'https://www.webteb.com/medical-technology',
 'https://news.webteb.com',
 'https://baby.webteb.com',
 'https://baby.webteb.com/حاسبة-الحمل-وموعد-الولادة',
 'https://baby.webteb.com/baby-names',
 'https://baby.web

In [11]:
len(urls)

8443

# I used this code to retrieve only the URLs of all drugs from the WebTeb website, 
# focusing on collecting a list of available drugs without any additional data.

# استخدمت هذا الكود للحصول على روابط جميع الأدوية فقط من موقع "ويب طب"،
# وذلك للتركيز على استرجاع قائمة الأدوية المتاحة دون أي بيانات إضافية.


In [12]:
U = [i for i in urls if 'https://www.webteb.com/drug' in i]
len(U)

3153

In [13]:
len(list(set(U)))

700

In [14]:
U = list(set(U))
U

['https://www.webteb.com/drug/كروتاميتون',
 'https://www.webteb.com/drug/ديفلوكورتولون',
 'https://www.webteb.com/drug/زولميتريبتان',
 'https://www.webteb.com/drug/كابازيتاكسيل',
 'https://www.webteb.com/drug/كوليستين-سولفوميثات',
 'https://www.webteb.com/drug/ليوبروليد',
 'https://www.webteb.com/drug/اوكساليبلاتين',
 'https://www.webteb.com/drug/مابروتيلين',
 'https://www.webteb.com/drug/سيرترالين',
 'https://www.webteb.com/drug/اميكاسين',
 'https://www.webteb.com/drug/فينلفورامين',
 'https://www.webteb.com/drug/جليميبرايد',
 'https://www.webteb.com/drug/اميلوريد',
 'https://www.webteb.com/drug/ديكلوفيناك',
 'https://www.webteb.com/drug/بنسلامين',
 'https://www.webteb.com/drug/ترولامين',
 'https://www.webteb.com/drug/كيتوكونازول',
 'https://www.webteb.com/drug/ايزوسوربيد',
 'https://www.webteb.com/drug/اميغلوسيريز',
 'https://www.webteb.com/drug/احادي-الكسيروتين',
 'https://www.webteb.com/drug/الانترفيرون',
 'https://www.webteb.com/drug/فيراباميل',
 'https://www.webteb.com/drug/ميجيست