# Wikepedia arabic articles scrapping

In this project I intend to create a scraping tool for wikepedia where the scrapper get the paragraphs in a certain article as well as collecting all the hyperlinks for other articles inside the body and store them in a Dataframe. In a future version I will add the ability to scrap the links I collect to create a dataset of many articles relevant to the first article I started with.

Please use this code responsibly and review wikepedia's rules for scrapping to check the legality and the effect of this method before usage.

## Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Access the article using URL

In [2]:
url = "https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%86%D8%A7%D8%AF%D9%8A_%D8%A7%D9%84%D8%A3%D9%87%D9%84%D9%8A_(%D9%85%D8%B5%D8%B1)"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

In [3]:
print(soup.prettify()[:15000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="rtl" lang="ar">
 <head>
  <meta charset="utf-8"/>
  <title>
   النادي الأهلي (مصر) - ويكيبيديا
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-fea

## Extract information from BeautifulSoup object

### Find title of article

In [4]:
title = soup.find('h1', {'id' : 'firstHeading'})
print('article title: {}.'.format(title.get_text()))

article title: النادي الأهلي (مصر).


### Filter page body and get text

In [5]:
# We will get the body as I am only intrested in the information inside the body
body = soup.find(class_ = 'mw-content-rtl mw-parser-output')

In [6]:
# Filter the body to only keep paragraphs in the main text of the article
infobox = body.find(class_ = 'infobox')
infobox_v2 = body.find(class_ = 'infobox infobox_v2')
navigation = body.find(class_ = 'navbox')
hatnote = body.find(class_ = 'hatnote navigation-not-searchable')
tables = body.find(class_ = 'wikitable')
infobox.decompose() # Remove side info box
infobox_v2.decompose() # Remove side info box
navigation.decompose() # Remove navigation elements
hatnote.decompose() # Remove hat note
tables.decompose() # Remove tables


In [7]:
# Get all text in the article body 
text_body = body.get_text()

In [8]:
# Create a txt file containing the article's body.
with open('test.txt', 'w') as file:
    file.writelines(text_body.strip())

### Get links from page to scrape them

In [9]:
a_tags = body.find_all('a')

#### Filter links
Now we go throug the href links to filter out images and construct the full URLs.

In [10]:
a_tags[10]

<a href="/wiki/%D8%A7%D9%84%D9%86%D8%A7%D8%AF%D9%8A_%D8%A7%D9%84%D8%A3%D9%87%D9%84%D9%8A_%D8%A7%D9%84%D9%85%D8%B5%D8%B1%D9%8A_%D9%84%D9%83%D8%B1%D8%A9_%D8%A7%D9%84%D9%85%D8%A7%D8%A1" title="النادي الأهلي المصري لكرة الماء"><img class="mw-file-element" data-file-height="300" data-file-width="300" decoding="async" height="50" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Water_polo_pictogram.svg/50px-Water_polo_pictogram.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Water_polo_pictogram.svg/75px-Water_polo_pictogram.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Water_polo_pictogram.svg/100px-Water_polo_pictogram.svg.png 2x" width="50"/></a>

In [11]:
articles_path = list() # articles path store the path to other wikepedia articles that the current one contains

for path in a_tags:
    try:
        if path.get('title') != None and 'توضيح' in path.get('title'):
            continue
        if path.get('href') != None and path.get('class') == None and 'cite_note' not in path.get('href'):
            article_name = path.get('title')
            full_url = 'https://ar.wikipedia.org' + path.get('href')
            value = (article_name , full_url)
            if value not in articles_path: # Check if this article exists in the list of links
                articles_path.append(value)
    except KeyError:
        continue

print(len(articles_path))

1759


In [12]:
articles_linkes_df = pd.DataFrame.from_records(articles_path, columns= ['name', 'URL'])
articles_linkes_df.head()


Unnamed: 0,name,URL
0,النادي الأهلي (مصر),https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%...
1,النادي الأهلي للكرة الطائرة (مصر),https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%...
2,النادي الأهلي للكرة الطائرة للسيدات,https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%...
3,النادي الأهلي المصري لكرة السلة,https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%...
4,النادي الأهلي لكرة السلة للسيدات,https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%...
