## <p style="text-align:center" color="red"><span style="color:red">Python Web Scraping with Selenium: Quran Karim english version</span></p>


<table align="center">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/labrijisaad/Python-Web-Scraping-Quran-Karim-with-Selenium"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

- 🎯 In this notebook, I tried to write a script able to scrape with Selenium the Koran Karim in English and to save it  a CSV file.
- 🧹 This notebook also cleans the dataset before saving it to a CSV file (the cleaning process takes care of unnecessary numbers, punctuation, excessive white spaces...

-  📍  This is the website we used for our scrapping process → **[quranful](https://www.quranful.com/)**
- 📫 Feel free to contact me if anything is wrong or if anything needs to be changed 😎!  **labrijisaad@gmail.com**

### Installing the necessary dependencies

In [1]:
%%capture
# # install chromedriver and selenium
# !pip install chromedriver-py==103.0.5060.53
# !pip install selenium
# !pip install matplotlib

### Importing the necessary libraries

In [2]:
# importing necessary packages
from __future__ import unicode_literals
import time
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from chromedriver_py import binary_path
import time
from selenium import webdriver
from tqdm import tqdm
import os
import re
import matplotlib.pyplot as plt
%matplotlib inline

C:\Users\SAAD\anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\SAAD\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


### Setting up the selenium environment

In [3]:
# creating a function that creates a chrome options object with given parameters
def add_driver_options(options):
    """
    This function sets driver options ( given as parameter)
    """
    chrome_options = Options()
    for opt in options:
        chrome_options.add_argument(opt)
    return chrome_options

# create function that initialize the driver and return it

def initialize_driver():
    """
    This function create an instace of the web driver with the options below:
    """
    driver_config = {
        "executable_path": binary_path,
        "options": [
            "--headless",
            "--no-sandbox",
            "--start-fullscreen",
            "--allow-insecure-localhost",
            "--disable-dev-shm-usage",
        ],
    }
    options = add_driver_options(driver_config["options"])
    driver = webdriver.Chrome(
        executable_path=driver_config["executable_path"], options=options
    )
    return driver

In [4]:
%%capture
driver = initialize_driver() # create a driver with initialize_driver method
driver.get("https://www.quranful.com/")

### Testing the scrapping process 

In [5]:
versets = driver.find_elements(By.XPATH,"//body/div[1]/div[2]/div[3]/div/div[2]")[1:]
for verset in versets:
    print(verset.text)
print(len(versets))

1  In the name of God, the Gracious, the Merciful.
2  Praise be to God, Lord of the Worlds.
3  The Most Gracious, the Most Merciful.
4  Master of the Day of Judgment.
5  It is You we worship, and upon You we call for help.
6  Guide us to the straight path.
7  The path of those You have blessed, not of those against whom there is anger, nor of those who are misguided.
7


In [6]:
sourat_dict = {}
sourat_verses_list = []

sourats = driver.find_elements(By.TAG_NAME, "option")
for sourat in tqdm(sourats[:114]):
    sourat.click()
    time.sleep(4)
    versets = driver.find_elements(By.XPATH,"//body/div[1]/div[2]/div[3]/div/div[2]")[1:]
    for verset in versets:
        sourat_verses_list.append(verset.text)
    sourat_dict[sourat.get_attribute("value")]=sourat_verses_list   
    sourat_verses_list=[]
    sourat.click()

100%|████████████████████████████████████████████████████████████████████████████████| 114/114 [09:08<00:00,  4.81s/it]


Everything we scrapped is stored in the dictionary: **`sourat_dict`**
### Turning the dictionary into pandas dataframe

In [7]:
import pandas as pd
df = pd.DataFrame()

for sourat, verses in sourat_dict.items():
    for verse in verses:
        index_verse = ''.join([n for n in verse if n.isdigit()])
        index_sourat = ''.join([n for n in sourat if n.isdigit()])
        row = {'index sourat':index_sourat, 'index verse':index_verse, 'name sourat':sourat, 'verse in english':verse}        
        df = df.append(row, ignore_index=True)

In [8]:
df

Unnamed: 0,index sourat,index verse,name sourat,verse in english
0,1,1,1 al-Fatihah الفاتحة,"1 In the name of God, the Gracious, the Merci..."
1,1,2,1 al-Fatihah الفاتحة,"2 Praise be to God, Lord of the Worlds."
2,1,3,1 al-Fatihah الفاتحة,"3 The Most Gracious, the Most Merciful."
3,1,4,1 al-Fatihah الفاتحة,4 Master of the Day of Judgment.
4,1,5,1 al-Fatihah الفاتحة,"5 It is You we worship, and upon You we call ..."
...,...,...,...,...
6230,114,2,114 an-Nas الـناس,2 The King of mankind.
6231,114,3,114 an-Nas الـناس,3 The God of mankind.
6232,114,4,114 an-Nas الـناس,4 From the evil of the sneaky whisperer.
6233,114,5,114 an-Nas الـناس,5 Who whispers into the hearts of people.


In [9]:
sourat_lenSourat = [(1, 7), (2, 286), (3, 200), (4, 176), (5, 120), (6, 165), (7, 206), (8, 75), (9, 129), (10, 109), (11, 123), (12, 111), (13, 43), (14, 52), (15, 99), (16, 128), (17, 111), (18, 110), (19, 98), (20, 135), (21, 112), (22, 78), (23, 118), (24, 64), (25, 77), (26, 227), (27, 93), (28, 88), (29, 69), (30, 60), (31, 34), (32, 30), (33, 73), (34, 54), (35, 45), (36, 83), (37, 182), (38, 88), (39, 75), (40, 85), (41, 54), (42, 53), (43, 89), (44, 59), (45, 37), (46, 35), (47, 38), (48, 29), (49, 18), (50, 45), (51, 60), (52, 49), (53, 62), (54, 55), (55, 78), (56, 96), (57, 29), (58, 22), (59, 24), (60, 13), (61, 14), (62, 11), (63, 11), (64, 18), (65, 12), (66, 12), (67, 30), (68, 52), (69, 52), (70, 44), (71, 28), (72, 28), (73, 20), (74, 56), (75, 40), (76, 31), (77, 50), (78, 40), (79, 46), (80, 42), (81, 29), (82, 19), (83, 36), (84, 25), (85, 22), (86, 17), (87, 19), (88, 26), (89, 30), (90, 20), (91, 15), (92, 21), (93, 11), (94, 8), (95, 8), (96, 19), (97, 5), (98, 8), (99, 8), (100, 11), (101, 11), (102, 8), (103, 3), (104, 9), (105, 5), (106, 4), (107, 7), (108, 3), (109, 6), (110, 3), (111, 5), (112, 4), (113, 5), (114, 6)]  

The **`sourat_lenSourat`** is a list that contains tuples composed of the index of the surat and the verse number of this surat, this list will be used to check if there is a problem with Scrapping and to avoid having surats with erroneous verses.

In [10]:
for tpl in sourat_lenSourat:    
    if tpl[1] != len(df[df["index sourat"]==str(tpl[0])]): 
        print(False)
        print(tpl)

False
(9, 129)


As we can see the sura **9** that we have scraped does not contain **129** verses, to check that we will display the verses of surat **9**

In [11]:
df[df["index sourat"]=="9"]

Unnamed: 0,index sourat,index verse,name sourat,verse in english
1235,9,2,9 at-Tawbah التوبة,"2 So travel the land for four months, and kno..."
1236,9,3,9 at-Tawbah التوبة,3 And a proclamation from God and His Messeng...
1237,9,4,9 at-Tawbah التوبة,4 Except for those among the polytheists with...
1238,9,5,9 at-Tawbah التوبة,"5 When the Sacred Months have passed, kill th..."
1239,9,6,9 at-Tawbah التوبة,6 And if anyone of the polytheists asks you f...
...,...,...,...,...
1358,9,125,9 at-Tawbah التوبة,125 But as for those in whose hearts is sickn...
1359,9,126,9 at-Tawbah التوبة,126 Do they not see that they are tested once...
1360,9,127,9 at-Tawbah التوبة,"127 And whenever a chapter is revealed, they ..."
1361,9,128,9 at-Tawbah التوبة,128 There has come to you a messenger from am...


As expected, the sura **9** we scraped does not contain **129** verses, it contains **128** verses because the indexed verse **1** is missing: To remedy this problem, we will manually add the verse by splitting the original dataset.

In [12]:
sourat_9_verset_1 = "1  A declaration of immunity from God and His Messenger to the polytheists with whom you had made a treaty."
row = {'index sourat':"9", 'index verse':1, 'name sourat':"9  at-Tawbah  التوبة", 'verse in english':sourat_9_verset_1} 
line = pd.DataFrame(row, index=[1235])
df = pd.concat([df.iloc[:1234], line, df.iloc[1234:]]).reset_index(drop=True)

In [13]:
df[df["index sourat"]=="9"]

Unnamed: 0,index sourat,index verse,name sourat,verse in english
1234,9,1,9 at-Tawbah التوبة,1 A declaration of immunity from God and His ...
1236,9,2,9 at-Tawbah التوبة,"2 So travel the land for four months, and kno..."
1237,9,3,9 at-Tawbah التوبة,3 And a proclamation from God and His Messeng...
1238,9,4,9 at-Tawbah التوبة,4 Except for those among the polytheists with...
1239,9,5,9 at-Tawbah التوبة,"5 When the Sacred Months have passed, kill th..."
...,...,...,...,...
1359,9,125,9 at-Tawbah التوبة,125 But as for those in whose hearts is sickn...
1360,9,126,9 at-Tawbah التوبة,126 Do they not see that they are tested once...
1361,9,127,9 at-Tawbah التوبة,"127 And whenever a chapter is revealed, they ..."
1362,9,128,9 at-Tawbah التوبة,128 There has come to you a messenger from am...


### Cleaning the dataset

In [14]:
def cleaning_numbers(data): # Cleaning and removing Numeric numbers :
    return re.sub('[0-9]+', '', data)

df['verse in english'] = df['verse in english'].apply(lambda x: cleaning_numbers(x))
df['name sourat'] = df['name sourat'].apply(lambda x: cleaning_numbers(x))
df['name sourat'] = df['name sourat'].apply(lambda text: re.sub("\s\s+", " ", text)) # remove excess whitespace 
df

Unnamed: 0,index sourat,index verse,name sourat,verse in english
0,1,1,al-Fatihah الفاتحة,"In the name of God, the Gracious, the Merciful."
1,1,2,al-Fatihah الفاتحة,"Praise be to God, Lord of the Worlds."
2,1,3,al-Fatihah الفاتحة,"The Most Gracious, the Most Merciful."
3,1,4,al-Fatihah الفاتحة,Master of the Day of Judgment.
4,1,5,al-Fatihah الفاتحة,"It is You we worship, and upon You we call f..."
...,...,...,...,...
6231,114,2,an-Nas الـناس,The King of mankind.
6232,114,3,an-Nas الـناس,The God of mankind.
6233,114,4,an-Nas الـناس,From the evil of the sneaky whisperer.
6234,114,5,an-Nas الـناس,Who whispers into the hearts of people.


In [15]:
name_sourat = df['name sourat'].str.split(" ", expand = True)
name_sourat

Unnamed: 0,0,1,2,3,4
0,,al-Fatihah,الفاتحة,,
1,,al-Fatihah,الفاتحة,,
2,,al-Fatihah,الفاتحة,,
3,,al-Fatihah,الفاتحة,,
4,,al-Fatihah,الفاتحة,,
...,...,...,...,...,...
6231,,an-Nas,الـناس,,
6232,,an-Nas,الـناس,,
6233,,an-Nas,الـناس,,
6234,,an-Nas,الـناس,,


In [16]:
# Assigning correct columns to latitude and longitude columns in airbnb
df['name sourat in english'] = name_sourat[1]
df['name sourat in arabic'] = name_sourat[2]
df.drop(columns=['name sourat'], inplace = True)
df

Unnamed: 0,index sourat,index verse,verse in english,name sourat in english,name sourat in arabic
0,1,1,"In the name of God, the Gracious, the Merciful.",al-Fatihah,الفاتحة
1,1,2,"Praise be to God, Lord of the Worlds.",al-Fatihah,الفاتحة
2,1,3,"The Most Gracious, the Most Merciful.",al-Fatihah,الفاتحة
3,1,4,Master of the Day of Judgment.,al-Fatihah,الفاتحة
4,1,5,"It is You we worship, and upon You we call f...",al-Fatihah,الفاتحة
...,...,...,...,...,...
6231,114,2,The King of mankind.,an-Nas,الـناس
6232,114,3,The God of mankind.,an-Nas,الـناس
6233,114,4,From the evil of the sneaky whisperer.,an-Nas,الـناس
6234,114,5,Who whispers into the hearts of people.,an-Nas,الـناس


In [17]:
def remove_ponctuation(text):
    for ele in text:
        if ele in '''!()-[]{}»«;:'"\,<>./?@#$%^&*_~''':
            text = text.replace(ele, " ")
    return text

df['verse in english'] = df['verse in english'].apply(lambda x: remove_ponctuation(x))
df['name sourat in english'] = df['name sourat in english'].apply(lambda x: remove_ponctuation(x))

In [18]:
# lower()
df['verse in english'] = df['verse in english'].str.lower()
df['name sourat in english'] = df['name sourat in english'].str.lower()

In [19]:
# remove excess whitespace 
df['verse in english'] = df['verse in english'].apply(lambda text: re.sub("\s\s+", " ", text))
df['name sourat in english'] = df['name sourat in english'].apply(lambda text: re.sub("\s\s+", " ", text))
df

Unnamed: 0,index sourat,index verse,verse in english,name sourat in english,name sourat in arabic
0,1,1,in the name of god the gracious the merciful,al fatihah,الفاتحة
1,1,2,praise be to god lord of the worlds,al fatihah,الفاتحة
2,1,3,the most gracious the most merciful,al fatihah,الفاتحة
3,1,4,master of the day of judgment,al fatihah,الفاتحة
4,1,5,it is you we worship and upon you we call for...,al fatihah,الفاتحة
...,...,...,...,...,...
6231,114,2,the king of mankind,an nas,الـناس
6232,114,3,the god of mankind,an nas,الـناس
6233,114,4,from the evil of the sneaky whisperer,an nas,الـناس
6234,114,5,who whispers into the hearts of people,an nas,الـناس


### Saving the dataframe

In [None]:
df.to_csv("Coran english processed.csv", encoding="utf-8-sig", index=False)

### Reading the dataset again

In [21]:
df = pd.read_csv("Coran english processed.csv", encoding='utf-8')
df

Unnamed: 0,index sourat,index verse,verse in english,name sourat in english,name sourat in arabic
0,1,1,in the name of god the gracious the merciful,al fatihah,الفاتحة
1,1,2,praise be to god lord of the worlds,al fatihah,الفاتحة
2,1,3,the most gracious the most merciful,al fatihah,الفاتحة
3,1,4,master of the day of judgment,al fatihah,الفاتحة
4,1,5,it is you we worship and upon you we call for...,al fatihah,الفاتحة
...,...,...,...,...,...
6231,114,2,the king of mankind,an nas,الـناس
6232,114,3,the god of mankind,an nas,الـناس
6233,114,4,from the evil of the sneaky whisperer,an nas,الـناس
6234,114,5,who whispers into the hearts of people,an nas,الـناس


> - 🙌 Notebook made by [@labriji_saad](https://github.com/labrijisaad)
> - 🔗 Linledin [@labriji_saad](https://www.linkedin.com/in/labrijisaad/)