## Getting Started

Links of pages used for data scraping:

[english ahram business / economy](https://english.ahram.org.eg/AllCategory/3/12/Business/Economy/0.aspx)

[english ahram sports / world](https://english.ahram.org.eg/AllCategory/6/55/Sports/World/0.aspx)

[english ahram arts&culture / visualart](https://english.ahram.org.eg/AllCategory/5/25/Arts%20&%20Culture/Visual%20Art/0.aspx)

[english ahram arts&culture / film](https://english.ahram.org.eg/AllCategory/5/32/Arts%20&%20Culture/Film/0.aspx)

[english ahram arts&culture / music](https://english.ahram.org.eg/AllCategory/5/33/Arts%20&%20Culture/Music/0.aspx)

[english ahram arts&culture / stage&street](https://english.ahram.org.eg/AllCategory/5/35/Arts%20&%20Culture/Stage%20&%20Street/0.aspx)

* these links redirect to the first page in economy, visualart, and world sports. 

* Each page contains 20 articles (will show in Logic Testing).

* we can use a for loop to access different pages and scrape articles titles from them.

___ 
One thing you will notice is that **elahram site** indexes pages in multiples of 20 starting from 0.

**meaning**, page1: 0, page2: 20, page3: 40, page4:60, and so on.

You can check by switching from page 1 to 2 to 3 and look the at the number at end of url.

In [1]:
# 1st step : import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Increase width to see the article_title clearly
pd.set_option('max_colwidth', None)

### Logic Testing


In [2]:
# 2nd step: use requests to fetch the url
response = requests.get("https://english.ahram.org.eg/AllCategory/3/12/Business/Economy/0.aspx")
# 3rd step: save page content/markup
src = response.content
# 4th step: create a beautifulsoup object to parse content
soup = BeautifulSoup(src, 'lxml')

In [3]:
#print(src)
#print()
#print(soup)

In [4]:
# 5th step: find the elements containing info we need --> article title
article_title = []
articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})

for i in range(len(articles_titles)):
    article_title.append(articles_titles[i].text)

print(article_title[2])
print(f"Number of articles per page = {len(article_title)}")



Remittance flows to Egypt expected to grow by 8% in 2022: World Bank



Remittance flows to Egypt are expected to inch up by eight percent in 2022 despite the repercussions of the global economic challenges, the World Bank stated.




Number of articles per page = 20


## Economy

In [None]:
%%time
title_economy = []
for i in range(0, 10000, 20):
    response = requests.get("https://english.ahram.org.eg/AllCategory/3/12/Business/Economy/"+str(i)+".aspx")
    src = response.content
    soup = BeautifulSoup(src, 'lxml')
    articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})
    # append articles in each page to article_title
    for j in range(len(articles_titles)):
        title_economy.append(articles_titles[j].text)
    print(f"iteration {int(i/20)} is done")
print("DONE !")

In [None]:
# testing
print(title_economy[3])

In [None]:
economy_articles = pd.DataFrame({"article_title": title_economy})
economy_articles['category'] = "economy"
# remove leading and trailin spaces
economy_articles['article_title'] = economy_articles['article_title'].str.strip()

In [None]:
display(economy_articles.head())
print()
print(economy_articles.shape)

In [None]:
economy_articles.to_csv("economy_2.csv")

## Sports

In [None]:
%%time
title_sports = []
for i in range(0, 10000, 20):
    response = requests.get("https://english.ahram.org.eg/AllCategory/6/55/Sports/World/"+str(i)+".aspx")
    src = response.content
    soup = BeautifulSoup(src, 'lxml')
    articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})
    # append articles in each page to article_title
    for j in range(len(articles_titles)):
        title_sports.append(articles_titles[j].text)
    print(f"iteration {int(i/20)} is done")
print("DONE !")

In [None]:
# testing
print(title_sports[2])

In [None]:
sports_articles = pd.DataFrame({"article_title": title_sports})
sports_articles['category'] = "sports"
# remove leading and trailin spaces
sports_articles['article_title'] = sports_articles['article_title'].str.strip()

In [None]:
display(sports_articles.head())
print()
print(sports_articles.shape)

In [None]:
sports_articles.to_csv("sports_2.csv")

economy and sports data (10000, 9880) are close in terms of number of records, but art data (1775) is too small in comparison.

We'll need to collect more data from different categories of art.

I have searched the different categories pages and found that:

* The films category 187 page --> (3740) records.

* The music category has 140 pages --> (2800) records

* The stage&street category has 115 pages --> (2300) records

So, art sums up to (10615) records.

## Arts

### VisualArt

In [5]:
%%time
title_arts = []
for i in range(0, 1870, 20):
    response = requests.get("https://english.ahram.org.eg/AllCategory/5/25/Arts%20&%20Culture/Visual%20Art/"+str(i)+".aspx")
    src = response.content
    soup = BeautifulSoup(src, 'lxml')
    articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})
    # append articles in each page to article_title
    for j in range(len(articles_titles)):
        title_arts.append(articles_titles[j].text)
    print(f"iteration {int(i/20)} is done")
print("DONE !")

iteration 0 is done
iteration 1 is done
iteration 2 is done
iteration 3 is done
iteration 4 is done
iteration 5 is done
iteration 6 is done
iteration 7 is done
iteration 8 is done
iteration 9 is done
iteration 10 is done
iteration 11 is done
iteration 12 is done
iteration 13 is done
iteration 14 is done
iteration 15 is done
iteration 16 is done
iteration 17 is done
iteration 18 is done
iteration 19 is done
iteration 20 is done
iteration 21 is done
iteration 22 is done
iteration 23 is done
iteration 24 is done
iteration 25 is done
iteration 26 is done
iteration 27 is done
iteration 28 is done
iteration 29 is done
iteration 30 is done
iteration 31 is done
iteration 32 is done
iteration 33 is done
iteration 34 is done
iteration 35 is done
iteration 36 is done
iteration 37 is done
iteration 38 is done
iteration 39 is done
iteration 40 is done
iteration 41 is done
iteration 42 is done
iteration 43 is done
iteration 44 is done
iteration 45 is done
iteration 46 is done
iteration 47 is done
it

In [6]:
print("len() After appending visualart:",len(title_arts))

len() After appending visualart: 1775


### Film

In [7]:
%%time
for i in range(0, 3840, 20):
    response = requests.get("https://english.ahram.org.eg/AllCategory/5/32/Arts%20&%20Culture/Film/"+str(i)+".aspx")
    src = response.content
    soup = BeautifulSoup(src, 'lxml')
    articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})
    # append articles in each page to article_title
    for j in range(len(articles_titles)):
        title_arts.append(articles_titles[j].text)
    print(f"iteration {int(i/20)} is done")
print("DONE !")

iteration 0 is done
iteration 1 is done
iteration 2 is done
iteration 3 is done
iteration 4 is done
iteration 5 is done
iteration 6 is done
iteration 7 is done
iteration 8 is done
iteration 9 is done
iteration 10 is done
iteration 11 is done
iteration 12 is done
iteration 13 is done
iteration 14 is done
iteration 15 is done
iteration 16 is done
iteration 17 is done
iteration 18 is done
iteration 19 is done
iteration 20 is done
iteration 21 is done
iteration 22 is done
iteration 23 is done
iteration 24 is done
iteration 25 is done
iteration 26 is done
iteration 27 is done
iteration 28 is done
iteration 29 is done
iteration 30 is done
iteration 31 is done
iteration 32 is done
iteration 33 is done
iteration 34 is done
iteration 35 is done
iteration 36 is done
iteration 37 is done
iteration 38 is done
iteration 39 is done
iteration 40 is done
iteration 41 is done
iteration 42 is done
iteration 43 is done
iteration 44 is done
iteration 45 is done
iteration 46 is done
iteration 47 is done
it

In [8]:
print("len() After appending film:", len(title_arts))

len() After appending film: 5521


### Music

In [9]:
%%time
for i in range(0, 2900, 20):
    response = requests.get("https://english.ahram.org.eg/AllCategory/5/33/Arts%20&%20Culture/Music/"+str(i)+".aspx")
    src = response.content
    soup = BeautifulSoup(src, 'lxml')
    articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})
    # append articles in each page to article_title
    for j in range(len(articles_titles)):
        title_arts.append(articles_titles[j].text)
    print(f"iteration {int(i/20)} is done")
print("DONE !")

iteration 0 is done
iteration 1 is done
iteration 2 is done
iteration 3 is done
iteration 4 is done
iteration 5 is done
iteration 6 is done
iteration 7 is done
iteration 8 is done
iteration 9 is done
iteration 10 is done
iteration 11 is done
iteration 12 is done
iteration 13 is done
iteration 14 is done
iteration 15 is done
iteration 16 is done
iteration 17 is done
iteration 18 is done
iteration 19 is done
iteration 20 is done
iteration 21 is done
iteration 22 is done
iteration 23 is done
iteration 24 is done
iteration 25 is done
iteration 26 is done
iteration 27 is done
iteration 28 is done
iteration 29 is done
iteration 30 is done
iteration 31 is done
iteration 32 is done
iteration 33 is done
iteration 34 is done
iteration 35 is done
iteration 36 is done
iteration 37 is done
iteration 38 is done
iteration 39 is done
iteration 40 is done
iteration 41 is done
iteration 42 is done
iteration 43 is done
iteration 44 is done
iteration 45 is done
iteration 46 is done
iteration 47 is done
it

In [10]:
print("len() After appending music:", len(title_arts))

len() After appending music: 8319


### Stage&Street

In [11]:
%%time
for i in range(0, 2400, 20):
    response = requests.get("https://english.ahram.org.eg/AllCategory/5/35/Arts%20&%20Culture/Stage%20&%20Street/"+str(i)+".aspx")
    src = response.content
    soup = BeautifulSoup(src, 'lxml')
    articles_titles = soup.find_all("div", {"class":"col-md-12 col-lg-12 mar-top-outer"})
    # append articles in each page to article_title
    for j in range(len(articles_titles)):
        title_arts.append(articles_titles[j].text)
    print(f"iteration {int(i/20)} is done")
print("DONE !")

iteration 0 is done
iteration 1 is done
iteration 2 is done
iteration 3 is done
iteration 4 is done
iteration 5 is done
iteration 6 is done
iteration 7 is done
iteration 8 is done
iteration 9 is done
iteration 10 is done
iteration 11 is done
iteration 12 is done
iteration 13 is done
iteration 14 is done
iteration 15 is done
iteration 16 is done
iteration 17 is done
iteration 18 is done
iteration 19 is done
iteration 20 is done
iteration 21 is done
iteration 22 is done
iteration 23 is done
iteration 24 is done
iteration 25 is done
iteration 26 is done
iteration 27 is done
iteration 28 is done
iteration 29 is done
iteration 30 is done
iteration 31 is done
iteration 32 is done
iteration 33 is done
iteration 34 is done
iteration 35 is done
iteration 36 is done
iteration 37 is done
iteration 38 is done
iteration 39 is done
iteration 40 is done
iteration 41 is done
iteration 42 is done
iteration 43 is done
iteration 44 is done
iteration 45 is done
iteration 46 is done
iteration 47 is done
it

In [12]:
# testing
print(title_arts[2])



Lebanese artist Haig Aivazian's video installation to be presented during 72nd Berlin Int'l Film Festival



The video installation All of Your Stars are But Dust on My Shoes by the Lebanese artist will be presented within the Forum Expanded programme of the upcoming Berlinale.






In [13]:
art_articles = pd.DataFrame({"article_title": title_arts})
art_articles['category'] = "art"
# remove leading and trailin spaces
art_articles['article_title'] = art_articles['article_title'].str.strip()

In [14]:
display(art_articles.head(1))
print()
display(art_articles.tail(1))
print(art_articles.shape)

Unnamed: 0,article_title,category
0,The first Art Exhibition for Marginalised Children in Egypt to take place in February\n\n\n\nThe first ‘Dream’ Art Exhibition for Marginalised Children will take place from 14 to 19 February 2022 in Cairo at the Townhouse Gallery to shed light and create social awareness on vital issues facing marginalised children in Egypt.,art





Unnamed: 0,article_title,category
10644,"A Humourless Night\n\n\n\nOn 12 November a stand-up comedy show with Akram Hosny took place at El Sakia culturewheel, with its theme - the current education system. The material was adapted from the satirical book Awel Mokarer by Haitham Dabbour.",art


(10645, 2)


In [15]:
art_articles.to_csv("arts_2.csv")

## Final Thoughts

I didn't run the economy and sports section in this notebook as I already have their csv files full of records.

So I only ran the arts section.

In the next notebook, we will combine these 3 csv files in one dataframe and do some further data cleaning before begining our nlp pipeline.