# Articles Collection

This notebook performs web scraping and downloads all [archived articles](https://thescipub.com/jcs/archive) in Journal of Computer Science provided by [Science Publications](https://thescipub.com/).

## Web-Scraping: Retrieve Issue Urls

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# url of Science Publications web page
main_url = 'https://thescipub.com'

### Retrieve Issue Urls for all Journal Issues

In [2]:
years = []       # year of publication
volumes = []     # volume number
issues = []      # issue number in a given Volume
urls = []        # url of Journal's issue

# grab contents from the archive page for Journal of Computer Science
page = requests.get(main_url + '/jcs/archive')

# create BeautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')

# get all divs with class 'pkp_block'
components = soup.find_all('div', class_='pkp_block')

# iterate through every div with class 'pkp_block'
for c in components:
    # get the text for h2 tag: yyyy - VolumeNumber
    volume_text = c.find('h2').get_text().split(" ")
    
    # get all Issue urls for a given Volume
    links = c.find_all('a', href=True)
    
    # iterate through the urls
    # and append data to the lists (years, volumes, issues, urls) for data frame creation
    for a in links:
        years.append(int(volume_text[0]))
        volumes.append(int(volume_text[-1]))
        issues.append(int(a.get_text().split(' ')[-1]))
        urls.append(main_url + a['href'])

# create a Pandas data frame
journal_archive = pd.DataFrame(dict({'Year': years, 'Volume': volumes, 'Issue#': issues, 'urls': urls}))

# display the first 5 rows of data frame
journal_archive.head()

Unnamed: 0,Year,Volume,Issue#,urls
0,2021,17,1,https://thescipub.com/jcs/issue/1273
1,2021,17,2,https://thescipub.com/jcs/issue/1288
2,2021,17,3,https://thescipub.com/jcs/issue/1292
3,2021,17,4,https://thescipub.com/jcs/issue/1299
4,2021,17,5,https://thescipub.com/jcs/issue/1303


In [3]:
# save to csv
journal_archive.to_csv('data/journal_archive.csv', index=False)

## Web-Scraping: Download Articles

In [4]:
from tqdm import tqdm

titles = []             # article titles
article_urls = []       # article's url to download pdf file
file_names = []         # name of article's pdf file saved in local machine

# iterate through every Journal Issue's url
for i, issue_url in enumerate(urls):
    # grab contents from a webpage
    page = requests.get(issue_url)
    
    # create BeautifulSoup object
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # get all divs with class 'obj_article_summary'
    components = soup.find_all('div', class_='obj_article_summary')
    
    # iterate through every div with class 'obj_article_summary'
    # and append data to the lists (titles, article_urls, file_names) for data frame creation
    for c in tqdm(components):
        # get article's title
        titles.append(c.find('div', class_='title').get_text())
        
        # get url of pdf file
        link = c.find('div', class_='galley_link').find('a', href=True)
        url = main_url + link['href']
        
        # name of article's pdf saved in local machine
        filename = str(years[i]) + '_' + str(volumes[i]) + '_' + str(issues[i]) + '_' + url.split('/')[-1]
        file_names.append(filename)
        article_urls.append(url)
        
        # download article and save to local machine under folder 'articles'
        response = requests.get(url)
        with open('articles/' + filename, 'wb') as pdf:
            pdf.write(response.content)
        
        response.close()

100%|██████████| 5/5 [00:02<00:00,  2.16it/s]
100%|██████████| 7/7 [00:03<00:00,  2.09it/s]
100%|██████████| 16/16 [00:07<00:00,  2.03it/s]
100%|██████████| 6/6 [00:03<00:00,  1.90it/s]
100%|██████████| 6/6 [00:03<00:00,  1.95it/s]
100%|██████████| 5/5 [00:03<00:00,  1.41it/s]
100%|██████████| 5/5 [00:02<00:00,  1.93it/s]
100%|██████████| 4/4 [00:01<00:00,  2.08it/s]
100%|██████████| 2/2 [00:01<00:00,  1.82it/s]
100%|██████████| 10/10 [00:04<00:00,  2.30it/s]
100%|██████████| 14/14 [00:05<00:00,  2.46it/s]
100%|██████████| 12/12 [00:04<00:00,  2.68it/s]
100%|██████████| 16/16 [00:06<00:00,  2.43it/s]
100%|██████████| 13/13 [00:05<00:00,  2.34it/s]
100%|██████████| 11/11 [00:04<00:00,  2.22it/s]
100%|██████████| 17/17 [00:07<00:00,  2.15it/s]
100%|██████████| 10/10 [00:04<00:00,  2.18it/s]
100%|██████████| 13/13 [00:05<00:00,  2.20it/s]
100%|██████████| 14/14 [00:06<00:00,  2.25it/s]
100%|██████████| 14/14 [00:08<00:00,  1.69it/s]
100%|██████████| 12/12 [00:06<00:00,  1.74it/s]
100%|███

In [5]:
# create a data frame for articles
articles = pd.DataFrame(dict({'Title': titles, 'File Name': file_names, 'URL': article_urls}))
articles.head()

Unnamed: 0,Title,File Name,URL
0,A Systematic Literature Review on English and ...,2021_17_1_jcssp.2021.1.18.pdf,https://thescipub.com/pdf/jcssp.2021.1.18.pdf
1,DAD: A Detailed Arabic Dataset for Online Text...,2021_17_1_jcssp.2021.19.32.pdf,https://thescipub.com/pdf/jcssp.2021.19.32.pdf
2,Collision Avoidance Modelling in Airline Traff...,2021_17_1_jcssp.2021.33.43.pdf,https://thescipub.com/pdf/jcssp.2021.33.43.pdf
3,Fine-Tuned MobileNet Classifier for Classifica...,2021_17_1_jcssp.2021.44.54.pdf,https://thescipub.com/pdf/jcssp.2021.44.54.pdf
4,A Content Filtering from Spam Posts on Social ...,2021_17_1_jcssp.2021.55.66.pdf,https://thescipub.com/pdf/jcssp.2021.55.66.pdf


In [6]:
# save articles as csv file
articles.to_csv('data/articles.csv', index=False)