# Scraping of journals on renewable energy, sustainability, and environment
   ![](https://i.imgur.com/IqJFN17.jpg)

### Introduction

While working at Vizag Steel as an industrial engineer I got first-hand experience on the polluting impact of the steel and power industry. I designed simulations and wrote programs to reduce the plant’s overall carbon emissions. What stuck with me is that the steel industry contributes 3 Tons of CO2 for the production of 1 Ton of Crude steel. On the contrary, I am excited by the potential of the same CO2 emissions to be ‘electrolyzed’ into fuels, and petrochemicals instead. Eventually this process is reducing CO2 emissions in environment, thus making our environment better. So, I have decided to scrape the ranking of journals in the field of renewable energy, sustainability and environment from this [website](https://www.scimagojr.com/journalrank.php?category=2105). By accessing these these jornals we can know about latest research in field of sustaianability. 

### Web scrapping

Web scraping is process of extracting information or data in structured form from website, which is generally written is HTML(Hyper Text Markup Language) and is in unstructured form. Web scraping can be done manually, or by using bots or by using an automated program. Some of applications of web scrapining includes price analysis, market research and sentiment analysis. Here we are scraping the data using python and it's inbult libraray and some of external libraries. For more infomation on web scrapping [click here](https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/).

### Objective and Outline

Here is the outline of steps I will follow:
1. Download the web page using requests library.
2. Parse the HTML source code using beautifulsoup4 get journal title, journal url, SJR Index, H Index and country.
3. Extract the required information from page.
4. Compile extracted information into python list.
5. Save the extracted information to a csv file.

File csv file should be like this:

![](https://i.imgur.com/nuaEmrE.png)

### Installing required packages

First of all I will download and import the required packages using pip, pip is a package manager for python. Requests library is used to download the webapge which is to be parsed, BeautifulSoup4 is used to parse the webpage and pandas is used to convert information in structural form (dataframes) and csv files.

In [1]:
# Install the library
!pip install requests --upgrade --quiet
import requests

!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup

!pip install pandas --upgrade --quiet
import pandas as pd

### Download the webpage using requests

After downloading and importing the required packages, now I will download the webpage which is to be parsed and check the response status code whether the page is downloaded or not. A successful download of webpage will show response status code in the range of 200-299. A full list of status code can be checked [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

In [2]:
url = 'https://www.scimagojr.com/journalrank.php?category=2105&page=1&total_size=203'

In [3]:
response = requests.get(url)
print(response.status_code)

# we can also check type of response
print(type(response))

200
<class 'requests.models.Response'>


The contents of the web page can be accessed using the .text property of the response and length of contents can be checked using len function. As page content is of large length, for understanding purpose accessing the some of page contents using indexing notation.

In [4]:
page_contents = response.text
print(len(page_contents))

93871


In [5]:
page_contents[:500]

'\n\n<html>\n  <head>\n \n  <title>Journal Rankings on Renewable Energy, Sustainability and the Environment</title>\n  <meta name="viewport" content="width=device-width, user-scalable=no" />\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n  <meta http-equiv="keywords" content="scientific information, scopus, journals, authors, citation, impact factor" />\n  <meta http-equiv="description" content="International Scientific Journal &amp; Country Ranking" />\n  <meta http-equiv="DC.De'

Now I can save the webpage in html format.

In [6]:
# saving the downloaded webpage 
with open('RenewableEnergySustainabillity.html', 'w', encoding="utf-8") as file:
    file.write(page_contents)

In [7]:
# saving the contents to a file
with open('RenewableEnergySustainabillity.html', 'r', encoding="utf-8") as f:
    html_link = f.read()

while opening the webpage, I found it looks similar to original webpage.

![](https://i.imgur.com/QFnvSpf.png)

### Parsing the webpage using BeautisulSoup

Now creating the beautifulsoup object of html file to parse the webpage for required information and along with checking the  type of this document.

In [8]:
doc = BeautifulSoup(html_link)
print(type(doc))

<class 'bs4.BeautifulSoup'>


Now I can summarise the work till this point in form of a function. The below function returns a document which has the class 'bs4.BeautifulSoup'.

In [9]:
# defining a function to get the page and then converting it into doc 

import requests
from bs4 import BeautifulSoup

def get_doc(url):
    
    # Get the HTML page content using requests
    response = requests.get(url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + url)
    
    # Construct a beautiful soup document
    docs = BeautifulSoup(response.text,'html.parser')

    return docs

Further, the title of tag can be accessed using .title property of beautifulsoup.

In [10]:
# The title of the page is contained within the <title> tag. We can access the title tag using doc.title

title = doc.title
print(title)
print(title.text)

<title>Journal Rankings on Renewable Energy, Sustainability and the Environment</title>
Journal Rankings on Renewable Energy, Sustainability and the Environment


In [11]:
# accessing first image 
doc.find('img')

<img alt="" src="./img/siricon.png"/>

### Extracting journal name, journal url, type, SJR Index, H Index and country from doc file

Now, I can inspect the downloaded webpage to get required information.

![](https://i.imgur.com/TnYITpC.png)

On inspecting the webpage I found that the requried information is contained in td tag, which is subset of tr tag, which is again subset of tbody tag and tbdoy tag itself is subset of table tag. So first we access the table tag, then tbody tag and then tr tag. To access theh tags we can use find or find_all property of BeautifulSoup.

In [12]:
# Accessing table tags which contains tbody tag

table_tags = doc.find_all('table')
print(len(table_tags))
print(table_tags)

1
[<table>
<thead>
<tr>
<th></th>
<th class="tit" title="Journal title">Title</th>
<th title="Type of publication">Type</th>
<th title="SCImago Journal Rank indicator. It is a measure of journal's impact, influence or prestige. It expresses the average number of weighted citations received in the selected year by the documents published in the journal in the three previous years"><a href="journalrank.php?category=2105&amp;order=sjr&amp;ord=asc" style="color:rgb(51,51,51)"><img src="img/sorted_down.png"/>SJR</a></th>
<th title="Journal's number of articles (h) that have received at least h citations over the whole period"><a href="journalrank.php?category=2105&amp;order=h&amp;ord=desc">H index</a></th>
<th title="Journal's published articles in 2021 All type of documents are considered"><a href="journalrank.php?category=2105&amp;order=item&amp;ord=desc">Total Docs. (2021)</a></th>
<th title="Journal's published articles in 2020, 2019 and 2018. All type of documents are considered"><a hr

In [13]:
# Accessing tbody tags which contains tr tag

tbody_tags = table_tags[0].find_all('tbody')
print(len(tbody_tags))
tbody_tags

1


[<tbody>
 <tr><td>1</td><td class="tit"><a href="journalsearch.php?q=21100812579&amp;tip=sid&amp;clean=0" title="view journal details">Nature Energy</a></td><td>journal</td><td class="orde">16.736 <span class="q1">Q1</span></td><td>160</td><td>204</td><td>632</td><td>6619</td><td>17706</td><td>377</td><td>25.73</td><td>32.45</td><td><img alt="US" src="banderas/us.png?v=1.2" title="United States"/></td></tr><tr><td>2</td><td class="tit"><a href="journalsearch.php?q=17500155114&amp;tip=sid&amp;clean=0" title="view journal details">Energy and Environmental Science</a></td><td>journal</td><td class="orde">11.558 <span class="q1">Q1</span></td><td>376</td><td>385</td><td>910</td><td>41592</td><td>33154</td><td>903</td><td>34.54</td><td>108.03</td><td><img alt="GB" src="banderas/gb.png?v=1.2" title="United Kingdom"/></td></tr><tr><td>3</td><td class="tit"><a href="journalsearch.php?q=21100199127&amp;tip=sid&amp;clean=0" title="view journal details">Advanced Energy Materials</a></td><td>journ

In [14]:
# Accessing tr tags which contains td tag

all_tr_tags = tbody_tags[0].find_all('tr')
print(len(all_tr_tags))
print(all_tr_tags)

50
[<tr><td>1</td><td class="tit"><a href="journalsearch.php?q=21100812579&amp;tip=sid&amp;clean=0" title="view journal details">Nature Energy</a></td><td>journal</td><td class="orde">16.736 <span class="q1">Q1</span></td><td>160</td><td>204</td><td>632</td><td>6619</td><td>17706</td><td>377</td><td>25.73</td><td>32.45</td><td><img alt="US" src="banderas/us.png?v=1.2" title="United States"/></td></tr>, <tr><td>2</td><td class="tit"><a href="journalsearch.php?q=17500155114&amp;tip=sid&amp;clean=0" title="view journal details">Energy and Environmental Science</a></td><td>journal</td><td class="orde">11.558 <span class="q1">Q1</span></td><td>376</td><td>385</td><td>910</td><td>41592</td><td>33154</td><td>903</td><td>34.54</td><td>108.03</td><td><img alt="GB" src="banderas/gb.png?v=1.2" title="United Kingdom"/></td></tr>, <tr><td>3</td><td class="tit"><a href="journalsearch.php?q=21100199127&amp;tip=sid&amp;clean=0" title="view journal details">Advanced Energy Materials</a></td><td>journal

At this point I can define a function to get all tr tags from document file. This function will return all tr tags.

In [15]:
def parse_tags(docs):
    table_tags = docs.find_all('table')
    tbody_tags = table_tags[0].find_all('tbody')
    tr_tags = tbody_tags[0].find_all('tr')
    return tr_tags

Now I have all tr tags, which is of length 50 and is in form of a list. To extract the information for first journal we have to extract the first tr tag.

In [16]:
# Accesiing first td tag

first_td_tag = all_tr_tags[0].find_all('td')

In [17]:
# Here we can check the contnets of td tag and length of td tag.

print(len(first_td_tag))
first_td_tag

13


[<td>1</td>,
 <td class="tit"><a href="journalsearch.php?q=21100812579&amp;tip=sid&amp;clean=0" title="view journal details">Nature Energy</a></td>,
 <td>journal</td>,
 <td class="orde">16.736 <span class="q1">Q1</span></td>,
 <td>160</td>,
 <td>204</td>,
 <td>632</td>,
 <td>6619</td>,
 <td>17706</td>,
 <td>377</td>,
 <td>25.73</td>,
 <td>32.45</td>,
 <td><img alt="US" src="banderas/us.png?v=1.2" title="United States"/></td>]

Now I can extract the journal name, url, type, SJR Index, H Index and country by accessing first td tag.

In [18]:
# Accessing Journal Title

title = first_td_tag[1]
Title = title.text
print(Title)

# Accessing journal url

base_url = 'https://www.scimagojr.com/'
journal_url = first_td_tag[1].find("a")['href']
Journal_url =base_url + journal_url
print(Journal_url)

# Accessing type 

jtype = first_td_tag[2]
Type = jtype.text
print(Type)

# Accessing SJR

sjr_index = first_td_tag[3]
SJR_index = sjr_index.text
print(SJR_index)

# Accessing H index

h_index = first_td_tag[4]
H_index  = h_index.text
print(H_index)

# Accessing contry name through image tag

country = first_td_tag[12].find('img')['alt']
print(country)

Nature Energy
https://www.scimagojr.com/journalsearch.php?q=21100812579&tip=sid&clean=0
journal
16.736 Q1
160
US


By accessing first td tag I got all required information for first journal and now I can extend it to get information for all journal by accessing all td tags.

In [19]:
# Getting all td tags

all_td_tags = []
for j in range (0,len(all_tr_tags)): 
    td_tag = all_tr_tags[j].find_all('td')
    all_td_tags.append(td_tag)
print(all_td_tags)
print(len(all_td_tags))


[[<td>1</td>, <td class="tit"><a href="journalsearch.php?q=21100812579&amp;tip=sid&amp;clean=0" title="view journal details">Nature Energy</a></td>, <td>journal</td>, <td class="orde">16.736 <span class="q1">Q1</span></td>, <td>160</td>, <td>204</td>, <td>632</td>, <td>6619</td>, <td>17706</td>, <td>377</td>, <td>25.73</td>, <td>32.45</td>, <td><img alt="US" src="banderas/us.png?v=1.2" title="United States"/></td>], [<td>2</td>, <td class="tit"><a href="journalsearch.php?q=17500155114&amp;tip=sid&amp;clean=0" title="view journal details">Energy and Environmental Science</a></td>, <td>journal</td>, <td class="orde">11.558 <span class="q1">Q1</span></td>, <td>376</td>, <td>385</td>, <td>910</td>, <td>41592</td>, <td>33154</td>, <td>903</td>, <td>34.54</td>, <td>108.03</td>, <td><img alt="GB" src="banderas/gb.png?v=1.2" title="United Kingdom"/></td>], [<td>3</td>, <td class="tit"><a href="journalsearch.php?q=21100199127&amp;tip=sid&amp;clean=0" title="view journal details">Advanced Energy


50


At this point defining a function to get all td tags:

In [20]:
# defining a function to get information to get all td tags

def td_tags(tr_tags):
    td_tags = []
    for j in range (0,len(tr_tags)):
        td_tag = tr_tags[j].find_all('td')
        td_tags.append(td_tag)
    return td_tags


Now defining a function to extract journal title, url, type, SJR Index, H Index and country.

In [21]:
# function to get journal title

def parse_title(td_tags):
    title = []
    # since td tag is a list so interating over each list
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        title.append(td_tag[1].text)
    return title

In [22]:
# parsing journal link

def parse_link(td_tags):
    journal_url = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        base_url = 'https://www.scimagojr.com/'
        journal_url.append((base_url + td_tag[1].find("a")['href']))
    return journal_url

In [23]:
# Parsing journal type

def parse_jtype(td_tags):
    jtype = []
    for k in range(0, len(td_tags)):
        td_tag = td_tags[k]
        jtype.append(td_tag[2].text)
    return jtype        

In [24]:
# parsing SJR Index

def parse_sjr(td_tags):
    sjr_index = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        sjr_index.append(td_tag[3].text)
    return sjr_index

In [25]:
# parsing H Index

def parse_hindex(td_tags):
    h_index = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        h_index.append(td_tag[4].text)
    return h_index

In [26]:
# parsing country info

def parse_country(td_tags):
    country = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        country.append(td_tag[12].find('img')['alt'])
    return country

Now defining a function to sum up all the required information.

In [27]:
# creating final journal list
journal_rank_dict = {
                    'Title' : [],
                    'Journal Link' : [],
                    'Type' : [],
                    'SJR Index' : [],
                    'H Index' : [],
                    'Country' : [],
                     }

# adding information in journal lists
def add_in_journal(title, journal_url, jtype, sjr_index, h_index, country):
    journal_rank_dict['Title'].extend(title)
    journal_rank_dict['Journal Link'].extend(journal_url)
    journal_rank_dict['Type'].extend(jtype)
    journal_rank_dict['SJR Index'].extend(sjr_index)
    journal_rank_dict['H Index'].extend(h_index)
    journal_rank_dict['Country'].extend(country)
    
    return journal_rank_dict

Now I can repeat the above steps to extract information from multiple pages and calling each function to get required data.

In [28]:
# extending over multiple pages

for i in range(1,5):
    url = 'https://www.scimagojr.com/journalrank.php?category=2105&page={}&total_size=203'.format(i)
    
    # getting doc file 
    docs =get_doc(url)
    
    # parsing tags
    tr_tags = parse_tags(docs)
    
    # parsing td tags
    tdtags = td_tags(tr_tags)
    
    # parsing content from td tags
    title = parse_title(tdtags)
    journal_url = parse_link(tdtags)
    jtype = parse_jtype(tdtags)
    sjr_index = parse_sjr(tdtags)
    h_index = parse_hindex(tdtags)
    country = parse_country(tdtags)

    final = add_in_journal(title, journal_url, jtype, sjr_index, h_index, country)

In [29]:
# Converting dictionary into dataframe using pandas

journalrank_df = pd.DataFrame(journal_rank_dict)

In [30]:
journalrank_df

Unnamed: 0,Title,Journal Link,Type,SJR Index,H Index,Country
0,Nature Energy,https://www.scimagojr.com/journalsearch.php?q=...,journal,16.736 Q1,160,US
1,Energy and Environmental Science,https://www.scimagojr.com/journalsearch.php?q=...,journal,11.558 Q1,376,GB
2,Advanced Energy Materials,https://www.scimagojr.com/journalsearch.php?q=...,journal,8.226 Q1,255,DE
3,ACS Energy Letters,https://www.scimagojr.com/journalsearch.php?q=...,journal,7.362 Q1,134,US
4,Nature Sustainability,https://www.scimagojr.com/journalsearch.php?q=...,journal,5.789 Q1,62,GB
...,...,...,...,...,...,...
195,Renewable Resources Journal,https://www.scimagojr.com/journalsearch.php?q=...,journal,0.109 Q4,9,US
196,Materials and Energy,https://www.scimagojr.com/journalsearch.php?q=...,book series,0.108 Q4,3,SG
197,Earth,https://www.scimagojr.com/journalsearch.php?q=...,journal,0.103 Q4,14,US
198,"International Journal of Environmental, Cultur...",https://www.scimagojr.com/journalsearch.php?q=...,book series,0.102 Q4,6,US


In [31]:
# Converting daraframe into csv file

journalrank_df.to_csv('Renewable Energy Sustainability Journal Ranking.csv', index=False,header=True)

In [32]:
# checking the created csv file 
with open('Renewable Energy Sustainability Journal Ranking.csv', 'r', encoding="utf-8") as f:   
    csv_link = f.read()

In [36]:
csv_link

'Title,Journal Link,Type,SJR Index,H Index,Country\nNature Energy,https://www.scimagojr.com/journalsearch.php?q=21100812579&tip=sid&clean=0,journal,16.736 Q1,160,US\nEnergy and Environmental Science,https://www.scimagojr.com/journalsearch.php?q=17500155114&tip=sid&clean=0,journal,11.558 Q1,376,GB\nAdvanced Energy Materials,https://www.scimagojr.com/journalsearch.php?q=21100199127&tip=sid&clean=0,journal,8.226 Q1,255,DE\nACS Energy Letters,https://www.scimagojr.com/journalsearch.php?q=21100832985&tip=sid&clean=0,journal,7.362 Q1,134,US\nNature Sustainability,https://www.scimagojr.com/journalsearch.php?q=21100873499&tip=sid&clean=0,journal,5.789 Q1,62,GB\nEnergy Storage Materials,https://www.scimagojr.com/journalsearch.php?q=21100420314&tip=sid&clean=0,journal,4.865 Q1,103,NL\nNano Energy,https://www.scimagojr.com/journalsearch.php?q=21100197947&tip=sid&clean=0,journal,4.684 Q1,200,NL\nCarbon Energy,https://www.scimagojr.com/journalsearch.php?q=21101045745&tip=sid&clean=0,journal,4.527 Q

## Summary:

In this project I have gathered the famous journals in the field of renewable energy, sustainability and environment from this [website](https://www.scimagojr.com/journalrank.php?category=2105). The gathered data shows Journal title, journal link(url),type of journal, SJR Index, H Index and Country. For this I have covered following steps in this project:

1. Downloaded the web page using requests library.
2. Parsed the HTML source code using beautifulsoup4 to get journal title, journal url, SJR Index, H Index and country.
3. Extracted the required information from page.
4. defined functions whenever necessary.
5. Saved the extracted information to a csv file.

At last, I am summarising all the steps and functions used in this project:

In [34]:
# defining a function to get the page and then converting it into doc 

import requests
from bs4 import BeautifulSoup

def get_doc(url):
    
    # Get the HTML page content using requests
    response = requests.get(url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + url)
    
    # Construct a beautiful soup document
    docs = BeautifulSoup(response.text,'html.parser')

    return docs

# defining a function to parse docs and to get 'tr' tags

def parse_tags(docs):
    table_tags = docs.find_all('table')
    tbody_tags = table_tags[0].find_all('tbody')
    tr_tags = tbody_tags[0].find_all('tr')
    return tr_tags

# defining a fnction to get 'td' tags

def td_tags(tr_tags):
    td_tags = []
    for j in range (0,len(tr_tags)):
        td_tag = tr_tags[j].find_all('td')
        td_tags.append(td_tag)
    return td_tags

# function to get journal title

def parse_title(td_tags):
    title = []
    # since td tag is a list so interating over each list
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        title.append(td_tag[1].text)
    return title

# parsing journal link

def parse_link(td_tags):
    journal_url = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        base_url = 'https://www.scimagojr.com/'
        journal_url.append((base_url + td_tag[1].find("a")['href']))
    return journal_url

# Parsing journal type

def parse_jtype(td_tags):
    jtype = []
    for k in range(0, len(td_tags)):
        td_tag = td_tags[k]
        jtype.append(td_tag[2].text)
    return jtype

# parsing SJR Index

def parse_sjr(td_tags):
    sjr_index = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        sjr_index.append(td_tag[3].text)
    return sjr_index

# parsing H Index

def parse_hindex(td_tags):
    h_index = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        h_index.append(td_tag[4].text)
    return h_index

# parsing country info

def parse_country(td_tags):
    country = []
    for k in range (0, len(td_tags)):
        td_tag = td_tags[k]
        country.append(td_tag[12].find('img')['alt'])
    return country


# creating final journal list
journal_rank_dict = {
                    'Title' : [],
                    'Journal Link' : [],
                    'Type' : [],
                    'SJR Index' : [],
                    'H Index' : [],
                    'Country' : [],
                     }

# adding information in journal lists
def add_in_journal(title, journal_url, jtype, sjr_index, h_index, country):
    journal_rank_dict['Title'].extend(title)
    journal_rank_dict['Journal Link'].extend(journal_url)
    journal_rank_dict['Type'].extend(jtype)
    journal_rank_dict['SJR Index'].extend(sjr_index)
    journal_rank_dict['H Index'].extend(h_index)
    journal_rank_dict['Country'].extend(country)
    
    return journal_rank_dict

# Converting dictionary into dataframe using pandas

journalrank_df = pd.DataFrame(journal_rank_dict)

# converting into csv file

journalrank_df.to_csv('Renewable Energy Sustainability Journal Ranking.csv', index=False,header=True)


In [None]:
jovian.commit(file=['Renewable Energy Sustainability Journal Ranking.csv'],project='webscraping-project', outputs = ['Renewable Energy Sustainability Journal Ranking.csv'] )

<IPython.core.display.Javascript object>

# Future work

The current web-scraping project is collecting all the famous journals in the field of renewable energy, sustainability and environment. Further each journal can be scrapped to know the details like:
-  To know the latest papers, for example one can scrape the journal of "Sustainable Materials and Technologies".
-  To know the latest technological development, for example carbon capture and storage (CSS) is quite popular technology in field of sustainability.

# References

1. https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis
2. https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/
3. https://www.w3schools.com/html/
4. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
5. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status