# Scraping Electricity Production of Different Countries and Indian States using Python
![](https://i.imgur.com/y1o5APh.png)


Wikipedia is a free content, multilingual online encyclopedia written and maintained by a community of volunteers through a model of open collaboration, using a wiki-based editing system. Individual contributors, also called editors, are known as Wikipedians. It is the largest and most-read reference work in history, and consistently one of the 15 most popular websites ranked by Alexa; as of 2021, Wikipedia was ranked the 13th most popular site. We can find any type of information on wikipedia. 

The page,  https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production provides a list of countries by electricity production and the page https://en.wikipedia.org/wiki/Electricity_sector_in_India, which is going to be extracted from country page, discusses the electricity generation of the Indian states on Wikipedia. In this project we'll retrieve information from these two pages using _web scraping_: the process of extracting and parsing data from websites in an automated fashion using a computer program. We'll use the python libraries [requests](https://docs.python-requests.org/en/latest/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [pandas](https://pandas.pydata.org/) to scrape data from these pages. 

Here's an outline of the steps we'll follow:

1. Download the web pages using `requests`
2. Parse the HTML source code using `beautifulsoup4`
3. Extract the required information from page.
4. Compile extracted information into python list
5. Save the extracted information to a csv file.

By the end of the project, we'll create two csv files.

One csv file looks in the following format:
```
Country_name, Country_electricity(GWh),Production_Year,country_url
China,"7,503,400",2020,https://en.wikipedia.org/wiki/Electricity_sector_in_China
United States,"4,286,600",2020,https://en.wikipedia.org/wiki/Electricity_sector_in_the_United_States
India,"1,560,900",2020,https://en.wikipedia.org/wiki/Electricity_sector_in_India
.....
```
Another csv file looks like this:
```
State/Union Territory,Coal,Lignite,Gas,Diesel,Sub-TotalThermal,Nuclear,Hydel,OtherRenewable,Sub-TotalRenewable,Total, % of National total, % Renewable
Western Region,85156,1540,10806,-,97502,1840,7392,30367,37759,137101,35.69%,27.54%
Maharashtra,24966,-,3207,-,28173,1400,3047,10383,13430,43003,11.20%,31.23%
....
```

## Importance of the project
Electricity is one of the most important blessings that science has given to mankind. It has also become a part of modern life and one cannot think of a world without it. Electricity has many uses in our day to day life. It is used for lighting rooms, working fans and domestic appliances like using electric stoves, A/C and more. All these provide comfort to people. In factories, large machines are worked with the help of electricity. Essential items like food, cloth, paper and many other things are the product of electricity. Modern means of transportation and communication have been revolutionized by it. Electric trains and battery cars are quick means of travel. Electricity also provides means of amusement, radio, television and cinema, which are the most popular forms of entertainment are the result of electricity. Modern equipment like computers and robots have also been developed because of electricity. Electricity plays a pivotal role in the fields of medicines and surgery too such as X-ray, ECG. The use of electricity is increasing day by day. The growth of the electricity sector will be important to sustain the economic output of the country. Electricity is not freely available in nature, so it must be produced. The main sources for [electricity generation](https://en.wikipedia.org/wiki/Electricity_generation) are coal, lignite, gas, diesel, nuclear, solar, wind, hydro and many more. The world is moving towards renewable energy sources such as wind, solar, biomass e.t.c , which provides reliable power supplies and fuel diversification.

Because of these innumerable uses of electricity it is highly recommended to scrape and analyze this information.  

This project is divided into two sections:
* Section 1: This section demonstrates the scraping process for page https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production
* Section 2: The section discusses the generation of a CSV file from page https://en.wikipedia.org/wiki/States_of_India_by_installed_power_capacity


  ### How to run the code
  You can execute the code using the "Run" button at the top of this page and selecting "Run on Binder". You can make changes and save your own version of the notebook to [Jovian](https://jovian.ai) by executing the following cells.

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="webscraping-project-final")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "prasanthi-vvit/webscraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/prasanthi-vvit/webscraping-project-final[0m


'https://jovian.ai/prasanthi-vvit/webscraping-project-final'

## Installing the required packages

As we want to scrape two different formats of data from two different pages, it is a good idea to install and import the required packages once for all. 

In this project, we'll use `requests` to download the web pages, `beautifulsoup4` to parse the web content and `pandas` to create dataframes and csv files.

Let us install these libraries.

In [4]:
!pip install requests --upgrade --quiet
import requests

!pip install bs4 --upgrade --quiet
from bs4 import BeautifulSoup

!pip install pandas --upgrade --quiet
import pandas as pd

## Section 1: Scraping List of Countries by Electricity Production
This section provides the step-by-step process for scraping the list of countries by electricity production from page https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production.
![list of countries](https://i.imgur.com/u5rEDXI.png)

We'll assign the page url to a variable.

In [5]:
topic_url='https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production'

### Download the web page using `requests`
The first step in scraping is to download the page. We'll use `requests.get` function to get the page.  

In [6]:
response=requests.get(topic_url)

`requests.get` returns a response object containing the data from the web page and some other information.

The `status code` property will be used to check if the response was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.


In [7]:
response.status_code

200

The request was successful! We will get the contents of the page using response.text 

In [8]:
page_contents=response.text


Let's check the no.of characters on the page.

In [9]:
len(page_contents)

325454

The page contains over 323,000!

Here are the first 500 characters of the page

In [10]:
page_contents[:500]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of countries by electricity production - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"9f8d5c57-'

What we are looking at above is the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the web page.

We can also save it to a file and view the page locally within Jupyter using "File > open".

In [11]:
with open('List_of_countries_by_electricity_production.html','w') as f:
    f.write(page_contents)

The page looks similar to the original page, but none of the links work.
![](https://i.imgur.com/rMIwBla.png)

We have successfully downloaded the webpage. 

In [12]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "prasanthi-vvit/webscraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/prasanthi-vvit/webscraping-project-final[0m


'https://jovian.ai/prasanthi-vvit/webscraping-project-final'

### Parse the HTML source code using beautiful soup
 The next step after downloading the page is to parse the HTML code. We'll use `beautifulsoup4` library to serve the purpose.
 
 Let us create a BeautifulSoup object to parse the content.
 

In [13]:
doc=BeautifulSoup(page_contents, 'html.parser')

The HTML source code of the web page is converted into beautifulsoup object. 

In Beautiful Soup library, we can specify `html.parser` to ask Python to read components of the page as a tree, instead of reading it as a long string. 

Let us check the type of doc object.

In [14]:
type(doc)

bs4.BeautifulSoup



BeautifulSoup object provides methods and properties to extract the data from it's document. In this project we'll use `.find` and/or `.find_all`
methods to access the specific tags and attributes.

Here are some simple ways to navigate the document:

In [15]:
doc.find('title')

<title>List of countries by electricity production - Wikipedia</title>

This has given the title of the web page

In [16]:
doc.find('img')

<img alt="Ambox current red Americas.svg" data-file-height="290" data-file-width="360" decoding="async" height="34" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ambox_current_red_Americas.svg/42px-Ambox_current_red_Americas.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ambox_current_red_Americas.svg/63px-Ambox_current_red_Americas.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ambox_current_red_Americas.svg/84px-Ambox_current_red_Americas.svg.png 2x" width="42"/>

This has returned the first image of the page.


We have successfully parsed the HTML source code of the web page using beautifulsoup.

In [17]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "prasanthi-vvit/webscraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/prasanthi-vvit/webscraping-project-final[0m


'https://jovian.ai/prasanthi-vvit/webscraping-project-final'

### Extract country_name, country_electricity, production_year and country_URL from the page 
The third step in scraping is to extract the information by inspecting the HTML source code.

Let us now inspect the parsed HTML source code to identify the unique tags and attributes which will give us the details we need.

#### Inspecting HTML code in the Browser

You can view the source code of any webpage right within your browser by right-clicking anywhere on a page and selecting the "Inspect" option. It opens the "Developer Tools" pane, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

Here's what it looks like on the Chrome browser:

![](https://i.imgur.com/rj9v52C.png)

Upon the inspection of the above html code, it was found that the page contains two tables and these can be accessed using `table` tag with the class "box-Update plainlinks metadata ambox ambox-content ambox-Update". Among these two tables, first table contains the required data. After selecting the table if we go further deep into it we can find `tr` tags with the class "static-row-header" as marked in the above image.


In [18]:
#Access table tags using .find_all method and assign it to a variable
table_tags=doc.find_all('table',class_='wikitable sortable static-row-numbers plainrowheaders srn-white-background')


In [19]:
#It was found that there are 2 table tags 
len(table_tags)

2

The table tag is accessed using `.find_all` method.

Next we'll select table1 and access the `tr` tags

In [20]:
#Access tr_tags from first table tag
tr_tags=table_tags[0].find_all('tr')

The tr_tags were accessed using `.find_all` method.

At this point we'll create a helper function called `get_tr_tags` which takes topic URL as an input and gives tr_tags as output.

Inside this function, we define the following local variables:
* The local variable 'response' is assigned to the response obtained when the topic URL is passed to the requests.get function.
* A valid response has a response code of 200. If not, an exception is raised stating - the response is invalid.
* The response is fed to the beautiful soup function and a beautifulsoup document is assigned to the 'doc' variable.
* 'table_tag': This is obtained by passing doc.find_all.
* The extracted `tr` tags from 'table_tags' are assigned to a variable 'tr_tags' and is returned.

In [21]:
def get_tr_tags(topic_url):
    #download webpage using 'requests'
    response=requests.get(topic_url)
    #checking status code
    if response.status_code !=200:
        raise Exception('Unable to fetch data from',topic_url)
    #if status code is 200, get beautiful doc using beautifulsoup
    doc=BeautifulSoup(response.text)
    #extract table tags from beautiful doc
    table_tags=doc.find_all('table',class_='wikitable sortable static-row-numbers plainrowheaders srn-white-background')
    #extract tr_tags from table_tag[0]
    tr_tags=table_tags[0].find_all('tr')
    return tr_tags

Let us call the function `get_tr_tags` to test it's functionality.

In [22]:
get_tr_tags(topic_url)[0:2]

[<tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
 <th>Country/region</th>
 <th style="max-width:5em">Electricity production<br/><span style="font-size:85%;"><style data-mw-deduplicate="TemplateStyles:r886047488">.mw-parser-output .nobold{font-weight:normal}</style><span class="nobold">(GWh)</span></span></th>
 <th>Year
 </th></tr>,
 <tr class="static-row-header" style="font-weight:bold;">
 <th scope="row" style="text-align:left"> <span class="flagicon" style="padding-left:25px;"> </span><b>World</b>
 </th>
 <td><span data-sort-value="7007268232000000000♠">26,823,200</span></td>
 <td><sup class="reference" id="cite_ref-BP2019_1-0"><a href="#cite_note-BP2019-1">[1]</a></sup><span data-sort-value="000000002020-01-01-0000" style="white-space:nowrap">2020</span>
 </td></tr>]

The first three 'tr' tags alone are displayed in the output to minimize the loading time.

When these `tr` tags are inspected, we find that each `tr` tag has one `th` tag and two `td` tags except first tr tag.

Here we can see the first `tr` tag.
![](https://i.imgur.com/kYAB9kS.png)

It has three `th` tags and these represent table headers.

Here we can see the third `tr` tag.
![](https://i.imgur.com/1qIfRjQ.png)

When the 'tr' tags are expanded, we can find the information we are looking for- country name, country electricity generation, year of production, country url. Since the second `tr` tag contains the total world generation details, we'll extract country details from third `tr` tag onwards.

These are the findings when the 'tr' tags are expanded:
* Country name is in the 'th' tag
* Country Electricity production is in the first 'td' tag
* Production year is in the `span` tag of the second 'td' tag
* Country url is inside the 'a' tag in the 'th' tag

In [23]:
# to look at the `th` tag in a `tr` tag
th_tag=tr_tags[2].find('th')
th_tag

<th style="text-align:left"><span class="flagicon" style="display:inline-block;width:25px;text-align:left;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span> <a href="/wiki/Electricity_sector_in_China" title="Electricity sector in China">China *</a>
</th>

In [24]:
# Country name is obtained when the th_tag is passed with the text attribute.
country_name=th_tag.text.strip().replace('\u202f*','')
country_name

'China'

In [25]:
# Country URL is obtained when the a tag in the th_tag is passed with the href attribute.
country_path=th_tag.find('a')['href']
country_url='https://en.wikipedia.org'+country_path
country_url

'https://en.wikipedia.org/wiki/Electricity_sector_in_China'

In [26]:
# to look at the 'td' tag in a 'tr' tag
td_tags=tr_tags[2].find_all('td')
td_tags

[<td><span data-sort-value="7006750340000000000♠">7,503,400</span></td>,
 <td><sup class="reference" id="cite_ref-BP2019_1-1"><a href="#cite_note-BP2019-1">[1]</a></sup><span data-sort-value="000000002020-01-01-0000" style="white-space:nowrap">2020</span>
 </td>]

In [27]:
# Country electricity is obtained when the first td_tag is passed with the text attribute.

country_electricity =td_tags[0].text
country_electricity

'7,503,400'

In [28]:
# production year is obtained from the span tag inside the second td_tag.
production_year=td_tags[1].find('span').text
production_year

'2020'

In [29]:
print('Country_name: ', country_name)
print('Country_Electricity: ', country_electricity)
print('Production_year: ', production_year)
print('Country_URL:', country_url)

Country_name:  China
Country_Electricity:  7,503,400
Production_year:  2020
Country_URL: https://en.wikipedia.org/wiki/Electricity_sector_in_China


Let us define a helper function called `get_country` to extract a country details as shown above.

In [30]:
def get_country(tr_tag):
    #extract th tag from a tr_tag
    th_tag=tr_tag.find('th')
    #obtain country name which is in the 'th' tag using .text
    country_name=th_tag.text.strip().replace('\u202f*','')
    #specifying base url to get country url
    base_url='https://en.wikipedia.org'
    #country url is concatenation of base url with the href in the th tag
    country_url=base_url+th_tag.find('a')['href'].strip()
    #access td tags from tr tag
    td_tags=tr_tag.find_all('td')
    #obtain country electricity from first td tag using .text
    country_electricity =td_tags[0].text.strip()
    #obtain production year from span tag inside the second td tag using .text
    span_tag=td_tags[1].find('span')
    if span_tag==None:
        production_year=td_tags[1].text.replace('[4]','').strip()
    else:
        production_year=span_tag.text
    #return country full details 
    return country_name, country_electricity,production_year,country_url  
    

The function `get_country` takes the `tr` tag as an input.

Inside this function, we define the following local variables:

country_name: This is the name of the country and is obtained when the th_tag is passed with the text attribute.

country_electricity: This is electricity generation of the country and is obtained when the first td_tag is passed with the text.

production_year: This is year of production and is obtained when the `span` tag inside the second 'td_tag' is passed with the text attribute.

country_URL: This is the URL for the country on wikipedia. We extract this, when the 'a' tag in the 'th_tag' is passed with the href attribute is added to the base_url.

The output of the function is a tuple containing  the local variables mentioned in the text above.



In [31]:
get_country(tr_tags[88])

('Dominican Republic',
 '14,367',
 '2015',
 'https://en.wikipedia.org/wiki/Electricity_sector_in_the_Dominican_Republic')

In [32]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "prasanthi-vvit/webscraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/prasanthi-vvit/webscraping-project-final[0m


'https://jovian.ai/prasanthi-vvit/webscraping-project-final'

### Compile extracted information into python lists

In the previous step we have created a helper function to extract a country data from the beautifulsoup doc. The next step is to compile all the information into a python dictionary.

Let us define a function to create a python dictionary of lists.

In [33]:
def get_all_countries(tr_tags):
    #extract all the countries details
    topic_countries_dict={'Country_name':[],'Country_elctricity':[],'Production_year':[],'Country_URL':[]}
    for i in range(2,len(tr_tags)):
        country_info=get_country(tr_tags[i])
        topic_countries_dict['Country_name'].append(country_info[0])
        topic_countries_dict['Country_elctricity'].append(country_info[1])
        topic_countries_dict['Production_year'].append(country_info[2])
        topic_countries_dict['Country_URL'].append(country_info[3])
    return topic_countries_dict

In [34]:
get_all_countries(tr_tags)['Production_year'][80:85]

['2011', '2011', '2020', '2011', '2012']

The function `get_country` gives the information of a single country as a tuple. Hence, we have created `get_all_countries` function to get the information of all the countries as a dictionary of lists. This function takes 'tr_tags' as an input and gives the output as a dictionary of lists.



### Save the extracted information to a csv file.

Once all the countries are parsed, the information needs to be saved in a csv file for usability.
To do this, first we have to create a dataframe from above dictionary of lists using pandas and then convert this dataframe to a csv file using pandas again.

Let us create a dataframe using `pd.DataFrame` function.

In [35]:
topic_countries_dict=get_all_countries(tr_tags)
countries_df=pd.DataFrame(topic_countries_dict)
countries_df

Unnamed: 0,Country_name,Country_elctricity,Production_year,Country_URL
0,China,7503400,2020,https://en.wikipedia.org/wiki/Electricity_sect...
1,United States,4286600,2020,https://en.wikipedia.org/wiki/Electricity_sect...
2,India,1560900,2020,https://en.wikipedia.org/wiki/Electricity_sect...
3,Russia,1085400,2020,https://en.wikipedia.org/wiki/Electricity_sect...
4,Japan,1004800,2020,https://en.wikipedia.org/wiki/Electricity_sect...
...,...,...,...,...
204,Kiribati,29,2016,https://en.wikipedia.org/wiki/Kiribati
205,Montserrat,24,2016,https://en.wikipedia.org/wiki/Montserrat
206,Falkland Islands,19,2016,https://en.wikipedia.org/wiki/Falkland_Islands
207,Saint Helena,7,2016,https://en.wikipedia.org/wiki/Saint_Helena


Now use `countries_df.to_csv` function to get a csv file.

In [36]:
countries_df.to_csv('List_of_countries_by_electricity_production.csv',index=False)

We can see whether the file is listed or not locally within Jupyter using "File > open".
![](https://i.imgur.com/hjWoxJX.png)

The data inside the csv file is shown here:
![](https://i.imgur.com/wgiblJ9.png)

Let us code a helper function to create a csv file.

In [37]:
def get_csv(topic_dict,path):
    topic_country_df=pd.DataFrame(topic_dict)
    topic_country_df.to_csv(path,index=False)

Let us write a function which will take topic url as an input and generates a csv file.

In [38]:
def scrape_countries(topic_url,path):
    #extracting tr_tags 
    tr_tags=get_tr_tags(topic_url)
    #creating python list of dictionaries
    topic_country_dict=get_all_countries(tr_tags)
    #Writing extracted data to a csv file
    get_csv(topic_country_dict,path)
    print('The csv file is created with the name {}'.format(path))

We'll test the function `scrape_countries` 

In [39]:
scrape_countries(topic_url, 'List_of_countries_by_electricity_production.csv')

The csv file is created with the name List_of_countries_by_electricity_production.csv


### Putting all the functions in a single cell

In [40]:
!pip install requests --upgrade --quiet
import requests

!pip install bs4 --upgrade --quiet
from bs4 import BeautifulSoup

!pip install pandas --upgrade --quiet
import pandas as pd

def scrape_countries(topic_url,path):
    #extracting tr_tags 
    tr_tags=get_tr_tags(topic_url)
    #creating python list of dictionaries
    topic_country_dict=get_all_countries(tr_tags)
    #Writing extracted data to a csv file
    get_csv(topic_country_dict,path)
    print('The csv file is created with the name {}'.format(path))
def get_tr_tags(topic_url):
    #download webpage using 'requests'
    response=requests.get(topic_url)
    #checking status code
    if response.status_code !=200:
        raise Exception('Unable to fetch data from',topic_url)
    #if status code is 200, get beautiful doc using beautifulsoup
    doc=BeautifulSoup(response.text)
    #extract table tags from beautiful doc
    table_tags=doc.find_all('table',class_='wikitable sortable static-row-numbers plainrowheaders srn-white-background')
    #extract tr_tags from table_tag[0]
    tr_tags=table_tags[0].find_all('tr')
    return tr_tags
def get_country(tr_tag):
    #extract th tag from a tr_tag
    th_tag=tr_tag.find('th')
    #obtain country name which is in the 'th' tag using .text
    country_name=th_tag.text.strip().replace('\u202f*','')
    #specifying base url to get country url
    base_url='https://en.wikipedia.org'
    #country url is concatenation of base url with the href in the th tag
    country_url=base_url+th_tag.find('a')['href'].strip()
    #access td tags from tr tag
    td_tags=tr_tag.find_all('td')
    #obtain country electricity from first td tag using .text
    country_electricity =td_tags[0].text.strip()
    #obtain production year from span tag inside the second td tag using .text
    span_tag=td_tags[1].find('span')
    if span_tag==None:
        production_year=td_tags[1].text.replace('[4]','').strip()
    else:
        production_year=span_tag.text
    #return country full details 
    return country_name, country_electricity,production_year,country_url 
    
def get_all_countries(tr_tags):
    #extract all the countries details
    topic_countries_dict={'Country_name':[],'Country_elctricity':[],'Production_year':[],'Country_URL':[]}
    for i in range(2,len(tr_tags)):
        country_info=get_country(tr_tags[i])
        topic_countries_dict['Country_name'].append(country_info[0])
        topic_countries_dict['Country_elctricity'].append(country_info[1])
        topic_countries_dict['Production_year'].append(country_info[2])
        topic_countries_dict['Country_URL'].append(country_info[3])
    return topic_countries_dict
def get_csv(topic_dict,path):
    topic_country_df=pd.DataFrame(topic_dict)
    topic_country_df.to_csv(path,index=False)

In [41]:
scrape_countries('https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production','List_of_countries_by_electricity_production.csv')

The csv file is created with the name List_of_countries_by_electricity_production.csv


In [71]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "prasanthi-vvit/webscraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/prasanthi-vvit/webscraping-project-final[0m


'https://jovian.ai/prasanthi-vvit/webscraping-project-final'

## Section2: Scraping States of India by Installed Power Capacity

This section will discuss about scraping states of India by installed power capacity on the page https://en.wikipedia.org/wiki/Electricity_sector_in_India.

![List of states](https://i.imgur.com/QW4kN01.png)

The project outlines similar to section 1 will be followed here. But instead of explaining each bit as in the previous section, I have directly given the functions with explanations hoping that the previous section has enough inputs.

We'll access India's URL from `get_all_countries` function and assign to a variable.

In [44]:
topic2_url=get_all_countries(tr_tags)['Country_URL'][2]
topic2_url

'https://en.wikipedia.org/wiki/Electricity_sector_in_India'

### We'll create a helper function called `get_table2_rows` to do the following:
* download the page,
* parse the data, 
* and extract the table rows.

In [45]:
def get_table2_rows(topic2_url):
    # downloads the web page
    response2=requests.get(topic2_url)
    # checks the status code
    if response2.status_code !=200:
        print('status code is',response.status_code)
        raise Exception('Failed to fetch web page')
    # creates a beautiful doc
    doc2=BeautifulSoup(response2.text)
    # access the table atg
    table2_tag=doc2.find_all('table',class_="wikitable sortable")
    # extract table rows
    tr_tags2=table2_tag[3].find_all('tr')
    return tr_tags2

In [46]:
tr_tags2=get_table2_rows(topic2_url)
tr_tags2[3]

<tr>
<td><a href="/wiki/Maharashtra" title="Maharashtra">Maharashtra</a></td>
<td align="right">24,966</td>
<td align="center">-</td>
<td align="right">3,207</td>
<td align="center">-</td>
<td align="right" style="background-color: #ddf;">28,173</td>
<td align="right">1,400</td>
<td align="right">3,047</td>
<td align="right">10,383</td>
<td align="right" style="background-color: #ddf;">13,430</td>
<td align="right" style="background-color: #dfd;">43,003</td>
<td align="right" style="background-color: #dfd;">11.20%</td>
<td style="background-color: #dfd;">31.23%
</td></tr>

### The function by name  `get_table2_headers` gives table headers:
As I have to extract many columns from the page, I felt that extracting headers of the columns from the page would be easy. Upon inspection, I found that table headers are in the first two rows of the table. The below function extracts table headers from the first two tr_tags.

In [47]:
def get_table2_headers(tr_tags2):
    headers2=[]
    #Obtain header 1 from rows[0]
    th_row1=tr_tags2[0].find_all('th')
    #access a tags in column_1
    a_tags=th_row1[0].find_all('a')
    headers2.append(a_tags[0].text+'/'+a_tags[1].text)
    #obtain remaining headers from row1
    th_tags=tr_tags2[1].find_all('th')
    for j in range(len(th_tags)):
        headers2.append(th_tags[j].text.strip())
    #The missing columns are inserted 
    headers2.insert(6,'Nuclear')
    headers2[5]='Sub-TotalThermal'
    # append last three headers from first row
    for i in range(4,7):
        headers2.append(th_row1[i].text.strip())
    return headers2


In [48]:
table2_headers=get_table2_headers(tr_tags2)
table2_headers

['State/Union Territory',
 'Coal',
 'Lignite',
 'Gas',
 'Diesel',
 'Sub-TotalThermal',
 'Nuclear',
 'Hydel',
 'OtherRenewable',
 'Sub-TotalRenewable',
 'Total(in MW)',
 '% of National Total',
 '% Renewable']

### `get_table2_values` gives the values of a table as a list of lists:
Each element in the list represents a row in the table as a list.
We can see the table values from third row onwards.

In [49]:
def get_table2_values(tr_tags2):
    table2_values=[]
    for i in range(2,len(tr_tags2)):
        values1=[]
        #access th_tags from  third tr tags onwards
        th_tags2=tr_tags2[i].find_all('th')
        # access td_tags and append its values
        if len(th_tags2)==0:
            td_tags2=tr_tags2[i].find_all('td')
            for j in range(len(td_tags2)):
                values1.append(td_tags2[j].text.strip().replace(',',''))
        else:
            # append th_tags text to the list
            for th_tag2 in th_tags2:
                values1.append(th_tag2.text.strip().replace(',',''))
        # append each row to the main list
        table2_values.append(values1)
    return table2_values

In [50]:
table2_values=get_table2_values(tr_tags2)

table2_values[0]

['Western Region',
 '85156',
 '1540',
 '10806',
 '-',
 '97502',
 '1840',
 '7392',
 '30367',
 '37759',
 '137101',
 '35.69%',
 '27.54%']

Let us look at the dataframe 

In [63]:
states_df=pd.DataFrame(table2_values,columns=table2_headers)
states_df

Unnamed: 0,State/Union Territory,Coal,Lignite,Gas,Diesel,Sub-TotalThermal,Nuclear,Hydel,OtherRenewable,Sub-TotalRenewable,Total(in MW),% of National Total,% Renewable
0,Western Region,85156,1540,10806,-,97502,1840,7392,30367,37759,137101,35.69%,27.54%
1,Maharashtra,24966,-,3207,-,28173,1400,3047,10383,13430,43003,11.20%,31.23%
2,Gujarat,14692,1400,7551,-,23643,440,1990,14050,16040,40123,10.45%,39.98%
3,Madhya Pradesh,21950,-,-,-,21950,-,2235,5282,7516,29466,7.67%,25.51%
4,Chhattisgarh,23688,-,-,-,23688,-,120,598,718,24406,6.35%,2.94%
5,Goa,-,-,48,-,48,-,-,8,8,55,0.014%,14.02%
6,Daman & Diu,-,-,-,-,-,-,-,41,41,40,0.01%,100%
7,Dadra & Nagar Haveli,-,-,-,-,-,-,-,5,5,5,0.001%,100%
8,Southern Region,37622,3140,6491,433,47688,3320,11694,43452,55146,108338,28.95%,51.95%
9,Tamil Nadu,9520,3140,1027,211,13899,2440,2178,14831,17009,33348,8.91%,51.01%


### The function `get_csv2` generates a csv file:
This file takes headers, values and path as an inputs, creates a dataframe from list of lists and then generates a csv file from dataframe using pandas.

In [64]:
def get_csv2(headers,values,path):
    #creates a dataframe using pandas from list of lists
    topic2_df=pd.DataFrame(values,columns=headers)
    #generates a csv file from dataframe
    topic2_df.to_csv(path,index=False)
    print('The csv file is created with the name {}'.format(path))

In [65]:
get_csv2(table2_headers,table2_values,'States_of_India_by_installed_power_capacity.csv')

The csv file is created with the name States_of_India_by_installed_power_capacity.csv


We'll see whether the file is listed or not locally within Jupyter using "File > open".
![](https://i.imgur.com/uvGzFTy.png)
The data inside the csv file is shown here
![](https://i.imgur.com/QB7NrHe.png)

### The function `scrape_states`  uses all the above functions to generate a csv file.
This function takes topic url and path as an inputs and generates a second csv file as shown below.

In [55]:
def scrape_states(topic2_url,path):
    # Extract table rows
    tr_tags2=get_table2_rows(topic2_url)
    # Extract headers from table rows
    headers=get_table2_headers(tr_tags2)
    # Extract table values from table rows
    values=get_table2_values(tr_tags2)
    # generates csv file
    get_csv2(headers,values,path)
    

In [56]:
scrape_states(topic2_url,'States_of_India_by_installed_power_capacity.csv')

The csv file is created with the name States_of_India_by_installed_power_capacity.csv


### Putting all the functions defined in this section into a single cell

In [57]:
def scrape_states(topic2_url,path):
    # Extract table rows
    tr_tags2=get_table2_rows(topic2_url)
    # Extract headers from table rows
    headers=get_table2_headers(tr_tags2)
    # Extract table values from table rows
    values=get_table2_values(tr_tags2)
    # generates csv file
    get_csv2(headers,values,path)
    
def get_table2_rows(topic2_url):
    # downloads the web page
    response2=requests.get(topic2_url)
    # checks the status code
    if response2.status_code !=200:
        print('status code is',response.status_code)
        raise Exception('Failed to fetch web page')
    # creates a beautiful doc
    doc2=BeautifulSoup(response2.text)
    # access the table atg
    table2_tag=doc2.find_all('table',class_="wikitable sortable")
    # extract table rows
    tr_tags2=table2_tag[3].find_all('tr')
    return tr_tags2

def get_table2_headers(tr_tags2):
    headers2=[]
    #Obtain header 1 from rows[0]
    th_row1=tr_tags2[0].find_all('th')
    #access a tags in column_1
    a_tags=th_row1[0].find_all('a')
    headers2.append(a_tags[0].text+'/'+a_tags[1].text)
    #obtain remaining headers from row1
    th_tags=tr_tags2[1].find_all('th')
    for j in range(len(th_tags)):
        headers2.append(th_tags[j].text.strip())
    #The missing columns are inserted 
    headers2.insert(6,'Nuclear')
    headers2[5]='Sub-TotalThermal'
    # append last three headers from first row
    for i in range(4,7):
        headers2.append(th_row1[i].text.strip())
    return headers2

def get_table2_values(tr_tags2):
    table2_values=[]
    for i in range(2,len(tr_tags2)):
        values1=[]
        #access th_tags from  third tr tags onwards
        th_tags2=tr_tags2[i].find_all('th')
        # access td_tags and append its values
        if len(th_tags2)==0:
            td_tags2=tr_tags2[i].find_all('td')
            for j in range(len(td_tags2)):
                values1.append(td_tags2[j].text.strip().replace(',',''))
        else:
            # append th_tags text to the list
            for th_tag2 in th_tags2:
                values1.append(th_tag2.text.strip().replace(',',''))
        # append each row to the main list
        table2_values.append(values1)
    return table2_values

def get_csv2(headers,values,path):
    #creates a dataframe using pandas from list of lists
    topic2_df=pd.DataFrame(values,columns=headers)
    #generates a csv file from dataframe
    topic2_df.to_csv(path,index=False)
    print('The csv file is created with the name {}'.format(path))

In [58]:
scrape_states(topic2_url,'States_of_India_by_installed_power_capacity.csv')

The csv file is created with the name States_of_India_by_installed_power_capacity.csv


In [59]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "prasanthi-vvit/webscraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/prasanthi-vvit/webscraping-project-final[0m


'https://jovian.ai/prasanthi-vvit/webscraping-project-final'

## Summary
Finally, we have scraped two different web pages and have two csv files ready for data analysis.

Let us look at the steps we have followed from the beginning to end.
1. The project was divided into two sections:
    * Section 1: This section has used the page https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production to scrape country's electricity data.
    * Section 2: This section fetched the Indian states electricity data from page https://en.wikipedia.org/wiki/Electricity_sector_in_India.
2. First, we have considered section 1 and performed the following:
    * The Web page was downloaded using `requests`
    * HTML code was parsed using `beautifulsoup4`
    * Extracted the required data by inspecting the code
    * The data was put into Python dictionary.
    * Finally, the csv file was created from the dictionary using `pandas`
3. Next, we have considered section 2 and created the following functions to scrape the data:
    * `get_table2_rows`: Downloaded the web page using `requests`, parsed the data using `beautifulsoup4` and accessed the table rows.
    * `get_table2_headers`: Extracted the table headers.
    * `get_table2_values`: Extracted the table values.
    * `get_csv2`: saved the information to a csv file
    * `scrape_states` : Used all the above functions and generated a CSV file when the topic_url was fed as an input.


## Future Work
Here are some ways in which the project can be extended.

1. We can go through each country's URL to fetch the detailed information on that country's electricity generation.
3. We can also perform data analysis on the extracted data to get more insights, such as generating the highest and lowest electricity across the globe, and so on...


## References
* Web Scraping and Rest APIs - https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis

* Documentation and Story Telling - https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/documentation-and-storytelling
* Project 1-Web Scraping with Python - https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-1-web-scraping-with-python
* Python official documentation - https://docs.python.org/3/tutorial/index.html
* Tutorial on HTML - https://www.htmldog.com/guides/html/ .
* Beautiful Soup documentation - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Requests Documentation - https://docs.python-requests.org/en/latest/
* Pandas Documentation - https://pandas.pydata.org/
* List of Countries by Electricity Generation - https://en.wikipedia.org/wiki/List_of_countries_by_electricity_production
* States of India by Installed Capacity - https://en.wikipedia.org/wiki/Electricity_sector_in_India.


In [None]:
jovian.commit()

<IPython.core.display.Javascript object>