# Scraping data from Wikipedia using Wikipedia library

![](https://www.webharvy.com/images/web%20scraping.png)

# Why to use Web Scraping ?
> Suppose you want some information from a website? Let’s say a paragraph on Donald Trump!<br>
> What do you do? Well, you can copy and paste the information from Wikipedia to your own file.<br>
> But what if you want to get large amounts of information from a website as quickly as possible?<br>
> Such as large amounts of data from a website to train a Machine Learning algorithm? <br>
> In such a situation, copying and pasting will not work! And that’s when you’ll need to use Web Scraping. <br>
> Unlike the long and mind-numbing process of manually getting data, Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time. <br>
> So let’s understand what Web scraping is in detail and how to use it to obtain data from other websites.<br>

![](https://roboticsandautomationnews.com/wp-content/uploads/2020/04/web-scraping-2.png)

# What is Web Scraping?
> Web scraping is an automatic method to obtain large amounts of data from websites. <br> <br>
> Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.<br><br>
> There are many different ways to perform web scraping to obtain data from websites. <br><br>
> These include using online services, particular API’s or even creating your code for web scraping from scratch.<br><br>
> Many large websites, like Google, Twitter, Facebook, StackOverflow, etc. have API’s that allow you to access their data in a structured format. <br><br>
> This is the best option, but there are other sites that don’t allow users to access large amounts of data in a structured form or they are simply not that technologically advanced. In that situation, it’s best to use Web Scraping to scrape the website for data.<br><br>
> Web scraping requires two parts, namely the crawler and the scraper. <br><br>
> The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet. <br><br>
> The scraper, on the other hand, is a specific tool created to extract data from the website.<br><br>
> The design of the scraper can vary greatly according to the complexity and scope of the project so that it can quickly and accurately extract the data.<br><br>

![](https://miro.medium.com/max/1024/1*uRnYym2aZd_MEgSY--iNxw.png)

# Demo:

In [1]:
# First we need to install wikipedia library for that we will use this command

from IPython.display import clear_output
! pip install wikipedia
! pip install mediawikiapi
clear_output()

# Getting Started
Getting the summary of any title
Summary of any title can be obtained by using summary method. 

- Syntax : wikipedia.summary(title, sentences)
- Argument : Title of the topic 
- Optional argument: setting number of lines in result.
- Return : Returns the summary in string format. 

In [2]:
# importing the module
import wikipedia

# finding result for the search
# sentences = 2 refers to numbers of line
result = wikipedia.summary("Nikola Tesla", sentences = 2)

# printing the result
print(result)


Nikola Tesla ( TESS-lə; Serbian Cyrillic: Никола Тесла, pronounced [nǐkola têsla]; 10 July [O.S. 28 June] 1856 – 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system.Born and raised in the Austrian Empire, Tesla studied engineering and physics in the 1870s without receiving a degree, gaining practical experience in the early 1880s working in telephony and at Continental Edison in the new electric power industry. In 1884 he emigrated to the United States, where he became a naturalized citizen.


# Searching title and suggestions
Title and suggestions can be get by using search() method. 

- Syntax : wikipedia.search(title, results)
- Argument : Title of the topic 
- Optional argument : setting number of result.
- Return : Returns the list of titles. 

In [3]:
# importing the module
import wikipedia

# getting suggestions
result = wikipedia.search("Samsung", results = 5)

# printing the result
print(result)


['Samsung', 'Samsung Electronics', 'Samsung Galaxy', 'Samsung Galaxy S series', 'Samsung Galaxy Watch series']


# Getting Full Wikipedia Page Data
The page() method is used to get the contents, categories, coordinates, images, links and other metadata of a Wikipedia page.

- Syntax : wikipedia.page(title)
- Argument : Title of the topic.
- Return : Return a WikipediaPage object. 

In [4]:
# importing the module
import wikipedia

# wikipedia page object is created
page_object = wikipedia.page("Languages of India")

# printing html of page_object
print(page_object.html)

# printing title
print(page_object.original_title)

# printing links on that page object
print(page_object.links[0:10])


<bound method WikipediaPage.html of <WikipediaPage 'Languages of India'>>
Languages of India
['1951 Census of India', '1961 Census of India', '1991 Census of India', '2001 Census of India', '2011 Census of India', '2011 census of India', '8th Schedule', 'Abahatta', 'Adi language', 'Administrative language']


# Changing language of Wikipedia page
The language can be changed to your native language if the page exists in your native language. Set_lang() method is used for the same. 

- Syntax : wikipedia.set_lang(language)
- Argument : prefix of the language like for arabic prefix is ar and so on.
- Action performed : It converted the data into that language default language is english. 

In [5]:
# importing the module
import wikipedia

# setting language to hindi
wikipedia.set_lang("hi")

# printing the summary
print(wikipedia.summary("India"))


भारत (आधिकारिक नाम: भारत गणराज्य, अंग्रेज़ी: Republic of India) दक्षिण एशिया में स्थित भारतीय उपमहाद्वीप का सबसे बड़ा देश है। भारत भौगोलिक दृष्टि से विश्व का सातवाँ सबसे बड़ा देश है, जबकि जनसंख्या के दृष्टिकोण से चीन के बाद दूसरा सबसे बड़ा देश है। भारत के पश्चिम में पाकिस्तान, उत्तर-पूर्व में चीन(तिब्बत), नेपाल और भूटान, पूर्व में बांग्लादेश और म्यान्मार स्थित हैं। हिन्द महासागर में इसके दक्षिण पश्चिम में मालदीव, दक्षिण में श्रीलंका और दक्षिण-पूर्व में इंडोनेशिया से भारत की सामुद्रिक सीमा लगती है। इसके उत्तर में हिमालय पर्वत तथा दक्षिण में हिन्द महासागर स्थित है। दक्षिण-पूर्व में बंगाल की खाड़ी तथा पश्चिम में अरब सागर है।
आधुनिक मानव या होमो सेपियन्स अफ्रीका से भारतीय उपमहाद्वीप में 55,000 साल पहले आये थे।  1,000 वर्ष पहले ये सिंधु नदी के पश्चिमी हिस्से की तरफ बसे हुए थे जहाँ से इन्होने धीरे धीरे पलायन किया और सिंधु घाटी सभ्यता के रूप में विकसित हुए।  1,200 ईसा पूर्व संस्कृत भाषा सम्पूर्ण भारतीय उपमहाद्वीप में फैली हुए थी और तब तक यहाँ पर हिन्दू धर्म का उद्धव हो चुका था और ऋग्वेद की रच

In [6]:
# importing the module
import wikipedia

# setting language to spanish
wikipedia.set_lang("es")

# printing the summary
print(wikipedia.summary("carne de vaca"))

La carne de vacuno, carne de res o carne de buey es la carne obtenida de res(Bos taurus).
Una de las primeras razas domésticas que pudieron abastecer al hombre de sus necesidades cárnicas pudo haber sido el uro (Bos primigenius) que se extendió a lo largo de Eurasia. En el siglo XVII, algunos ganaderos de Europa empezaron a seleccionar diversas razas bovinas para mejorar ciertas cualidades como su leche, la capacidad y resistencia ante el trabajo agrícola, la calidad de la carne, etc. De esta forma existen hoy en día razas como la francesa Charolesa y Limousin, la italiana Chianina (de tamaño inmenso), las inglesas de Hereford y Shorthorn o la Rubia gallega. En Estados Unidos existen razas autóctonas que proporcionan una carne con sebo entrevetado (en inglés se denomina 'marbling') y que suelen proceder de animales sacrificados a la edad de 15 a 24 meses, este tipo de carne es considerada de buena calidad por el consumidor medio estadounidense. En Japón existen razas como la wagyu de c

In [7]:
# importing the module
import wikipedia

# setting language to english
wikipedia.set_lang("en")

# printing the summary
print(wikipedia.summary("Royal Family"))


A royal family is the immediate family of kings/queens, emirs/emiras, sultans/sultanas, or raja/rani and sometimes their extended family. The term imperial family appropriately describes the family of an emperor or empress, and the term papal family describes the family of a pope, while the terms baronial family, comital family, ducal family, archducal family, grand ducal family, or princely family are more appropriate to describe, respectively, the relatives of a reigning baron, count/earl, duke, archduke, grand duke, or prince. However, in common parlance members of any family which reigns by hereditary right are often referred to as royalty or "royals". It is also customary in some circles to refer to the extended relations of a deposed monarch and their descendants as a royal family. A dynasty is sometimes referred to as the "House of ...". In July 2013 there were 26 active sovereign dynasties in the world that ruled or reigned over 43 monarchies.As of 2021, while there are several

# Page Content: 

For extracting the content of an article, we will use page() method and content property to get the actual data.

- Syntax: wikipedia.page(“Enter Query”).content

In [8]:
wikipedia.page("Python (programming language)").content

'Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small- and large-scale projects.Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with ear

# Extract images from Wikipedia.

 - Syntax: wikipedia.page(“Enter Query”).images

In [9]:
wikipedia.page("Python (programming language)").images

['https://upload.wikimedia.org/wikipedia/commons/3/31/Free_and_open-source_software_logo_%282009%29.svg',
 'https://upload.wikimedia.org/wikipedia/commons/9/94/Guido_van_Rossum_OSCON_2006_cropped.png',
 'https://upload.wikimedia.org/wikipedia/commons/6/6f/Octicons-terminal.svg',
 'https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg',
 'https://upload.wikimedia.org/wikipedia/commons/1/10/Python_3._The_standard_type_hierarchy.png',
 'https://upload.wikimedia.org/wikipedia/commons/b/bd/Python_Powered.png',
 'https://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.svg',
 'https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikibooks-logo.svg',
 'https://upload.wikimedia.org/wikipedia/commons/f/ff/Wikidata-logo.svg',
 'https://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg',
 'https://upload.wikimedia.org/wikipedia/commons/0/0b/Wikiversity_logo_2017.svg',
 'https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg',
 'https

# Extract current Page URL: 

Use page() method and url property. 

- Syntax: wikipedia.page(“Enter Query”).url

In [10]:
wikipedia.page('"Hello, World!" program').url

'https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'

# Get the list of categories of articles.

Use page() method and categories property. 

- Syntax: wikipedia.page(“Enter Query”).categories

In [11]:
wikipedia.page('"Hello, World!" program').categories

['Articles with example code',
 'Articles with short description',
 'Commons category link is on Wikidata',
 'Computer programming folklore',
 'Short description is different from Wikidata',
 'Test items in computer languages',
 'Webarchive template wayback links',
 'Wikipedia semi-protected pages']

# Get the list of all links to an article 

- Syntax: wikipedia.page(“Enter Query”).links

In [12]:
wikipedia.page('"Hello, World!" program').links

['.deb',
 '1951 USAF resolution test chart',
 '3D computer graphics',
 '99 Bottles of Beer',
 'ASCII',
 'Acid3',
 'Application programming interface',
 'Artificial intelligence',
 'BASIC',
 'BCPL',
 'B (programming language)',
 'Bad Apple!!',
 'Ballerina (programming language)',
 'Bash (Unix shell)',
 'Batch file',
 'Bell Labs',
 'Brian Kernighan',
 'Brian W. Kernighan',
 'C++',
 'COBOL',
 'C (programming language)',
 'C Sharp (programming language)',
 'Calgary corpus',
 'Canterbury corpus',
 'Catchphrase',
 'Chinese room',
 'Clojure',
 'Complex programmable logic device',
 'Computer',
 'Computer graphics',
 'Computer language',
 'Cornell box',
 'D (programming language)',
 'Dart (programming language)',
 'Data compression',
 'Debian',
 'Dennis M. Ritchie',
 'EICAR test file',
 'ETP-1',
 'EURion constellation',
 'Elixir (programming language)',
 'Embedded system',
 'Englewood Cliffs, NJ',
 'Entry point',
 'Esoteric programming language',
 'Ezhil (programming language)',
 'Factorial',
 

# Get data in different languages.

Now we will see language conversion, for converting into another language we will use set_lang() method. 

- Syntax: wikipedia.set_lang(“Enter Language Type”)

In [13]:
wikipedia.set_lang("hi")
wikipedia.summary('"Hello, World!" program')

'जावा एक प्रोग्रामिंग भाषा है जिसे मूलतः सन माइक्रोसिस्टम्स के जेम्स गोसलिंग द्वारा विकसित किया गया तथा 1995 में इसे सन माइक्रोसिस्टम्स के जावा प्लेटफ़ार्म के एक मुख्य अवयव के रूप में रिलीज़ किया गया। भाषा अपना अधिकांश वाक्य विन्यास (सिंटेक्स) C (सी) और C++ से प्राप्त करती है लेकिन इसके पास एक सरल ऑब्जेक्ट मॉडल और कुछ निम्न स्तर की सुविधायें मौजूद हैं। जावा के प्रयोगों को विशिष्ट रूप से बाईटकोड (क्लास फाइल) के लिए संकलित किया जाता है जिसे किसी भी कंप्यूटर आर्किटेक्चर वाले किसी भी जावा वर्चुअल मशीन (JVM) पर चालू किया जा सकता है।\n1995 से सन द्वारा मूल तथा सन्दर्भ कार्यान्वयन जावा संकलकों (कम्पाइलरों), वर्चुअल मशीनों और क्लास लाइब्रेरियों को विकसित किया गया। मई 2007 तक, जावा कम्युनिटी प्रोसेस के विशेष उल्लेखपूर्वक अनुमति में सन ने अपने अधिकांश जावा प्रोद्योगिकियों को GNU जनरल पब्लिक लाइसेन्स के अर्न्तगत मुफ्त सॉफ्टवेयर के रूप में उपलब्ध कराया. दूसरों ने भी सन की इन प्रोद्योगिकियों के वैकल्पिक कार्यान्वयनों को विकसित किया, जैसे कि GNU क्लासपाथ और जावा के लिए GNU कम्पाइलर.'

# Conclusion:
- Here we just implemented the basics of the wikipedia module, to get the summary of the text, get image url, get the url of the pages, get the suggestion of the certain topics, and many more.
- Mostly we havent scrape the data from wikipedia, but we can do that too.
- There are number of ways we can play around and test this library for scraping the data.
- We can also scrap the data that we want using few steps, but it will be implemented in different notebook.

# Next Work:
- [Scraping the custom data from Wikipedia](https://www.kaggle.com/meetnagadia/scrap-data-using-wikipedia)
- Preprocessing the Data.
- Converting data to csv format.
- Performing Classification on our Corpus.

# Links to Understand Web Scraping in detail:
[realpython](https://realpython.com/beautiful-soup-web-scraper-python/)<br>
[freecodecamp](https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/)<br>
[edureka](https://www.edureka.co/blog/web-scraping-with-python/)<br>
[towardsdatascience](https://towardsdatascience.com/step-by-step-tutorial-web-scraping-wikipedia-with-beautifulsoup-48d7f2dfa52d)<br>
[pypi](https://pypi.org/project/wikipedia/)<br>
[Alanhtlands](https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/)

# If you found it useful dont forget to 

![](https://i.kym-cdn.com/photos/images/newsfeed/000/911/007/b5a.jpg)