[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kjmazidi/NLP/blob/master/Part_4-Documents/Chapter_12_Corpora/Wikipedia.ipynb)

###### Code accompanies *Natural Language Processing* by KJG Mazidi, all rights reserved.

### Downloading Data with the Wikipedia Library

This notebook demonstrates how to download data from Wikipedia with the wikipedia library. Install with pip/pip3:

``` 
pip install wikipedia
```

The wikipedia library is a wrapper of the [MediaWikiAPI](https://www.mediawiki.org/wiki/API:Main_page) and is designed to facilitate download of data like article summaries, links, etc. If you are doing heavy downloading, it is suggested to use one of the heavy-duty API wrappers available here: https://en.wikipedia.org/wiki/Help:Creating_a_bot#Python

Read the docs for this package here: https://wikipedia.readthedocs.io/en/latest/

In [24]:
# run this code block if you are on colab

!pip install wikipedia

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip[0m


In [1]:
import wikipedia



In [2]:
print(wikipedia.summary('Austin, Texas'))

Austin (  AW-stin) is the capital of the U.S. state of Texas and the seat and most populous city of Travis County, with portions extending into Hays and Williamson counties. Incorporated on December 27, 1839, it is the 28th-largest metropolitan area in the United States, the 11th-most populous city in the United States, the fourth-most populous city in the state after Houston, San Antonio, and Dallas, and the second-most populous state capital city after Phoenix, the capital of Arizona. It has been one of the fastest growing large cities in the United States since 2010. Downtown Austin and Downtown San Antonio are approximately 80 miles (129 km) apart, and both fall along the Interstate 35 corridor. This combined metropolitan region of San Antonio–Austin has approximately 5 million people. Austin is the southernmost state capital in the contiguous United States and is considered a Gamma + level global city as categorized by the Globalization and World Cities Research Network.
As of 202

In [3]:
# for a given search term there are likely many matches

wikipedia.search('Austin')

['Austin',
 'Austin, Texas',
 'Austin Butler',
 'Austin Austin',
 'Mary Austin',
 'University of Texas at Austin',
 'Stone Cold Steve Austin',
 'Austin Wells',
 'Austin Powers',
 'Chris Austin']

In [4]:
# let 'austin' be the wiki object

austin = wikipedia.page('Austin, Texas')

In [5]:
austin.title

'Austin, Texas'

In [6]:
austin.content[:100]

'Austin (  AW-stin) is the capital of the U.S. state of Texas and the seat and most populous city of '

In [7]:
# another way to do the same as the above cell

austin_text = austin.content
austin_text[:100]

'Austin (  AW-stin) is the capital of the U.S. state of Texas and the seat and most populous city of '

In [8]:
austin.url

'https://en.wikipedia.org/wiki/Austin,_Texas'

In [9]:
austin_links = austin.links
austin_links[:5]  # get the first 5 links

['1836 Republic of Texas presidential election',
 '1838 Republic of Texas presidential election',
 '1841 Republic of Texas presidential election',
 '1844 Republic of Texas presidential election',
 '1850 United States census']

In [10]:
# see list of languages here: https://meta.wikimedia.org/wiki/List_of_Wikipedias
wikipedia.set_lang('es')

In [11]:
wikipedia.summary('Austin, Texas', sentences=1)

'Austin es la capital del estado estadounidense de Texas y del condado de Travis.'