# Wikipedia Data Collection

## Introduction

In this lab, we will familiarize ourselves with two python wrappers for the Wikipedia API [wikipedia](https://wikipedia.readthedocs.io/en/latest/) and [wikipedia-api](https://wikipedia-api.readthedocs.io/en/latest/README.html). By completing this lab, you should be able to extract information from wikipedia pages and collect page information from larger categories.

Note: These are python wrappers for the API, which means that the interactions are much simpler and therefore are limited. 
Doing more advanced API interactions like [creating a bot](https://en.wikipedia.org/wiki/Help:Creating_a_bot#Python) requires use of either the original [Wikimedia API](https://www.mediawiki.org/wiki/API:Main_page) or other specified wrappers. This lab deals primarily with basic extraction of information to get you comfortable with working with APIs

### Pre-requisites
- Install the `wikipedia` and `wikipedia-api` Python wrappers in your terminal (code below)

In [None]:
#%pip install wikipedia
#%pip install wikipedia-api

In [73]:
import wikipedia as wikisearch
import wikipediaapi
import pandas as pd

`wikipediaapi` requires you to create an instance with a specified user agent and language. 
Fill in a project name and your email (no need to have an account) in the variables. 
Feel free to change the language as needed - it's currently set to English

In [3]:
proj_name = 'INFO492Lab'
email = 'hayad03@uw.edu'
wiki = wikipediaapi.Wikipedia(f'{proj_name} ({email})', 'en')

## Searching for a Page

The most basic function of Wikipedia interaction is collecting information from a single page. Let's say that you don't know the exact wording of the page title. Using the `wikipedia` wrapper, we can get some suggestions based on our query - similar to the function of a search bar. 

Remember that we've named the `wikipedia` wrapper `wikisearch`. This just helps differentiate the multiple wrappers we're using.

In [7]:
wikisearch.search("Twitter") # Try it out with your own words!

['Twitter',
 'Twitter, Inc.',
 'Twitter Files',
 'List of most-followed Twitter accounts',
 'Twitter verification',
 'Stan Twitter',
 'Twitter under Elon Musk',
 'Censorship of Twitter',
 'Black Twitter',
 'Timeline of Twitter']

Great! Now we have a list of page titles that we might be able to get information on. That extra step helps avoid errors in querying for a wikipedia page that might have multiple meanings.

>>To get information on an actual page, I find it best to use `wikipediaapi`. This is for a few reasons, namely because the `wikipedia` wrapper (although much more popular) hasn't been updated since 2014 and because it doesn't give you as much metadata on the page itself. You can use either wrapper for this though, it's just a matter of preference.

In [None]:
page = wiki.page("Twitter")
dir(page) # This shows all the attributes in an object so you can find out what's available to get info on

This returns an object with a few different attributes like summary, text, title, links, sections, and categories. 

### Get Page Info -- Try it Yourself

It's easiest when working with a lot of different queries to turn repeated processes like this into functions. Try to create a function that accepts a page name and returns a dict with the:
* Page Title
* Page Summary
* Categories a page is in
* Page Sections
* and Page Text (first 500 words after summary)

**HINT:** You'll notice that getting page sections only returns high level sections. Take a look at [the documentation](https://wikipedia-api.readthedocs.io/en/latest/README.html) to see if there's a better way to get and display a page's sections. You might need to adapt what's been provided to fit your needs. 

**HINT 2:** The page text includes the summary. See if you can get the text without any overlap.

In [80]:
# Your Solution

In [87]:
# Test your code here

### Getting Pages by Category

When we're getting info on a lot of pages at once, we might want to see if Wikipedia has already created a grouping of pages within our area of interest. For example, we can see that Twitter is part of a category called 'American social networking mobile apps'. I can decide to grab a list of all the Wikipedia pages under that same category.

In [34]:
wiki.page("Category:Physics").categorymembers

{'Physics': Physics (id: ??, ns: 0),
 'Portal:Physics': Portal:Physics (id: ??, ns: 100),
 'Action principles': Action principles (id: ??, ns: 0),
 'Charge based boundary element fast multipole method': Charge based boundary element fast multipole method (id: ??, ns: 0),
 'Computational chemistry': Computational chemistry (id: ??, ns: 0),
 'Dynamic toroidal dipole': Dynamic toroidal dipole (id: ??, ns: 0),
 'Talk:Dynamic toroidal dipole': Talk:Dynamic toroidal dipole (id: ??, ns: 1),
 'Edge states': Edge states (id: ??, ns: 0),
 'Force control': Force control (id: ??, ns: 0),
 'Isoelectric (electric potential)': Isoelectric (electric potential) (id: ??, ns: 0),
 'Laser cooling': Laser cooling (id: ??, ns: 0),
 'Neutral atom quantum computer': Neutral atom quantum computer (id: ??, ns: 0),
 'Olsen cycle': Olsen cycle (id: ??, ns: 0),
 'Overlap fermion': Overlap fermion (id: ??, ns: 0),
 'Talk:Paul Harry Roberts': Talk:Paul Harry Roberts (id: ??, ns: 1),
 'Quasi-isodynamic stellarator': 

### Try it yourself

This result is difficult to read. Using the documentation or adapting your earlier solution, create a function that returns the categories in a more readable way. 

In [86]:
# Your Solution

In [85]:
# Test your code here

## **BONUS:** Getting Page Info by Category
Combine your earlier work to create a function that automatically makes a database of page information from a category search. Like before, the function should accept the name of a category and return a database with page information including the title, summary, sections, categories, and first 500 words of text. 

**NOTE:** Some of the items returned by .categorymembers are themselves categories or lists. Try to exclude those from the search.

In [83]:
# Your Solution


In [84]:
# Test your code here. Depending on the category you search, this might take a while to run