# Wikipedia Data Collection - SOLUTION

## Introduction

In this lab, we will familiarize ourselves with two python wrappers for the Wikipedia API [wikipedia](https://wikipedia.readthedocs.io/en/latest/) and [wikipedia-api](https://wikipedia-api.readthedocs.io/en/latest/README.html). By completing this lab, you should be able to extract information from wikipedia pages and collect page information from larger categories.

Note: These are python wrappers for the API, which means that the interactions are much simpler and therefore are limited. 
Doing more advanced API interactions like [creating a bot](https://en.wikipedia.org/wiki/Help:Creating_a_bot#Python) requires use of either the original [Wikimedia API](https://www.mediawiki.org/wiki/API:Main_page) or other specified wrappers. This lab deals primarily with basic extraction of information to get you comfortable with working with APIs

### Pre-requisites
- Install the `wikipedia` and `wikipedia-api` Python wrappers in your terminal (code below)

In [1]:
#!pip install wikipedia
#!pip install wikipedia-api

In [2]:
import wikipedia as wikisearch
import wikipediaapi
import pandas as pd

`wikipediaapi` requires you to create an instance with a specified user agent and language. 
Fill in a project name and your email (no need to have an account) in the variables. 
Feel free to change the language as needed - it's currently set to English

In [3]:
proj_name = 'INFO492Lab'
email = 'hayad03@uw.edu'
wiki = wikipediaapi.Wikipedia(f'{proj_name} ({email})', 'en')

## Searching for a Page

The most basic function of Wikipedia interaction is collecting information from a single page. Let's say that you don't know the exact wording of the page title. Using the `wikipedia` wrapper, we can get some suggestions based on our query - similar to the function of a search bar. 

Remember that we've named the `wikipedia` wrapper `wikisearch`. This just helps differentiate the multiple wrappers we're using.

In [4]:
wikisearch.search("Twitter") # Try it out with your own words!

['Twitter',
 'Twitter, Inc.',
 'Twitter Files',
 'List of most-followed Twitter accounts',
 'Twitter verification',
 'Stan Twitter',
 'Twitter under Elon Musk',
 'Black Twitter',
 'Acquisition of Twitter by Elon Musk',
 'Censorship of Twitter']

Great! Now we have a list of page titles that we might be able to get information on. That extra step helps avoid errors in querying for a wikipedia page that might have multiple meanings.

>>To get information on an actual page, I find it best to use `wikipediaapi`. This is for a few reasons, namely because the `wikipedia` wrapper (although much more popular) hasn't been updated since 2014 and because it doesn't give you as much metadata on the page itself. You can use either wrapper for this though, it's just a matter of preference.

In [5]:
page = wiki.page("Twitter")
dir(page) # This shows all the attributes in an object so you can find out what's available to get info on

['ATTRIBUTES_MAPPING',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_attributes',
 '_backlinks',
 '_called',
 '_categories',
 '_categorymembers',
 '_fetch',
 '_langlinks',
 '_links',
 '_section',
 '_section_mapping',
 '_summary',
 'backlinks',
 'categories',
 'categorymembers',
 'exists',
 'langlinks',
 'language',
 'links',
 'namespace',
 'section_by_title',
 'sections',
 'sections_by_title',
 'summary',
 'text',
 'title',
 'wiki']

This returns an object with a few different attributes like summary, text, title, links, sections, and categories. 

### Get Page Info -- Try it Yourself

It's easiest when working with a lot of different queries to turn repeated processes like this into functions. Try to create a function that accepts a page name and returns a dict with the:
* Page Title
* Page Summary
* Categories a page is in
* Page Sections
* and Page Text (first 500 words after summary)

**HINT:** You'll notice that getting page sections only returns high level sections. Take a look at [the documentation](https://wikipedia-api.readthedocs.io/en/latest/README.html) to see if there's a better way to get and display a page's sections. You might need to adapt what's been provided to fit your needs. 

**HINT 2:** The page text includes the summary. See if you can get the text without any overlap.

In [6]:
def get_page_info(name):
    
    page = wiki.page(name)
    return {
        "title": page.title,
        "summary": page.summary,
        "sections": format_sections(page.sections),
        "categories": page.categories,
        "text": page.text[len(page.summary):len(page.summary) + 500]
    }

def format_sections(sections, level=0, section_list=None):
    if section_list == None:
         section_list = []

    for s in sections:
            section_list.append(s.title)
            format_sections(s.sections, level + 1, section_list)
    return section_list

In [7]:
get_page_info('Twitter')

{'title': 'Twitter',
 'summary': 'X, commonly referred to by its former name Twitter, is a social media website based in the United States. With over 500 million users, it is one of the world\'s largest social networks and the fifth-most visited website in the world. Users can share text messages, images, and videos as "tweets". X (Twitter) also includes direct messaging, video and audio calling, bookmarks, lists and communities, and Spaces, a social audio feature. Users can vote on context added by approved users using the Community Notes feature.\nThe service is owned by the American company X Corp., the successor of Twitter, Inc. Twitter was created in March 2006 by Jack Dorsey, Noah Glass, Biz Stone, and Evan Williams, and launched in July of that year. Twitter grew quickly, and by 2012, more than 100 million users produced 340 million tweets per day. Twitter, Inc., was based in San Francisco, California, and had more than 25 offices around the world. A signature characteristic of 

### Getting Pages by Category

When we're getting info on a lot of pages at once, we might want to see if Wikipedia has already created a grouping of pages within our area of interest. For example, we can see that Twitter is part of a category called 'American social networking mobile apps'. I can decide to grab a list of all the Wikipedia pages under that same category.

In [13]:
wiki.page("Category:American_social_networking_websites").categorymembers

{'8tracks.com': 8tracks.com (id: ??, ns: 0),
 '43 Things': 43 Things (id: ??, ns: 0),
 'About.me': About.me (id: ??, ns: 0),
 'Academia.edu': Academia.edu (id: ??, ns: 0),
 'All Partners Access Network': All Partners Access Network (id: ??, ns: 0),
 'Ally Energy': Ally Energy (id: ??, ns: 0),
 'App.net': App.net (id: ??, ns: 0),
 'Athlinks': Athlinks (id: ??, ns: 0),
 'Bolt (website)': Bolt (website) (id: ??, ns: 0),
 'BookCrossing': BookCrossing (id: ??, ns: 0),
 'Boredat': Boredat (id: ??, ns: 0),
 'Bottlenose (company)': Bottlenose (company) (id: ??, ns: 0),
 'Brainly': Brainly (id: ??, ns: 0),
 'Bring Light': Bring Light (id: ??, ns: 0),
 'Broadband Sports': Broadband Sports (id: ??, ns: 0),
 'Buzznet': Buzznet (id: ??, ns: 0),
 'Cake Financial': Cake Financial (id: ??, ns: 0),
 'Care2': Care2 (id: ??, ns: 0),
 'Cellufun': Cellufun (id: ??, ns: 0),
 'Chess.com': Chess.com (id: ??, ns: 0),
 'Chictopia': Chictopia (id: ??, ns: 0),
 'City-Data': City-Data (id: ??, ns: 0),
 'Classical 

### Try it yourself

This result is very repetitive. Using [the documentation](https://wikipedia-api.readthedocs.io/en/latest/README.html) or adapting your earlier solution, create a function that returns the categories in a more readable way. 

In [23]:
def get_categorymembers(categorymembers, level=0, max_level=1, withSubcategories=True, categoryList = None):
        if categoryList == None:
            categoryList = []
        for c in categorymembers.values():
            categoryList.append("%s" % (c.title))
            if withSubcategories:
                if c.ns == wikipediaapi.Namespace.CATEGORY and level < max_level:
                    get_categorymembers(c.categorymembers, level + 1, max_level, categoryList=categoryList)
        return categoryList

In [26]:
cat = wiki.page("Category:American_social_networking_websites")
print("Category members: Category:American_social_networking_websites")
get_categorymembers(cat.categorymembers)

Category members: Category:American_social_networking_websites


['8tracks.com',
 '43 Things',
 'About.me',
 'Academia.edu',
 'All Partners Access Network',
 'Ally Energy',
 'App.net',
 'Athlinks',
 'Bolt (website)',
 'BookCrossing',
 'Boredat',
 'Bottlenose (company)',
 'Brainly',
 'Bring Light',
 'Broadband Sports',
 'Buzznet',
 'Cake Financial',
 'Care2',
 'Cellufun',
 'Chess.com',
 'Chictopia',
 'City-Data',
 'Classical Lounge',
 'Classmates.com',
 'Coffee Meets Bagel',
 'Cohost',
 'CouchSurfing',
 'Craftster',
 'CUNY Academic Commons',
 'DataLounge',
 'Diabetes Hands Foundation',
 'Dogster',
 'Doximity',
 'Dreamwidth',
 'DYP (app)',
 'Elixio',
 'Ello (social network)',
 'Engage With Grace',
 'Eons.com',
 'Essembly',
 'Everloop',
 'Everything2',
 'Experts Exchange',
 'Facebook',
 'Faves.com',
 'FieldLevel',
 'Firefly (website)',
 'Fitocracy',
 'Flickchart',
 'Flixster',
 'FlyLady',
 'Focus.com',
 'Foodily.com',
 'Fotolog',
 'Foursquare City Guide',
 'The Freecycle Network',
 'Friendster',
 'Fyuse',
 'Gab (social network)',
 'Gaia Online',
 'Game

## **BONUS:** Getting Page Info by Category
Combine your earlier work to create a function that automatically makes a database of page information from a category search. Like before, the function should accept the name of a category and return a database with page information including the title, summary, sections, categories, and first 500 words of text. 

**NOTE:** Some of the items returned by .categorymembers are themselves categories or lists. Try to exclude those from the search.

In [19]:
def get_all_in_category(category):
    page_list = get_categorymembers(wiki.page(category).categorymembers)
    page_list = [page for page in page_list if ('Category:' not in page) & ('List of' not in page)]
    
    page_data = []
    for page in page_list:
        page_data.append(get_page_info(page))

    return pd.DataFrame(page_data)


In [20]:
get_all_in_category("Category:American_social_networking_websites") # Depending on the category you search, this might take a while to run

Unnamed: 0,title,summary,sections,categories,text
0,8tracks.com,8tracks.com is an internet radio and social ne...,"[History, Website and App Usage, Partnerships ...",{'Category:American music websites': Category:...,\n\nHistory\nOne of Porter's significant influ...
1,43 Things,43 Things was a social networking service esta...,"[History, Critique, Awards, References, Extern...",{'Category:Amazon (company)': Category:Amazon ...,\n\nHistory\n43 Things was launched on January...
2,About.me,about.me is a personal web hosting service co-...,"[References, External links]",{'Category:All articles with dead external lin...,\n\nReferences\nExternal links\nOfficial website
3,Academia.edu,Academia.edu is a for-profit open repository o...,"[History, Competitors, Criticism, References, ...",{'Category:Aggregation-based digital libraries...,\n\nHistory\nAcademia.edu was founded by Richa...
4,All Partners Access Network,"All Partners Access Network (APAN), formerly c...","[History, 1997–2004, 2005–2009, 2010–2015, 201...",{'Category:American social networking websites...,\n\nHistory\n1997–2004\nThe origins of APAN ca...
...,...,...,...,...,...
428,Twitterature,Twitterature (a portmanteau of Twitter and lit...,"[Genres, Aphorism, Poetry, Fiction, Literary c...",{'Category:2000s neologisms': Category:2000s n...,\n\nGenres\nAphorism\nAphorisms are popular be...
429,Unknown Number,"""Unknown Number"" is a science fiction short st...","[Synopsis, Reception, References, External links]",{'Category:2021 LGBT-related literary works': ...,\n\nSynopsis\nRather than being a standard nar...
430,Weird Twitter,Weird Twitter is a loose genre of Internet hum...,"[References, External links]",{'Category:Articles with short description': C...,\n\nReferences\nExternal links\nKatie Notopoul...
431,WhyIStayed/WhyILeft,#WhyIStayed became a trending hashtag in Novem...,"[Timeline, 2014, Ray Rice Assault and the Nati...",{'Category:2014 establishments in the United S...,\n\nTimeline\n2014\nRay Rice Assault and the N...
