# Getting Data from Wikipedia
 
This short notebook shows how to work with the **mwclient** library to exlore the content of Wikipedia pages. It's an example of a API wrapper, that facilitates working with the underlying API, the [MediaWikiAPI](https://www.mediawiki.org/wiki/API:Main_page).

Once you learn how to use the library, you will complete a task that collects revisions for a particular Wikipedia page to be used later.

**Table of Content**
1. [Installing the `mwclient` module](#sec1)
2. [Connecting to a site and working with pages](#sec2)
3. [The Wellesley College page](#sec3)
4. [Exploring Wellesley's Wiki page revisions](#sec4)
5. **[Tasks for you: Dobbs v. Jackson](#sec5)**

## 1. Installing the `mwclient` module

In this notebook I've used the module [mwclient](http://mwclient.readthedocs.io/en/latest/index.html). The name stands for "Media Wiki Client". It's a library to access the Wikipedia pages through Python.

Since this is your first time using this library, you have to install it first.  
**Note:** Older notebook versions might require the use of the exclamation mark symbol before `pip`.

In [None]:
pip install mwclient

In [None]:
# check if module is installed
import mwclient

<a id="sec2"></a>

## 2. Connecting to a wiki site and getting pages

There are many wiki websites that are accessed by MediaWikiAPI. We need to provide the URL of the one we will work with, in our case the English Wikipedia.

In [None]:
from mwclient import Site
site = Site('en.wikipedia.org')

It's possible to search for pages based on a simple query term, given that we will search within the `site`:

In [None]:
page = site.pages['Wellesley']
page

And we can read the text of the page, which in this case appears to be a disambiguation page with links to many pages that contain the word "Wellesley":

In [None]:
page.text()

### 2.a What else does a page contain?

Let's looks at some properties that the page contains:

**Categories:** Most pages in Wikipedia are assigned categories, which we can access:

In [None]:
for cat in page.categories():
    print(cat)

**Links:** A page has many links to other Wikipedia pages, we can access them too:

In [None]:
links = [l for l in page.links()]
links[:5]

**IMPORTANT - Lazy behavior:** Simply calling the method `links` on the page object will not give us the list of links:

In [None]:
page.links()

We need to iterate over this object to get the links, which themselves are objects pointing to the pages.

### 2.b What's in a page object?

As we saw above, each link shows up as a page object in the list of links. This is because these are all Wikipedia articles. Let's verify again that each page is an object:

In [None]:
type(page)

We can use the Python built-in function `dir` to find out what properties or methods we can call on this object:

In [None]:
onePage = links[10]
print(dir(onePage))

Let's try out some of the properties:

In [None]:
# length of page in characters
onePage.length

In [None]:
# name of the page
onePage.name

In [None]:
# timestamp of when the page was changed the last time
onePage.touched

**Note:** Notice the type `time.struct_time` that is used to represent time in Wikipedia. Recall that you learned about the `time` library in Week 3 tasks.

<a id="sec3"></a>

## 3. The Wellesley College page

Let's get the Wellesley College page and look at its properties. Instead of searching for "Wellesley", let's search for "Wellesley College":

In [None]:
wcp = site.pages['Wellesley College']
wcp.name

Is it a protected page? Meaning, can anyone edit it, or are there some restrictions in place? In Wikipedia, some pages are protected to prevent vandalism.

In [None]:
wcp.protection

No, it's not. But, we can easily find a page that is protected:

In [None]:
hcp = site.pages['Hillary Clinton']
hcp.name

In [None]:
hcp.protection

Notice that the result this time looks different from that of the Wellesley page. For example, only **autoconfirmed** users can edit the page. You can learn more about levels of user access on Wikipedia [in this page](https://en.wikipedia.org/wiki/Wikipedia:User_access_levels).

When it was the last time that Wellesley's page was edited?

In [None]:
wcp.touched # last time it was edited

What is the length of the page in characters?

In [None]:
wcp.length # length of page in characters

Get external links from this page (links that go outside Wikipedia):

In [None]:
extlinks = [el for el in wcp.extlinks()]
len(extlinks)

In [None]:
extlinks[:10]

Find all Wikipedia pages that link to Wellesley College, these are known as **backlinks**:

In [None]:
backlinks = [el for el in wcp.backlinks()]
len(backlinks)

That is a lot of backlinks that point to the Wellesley College page from other Wikipedia pages!

In [None]:
backlinks[:10]

Finally, look at the links from this page to other Wikipedia pages:

In [None]:
links = [el for el in wcp.links()]
len(links)

In [None]:
links[:10]

We can say that more pages link to Wellesley College than vice-versa.

**IMPORTANT:** The links to other pages are useful to find things that are related. Even better are reciprocal links: pages that point to each-other.

<a id="sec4"></a>
## 4. Exploring the Wellesley page revisions

We can see that the object `page` has two properties, `revision` and `revisions`, let's look at them:

In [None]:
wcp.revision

In [None]:
wcp.revisions

The message shows that `revisions` is a method, not a property, we'll need parens to access it:

In [None]:
wcp.revisions()

We can see the pattern now, most functions return **lazy objects**, because the user might not be interested in everything.  

**Get all revisions:** We can get all revisions we want by looping through the list iterator. This might take a few seconds.

In [None]:
revisions = [rev for rev in wcp.revisions()]
len(revisions)

In [None]:
revisions[:3]

### 4.a Find users in revisions

Each revision is stored as a Python dictionary, so we can easily extract the users:

In [None]:
users = [rev['user'] for rev in revisions]

We'll use `Counter` to create a dict of users with their counts and then print these users based on the number of edits, with the most common edits at the top:

In [None]:
from collections import Counter

usersDct = Counter(users)
usersDct.most_common(10)

**Note:** When I taught this class in 2017, CS 234 students edited the Wellesley College page on Wikipedia. 


Let's check the count of edits for some of CS 234 editors of the page:

In [None]:
usersDct['Imanh19']

In [None]:
usersDct['Angelinahli']

How many unique users have edited this page?

In [None]:
print(f"{len(usersDct)} unique users have edited {len(revisions)} times the {wcp.name} Wikipedia page.")

### 4.b Working with timestamps

Each revision contains a timestamp. Let's convert that to a datetime object to make it easier to work with it.  
**NOTE:** To make sense of this part, you need to have completed the notebook on working with date & time objects in Week 3 tasks.

In [None]:
ts = revisions[0]['timestamp']
ts

In [None]:
type(ts)

The following modules will work together to make the conversion from `timestruct` to `datetime`:

In [None]:
from time import mktime
from datetime import datetime

# turn an object from type struct_time to datetime
datetime.fromtimestamp(mktime(ts))

Now that we have a datetime object, we can do many things:

1. group number of revisions by day
2. group number of revisions by month or year
3. group revisions by user revisions per day

A reminder that a datetime object has properties to access values such as year and month:

In [None]:
dt = datetime.fromtimestamp(mktime(ts))
print(dt.year)
print(dt.month)
print(dt.day)

As well as a useful method to return only the date (without the time portion):

In [None]:
print(dt.date())

This is especially useful in the succeeding example.

**Example: What was the day with most revisions?**

First, we convert all timestamps into string dates, just because it is easier to store them than datetime objects.

In [None]:
def createDateTime(timestamp):
    """convert a timestruct to datetime"""
    return datetime.fromtimestamp(mktime(timestamp))

dates = [str(createDateTime(rev['timestamp']).date()) for rev in revisions]

# what does str(createDateTime(rev['timestamp']).date()) do?
# 1. it call the function createDateTime with each revision's timestamp object
# 2. then it applies the method date() on the returned datetime object, to get a date object
# 3. it converts the date object into a string

dates[:10]

**Find days with most edits**

We can do this in the same way we found the users with most edits, using the `Counter` constructor:

In [None]:
datesDct = Counter(dates)
datesDct.most_common(10)

<a id='sec5'></a>
## 5. Tasks for you

In this task you will accomplish the following goals:

* Get the revisions of the Wiki page on [Dobbs v. Jackson Women's Health Organization](https://en.wikipedia.org/wiki/Dobbs_v._Jackson_Women%27s_Health_Organization), in a similar way as you got the revisions for the Wellesley College page.
* Find the number of revisions contributed by each user. Create a dataframe that contains two columns: username and revision count.
* Study the first plot in [this Plotly page](https://plotly.com/python/ecdf-plots/), known as the empirical cummulative distribution function plot and create one for the revision count column.
* Interpret the plot: What is the eCDF plot telling us about users and Wikipedia revisions? If you have never encountered eCDF, read the [Wiki page](https://en.wikipedia.org/wiki/Empirical_distribution_function) and check out how eCDF looks like for a normal distribution.
* Open the file `mcgowan_timestamps.json` in your folder to study its structure. It's basically a list of lists, each of them with two items: name of the editor and timestamp of their revision. Then create a similar JSON file, titled `dobbsVJaksonRevisions.json`, to use in the TimeSeries task. It should contain only the usernames and the timestamps (as datetime strings) of their revisions.