# Getting Data from Wikipedia
 
This short notebook shows how to work with the **mwclient** library to exlore the content of Wikipedia pages. It's an example of a API wrapper, that facilitates working with the underlying API, the [MediaWikiAPI](https://www.mediawiki.org/wiki/API:Main_page).

Once you learn how to use the library, you will complete a task that collects revisions for a particular Wikipedia page to be used later.

**Table of Content**
1. [Installing the `mwclient` module](#sec1)
2. [Connecting to a site and working with pages](#sec2)
3. [The Wellesley College page](#sec3)
4. [Exploring Wellesley's Wiki page revisions](#sec4)
5. **[Tasks for you: Dobbs v. Jackson](#sec5)**

## 1. Installing the `mwclient` module

In this notebook I've used the module [mwclient](http://mwclient.readthedocs.io/en/latest/index.html). The name stands for "Media Wiki Client". It's a library to access the Wikipedia pages through Python.

Since this is your first time using this library, you have to install it first.  
**Note:** Older notebook versions might require the use of the exclamation mark symbol before `pip`.

In [1]:
pip install mwclient

Collecting mwclient
  Downloading mwclient-0.10.1-py2.py3-none-any.whl (27 kB)
Collecting requests-oauthlib (from mwclient)
  Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthlib->mwclient)
  Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.7/151.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: oauthlib, requests-oauthlib, mwclient
Successfully installed mwclient-0.10.1 oauthlib-3.2.2 requests-oauthlib-2.0.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
# check if module is installed
import mwclient

<a id="sec2"></a>

## 2. Connecting to a wiki site and getting pages

There are many wiki websites that are accessed by MediaWikiAPI. We need to provide the URL of the one we will work with, in our case the English Wikipedia.

In [3]:
from mwclient import Site
site = Site('en.wikipedia.org')

It's possible to search for pages based on a simple query term, given that we will search within the `site`:

In [4]:
page = site.pages['Wellesley']
page

<Page object 'b'Wellesley'' for <Site object 'en.wikipedia.org/w/'>>

And we can read the text of the page, which in this case appears to be a disambiguation page with links to many pages that contain the word "Wellesley":

In [5]:
page.text()

"'''Wellesley''' may refer to:\n{{TOC right}}\n\n* \n\n== People ==\n===Dukes of Wellington===\n* [[Arthur Wellesley, 1st Duke of Wellington]] (1769–1852), British soldier, statesman, and Prime Minister of the United Kingdom\n* [[Arthur Wellesley, 2nd Duke of Wellington]] (1807–1884), British politician\n* [[Henry Wellesley, 3rd Duke of Wellington]] (1846–1900), British soldier and politician\n* [[Arthur Wellesley, 4th Duke of Wellington]] (1849–1934), British soldier\n* [[Arthur Wellesley, 5th Duke of Wellington]] (1876–1941), British soldier\n* [[Henry Wellesley, 6th Duke of Wellington]] (1912–1943), British soldier\n* [[Gerald Wellesley, 7th Duke of Wellington]] (1885–1972), British soldier and diplomat\n* [[Valerian Wellesley, 8th Duke of Wellington]] (1915–2014), British soldier\n* [[Charles Wellesley, 9th Duke of Wellington]] (born 1945), British politician and businessman\n\n==Barons Cowley (1828)==\n* [[Henry Wellesley, 1st Baron Cowley]] (1773–1847)\n* [[Henry Wellesley, 1st E

### 2.a What else does a page contain?

Let's looks at some properties that the page contains:

**Categories:** Most pages in Wikipedia are assigned categories, which we can access:

In [6]:
for cat in page.categories():
    print(cat)

<Category object 'b'Category:All article disambiguation pages'' for <Site object 'en.wikipedia.org/w/'>>
<Category object 'b'Category:All disambiguation pages'' for <Site object 'en.wikipedia.org/w/'>>
<Category object 'b'Category:Disambiguation pages'' for <Site object 'en.wikipedia.org/w/'>>
<Category object 'b'Category:Disambiguation pages with surname-holder lists'' for <Site object 'en.wikipedia.org/w/'>>
<Category object 'b'Category:Place name disambiguation pages'' for <Site object 'en.wikipedia.org/w/'>>
<Category object 'b'Category:Short description is different from Wikidata'' for <Site object 'en.wikipedia.org/w/'>>


**Links:** A page has many links to other Wikipedia pages, we can access them too:

In [7]:
links = [l for l in page.links()]
links[:5]

[<Page object 'b'Denis Wellesley, 5th Earl Cowley'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Arthur Wellesley, 1st Duke of Wellington'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Arthur Wellesley, 2nd Duke of Wellington'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Arthur Wellesley, 4th Duke of Wellington'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Arthur Wellesley, 5th Duke of Wellington'' for <Site object 'en.wikipedia.org/w/'>>]

**IMPORTANT - Lazy behavior:** Simply calling the method `links` on the page object will not give us the list of links:

In [8]:
page.links()

<List object 'links' for <Site object 'en.wikipedia.org/w/'>>

We need to iterate over this object to get the links, which themselves are objects pointing to the pages.

### 2.b What's in a page object?

As we saw above, each link shows up as a page object in the list of links. This is because these are all Wikipedia articles. Let's verify again that each page is an object:

In [9]:
type(page)

mwclient.page.Page

We can use the Python built-in function `dir` to find out what properties or methods we can call on this object:

In [10]:
onePage = links[10]
print(dir(onePage))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_edit', '_info', '_textcache', 'append', 'backlinks', 'base_name', 'base_title', 'can', 'categories', 'contentmodel', 'delete', 'edit', 'edit_time', 'embeddedin', 'exists', 'extlinks', 'get_token', 'handle_edit_error', 'images', 'iwlinks', 'langlinks', 'last_rev_time', 'length', 'links', 'move', 'name', 'namespace', 'normalize_title', 'page_title', 'pageid', 'pagelanguage', 'prepend', 'protection', 'purge', 'redirect', 'redirects_to', 'resolve_redirect', 'restrictiontypes', 'revision', 'revisions', 'save', 'site', 'strip_namespace', 'templates', 'text', 'touch', 'touched']


Let's try out some of the properties:

In [11]:
# length of page in characters
onePage.length

5075

In [12]:
# name of the page
onePage.name

'Elizabeth Wellesley, Duchess of Wellington'

In [13]:
# timestamp of when the page was changed the last time
onePage.touched

time.struct_time(tm_year=2024, tm_mon=5, tm_mday=3, tm_hour=20, tm_min=12, tm_sec=12, tm_wday=4, tm_yday=124, tm_isdst=-1)

**Note:** Notice the type `time.struct_time` that is used to represent time in Wikipedia. Recall that you learned about the `time` library in Week 3 tasks.

<a id="sec3"></a>

## 3. The Wellesley College page

Let's get the Wellesley College page and look at its properties. Instead of searching for "Wellesley", let's search for "Wellesley College":

In [14]:
wcp = site.pages['Wellesley College']
wcp.name

'Wellesley College'

Is it a protected page? Meaning, can anyone edit it, or are there some restrictions in place? In Wikipedia, some pages are protected to prevent vandalism.

In [15]:
wcp.protection

{}

No, it's not. But, we can easily find a page that is protected:

In [16]:
hcp = site.pages['Hillary Clinton']
hcp.name

'Hillary Clinton'

In [17]:
hcp.protection

{'edit': ('autoconfirmed', 'infinity'), 'move': ('sysop', 'infinity')}

Notice that the result this time looks different from that of the Wellesley page. For example, only **autoconfirmed** users can edit the page. You can learn more about levels of user access on Wikipedia [in this page](https://en.wikipedia.org/wiki/Wikipedia:User_access_levels).

When it was the last time that Wellesley's page was edited?

In [18]:
wcp.touched # last time it was edited

time.struct_time(tm_year=2024, tm_mon=5, tm_mday=7, tm_hour=10, tm_min=15, tm_sec=1, tm_wday=1, tm_yday=128, tm_isdst=-1)

What is the length of the page in characters?

In [19]:
wcp.length # length of page in characters

71335

Get external links from this page (links that go outside Wikipedia):

In [20]:
extlinks = [el for el in wcp.extlinks()]
len(extlinks)

188

In [21]:
extlinks[:10]

['http://www.wellesley.edu/sfs/UnderstandingFinAid.html',
 'http://www.wellesley.edu/PublicAffairs/Releases/2009/042509.html',
 'http://www.wellesley.edu/Welcome/Traditions/hooprolling.html',
 'http://www.wellesley.edu/Welcome/HistoricalMaps/maps_main.html',
 'http://new.wellesley.edu/admission/esp/nontraditional',
 'http://www.wellesleyblue.com/',
 'http://www.wellesley.edu/PublicAffairs/About/briefhistory.html',
 'http://www.travelandleisure.com/articles/americas-most-beautiful-college-campuses/23',
 'http://chronicle.com/free/v52/i38/38a04001.htm',
 'http://www.virtualvermont.com/history/hgreen.html']

Find all Wikipedia pages that link to Wellesley College, these are known as **backlinks**:

In [22]:
backlinks = [el for el in wcp.backlinks()]
len(backlinks)

2632

That is a lot of backlinks that point to the Wellesley College page from other Wikipedia pages!

In [23]:
backlinks[:10]

[<Page object 'b'America the Beautiful'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Basketball'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Brown University'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Barnard College'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'California Institute of Technology'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Columbia University'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Colonna family'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'City University of New York'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Dartmouth College'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Grinnell College'' for <Site object 'en.wikipedia.org/w/'>>]

Finally, look at the links from this page to other Wikipedia pages:

In [24]:
links = [el for el in wcp.links()]
len(links)

1021

In [25]:
links[:10]

[<Page object 'b'Judy Atterbury'' for <Site object 'en.wikipedia.org/w/'>>,
 <Image object 'b'File:Wellesley college panorama-red.jpg'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'ACT (test)'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Ada Howard'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Adaline Emerson Thompson'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Adrian Piper'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Agnes Scott College'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Alan Schechter'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Albion College'' for <Site object 'en.wikipedia.org/w/'>>,
 <Page object 'b'Albright College'' for <Site object 'en.wikipedia.org/w/'>>]

We can say that more pages link to Wellesley College than vice-versa.

**IMPORTANT:** The links to other pages are useful to find things that are related. Even better are reciprocal links: pages that point to each-other.

<a id="sec4"></a>
## 4. Exploring the Wellesley page revisions

We can see that the object `page` has two properties, `revision` and `revisions`, let's look at them:

In [26]:
wcp.revision

1222684612

In [27]:
wcp.revisions

<bound method Page.revisions of <Page object 'b'Wellesley College'' for <Site object 'en.wikipedia.org/w/'>>>

The message shows that `revisions` is a method, not a property, we'll need parens to access it:

In [28]:
wcp.revisions()

<List object 'revisions' for <Site object 'en.wikipedia.org/w/'>>

We can see the pattern now, most functions return **lazy objects**, because the user might not be interested in everything.  

**Get all revisions:** We can get all revisions we want by looping through the list iterator. This might take a few seconds.

In [29]:
revisions = [rev for rev in wcp.revisions()]
len(revisions)

2287

In [30]:
revisions[:3]

[OrderedDict([('revid', 1222684612),
              ('parentid', 1222451585),
              ('user', '2600:100E:B070:43E0:4116:EB00:C123:26BC'),
              ('anon', ''),
              ('timestamp',
               time.struct_time(tm_year=2024, tm_mon=5, tm_mday=7, tm_hour=10, tm_min=15, tm_sec=1, tm_wday=1, tm_yday=128, tm_isdst=-1)),
              ('comment', '/* Notable alumnae */')]),
 OrderedDict([('revid', 1222451585),
              ('parentid', 1222451429),
              ('user', '173.77.203.132'),
              ('anon', ''),
              ('timestamp',
               time.struct_time(tm_year=2024, tm_mon=5, tm_mday=6, tm_hour=1, tm_min=8, tm_sec=41, tm_wday=0, tm_yday=127, tm_isdst=-1)),
              ('comment', '/* Athletics */')]),
 OrderedDict([('revid', 1222451429),
              ('parentid', 1222451297),
              ('user', '173.77.203.132'),
              ('anon', ''),
              ('timestamp',
               time.struct_time(tm_year=2024, tm_mon=5, tm_mday=6, tm_h

### 4.a Find users in revisions

Each revision is stored as a Python dictionary, so we can easily extract the users:

In [31]:
users = [rev['user'] for rev in revisions]

We'll use `Counter` to create a dict of users with their counts and then print these users based on the number of edits, with the most common edits at the top:

In [32]:
from collections import Counter

usersDct = Counter(users)
usersDct.most_common(10)

[('Contributor321', 66),
 ('ElKevbo', 59),
 ('Interestingstuffadder', 46),
 ('Classicfilms', 27),
 ('Catamorphism', 25),
 ('Cellmesellme', 20),
 ('Rjensen', 19),
 ('Vadalium92', 18),
 ('RegentsPark', 17),
 ('GuardianH', 17)]

**Note:** When I taught this class in 2017, CS 234 students edited the Wellesley College page on Wikipedia. 


Let's check the count of edits for some of CS 234 editors of the page:

In [33]:
usersDct['Imanh19']

3

In [34]:
usersDct['Angelinahli']

6

How many unique users have edited this page?

In [35]:
print(f"{len(usersDct)} unique users have edited {len(revisions)} times the {wcp.name} Wikipedia page.")

1186 unique users have edited 2287 times the Wellesley College Wikipedia page.


### 4.b Working with timestamps

Each revision contains a timestamp. Let's convert that to a datetime object to make it easier to work with it.  
**NOTE:** To make sense of this part, you need to have completed the notebook on working with date & time objects in Week 3 tasks.

In [36]:
ts = revisions[0]['timestamp']
ts

time.struct_time(tm_year=2024, tm_mon=5, tm_mday=7, tm_hour=10, tm_min=15, tm_sec=1, tm_wday=1, tm_yday=128, tm_isdst=-1)

In [37]:
type(ts)

time.struct_time

The following modules will work together to make the conversion from `timestruct` to `datetime`:

In [38]:
from time import mktime
from datetime import datetime

# turn an object from type struct_time to datetime
datetime.fromtimestamp(mktime(ts))

datetime.datetime(2024, 5, 7, 10, 15, 1)

Now that we have a datetime object, we can do many things:

1. group number of revisions by day
2. group number of revisions by month or year
3. group revisions by user revisions per day

A reminder that a datetime object has properties to access values such as year and month:

In [39]:
dt = datetime.fromtimestamp(mktime(ts))
print(dt.year)
print(dt.month)
print(dt.day)

2024
5
7


As well as a useful method to return only the date (without the time portion):

In [40]:
print(dt.date())

2024-05-07


This is especially useful in the succeeding example.

**Example: What was the day with most revisions?**

First, we convert all timestamps into string dates, just because it is easier to store them than datetime objects.

In [41]:
def createDateTime(timestamp):
    """convert a timestruct to datetime"""
    return datetime.fromtimestamp(mktime(timestamp))

dates = [str(createDateTime(rev['timestamp']).date()) for rev in revisions]

# what does str(createDateTime(rev['timestamp']).date()) do?
# 1. it call the function createDateTime with each revision's timestamp object
# 2. then it applies the method date() on the returned datetime object, to get a date object
# 3. it converts the date object into a string

dates[:10]

['2024-05-07',
 '2024-05-06',
 '2024-05-06',
 '2024-05-06',
 '2024-05-06',
 '2024-04-24',
 '2024-04-21',
 '2024-04-17',
 '2024-04-17',
 '2024-04-16']

**Find days with most edits**

We can do this in the same way we found the users with most edits, using the `Counter` constructor:

In [42]:
datesDct = Counter(dates)
datesDct.most_common(10)

[('2023-11-14', 19),
 ('2008-07-10', 16),
 ('2008-07-23', 14),
 ('2006-08-09', 14),
 ('2017-09-29', 13),
 ('2010-06-29', 13),
 ('2009-04-05', 13),
 ('2020-06-24', 12),
 ('2014-07-08', 12),
 ('2012-07-16', 12)]

<a id='sec5'></a>
## 5. Tasks for you

In this task you will accomplish the following goals:

* Get the revisions of the Wiki page on [Dobbs v. Jackson Women's Health Organization](https://en.wikipedia.org/wiki/Dobbs_v._Jackson_Women%27s_Health_Organization), in a similar way as you got the revisions for the Wellesley College page.
* Find the number of revisions contributed by each user. Create a dataframe that contains two columns: username and revision count.
* Study the first plot in [this Plotly page](https://plotly.com/python/ecdf-plots/), known as the empirical cummulative distribution function plot and create one for the revision count column.
* Interpret the plot: What is the eCDF plot telling us about users and Wikipedia revisions? If you have never encountered eCDF, read the [Wiki page](https://en.wikipedia.org/wiki/Empirical_distribution_function) and check out how eCDF looks like for a normal distribution.
* Open the file `mcgowan_timestamps.json` in your folder to study its structure. It's basically a list of lists, each of them with two items: name of the editor and timestamp of their revision. Then create a similar JSON file, titled `dobbsVJaksonRevisions.json`, to use in the TimeSeries task. It should contain only the usernames and the timestamps (as datetime strings) of their revisions.

In [43]:
dobbs_page = site.pages["Dobbs v. Jackson Women's Health Organization"]
dobbs_revisions = [rev for rev in dobbs_page.revisions()]


In [44]:
dobbs_users = [rev['user'] for rev in dobbs_revisions]
dobbs_users_count = Counter(dobbs_users)


In [45]:
# Create a DataFrame
import pandas as pd
dobbs_df = pd.DataFrame(dobbs_users_count.items(), columns=['username', 'revision_count'])

In [46]:
dobbs_df.head()


Unnamed: 0,username,revision_count
0,K1ausMouse,1
1,Hinnk,1
2,SilverLocust,12
3,Citation bot,44
4,AlsoWukai,117


In [49]:
pip install --upgrade plotly

Collecting plotly
  Downloading plotly-5.22.0-py3-none-any.whl (16.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.9.0
    Uninstalling plotly-5.9.0:
      Successfully uninstalled plotly-5.9.0
Successfully installed plotly-5.22.0
Note: you may need to restart the kernel to use updated packages.


In [51]:

fig = px.bar(dobbs_df, x='username', y='revision_count', title='Revision Count by User')

# Show the plot
fig.show()




np.find_common_type is deprecated.  Please use `np.result_type` or `np.promote_types`.
See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)


distutils Version classes are deprecated. Use packaging.version instead.


distutils Version classes are deprecated. Use packaging.version instead.



In [53]:

dobbs_df['timestamp'] = dobbs_df['timestamp'].astype(str)

# Convert DataFrame to list of lists
dobbs_data = dobbs_df[['username', 'timestamp']].values.tolist()




KeyError: 'timestamp'

In [None]:
# Write data to JSON file
with open('dobbsVJaksonRevisions.json', 'w') as file:
    json.dump(dobbs_data, file)