# LAB 9 Web APIs [Total: 3 points]

The purpose of this assignment is for you to practice how to collect data from a real-world Web API. This will be accomplished through a coding assignment. You will carry out this task in the present notebook, and use the notebook to document the various steps of the exercise and to answer all questions.

## Required skills

This lab will let you practice the following APIs:

- [MediaWiki &ldquo;Action&rdquo; API](https://www.mediawiki.org/wiki/API:Main_page), in particular the API for the English Wikipedia
  + Endpoint: `https://en.wikipedia.org/w/api.php`
  + Sandbox: https://en.wikipedia.org/wiki/Special:ApiSandbox
- [Wikimedia REST API](https://wikimedia.org/api/rest_v1/#), in particular the Pageviews API
  + Endpoint: `/metrics/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}`
  + Note that a sandbox is available at the main link above.

## Important
* Please ensure that you run the following two cells below before running any others. This will download all required files, as well as install the necessary packages to ensure the code runs successfully. If you restart the kernel or your runtime session (in Colab), be sure to rerun this cell before running any others.
* This assignment recommends using Google Colab. If you are using Anaconda Jupyter notebook/lab, please ensure that **this notebook is kept in a new folder**. This is because the following commands will **delete all files with the extensions .csv and .py** before downloading the required files.

In [1]:
# Installing Otter-Grader and downloading required files
required_files = "https://github.com/mainuddin-rony/inst447-fall2024/raw/main/assignment/lab/lab9/required_files.zip"
! rm -rf tests
! rm -f required_files.zip *.csv *.py ._*.csv *.html *.txt
! wget $required_files && unzip -j required_files.zip
! mkdir tests && mv *.py tests
! pip install otter-grader==6.0.4

--2024-11-22 20:20:29--  https://github.com/mainuddin-rony/inst447-fall2024/raw/main/assignment/lab/lab9/required_files.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mainuddin-rony/inst447-fall2024/main/assignment/lab/lab9/required_files.zip [following]
--2024-11-22 20:20:29--  https://raw.githubusercontent.com/mainuddin-rony/inst447-fall2024/main/assignment/lab/lab9/required_files.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5940 (5.8K) [application/zip]
Saving to: ‘required_files.zip’


2024-11-22 20:20:29 (55.4 MB/s) - ‘required_files.zip’ saved [5940/5940]

Archive:  required_fi

In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

In [3]:
import pandas as pd
import numpy as np
import requests
import json

## Q1

**Points**: 1

Write a function called `dogrevisions` that will download data about a fixed set of revisions of the [Wikipedia article on dogs](https://en.wikipedia.org/wiki/Dog).

Your function should take a single parameter &ndash; the list of revisions (provided in code cell below). It should query the Wikipedia API and it should return a data frame with 10 entries.

Your data frame should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>user</th>
      <th>timestamp</th>
      <th>comment</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Johnj1995</td>
      <td>2023-10-10T22:48:48Z</td>
      <td>Undid revision 1179555499 by [[Special:Contributions/Readytowriteyay12345|Readytowriteyay12345]] ([[User talk:Readytowriteyay12345|talk]]) Unsourced</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Graph8389</td>
      <td>2023-10-11T21:16:48Z</td>
      <td>/* Breeds */</td>
    </tr>
    <tr>
      <th>2</th>
      <td>WikiCleanerBot</td>
      <td>2023-10-19T04:47:16Z</td>
      <td>v2.05b - [[User:WikiCleanerBot#T20|Bot T20 CW#61]] - Fix errors for [[WP:WCW|CW project]] (Reference before punctuation)</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Citation bot</td>
      <td>2023-10-25T16:44:10Z</td>
      <td>Add: pmc, pmid. | [[:en:WP:UCB|Use this bot]]. [[:en:WP:DBUG|Report bugs]]. | Suggested by Abductive | [[Category:Wolves]] | #UCB_Category 11/45</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Halfsentientsnail</td>
      <td>2023-10-30T06:37:11Z</td>
      <td></td>
    </tr>
    <tr>
      <th>5</th>
      <td>Justlettersandnumbers</td>
      <td>2023-10-30T09:55:00Z</td>
      <td>Restored revision 1181852947 by [[Special:Contributions/Citation bot|Citation bot]] ([[User talk:Citation bot|talk]]): Thanks, but too many mistakes (grammar, [[MOS:OL]], low-grade sources etc); perhaps try making one edit at a time?</td>
    </tr>
    <tr>
      <th>6</th>
      <td>Citation bot</td>
      <td>2023-11-04T04:20:08Z</td>
      <td>Alter: chapter. | [[:en:WP:UCB|Use this bot]]. [[:en:WP:DBUG|Report bugs]]. | Suggested by Лисан аль-Гаиб | #UCB_webform 2/102</td>
    </tr>
    <tr>
      <th>7</th>
      <td>The Herald</td>
      <td>2023-11-05T13:17:10Z</td>
      <td>Cleaned up using [[WP:AutoEd|AutoEd]]</td>
    </tr>
    <tr>
      <th>8</th>
      <td>Graham87</td>
      <td>2023-11-06T07:43:18Z</td>
      <td>/* See also */ undo edits by [[Special:Contributions/Alemedicen|Alemedicen]], tangential, promotional, paid editor</td>
    </tr>
    <tr>
      <th>9</th>
      <td>Graham87</td>
      <td>2023-11-06T08:12:49Z</td>
      <td>/* Domestication */ rm duplicate text ... probably caused by a bad cut-and-paste in [[Special:Diff/1172721550|this edit]] by [[User:Hemiauchenia|Hemiauchenia]]</td>
    </tr>
  </tbody>
</table>

### Hints
For this question you should keep the timestamp column as a string (i.e. not converted to Pandas Timestamp).

For autograding purposes, your function needs to return data about the exact set of revisions below. These revisions are not the latest ones, so instead of specifying a int argument for the `rvlimit` parameter, you will need to specify the `revids=` parameter, which takes a pipe-separated (e.g. `revids=123|456`) list of revisions.

In [4]:
revisions = [
    '1183754588',
    '1183750837',
    '1183617457',
    '1183413301',
    '1182612592',
    '1182593497',
    '1181852947',
    '1180839131',
    '1179700159',
    '1179559162'
]

def dogrevisions(revisions):
    target = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "format": "json",
        "prop": "revisions",
        "revids": "|".join(revisions)
    }

    res = requests.get(target, params=params)
    wiki_data = res.json()

    dog_revisions = []

    for page in wiki_data["query"]["pages"].values():
        for revised in page.get("revisions"):
            dog_revisions.append({
                "user": revised.get("user"),
                "timestamp": revised.get("timestamp"),
                "comment": revised.get("comment")
            })

    df = pd.DataFrame(dog_revisions)
    return df

Run this cell to test your function.

In [5]:
dogrevisions(revisions)

Unnamed: 0,user,timestamp,comment
0,Johnj1995,2023-10-10T22:48:48Z,Undid revision 1179555499 by [[Special:Contrib...
1,Graph8389,2023-10-11T21:16:48Z,/* Breeds */
2,WikiCleanerBot,2023-10-19T04:47:16Z,v2.05b - [[User:WikiCleanerBot#T20|Bot T20 CW#...
3,Citation bot,2023-10-25T16:44:10Z,"Add: pmc, pmid. | [[:en:WP:UCB|Use this bot]]...."
4,Halfsentientsnail,2023-10-30T06:37:11Z,
5,Justlettersandnumbers,2023-10-30T09:55:00Z,Restored revision 1181852947 by [[Special:Cont...
6,Citation bot,2023-11-04T04:20:08Z,Alter: chapter. | [[:en:WP:UCB|Use this bot]]....
7,Benison,2023-11-05T13:17:10Z,Cleaned up using [[WP:AutoEd|AutoEd]]
8,Graham87,2023-11-06T07:43:18Z,/* See also */ undo edits by [[Special:Contrib...
9,Graham87,2023-11-06T08:12:49Z,/* Domestication */ rm duplicate text ... prob...


When you're ready, run the cell below to get feedback on your answer.

In [6]:
grader.check("q1")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Q2

**Points**: 1

Write a function called `viewsofpage` that queries the Wikimedia REST API for the daily number of pageviews (granularity `daily`) by non-automated users only (agent: `user`) to a given Wikipedia page. Your function should accept a single argument -- the title of a Wikipedia article -- and it should return a data frame with the views from August 26, 2023 (`20230826`) at midnight, until Oct 30, 2023 (`20231030`) at midnight. (Your data frame should be 66 rows long.)

For example, if the article is the `Dog` one, then the first 5 rows of your data frame should look this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>project</th>
      <th>article</th>
      <th>granularity</th>
      <th>timestamp</th>
      <th>access</th>
      <th>agent</th>
      <th>views</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023082600</td>
      <td>all-access</td>
      <td>user</td>
      <td>9391</td>
    </tr>
    <tr>
      <th>1</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023082700</td>
      <td>all-access</td>
      <td>user</td>
      <td>9956</td>
    </tr>
    <tr>
      <th>2</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023082800</td>
      <td>all-access</td>
      <td>user</td>
      <td>11052</td>
    </tr>
    <tr>
      <th>3</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023082900</td>
      <td>all-access</td>
      <td>user</td>
      <td>11307</td>
    </tr>
    <tr>
      <th>4</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023083000</td>
      <td>all-access</td>
      <td>user</td>
      <td>11731</td>
    </tr>
  </tbody>
</table>

and the last 5 rows should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>project</th>
      <th>article</th>
      <th>granularity</th>
      <th>timestamp</th>
      <th>access</th>
      <th>agent</th>
      <th>views</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>61</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023102600</td>
      <td>all-access</td>
      <td>user</td>
      <td>11937</td>
    </tr>
    <tr>
      <th>62</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023102700</td>
      <td>all-access</td>
      <td>user</td>
      <td>10713</td>
    </tr>
    <tr>
      <th>63</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023102800</td>
      <td>all-access</td>
      <td>user</td>
      <td>8687</td>
    </tr>
    <tr>
      <th>64</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023102900</td>
      <td>all-access</td>
      <td>user</td>
      <td>9469</td>
    </tr>
    <tr>
      <th>65</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>2023103000</td>
      <td>all-access</td>
      <td>user</td>
      <td>11501</td>
    </tr>
  </tbody>
</table>

### Hints

For this question you should keep the timestamp column as a string (i.e. not converted to Pandas Timestamp).

The Wikimedia REST API expects you to include a User Agent string in the headers of your request, else it will return a 401 error. To do you can use the following user UA string taken from the popular cURL utility:

```python
# Set request headers
headers = {
    'user-agent': 'curl/7.81.0'
}

# Send the API request
response = requests.get(url, headers=headers)
```

In [7]:
def viewsofpage(article_name):
    target = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/"
    url = f"{target}en.wikipedia/all-access/user/{article_name}/daily/20230826/20231030"
    headers = {
        'user-agent': 'curl/7.81.0'
    }

    res = requests.get(url, headers=headers)
    wiki_data = res.json()

    views_of_page = []

    for cols in wiki_data["items"]:
        views_of_page.append({
            "project": cols["project"],
            "article": cols["article"],
            "granularity": cols["granularity"],
            "timestamp": cols["timestamp"],
            "access": cols["access"],
            "agent": cols["agent"],
            "views": cols["views"]
        })

    df = pd.DataFrame(views_of_page)
    return df

Call the function with the article name, e.g., `"Dog"`. Try different Wikipedia entries as well.

In [8]:
article_name = "Dog"
viewsofpage(article_name)

Unnamed: 0,project,article,granularity,timestamp,access,agent,views
0,en.wikipedia,Dog,daily,2023082600,all-access,user,9391
1,en.wikipedia,Dog,daily,2023082700,all-access,user,9956
2,en.wikipedia,Dog,daily,2023082800,all-access,user,11052
3,en.wikipedia,Dog,daily,2023082900,all-access,user,11307
4,en.wikipedia,Dog,daily,2023083000,all-access,user,11731
...,...,...,...,...,...,...,...
61,en.wikipedia,Dog,daily,2023102600,all-access,user,11937
62,en.wikipedia,Dog,daily,2023102700,all-access,user,10713
63,en.wikipedia,Dog,daily,2023102800,all-access,user,8687
64,en.wikipedia,Dog,daily,2023102900,all-access,user,9469


When you're ready, run the cell below to get feedback on your answer.

In [9]:
grader.check("q2")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Q3

**Points**: 1

Write a function called `viewsofmanypages` that returns the number of views to a set of Wikipedia articles. Your function should take a single parameter -- a lists of Wikipedia article titles -- and it should return a data frame with the views from August 26, 2023 (`20230826`), until Oct 30, 2023 (`20231030`), indexed by timestamp (as a Pandas DateTimeIndex) and where each column corresponds to a article title.

For example, if the titles are `['Dog', 'Cat', 'Parrot', 'Rabbit']` then your function should return a data frame with exactly 264 rows. The top 5 rows of the data frame should look like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>project</th>
      <th>article</th>
      <th>granularity</th>
      <th>access</th>
      <th>agent</th>
      <th>view</th>
    </tr>
    <tr>
      <th>timestamp</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2023-08-26</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>9391</td>
    </tr>
    <tr>
      <th>2023-08-27</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>9956</td>
    </tr>
    <tr>
      <th>2023-08-28</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>11052</td>
    </tr>
    <tr>
      <th>2023-08-29</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>11307</td>
    </tr>
    <tr>
      <th>2023-08-30</th>
      <td>en.wikipedia</td>
      <td>Dog</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>11731</td>
    </tr>
  </tbody>
</table>    
and the bottom 5 like this:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>project</th>
      <th>article</th>
      <th>granularity</th>
      <th>access</th>
      <th>agent</th>
      <th>view</th>
    </tr>
    <tr>
      <th>timestamp</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2023-10-26</th>
      <td>en.wikipedia</td>
      <td>Rabbit</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>3751</td>
    </tr>
    <tr>
      <th>2023-10-27</th>
      <td>en.wikipedia</td>
      <td>Rabbit</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>3495</td>
    </tr>
    <tr>
      <th>2023-10-28</th>
      <td>en.wikipedia</td>
      <td>Rabbit</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>3426</td>
    </tr>
    <tr>
      <th>2023-10-29</th>
      <td>en.wikipedia</td>
      <td>Rabbit</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>3537</td>
    </tr>
    <tr>
      <th>2023-10-30</th>
      <td>en.wikipedia</td>
      <td>Rabbit</td>
      <td>daily</td>
      <td>all-access</td>
      <td>user</td>
      <td>3839</td>
    </tr>
  </tbody>
</table>

### Hints
The previous hint about supplying a User Agent header in your requests applies here too, see previous question.

Unlike the previous question, for this question you will need to convert the `timestamp` column into a pandas Timestamp and use the converted column as a DatetimeIndex for the data frame.

In [10]:
def viewsofmanypages(article_titles):
    target = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/"
    headers = {
        'user-agent': 'curl/7.81.0'
    }

    views_of_many_pages = []

    for title in article_titles:
        url = f"{target}en.wikipedia/all-access/user/{title}/daily/20230826/20231030"
        res = requests.get(url, headers=headers)
        wiki_data = res.json()

        for cols in wiki_data["items"]:
            views_of_many_pages.append({
              "timestamp": pd.to_datetime(cols["timestamp"], format="%Y%m%d%H"),
              "project": cols["project"],
              "article": cols["article"],
              "granularity": cols["granularity"],
              "access": cols["access"],
              "agent": cols["agent"],
              "view": cols["views"]
          })

    df = pd.DataFrame(views_of_many_pages)
    df.set_index("timestamp", inplace=True)

    return df

Call the function with the article name, e.g., `"Dog"`. Try different sets of Wikipedia entries as well. (For example, different sports teams, or different computer brands, etc.)

In [11]:
viewsofmanypages(['Dog', 'Cat', 'Parrot', 'Rabbit'])

Unnamed: 0_level_0,project,article,granularity,access,agent,view
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-08-26,en.wikipedia,Dog,daily,all-access,user,9391
2023-08-27,en.wikipedia,Dog,daily,all-access,user,9956
2023-08-28,en.wikipedia,Dog,daily,all-access,user,11052
2023-08-29,en.wikipedia,Dog,daily,all-access,user,11307
2023-08-30,en.wikipedia,Dog,daily,all-access,user,11731
...,...,...,...,...,...,...
2023-10-26,en.wikipedia,Rabbit,daily,all-access,user,3751
2023-10-27,en.wikipedia,Rabbit,daily,all-access,user,3495
2023-10-28,en.wikipedia,Rabbit,daily,all-access,user,3426
2023-10-29,en.wikipedia,Rabbit,daily,all-access,user,3537


When you're ready, run the cell below to get feedback on your answer.

In [12]:
grader.check("q3")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Submission

Don't forget to run all cells in your notebook and then save it. To save, click on *File*, then select *Save/Save Notebook*. After that, download the notebook by going to *File --> Download* (for Anaconda Notebook) or *File --> Download .ipynb* (for Colab). Finally, submit the notebook on Gradescope using the link found on ELMS.