# **Implementing the Wikipedia API**

Copyright 2024, Denis Rothman

[Wikipedia API documentation](https://pypi.org/project/Wikipedia-API/)

The Citations of the Wikipedia pages are in the `Chapter10/citations` directory of the repository.

For more on Wikipedia citations: [Citations](https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Mark_Twain&id=1231834317&wpFormIdentifier=titleform)

Source Code:
Wikipedia API
https://github.com/martin-majlis/Wikipedia-API


# Installing the environment

In [1]:
# try:
#   import wikipediaapi
# except:
#   !pip install Wikipedia-API==0.6.0
#   import wikipediaapi

## Defining the tokenization function

In [2]:
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the necessary NLTK resource downloaded
nltk.download('punkt')

def nb_tokens(text):
    # More sophisticated tokenization which includes punctuation
    tokens = word_tokenize(text)
    return len(tokens)

[nltk_data] Downloading package punkt to /Users/mjack6/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Retrieving Wikipedia Data and Metadata

## Creating an instance

In [4]:
import wikipediaapi
# Create an instance of the Wikipedia API with a detailed user agent
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='Knowledge/1.0 ([YOUR EMAIL])'
)

## Defining root page

Check the [Wikipedia rate limits](https://api.wikimedia.org/wiki/Rate_limits#:~:text=User%2Dauthenticated%20requests,requests%20per%20hour%20per%20user.) before making API requests.


In [5]:
topic="Marketing"     # topic
filename="Marketing"  # filename for saving the outputs
maxl=100              # maximum number of links to retrieve. This value was set to 100 the URL dataset.

## Root page summary

In [6]:
import textwrap # to wrap the text and display in 
nltk.download('punkt_tab')
page=wiki.page(topic)

if page.exists()==True:
  print("Page - Exists: %s" % page.exists())
  summary=page.summary
  # number of tokens)
  nbt=nb_tokens(summary)
  print("Number of tokens: ",nbt)
  # Use textwrap to wrap the summary text to a specified width, e.g., 70 characters
  wrapped_text = textwrap.fill(summary, width=60)
  # Print the wrapped summary text
  print(wrapped_text)
else:
  print("Page does not exist")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mjack6/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Page - Exists: True
Number of tokens:  229
Marketing is the act of satisfying and retaining customers.
It is one of the primary components of business management
and commerce. Marketing is usually conducted by the seller,
typically a retailer or manufacturer. Products can be
marketed to other businesses (B2B) or directly to consumers
(B2C). Sometimes tasks are contracted to dedicated marketing
firms, like a media, market research, or advertising agency.
Sometimes, a trade association or government agency (such as
the Agricultural Marketing Service) advertises on behalf of
an entire industry or locality, often a specific type of
food (e.g. Got Milk?), food from a specific area, or a city
or region as a tourism destination. Market orientations are
philosophies concerning the factors that should go into
market planning. The marketing mix, which outlines the
specifics of the product and how it will be sold, including
the channels that will be used to advertise the product, is
affected by t

## URLs and Citations

In [7]:
print(page.fullurl)

https://en.wikipedia.org/wiki/Marketing


## Links in the page

In [8]:
# prompt: read the program up to this cell. Then retrieve all the links for this page: print the link and a summary of each link

# Get all the links on the page
links = page.links

# Print the link and a summary of each link
urls = []
counter=0
for link in links:
  try:
    counter+=1
    print(f"Link {counter}: {link}")
    summary = wiki.page(link).summary
    print(f"Link: {link}")
    print(wiki.page(link).fullurl)
    urls.append(wiki.page(link).fullurl)
    print(f"Summary: {summary}")
    if counter>=maxl:
      break
  except page.exists()==False:
    # Ignore pages that don't exist
    pass

print(counter)
print(urls)

Link 1: 24-hour news cycle
Link: 24-hour news cycle
https://en.wikipedia.org/wiki/24-hour_news_cycle
Summary: The 24-hour news cycle (or 24/7 news cycle) is the 24-hour investigation and reporting of news, concomitant with fast-paced lifestyles. The vast news resources available in recent decades have increased competition for audience and advertiser attention, prompting media providers to deliver the latest news in the most compelling manner in order to remain ahead of competitors. Television, radio, print, online and mobile app news media all have many suppliers that want to be relevant to their audiences and deliver news first.
A complete news cycle consists of the media reporting on some event, followed by the media reporting on public and other reactions to the earlier reports. The advent of 24-hour cable and satellite television news channels and, in more recent times, of news sources on the World Wide Web (including blogs), considerably shortened this process.
Link 2: Account-ba

In [9]:
page.fullurl

'https://en.wikipedia.org/wiki/Marketing'

## Writing the citations page and collecting the URLs

In [10]:
from datetime import datetime

# Get all the links on the page
links = page.links

# Prepare a file to store the outputs
fname = filename+"_citations.txt"
with open(fname, "w") as file:
    # Write the citation header
    file.write(f"Citation. In Wikipedia, The Free Encyclopedia. Pages retrieved from the following Wikipedia contributors on {datetime.now()}\n")
    file.write("Root page: " + page.fullurl + "\n")
    counter = 0
    urls = []
    urls.append(page.fullurl)
    # Loop through the links and collect summaries
    for link in links:
        try:
            counter += 1
            page_detail = wiki.page(link)
            summary = page_detail.summary

            # Print details to the file
            file.write(f"Link {counter}: {link}\n")
            file.write(f"Link: {link}\n")
            file.write(f"{page_detail.fullurl}\n")
            urls.append(page_detail.fullurl)
            file.write(f"Summary: {summary}\n")

            # Limit to 20 pages to avoid excessive scraping
            if counter >= maxl:
                break
        except wiki.exceptions.PageError:
            # Ignore pages that don't exist
            continue

    # Write the total counts and URLs at the end
    file.write(f"Total links processed: {counter}\n")
    file.write("URLs:\n")
    file.write("\n".join(urls))

# Note: Ensure the topic you specify corresponds to a valid Wikipedia article.

In [11]:
urls

['https://en.wikipedia.org/wiki/Marketing',
 'https://en.wikipedia.org/wiki/24-hour_news_cycle',
 'https://en.wikipedia.org/wiki/Account-based_marketing',
 'https://en.wikipedia.org/wiki/Activism',
 'https://en.wikipedia.org/wiki/Adam_Smith',
 'https://en.wikipedia.org/wiki/Adam_Smith_Institute',
 'https://en.wikipedia.org/wiki/Advertising',
 'https://en.wikipedia.org/wiki/Advertising',
 'https://en.wikipedia.org/wiki/Advertising_agency',
 'https://en.wikipedia.org/wiki/Advertising_mail',
 'https://en.wikipedia.org/wiki/Advertising_management',
 'https://en.wikipedia.org/wiki/Advertising_slogan',
 'https://en.wikipedia.org/wiki/Advocacy',
 'https://en.wikipedia.org/wiki/Advocacy_group',
 'https://en.wikipedia.org/wiki/Affinity_marketing',
 'https://en.wikipedia.org/wiki/Agenda-setting_theory',
 'https://en.wikipedia.org/wiki/Agile_marketing',
 'https://en.wikipedia.org/wiki/Agricultural_Marketing_Service',
 'https://en.wikipedia.org/wiki/Agricultural_marketing',
 'https://en.wikipedia.

## Writing the URL file

In [12]:
# Write URLs to a file
ufname = filename+"_urls.txt"
with open(ufname, 'w') as file:
    for url in urls:
        file.write(url + '\n')

print("URLs have been written to urls.txt")

URLs have been written to urls.txt


In [13]:
# Read URLs from the file
with open(ufname, 'r') as file:
    urls = [line.strip() for line in file]

# Display the URLs
print("Read URLs:")
for url in urls:
    print(url)

Read URLs:
https://en.wikipedia.org/wiki/Marketing
https://en.wikipedia.org/wiki/24-hour_news_cycle
https://en.wikipedia.org/wiki/Account-based_marketing
https://en.wikipedia.org/wiki/Activism
https://en.wikipedia.org/wiki/Adam_Smith
https://en.wikipedia.org/wiki/Adam_Smith_Institute
https://en.wikipedia.org/wiki/Advertising
https://en.wikipedia.org/wiki/Advertising
https://en.wikipedia.org/wiki/Advertising_agency
https://en.wikipedia.org/wiki/Advertising_mail
https://en.wikipedia.org/wiki/Advertising_management
https://en.wikipedia.org/wiki/Advertising_slogan
https://en.wikipedia.org/wiki/Advocacy
https://en.wikipedia.org/wiki/Advocacy_group
https://en.wikipedia.org/wiki/Affinity_marketing
https://en.wikipedia.org/wiki/Agenda-setting_theory
https://en.wikipedia.org/wiki/Agile_marketing
https://en.wikipedia.org/wiki/Agricultural_Marketing_Service
https://en.wikipedia.org/wiki/Agricultural_marketing
https://en.wikipedia.org/wiki/Airborne_leaflet_propaganda
https://en.wikipedia.org/wiki/