# Create tool for informative infographics from structured information from Wikimedia projects - Task A

This notebook provides information on accessing and visualizing article pageviews in the Portuguese Wikipedia.

The objectives are:
1. to read and understand the Wikimedia API documentation;
2. to complete the gaps in the code below, coding functions about the most viewed articles in the Portuguese Wikipedia;
3. to create an animated visualization of the data you gathered.

In [111]:
# TODO: execute this row and type your Wikimedia username below
username = input("Type your Wikimedia username:")

Type your Wikimedia username: Udonels


### Wikimedia API documentation

Here you can find the documentation for the Wikimedia API: https://wikimedia.org/api/rest_v1/#/.

You do not need to read every function there, but focus on the *PageViews data* section.

**Objective:** *(1) to read and understand the documentation.*

### Completing the gaps

Based on your knowledges of Python and the documentation for the Wikimedia API, please implement below:
* A function to get a list of the most viewed articles in the Portuguese Wikipedia for the month of January, ordered from the most viewed to the least one;
  * Be aware that there are false positives, so you will need to remove rows of data of other projects other than the Portuguese Wikipedia;
* A function to get a dataframe of the most viewed articles in the Portuguese Wikipedia for the period of January 1st, 2024 and February 29th, 2024
  * The result dataframe should have the folowing structure:
    * Each row represents an article (A);
    * Each column header (besides the articles names) represents a day of this period (D);
    * The cells values store the visualization of the article A on the date D.
 

**Objective:** *(2) to complete the gaps in the code below, coding functions about the most viewed articles in the Portuguese Wikipedia.*

### ADDRESSING THE FEEDBACK RECEIVED

I effected changes to my assigned tasks as required by my mentor, Eder Porto. The changes I made are as follows:

**4.3: You need to filter out pages that are not articles.**

I checked and updated my articles list to exclude false positives based on the Wikimedia information here (https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article%3F): An article belongs to the main namespace and doesn't include a prefix with a ':' as guided by my mentor, Eder Porto.

**4.5: The numbers in your dataframe and video seem off. Please review that. Facebook, for example, had 350 thousand in all of January**

I updated my dataframe (for the chart race) containing the most viewed articles in pt_wiki from 20240101 to 20240229 with its correct views per timestamp as guided by my mentor, Eder Porto after an understanding that the *views required user-agents and not all-agents,* and articles belong to the main namespace.

**I like the async approach, because it saves a lot of time!**

Lastly, I improved the computation time in making numerous requests to ~5 seconds after Eder Porto's invaluable feedback and acknowledgment of my async approach.

In [2]:
# TODO: add libraries as necessary
import pandas as pd
import asyncio
import concurrent.futures
import httpx
import nest_asyncio
import requests
import time
import warnings
import json
import logging
from urllib.parse import quote
from datetime import datetime, timedelta
from IPython.display import Video
from typing import List

pd.set_option("display.max_rows", 100)
print('done')

done


## Get a list of the most viewed articles in the Portuguese Wikipedia for any month.

### a. Function to get the most viewed pt_wiki per month

In [3]:
# Map month name to a number sequentially
MONTH_NAME_TO_NUMBER = {
    "January": "01",
     "February": "02"
    # Add other months as needed
}

def most_viewed_ptwiki(months: List[str]) -> List[str]:
    """
    This function retrieves the most viewed articles on Portuguese Wikipedia for a given list of months.
    
    Parameters:
    - months (List[str]): A list of month names for which to fetch article views.
    
    Returns:
    - List[str]: A list of article titles sorted by views in descending order.
    """

    # Define user-agents to allow batch download
    user_agent = "nelson.ifechukwu@gmail.com"
    headers = {"User-Agent": user_agent}

    # Define a base URL for interacting with the Wikimedia API
    BASE_URL = "https://wikimedia.org/api/rest_v1/metrics/pageviews/top/pt.wikipedia.org/all-access/2024"

    # Define a dictionary to store {articles : views}
    all_articles_and_views = {}

    # Go through each month passed in the parameters
    for each_month in months:
        # Get article views for the specified month
        month_number = MONTH_NAME_TO_NUMBER.get(each_month)
        if month_number is not None:
            url = f"{BASE_URL}/{month_number}/all-days"
            response = requests.get(url, headers=headers)

            # Check if the request is successful
            try:
                response.raise_for_status()
                data = response.json()
            except requests.RequestException as e:
                raise ValueError(f"Error in fetching data for {each_month}: {e}")

            # Get all articles and corresponding views
            items = data.get('items', [])
            if items:
                project = items[0].get('project', '')
                articles = items[0].get('articles', [])
                for article_info in articles:
                    article_title = article_info.get('article', '')
                    views = article_info.get('views', 0)
                    all_articles_and_views[article_title] = all_articles_and_views.get(article_title, 0) + views
                    # all_articles_and_views[article_title] = views
            else:
                raise ValueError(f"No data available for {each_month}")
        else:
            raise ValueError(f"{each_month} is not a month")

    # Sort all articles by views in descending order
    all_sorted_articles_and_views = sorted(
        all_articles_and_views.items(), key=lambda items: items[1], reverse=True)

    # Return only the sorted articles in the dictionary
    all_sorted_articles = [element[0] for element in all_sorted_articles_and_views]
    return all_sorted_articles


### b. Function to remove false positives

In [4]:
#According to the definition of an article here, https://en.wikipedia.org/wiki/Wikipedia:What_is_an_article%3F
#An article occupies the mainspace pages and doesn't include prefixes with :
#Hence, we'll remove titles with : in them
def remove_false_positives(article_list: List[str]) -> List[str]:
    """
    This function accepts a list of strings and removes any string with ':' in it

    Returns: It returns the modified list
    """
    
    # Create a copy of the 'article_list' to avoid list modification during iteration that can lead to potential incomplete removal
    article_list_copy = article_list[:]  
    for each_article in article_list_copy:
        if ':' in each_article:
            # Remove from original list
            article_list.remove(each_article)  
    return article_list

### c. Function to get the most viewed pt_wiki articles for January

In [5]:
# Get a sorted list of the most viewed articles in Portuguese Wiki in January
def most_viewed_ptwiki_jan() -> List[str]:
    sorted_articles_jan = most_viewed_ptwiki(["January"])

    #Dealing with falsepositives
    sorted_articles_jan = remove_false_positives(sorted_articles_jan)

    return sorted_articles_jan

### c (i). showing the most viewed pt_wiki articles for January

In [6]:
# TODO: add parameters as necessary and execute this block
top_viewed_list = most_viewed_ptwiki_jan()
top_viewed_list

['XXx',
 'Fotos_dos_Mamonas_Assassinas_mortos',
 'Voo_Força_Aérea_Uruguaia_571',
 'Facebook',
 'Zagallo',
 'Porno_Graffitti',
 'Renascer',
 'ChatGPT',
 'Yasmin_Brunet',
 'Cleópatra',
 'Griselda_Blanco',
 'AMBEV',
 'Renascer_(2024)',
 'YouTube',
 'Copa_São_Paulo_de_Futebol_Júnior',
 'Napoleão_Bonaparte',
 'Sony_Channel',
 'Rodriguinho_(cantor)',
 'Brasil',
 'Twitter',
 'Ano-novo',
 'João_Carreiro_&_Capataz',
 'TV_Globo',
 'Canal_Brasil',
 'Jeffrey_Epstein',
 'Domingos_Brazão',
 'Cristiano_Ronaldo',
 'Instagram',
 'Mamonas_Assassinas',
 'Louis_Joseph_César_Ducornet',
 'Big_Brother_Brasil_24',
 'Campeonato_Africano_das_Nações',
 'Copa_São_Paulo_de_Futebol_Júnior_de_2024',
 'Franz_Beckenbauer',
 'Carlos_Alberto_Parreira',
 'Dorival_Júnior',
 'Thiago_Carpini',
 'Marcinho_VP',
 'Robert_Oppenheimer',
 'Fernando_Parrado',
 'Wanessa_Camargo',
 'Vanessa_Lopes',
 'Campeonato_Paulista_de_Futebol',
 'Mortes_em_janeiro_de_2024',
 'Jogo_do_bicho',
 'São_Paulo',
 'Roberto_Canessa',
 'Portugal',
 'Pabl

### d. Function that returns a Dataframe containing the views of the most viewed pt_wiki articles for January & February

### d (i). Showing the dataframe containing the views of the most viewed pt_wiki articles for January & February

In [9]:
async def fetch(date, url, article_views_df, sorted_articles_jan_feb):
    """
    Async function to fetch data from the given URL and update the DataFrame.
    
    NB: This function uses asynchronous programming to cater for the blocking time
        in the large amounts of requests to be executed for each
        article all through the days in January and February. 
        
        Computation time was reduced from >2hrs to ~5secs
    """
    user_agent = "nelson.ifechukwu@gmail.com"
    headers = {"User-Agent": user_agent}

    # Make an asynchronous request to the specified URL
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(url, headers=headers)
            data = response.json()

            # Check if the response contains the expected 'items' key and has non-empty data
            if 'items' in data and data['items']:
                info = data['items'][0]
                
                # Check if the project is 'pt.wikipedia' (Portuguese Wikipedia)
                if info['project'] == 'pt.wikipedia':
                    
                    # Iterate through each article in the response data per the timestamp
                    for article_info in info['articles']:
                        
                        #check if the article is among the most viewed pt_wiki articles ( => top 1000)
                        if article_info['article'] in sorted_articles_jan_feb:      
                            
                            # Update the DataFrame with the number of views for the article on the specified date
                            article_views_df.at[article_info['article'], date.strftime('%Y-%m-%d')] = article_info['views']
            else:
                # Log a warning if the expected 'items' key is not found in the response
                logging.warning(f"No 'items' key found in response: {data}, Status code: {response.status_code}")
    
    except httpx.RequestError as e:
        # Log an error if there is an issue with the HTTP request
        logging.error(f"Error fetching url-{url}: {e}")

async def most_viewed_ptwiki_jan_feb_per_day() -> pd.DataFrame:
    """
    Get the most viewed articles in the Portuguese Wikipedia
    for the specified months, organized by day.
    
    Returns:
    - pd.DataFrame: A DataFrame containing the articles as the row index and dates
      as the column headers with each cell containing the number of views of the article
      on the specified date.
    """ 

    # Define formatted date range to fetch from using the url
    start_date = datetime(2024, 1, 1)
    end_date = datetime(2024, 2, 29)

    # Get a sorted list of the most viewed articles in Portuguese Wiki in January & February
    sorted_articles_jan_feb = most_viewed_ptwiki(["January", "February"])

    #remove false positives
    sorted_articles_jan_feb = remove_false_positives(sorted_articles_jan_feb)

    # Define a date-range for the DataFrame column headers
    date_range = [(start_date + timedelta(days=x)).strftime('%Y-%m-%d') for x in range((end_date - start_date).days + 1)]
    
    # Initialize the DataFrame
    article_views_df = pd.DataFrame(index=sorted_articles_jan_feb, columns=date_range)

    # Get all the URLs per article in sorted_articles_jan_feb (to get their views from Jan to Feb)
    urls = []
    current_date = start_date
    while current_date <= end_date:
        urls.append((current_date, f"https://wikimedia.org/api/rest_v1/metrics/pageviews/top/pt.wikipedia/all-access/{current_date.strftime('%Y/%m/%d')}"))
        current_date += timedelta(days=1)
        
    # Split the async operations into batches
    tasks = [fetch(url_detail[0], url_detail[1], article_views_df, sorted_articles_jan_feb) for url_detail in urls]
    await asyncio.gather(*tasks)

    return article_views_df


In [173]:
# allow the async event loop to run in the Jupyter cell event loop
nest_asyncio.apply()

# call most_viewed_ptwiki_jan_feb_per_day asynchronously 
top_viewed_dataframe = asyncio.run(most_viewed_ptwiki_jan_feb_per_day())
top_viewed_dataframe

Unnamed: 0,2024-01-01,2024-01-02,2024-01-03,2024-01-04,2024-01-05,2024-01-06,2024-01-07,2024-01-08,2024-01-09,2024-01-10,...,2024-02-20,2024-02-21,2024-02-22,2024-02-23,2024-02-24,2024-02-25,2024-02-26,2024-02-27,2024-02-28,2024-02-29
Facebook,13195,14024,14480,14609,13584,12304,12509,14081,15977,14702,...,15625,15662,14481,13224,12502,11664,15940,17258,15484,15548
Fotos_dos_Mamonas_Assassinas_mortos,,,12913,15280,11216,46426,32110,26610,17609,14837,...,4998,4538,4762,5351,7864,5264,3340,2732,2665,3136
Cleópatra,3432,3420,3542,3642,3922,3924,3801,4229,4250,4168,...,13758,14087,14285,14373,14599,14139,14269,14999,14999,14375
Porno_Graffitti,,,,,,,,,,,...,,,,,,,,,,
ChatGPT,1274,2484,3258,2657,2408,1989,2178,7203,9411,9114,...,11161,11377,11023,10056,6763,6857,12084,13414,13736,13466
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Banana,,,,,,,,,,,...,,,,,,,,,,
Ana_de_Armas,,,,,,,747,,,,...,,,,,,,,,,
Idade_Média,,,,,,,,,,,...,651,704,697,538,,,749,752,675,624
Chico_Buarque,552,,,,,,,,,,...,,597,565,540,,,,,666,


### Data visualization

Here you can find the documentation for a library for a Bar Chart Race library: https://pypi.org/project/bar-chart-race.

Read and understand the documentation and use this library to create a function that display an animated race chart of the dataframe you produced in the section before (*top_viewed_dataframe*).

**Objective:** *(3) to create an animated visualization of the data you gathered.*

### e. Visualize the data in a chart-race

In [174]:
# Install the bar-chart-race library to visualize the article views
!pip install bar_chart_race



In [175]:
# Download a static FFmpeg build and add it to PATH for the bar_chart_race to execute correctly
import bar_chart_race as bcr
%run 'util/load-ffmpeg.ipynb'
print('Done!')

./ffmpeg-6.1-amd64-static/ffmpeg
Done!


In [178]:

def dataframe_to_race_chart(df):
    """
    Define a function to plot the chart race from the formed dataframe.

    Parameters:
    - df (pd.DataFrame, optional): The DataFrame to be used for plotting the chart race. Default is top_viewed_dataframe.

    Returns:
    - None

    This function also cleans, filters, and prepares the DataFrame as a 'wide' data according to the bar_chart_race documentation.
    """
    # Transpose the DataFrame to make it suitable for bar_chart_race
    df = df.T
    
    # Fill NaN values with 0
    df = df.fillna(0)
    
    # Convert all values to numeric type
    df = df.apply(pd.to_numeric, errors='coerce')
    
    # Apply cumulation to show accumulating views over time
    c_df = df.cumsum(axis=0) 

    # Plot the chart race using bar_chart_race
    bcr.bar_chart_race(
        c_df,
        n_bars=8,  # Define the number of bars to be displayed
        period_length=500,  # Number of time taken per period in a frame
        steps_per_period=10,  # Number of frame steps per period
        title='Most viewed articles',  # Title of the chart
        shared_fontdict={'family': 'DejaVu Sans', 'color': '.1'},  # Define font type in the chart 
        scale='linear',
        writer=None,
        bar_label_size=7,  #Define the size of the bar labels
        tick_label_size=7,  #Define the size of the bar ticks (numbers)
        filename="chart-race.mp4",  #Define the name of the file output
        fixed_order=False,
        fixed_max=True,
        label_bars=True,
        bar_size=.95,  #define how large the bars are
        period_label={'x': .99, 'y': .25, 'ha': 'right', 'va': 'center'},  # Position and alignment of the period label
        perpendicular_bar_func='median',  # Function to calculate perpendicular bar heights
        dpi=144,  # Dots per inch
        cmap='dark2',  # Colormap for coloring bars
        fig=None,
        bar_kwargs={'alpha': .7},  # Keyword arguments for bars
        filter_column_colors=False  # Whether to filter column colors
    )


In [162]:
# Suppress the specific warnings when executing the bar_chart_race function 
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Call the bar_chart_race function
dataframe_to_race_chart(df = top_viewed_dataframe)

In [179]:
# Display the chart race video
Video("chart-race.mp4", embed=True)