# Loading Data via the Wikipedia API

## Overview of the Prepared Dataset

- The planned target variable is the **Click-Through Rate (CTR)** of selected Wikipedia pages.  
- Data will be collected for **5,000–10,000 Wikipedia articles** on similar topics (e.g., science) to ensure accurate predictions.  
- The script will retrieve the following **features** for each article:  
  - topic,  
  - summary,  
  - number and list of categories,  
  - article length (in words),  
  - number of links,  
  - number of links in the first section,
  - number of edits and editors,  
  - clicks in,  
  - clicks out,  
  - number of page views over the last 30 days.  
- Based on these data, **derived features** will be created — for example:  
  - one-hot encoding of selected keywords (based on a bag-of-words approach),  
  - title and/or summary embeddings,  
  - ratio of links to total words.  
- Additionally, **external metadata** related to web traffic will be retrieved (e.g., Google Trends data for article titles).  


In [2]:
%cd C:\Users\piecz\PycharmProjects\pythonProject2\WdAN_projekt

C:\Users\piecz\PycharmProjects\pythonProject2\WdAN_projekt


In [None]:
!pip install wikipedia-api

In [76]:
!pip install pytrends

Collecting pytrends
  Obtaining dependency information for pytrends from https://files.pythonhosted.org/packages/68/ba/7a24a3723c790000faf880505ff1cc46f4d29f46dd353037938a070c4d23/pytrends-4.9.2-py3-none-any.whl.metadata
  Downloading pytrends-4.9.2-py3-none-any.whl.metadata (13 kB)
Collecting lxml (from pytrends)
  Obtaining dependency information for lxml from https://files.pythonhosted.org/packages/f7/d7/0cdfb6c3e30893463fb3d1e52bc5f5f99684a03c29a0b6b605cfae879cd5/lxml-6.0.2-cp312-cp312-win_amd64.whl.metadata
  Downloading lxml-6.0.2-cp312-cp312-win_amd64.whl.metadata (3.7 kB)
Downloading pytrends-4.9.2-py3-none-any.whl (15 kB)
Downloading lxml-6.0.2-cp312-cp312-win_amd64.whl (4.0 MB)
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/4.0 MB 1.3 MB/s eta 0:00:04
   - -------------------------------------- 0.1/4.0 MB 1.6 MB/s eta 0:00:03
   -- ------------------------------------- 0.2/4.0 MB 2.0 MB/s eta 0:00:02
   ---


[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: C:\Users\piecz\AppData\Local\Programs\Python\Python312\python.exe -m pip install --upgrade pip


In [48]:
import pandas as pd
import numpy as np
import wikipediaapi
import functions.wikipedia_use as my_wiki
import sqlite3
import os
import requests
from IPython.display import display
import time
import random
from requests.exceptions import RequestException
from json import JSONDecodeError


In [55]:
def get_pageviews(title, lang="en", year=2025, month=9):
    S = requests.Session()
    S.headers.update({"User-Agent": "StudentProject/1.0"})
    
    title_api = title.replace(" ", "_")
    start = f"{year}{month:02d}01"
    end = f"{year}{month:02d}30"
    
    URL = f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/{lang}.wikipedia/all-access/all-agents/{title_api}/daily/{start}/{end}"
    
    try:
        resp = S.get(URL, timeout=10)
        if resp.status_code != 200:
            return 0
        data = resp.json()
        total_views = sum(item.get("views", 0) for item in data.get("items", []))
        return total_views
    except:
        return 0



def get_article_metadata(title, lang="en", delay=0.1, max_retries=3):
    S = requests.Session()
    S.headers.update({"User-Agent": "StudentProject/1.0 (estera.maria03@gmail.com)"})
    URL = f"https://{lang}.wikipedia.org/w/api.php"
    
    PARAMS = {
        "action": "query",
        "titles": title,
        "prop": "revisions|links|categories|images|extracts",
        "rvprop": "user|timestamp",  # dodajemy timestamp rewizji
        "rvlimit": "max",
        "rvdir": "newer",            # pobierzemy najstarszą rewizję jako pierwszą
        "explaintext": 1,
        "format": "json",
        "pllimit": "max",
        "cllimit": "max",
        "ilimit": "max"
    }

    backoff = 0.5
    for attempt in range(1, max_retries + 1):
        try:
            resp = S.get(URL, params=PARAMS, timeout=10)
            if resp.status_code != 200:
                time.sleep(delay + backoff)
                backoff *= 2
                continue
            try:
                data = resp.json()
            except (JSONDecodeError, ValueError):
                time.sleep(delay + backoff)
                backoff *= 2
                continue

            pages = data.get("query", {}).get("pages", {})
            for _, page_data in pages.items():
                extract = page_data.get("extract", "")
                word_count = len(extract.split())

                links = page_data.get("links", [])
                num_links_internal = len(links)

                categories = [c["title"] for c in page_data.get("categories", [])]
                num_categories = len(categories)

                images = [i["title"] for i in page_data.get("images", [])]
                num_images = len(images)

                revisions = page_data.get("revisions", [])
                num_edits = len(revisions)
                editors = set(rev.get("user") for rev in revisions)
                num_editors = len(editors)

                creation_date = revisions[0]["timestamp"] if revisions else None
                page_views = get_pageviews(page_data.get("title"))
                

                return {
                    "title": page_data.get("title"),
                    "word_count": word_count,
                    "num_links_internal": num_links_internal,
                    "num_categories": num_categories,
                    "categories": categories,
                    "num_images": num_images,
                    "image_titles": images,
                    "num_edits": num_edits,
                    "num_editors": num_editors,
                    "summary": extract,
                    "creation_date": creation_date,
                    "mo_page_views": page_views,
                }
            return None
        except RequestException:
            time.sleep(delay + backoff)
            backoff *= 2
            continue
    return None


def get_articles_from_category(category_name, depth=0):
    wiki = wikipediaapi.Wikipedia(language="en", user_agent="StudentProject/1.0")
    if not category_name.startswith("Category:"):
        category_name = "Category:" + category_name
    category = wiki.page(category_name)
    if not category.exists():
        print(f"Category '{category_name}' not found.")
        return pd.DataFrame()

    articles_data = []

    def add_articles(cat, current_depth=0):
        for page in cat.categorymembers.values():
            if page.ns == 0:
                try:
                    meta = get_article_metadata(page.title)
                    if meta:
                        articles_data.append(meta)
                    time.sleep(random.uniform(0,0.1))
                except Exception as e:
                    print(f"Błąd przy pobieraniu {page.title}: {e}")
            elif page.ns == 14 and current_depth < depth: 
                add_articles(page, current_depth + 1)

    add_articles(category)
    df = pd.DataFrame(articles_data)
    return df

In [56]:

df_articles = get_articles_from_category("Category:Cycling activism", depth=0)
print(df_articles.head())
print(f"\nTotal articles collected: {len(df_articles)}")

                     title  word_count  num_links_internal  num_categories  \
0         Cycling advocacy        1106                 174              13   
1                Biketober         219                 100               5   
2  Critical Mass (cycling)        1869                 152              15   
3   Cycling Action Network         459                  23               9   
4    Dutch Cycling Embassy         112                   7               4   

                                          categories  num_images  \
0  [Category:Advocacy groups, Category:Articles w...           5   
1  [Category:Articles with short description, Cat...           2   
2  [Category:Articles containing video clips, Cat...           9   
3  [Category:All articles containing potentially ...           4   
4  [Category:All stub articles, Category:Articles...           1   

                                        image_titles  num_edits  num_editors  \
0  [File:A Short History of Traffic Engine

In [57]:
display(df_articles.head())

Unnamed: 0,title,word_count,num_links_internal,num_categories,categories,num_images,image_titles,num_edits,num_editors,summary,creation_date,mo_page_views
0,Cycling advocacy,1106,174,13,"[Category:Advocacy groups, Category:Articles w...",5,[File:A Short History of Traffic Engineering.p...,229,139,Cycling advocacy consists of activities that c...,2005-09-16T21:50:36Z,1067
1,Biketober,219,100,5,"[Category:Articles with short description, Cat...",2,"[File:Biketober promotional material.png, File...",6,5,Biketober is a month-long festival that celebr...,2024-09-13T03:11:54Z,75
2,Critical Mass (cycling),1869,152,15,"[Category:Articles containing video clips, Cat...",9,"[File:Commons-logo.svg, File:Critical Mass, Sa...",500,255,Critical Mass is a form of direct action in wh...,2003-01-28T07:28:37Z,5197
3,Cycling Action Network,459,23,9,[Category:All articles containing potentially ...,4,"[File:Can-logo.png, File:Flag of New Zealand.s...",95,34,Cycling Action Network (CAN) is a national cyc...,2008-03-28T21:12:19Z,114
4,Dutch Cycling Embassy,112,7,4,"[Category:All stub articles, Category:Articles...",1,[File:Flag of the Netherlands.svg],4,2,The Dutch Cycling Embassy is a public-private ...,2025-07-19T18:16:14Z,63


In [67]:
path = "C:\\Users\\piecz\\PycharmProjects\\pythonProject2\\WdAN_projekt\\data\\clickstream-enwiki-2024-09.tsv.gz"
articles_of_interest = [x.replace(" ", "_") for x in df_articles.title.tolist()]

clicks_in = pd.Series(0, index=articles_of_interest)
clicks_out = pd.Series(0, index=articles_of_interest)

chunksize = 500_000

for chunk in pd.read_csv(path, sep="\t", header=None, chunksize=chunksize):
    chunk.columns = ["source", "target", "type", "count"]
    
    out_chunk = chunk[chunk["source"].isin(articles_of_interest)]
    out_sum = out_chunk.groupby("source")["count"].sum()
    clicks_out[out_sum.index] += out_sum
    
    in_chunk = chunk[chunk["target"].isin(articles_of_interest)]
    in_sum = in_chunk.groupby("target")["count"].sum()
    clicks_in[in_sum.index] += in_sum

summary = pd.DataFrame({
    "article": articles_of_interest,
    "clicks_in": clicks_in.values,
    "clicks_out": clicks_out.values
})

print(summary)

                        article  clicks_in  clicks_out
0              Cycling_advocacy        290          15
1                     Biketober         20           0
2       Critical_Mass_(cycling)       3259         265
3        Cycling_Action_Network         11           0
4         Dutch_Cycling_Embassy          0           0
5                    Ghost_bike       3133         378
6                 I_BIKE_Dublin         50           0
7                  Kidical_Mass        253          24
8                Le_Tour_Entier          0           0
9                Ovarian_Psycos         69           0
10                      Quaxing          0           0
11  Vision_Zero_(New_York_City)        135           0


In [75]:
summary.article = summary.article.str.replace("_", " ")

df_articles_summary = df_articles.merge(summary, left_on = "title", right_on = "article")
df_articles_summary = df_articles_summary.drop("article",axis =1)

df_articles_summary["clicks_per_view"] = np.where(
    df_articles_summary["mo_page_views"] != 0,
    df_articles_summary["clicks_out"] / df_articles_summary["mo_page_views"],
    0
)

df_articles_summary.head()

Unnamed: 0,title,word_count,num_links_internal,num_categories,categories,num_images,image_titles,num_edits,num_editors,summary,creation_date,mo_page_views,clicks_in,clicks_out,clicks_per_view
0,Cycling advocacy,1106,174,13,"[Category:Advocacy groups, Category:Articles w...",5,[File:A Short History of Traffic Engineering.p...,229,139,Cycling advocacy consists of activities that c...,2005-09-16T21:50:36Z,1067,290,15,0.014058
1,Biketober,219,100,5,"[Category:Articles with short description, Cat...",2,"[File:Biketober promotional material.png, File...",6,5,Biketober is a month-long festival that celebr...,2024-09-13T03:11:54Z,75,20,0,0.0
2,Critical Mass (cycling),1869,152,15,"[Category:Articles containing video clips, Cat...",9,"[File:Commons-logo.svg, File:Critical Mass, Sa...",500,255,Critical Mass is a form of direct action in wh...,2003-01-28T07:28:37Z,5197,3259,265,0.050991
3,Cycling Action Network,459,23,9,[Category:All articles containing potentially ...,4,"[File:Can-logo.png, File:Flag of New Zealand.s...",95,34,Cycling Action Network (CAN) is a national cyc...,2008-03-28T21:12:19Z,114,11,0,0.0
4,Dutch Cycling Embassy,112,7,4,"[Category:All stub articles, Category:Articles...",1,[File:Flag of the Netherlands.svg],4,2,The Dutch Cycling Embassy is a public-private ...,2025-07-19T18:16:14Z,63,0,0,0.0


Błąd dla batch ['Cycling advocacy']: The request failed: Google returned a response with code 429
Błąd dla batch ['Biketober']: The request failed: Google returned a response with code 429
Błąd dla batch ['Critical Mass (cycling)']: The request failed: Google returned a response with code 429
Błąd dla batch ['Cycling Action Network']: The request failed: Google returned a response with code 429
Błąd dla batch ['Dutch Cycling Embassy']: The request failed: Google returned a response with code 429
Błąd dla batch ['Ghost bike']: The request failed: Google returned a response with code 429
Błąd dla batch ['I BIKE Dublin']: The request failed: Google returned a response with code 429
Błąd dla batch ['Kidical Mass']: The request failed: Google returned a response with code 429
Błąd dla batch ['Le Tour Entier']: The request failed: Google returned a response with code 429
Błąd dla batch ['Ovarian Psycos']: The request failed: Google returned a response with code 429
Błąd dla batch ['Quaxing']