# Loading Data via the Wikipedia API

## Overview of the Prepared Dataset

- The planned target variable is the **Click-Through Rate (CTR)** of selected Wikipedia pages.  
- Data will be collected for **5,000–10,000 Wikipedia articles** on similar topics (e.g., science) to ensure accurate predictions.  
- The script will retrieve the following **features** for each article:  
  - topic,  
  - summary,  
  - number and list of categories,  
  - article length (in words),  
  - number of links,  
  - number of links in the first section,
  - number of edits and editors,  
  - clicks in,  
  - clicks out,  
  - number of page views over the last 30 days.  
- Based on these data, **derived features** will be created — for example:  
  - one-hot encoding of selected keywords (based on a bag-of-words approach),  
  - title and/or summary embeddings,  
  - ratio of links to total words.  
- Additionally, **external metadata** related to web traffic will be retrieved (e.g., Google Trends data for article titles).  


In [32]:
%cd C:\Users\piecz\PycharmProjects\WdAD_projekt_wikipedia

C:\Users\piecz\PycharmProjects\WdAD_projekt_wikipedia


In [None]:
!pip install wikipedia-api

In [8]:
import pandas as pd
import numpy as np
import wikipediaapi
import sqlite3
import os
import requests
from IPython.display import display
import time
import random
from requests.exceptions import RequestException
from json import JSONDecodeError
import functions.wiki_db as wiki_db
import importlib
importlib.reload(wiki_db)

<module 'functions.wiki_db' from 'C:\\Users\\piecz\\PycharmProjects\\WdAD_projekt_wikipedia\\functions\\wiki_db.py'>

# Load articles belonging to the given categories

In [36]:
categories_list = ["Mathematics"]  
df_list = []

for category in categories_list:
    df = wiki_db.get_articles_from_category(category, depth=1)
    df_list.append(df)
    print(f"\nNumber of articles from {category}: {len(df)}")


Number of articles from Mathematics: 1434


In [37]:
df_articles = pd.concat(df_list)
display(df_articles.sample(5))
print(f"\nTotal articles collected: {len(df_articles)}")

Unnamed: 0,title,word_count,num_links_internal,num_categories,categories,num_images,image_titles,num_edits,num_editors,summary,creation_date,mo_page_views
745,Sphuṭacandrāpti,153,131,13,[Category:All Wikipedia articles written in In...,3,"[File:045r b.jpg, File:Arithmetic symbols.svg,...",14,8,Sphuṭacandrāpti (Computation of True Moon) is ...,2017-03-22T00:55:07Z,107
290,List of manifolds,342,208,4,"[Category:Articles with short description, Cat...",1,[File:Wikibooks-logo-en-noslogan.svg],76,37,"This is a list of particular manifolds, by Wik...",2004-09-10T14:44:49Z,842
439,Continuum (measurement),448,45,10,[Category:All articles needing additional refe...,2,"[File:Question book-new.svg, File:Split-arrows...",124,94,Continuum (pl.: continua or continuums) theori...,2006-12-06T19:19:22Z,1864
707,History of mathematical notation,8594,500,20,[Category:All articles with unsourced statemen...,10,"[File:Chounumerals.svg, File:Death of Archimed...",500,107,The history of mathematical notation covers th...,2006-07-27T01:27:11Z,3225
648,Eudemus of Rhodes,1089,251,26,"[Category:300s BC deaths, Category:370s BC bir...",6,"[File:Commons-logo.svg, File:Eudemos von Rhodo...",123,88,Eudemus of Rhodes (Ancient Greek: Εὔδημος; c. ...,2005-12-17T22:10:14Z,1276



Total articles collected: 1434


### Annotating info about clicks

In [45]:
df_articles = df_articles.drop_duplicates(subset="title")

In [46]:
path = "C:\\Users\\piecz\\PycharmProjects\\pythonProject2\\WdAN_projekt\\data\\clickstream-enwiki-2024-09.tsv.gz"
articles_of_interest = [x.replace(" ", "_") for x in df_articles.title.tolist()]

clicks_in = pd.Series(0, index=articles_of_interest)
clicks_out = pd.Series(0, index=articles_of_interest)

chunksize = 500_000

for chunk in pd.read_csv(path, sep="\t", header=None, chunksize=chunksize):
    chunk.columns = ["source", "target", "type", "count"]
    
    out_chunk = chunk[chunk["source"].isin(articles_of_interest)]
    out_sum = out_chunk.groupby("source")["count"].sum()
    clicks_out[out_sum.index] += out_sum
    
    in_chunk = chunk[chunk["target"].isin(articles_of_interest)]
    in_sum = in_chunk.groupby("target")["count"].sum()
    clicks_in[in_sum.index] += in_sum

summary = pd.DataFrame({
    "article": articles_of_interest,
    "clicks_in": clicks_in.values,
    "clicks_out": clicks_out.values
})

print(summary)

                              article  clicks_in  clicks_out
0                         Mathematics     117924       54025
1                                  −2         82          16
2                        Chang_Thokpa          0           0
3             Language_of_mathematics       3967         540
4                         Limit_group          0           0
...                               ...        ...         ...
1314  Weierstrass–Mandelbrot_function          0           0
1315                 Weisner's_method         19           0
1316                    Weyl_sequence        355          75
1317    Whittaker–Henderson_smoothing          0           0
1318                            WIRIS         74           0

[1319 rows x 3 columns]


In [47]:
summary.article = summary.article.str.replace("_", " ")

df_articles_summary = df_articles.merge(summary, left_on = "title", right_on = "article")
df_articles_summary = df_articles_summary.drop("article",axis =1)

df_articles_summary["clicks_per_view"] = np.where(
    df_articles_summary["mo_page_views"] != 0,
    df_articles_summary["clicks_out"] / df_articles_summary["mo_page_views"],
    0
)

df_articles_summary.head()

Unnamed: 0,title,word_count,num_links_internal,num_categories,categories,num_images,image_titles,num_edits,num_editors,summary,creation_date,mo_page_views,clicks_in,clicks_out,clicks_per_view
0,Mathematics,8041,500,23,[Category:All Wikipedia articles written in Am...,10,"[File:Arithmetic symbols.svg, File:Bakhshali n...",500,284,Mathematics is a field of study that discovers...,2001-11-08T15:31:38Z,160941,117924,54025,0.335682
1,−2,1178,500,11,"[Category:2 (number), Category:Articles with s...",0,[],48,17,"In mathematics, negative two or minus two is a...",2025-08-22T04:03:57Z,4212,82,16,0.003799
2,Chang Thokpa,582,23,7,"[Category:Articles with short description, Cat...",4,[File:Classical Meitei odd numbers - related t...,9,3,The concept of Chang Thokpa (ꯆꯪ ꯊꯣꯛꯄ) is a cen...,2023-10-17T14:07:02Z,80,0,0,0.0
3,Language of mathematics,797,83,6,[Category:All articles needing additional refe...,1,[File:Question book-new.svg],307,169,The language of mathematics or mathematical la...,2003-11-20T14:31:23Z,3135,3967,540,0.172249
4,Limit group,1077,43,8,"[Category:Algebra, Category:Articles with shor...",0,[],28,7,"In mathematics, specifically in group theory a...",2025-01-21T04:14:53Z,237,0,0,0.0


### Saving results

In [49]:
path = "data//math_19_10_2025.csv"
df_articles_summary.to_csv(path)