# WikiPDA Categories
This notebook processes Turkish Wikipedia text data locally with WikiPDA and generates datasets relating articles to their corresponding ORES Topic categories.
First we process the Wikitext and apply WikiPDA two get the following data:
* ```wikipda_top_topics.csv```: Contains the page_id of all retrieved articles and the ORES category that has the highest probability.
* ```wikipda_topics.csv``` : Contains the page_id of all retrieved articles and probability for each ORES category that WikiPDA predicts for that article.

Then we exploit these data to obtain the number of edits per day and revert rates for each predicted category.
Three more DataFrames are then created
* ```thresholded_topics.csv```: Contains page_id, page_title and the associated topic for an article. An article may have more than one topic. An article is said to belong to a topic if WikiPDA predits that topic with a probability bigger than 0.7.
* ```daily_edits_by_topic.csv``` :  Daily number of edits (non-bot) to all topics (topics assigned with the threshold method)
* ```revert_rate_by_topic.csv```: Daily revert rate of all topics (topics assigned with the threshold method)


In [None]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 50)

In [1]:
import pyspark
import pyspark.sql
from pyspark.sql import *
from pyspark.sql.functions import *
import urllib
import argparse
from pyspark.ml.feature import CountVectorizerModel
from pyspark.ml.clustering import LDA
from wikipda.article import Preprocessor, Bagger, fetch_article_data
from wikipda.model import WikiPDAModel, ORESClassifier
import pickle
import numpy as np
import pandas as pd

## I. Process data and get WikiPDA topics

In [2]:
# Initialize spark
conf = pyspark.SparkConf().setMaster("local[*]").setAll([
                                   ('spark.driver.memory','32g'),
                                   ('spark.jars.packages', 'com.databricks:spark-xml_2.11:0.8.0'),
                                   ('spark.driver.maxResultSize', '32G'),
                                   ('spark.local.dir', '/scratch/tmp/'),
                                   ('spark.yarn.stagingDir', '/scratch/tmp/'),
                                   ('spark.sql.warehouse.dir', '/scratch/tmp/')
                                  ])
# create the session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# create the context
sc = spark.sparkContext

In [None]:
# Wikitext data 
WIKIPEDIA_DUMP = '/dlabdata1/turkish_wiki/trwiki-20210201-pages-articles-multistream.xml.bz2'

# Read Articles that are in namespace 0
wikipedia_all = spark.read.format('com.databricks.spark.xml') \
    .options(rowTag='page').load(WIKIPEDIA_DUMP) \
    .filter("ns = '0'") \
    .filter("revision.text._VALUE is not null") \
    .filter("length(revision.text._VALUE) > 0")

# Get articles
wikipedia_articles = wikipedia_all.where("redirect is NULL")\
                    .selectExpr("id", "revision.id revision_id", "revision.text._VALUE wikicode", "title")

In [6]:
p = Preprocessor('tr')

In [7]:
# Collect the wikitext of articles in memory for processing
wikitext = wikipedia_articles.select("wikicode").rdd.flatMap(lambda x: x).collect()

In [8]:
articles = []
error_idx = []

In [None]:
# Process all articles and add to articles and error_idx lists
for idx, text in enumerate(wikitext):
    try:
        article = p.load([text], enrich=True)
        articles.extend(article)
    except:
        error_idx.append(idx)
        articles.extend([None])

In [23]:
# Dump articles and error indexes
pickle.dump(error_idx, open("/dlabdata1/turkish_wiki/error_idx.p", "wb" ))
pickle.dump(articles, open("/dlabdata1/turkish_wiki/articles.p", "wb" ))

In [28]:
wiki_ids = wikipedia_articles.select("id").rdd.flatMap(lambda x: x).collect()

In [39]:
zipped_articles = list(zip(wiki_ids, articles))

In [44]:
zipped_articles = [elem for elem in zipped_articles if elem[1] != None]

In [47]:
# Get the bag-of-links representations
bols = Bagger().bag([elem[1] for elem in zipped_articles])
# Get topics distribution
model = WikiPDAModel(k=300)
topics_distribution = model.get_distribution(bols)
classifier = ORESClassifier()

# Predict categories of articles
text_categories = classifier.predict_category(topics_distribution)

INFO:loading LdaMulticore object from /home/ira/wikipda_data/LDA_models/300/lda.model
INFO:loading expElogbeta from /home/ira/wikipda_data/LDA_models/300/lda.model.expElogbeta.npy with mmap=None
INFO:setting ignored attribute state to None
INFO:setting ignored attribute id2word to None
INFO:setting ignored attribute dispatcher to None
INFO:LdaMulticore lifecycle event {'fname': '/home/ira/wikipda_data/LDA_models/300/lda.model', 'datetime': '2021-05-12T00:16:51.399950', 'gensim': '4.0.1', 'python': '3.7.7 (default, Mar 23 2020, 22:36:06) \n[GCC 7.3.0]', 'platform': 'Linux-4.15.0-91-generic-x86_64-with-debian-buster-sid', 'event': 'loaded'}
INFO:loading LdaState object from /home/ira/wikipda_data/LDA_models/300/lda.model.state
INFO:loading sstats from /home/ira/wikipda_data/LDA_models/300/lda.model.state.sstats.npy with mmap=None
INFO:LdaState lifecycle event {'fname': '/home/ira/wikipda_data/LDA_models/300/lda.model.state', 'datetime': '2021-05-12T00:16:54.679446', 'gensim': '4.0.1', 'p

In [50]:
text_categories.shape

(389054,)

In [58]:
zipped_articles = np.array(zipped_articles)

In [73]:
article_cats = np.c_[zipped_articles[:, 0], text_categories]

In [74]:
pickle.dump(article_cats, open("/dlabdata1/turkish_wiki/article_cats.p", "wb" ))

In [75]:
# Gets probability of each topic for the articles
category_probas = classifier.predict_proba_labeled(topics_distribution)

  "because it will generate extra copies and increase " +


In [81]:
# DataFrame containing the probability of the article belonging a certain topic
topic_df = pd.DataFrame(category_probas)

In [83]:
topic_df['page_id'] = zipped_articles[:, 0]

In [87]:
topic_df = topic_df[['page_id', 'Culture.Media.Entertainment', 'STEM.Space', 'STEM.Mathematics',
       'Geography.Regions.Africa.Central Africa',
       'Geography.Regions.Americas.North America',
       'History and Society.Society', 'Geography.Regions.Oceania',
       'STEM.Engineering', 'STEM.Libraries _ Information',
       'History and Society.Politics and government', 'STEM.Biology',
       'Culture.Media.Music', 'Geography.Regions.Asia.West Asia',
       'Geography.Regions.Asia.Asia_',
       'Geography.Regions.Americas.Central America',
       'Geography.Regions.Europe.Southern Europe',
       'Geography.Regions.Africa.Africa_',
       'Geography.Regions.Asia.Central Asia',
       'History and Society.Business and economics', 'STEM.STEM_',
       'Culture.Media.Video games', 'Culture.Media.Software',
       'Geography.Regions.Americas.South America',
       'Culture.Biography.Biography_', 'Culture.Visual arts.Comics and Anime',
       'Geography.Regions.Africa.Western Africa',
       'Geography.Regions.Africa.Southern Africa', 'Culture.Performing arts',
       'STEM.Physics', 'Culture.Linguistics', 'Culture.Internet culture',
       'Culture.Biography.Women', 'STEM.Technology', 'STEM.Medicine _ Health',
       'Culture.Media.Television', 'Culture.Philosophy and religion',
       'Culture.Visual arts.Fashion',
       'Geography.Regions.Europe.Western Europe',
       'Geography.Regions.Asia.Southeast Asia', 'Culture.Media.Radio',
       'Culture.Media.Books', 'Culture.Literature',
       'Geography.Regions.Asia.South Asia', 'STEM.Computing',
       'Culture.Food and drink', 'Geography.Geographical',
       'Culture.Visual arts.Architecture',
       'Geography.Regions.Africa.Eastern Africa',
       'Geography.Regions.Asia.East Asia', 'STEM.Earth and environment',
       'History and Society.Transportation', 'STEM.Chemistry',
       'Culture.Media.Films', 'History and Society.History',
       'History and Society.Military and warfare', 'Culture.Sports',
       'Geography.Regions.Europe.Eastern Europe',
       'Culture.Visual arts.Visual arts_', 'Geography.Regions.Asia.North Asia',
       'Culture.Media.Media_', 'History and Society.Education',
       'Geography.Regions.Africa.Northern Africa',
       'Geography.Regions.Europe.Northern Europe',
       'Geography.Regions.Europe.Europe_']]

In [93]:
topic_df.to_csv('/dlabdata1/turkish_wiki/processed_data/wikipda_topics.csv', index = False)

In [125]:
top_cats = pd.DataFrame(article_cats)

In [126]:
top_cats.columns = ['page_id', 'category']

In [127]:
top_cats.to_csv('/dlabdata1/turkish_wiki/processed_data/wikipda_top_topics.csv', index = False)

## II. Get edits and revert rates by category.

### 1) Get thresholded topics
It's hard to qualify any Wikipedia page into only one topic. When looking at the distributions I found out that subjects such as West Asia are heavily biased and will appear as the top topic of an article since it encodes information about Turkey but the article may have another topic which is way more intuitive. Thus I decided to threshold the probabilities of the articles, and having multiple topics per article is possible. After some inspection, I decided to settle the threshold at 0.7 where I considered an article belonging to a certain topic if the probabililty estimated by WikiPDA was bigger than 0.7. The median number of topics per article is 5 with this method and we can captuee valuable information.


In [102]:
# Read relevant DataFrames
topic_df = pd.read_csv('/dlabdata1/turkish_wiki/processed_data/wikipda_topics.csv')
pages = pd.read_csv('/dlabdata1/turkish_wiki/processed_data/page.csv')
daily_edits = pd.read_csv('/dlabdata1/turkish_wiki/processed_data/edits.csv')

In [103]:
pages = pages[['page_id', 'page_title']]

In [104]:
topic_df.page_id = topic_df.page_id.astype(str)

In [105]:
# Merge DataFrames to get the page_title
topic_df = pd.merge(pages, topic_df)

In [107]:
# Get daily number of edits by non-bots by page_id
daily_edits = daily_edits[daily_edits['user_kind'] != 'bot'].groupby(['date', 'page_id'])[['event_user_id']].sum()

In [108]:
daily_edits = daily_edits.reset_index()

In [110]:
topics = ['Culture.Media.Entertainment', 'STEM.Space',
       'STEM.Mathematics', 'Geography.Regions.Africa.Central Africa',
       'Geography.Regions.Americas.North America',
       'History and Society.Society', 'Geography.Regions.Oceania',
       'STEM.Engineering', 'STEM.Libraries _ Infor mation',
       'History and Society.Politics and government', 'STEM.Biology',
       'Culture.Media.Music', 'Geography.Regions.Asia.West Asia',
       'Geography.Regions.Asia.Asia_',
       'Geography.Regions.Americas.Central America',
       'Geography.Regions.Europe.Southern Europe',
       'Geography.Regions.Africa.Africa_',
       'Geography.Regions.Asia.Central Asia',
       'History and Society.Business and economics', 'STEM.STEM_',
       'Culture.Media.Video games', 'Culture.Media.Software',
       'Geography.Regions.Americas.South America',
       'Culture.Biography.Biography_', 'Culture.Visual arts.Comics and Anime',
       'Geography.Regions.Africa.Western Africa',
       'Geography.Regions.Africa.Southern Africa', 'Culture.Performing arts',
       'STEM.Physics', 'Culture.Linguistics', 'Culture.Internet culture',
       'Culture.Biography.Women', 'STEM.Technology', 'STEM.Medicine _ Health',
       'Culture.Media.Television', 'Culture.Philosophy and religion',
       'Culture.Visual arts.Fashion',
       'Geography.Regions.Europe.Western Europe',
       'Geography.Regions.Asia.Southeast Asia', 'Culture.Media.Radio',
       'Culture.Media.Books', 'Culture.Literature',
       'Geography.Regions.Asia.South Asia', 'STEM.Computing',
       'Culture.Food and drink', 'Geography.Geographical',
       'Culture.Visual arts.Architecture',
       'Geography.Regions.Africa.Eastern Africa',
       'Geography.Regions.Asia.East Asia', 'STEM.Earth and environment',
       'History and Society.Transportation', 'STEM.Chemistry',
       'Culture.Media.Films', 'History and Society.History',
       'History and Society.Military and warfare', 'Culture.Sports',
       'Geography.Regions.Europe.Eastern Europe',
       'Culture.Visual arts.Visual arts_', 'Geography.Regions.Asia.North Asia',
       'Culture.Media.Media_', 'History and Society.Education',
       'Geography.Regions.Africa.Northern Africa',
       'Geography.Regions.Europe.Northern Europe',
       'Geography.Regions.Europe.Europe_']

In [111]:
# Strip columns prefix to make it easier to read
topic_df.columns  = [x.split(".")[-1] for x in topic_df.columns]

In [112]:
topics = ['Entertainment', 'Space', 'Mathematics',
       'Central Africa', 'North America', 'Society', 'Oceania', 'Engineering',
       'Libraries _ Information', 'Politics and government', 'Biology',
       'Music', 'West Asia', 'Asia_', 'Central America', 'Southern Europe',
       'Africa_', 'Central Asia', 'Business and economics', 'STEM_',
       'Video games', 'Software', 'South America', 'Biography_',
       'Comics and Anime', 'Western Africa', 'Southern Africa',
       'Performing arts', 'Physics', 'Linguistics', 'Internet culture',
       'Women', 'Technology', 'Medicine _ Health', 'Television',
       'Philosophy and religion', 'Fashion', 'Western Europe',
       'Southeast Asia', 'Radio', 'Books', 'Literature', 'South Asia',
       'Computing', 'Food and drink', 'Geographical', 'Architecture',
       'Eastern Africa', 'East Asia', 'Earth and environment',
       'Transportation', 'Chemistry', 'Films', 'History',
       'Military and warfare', 'Sports', 'Eastern Europe', 'Visual arts_',
       'North Asia', 'Media_', 'Education', 'Northern Africa',
       'Northern Europe', 'Europe_']

In [120]:
# Get topics corresponding to probabilities bigger than 0.7
topic_df[topics] = topic_df[topics].where(topic_df[topics] >= 0.7, np.nan)
topic_df[topics] = topic_df[topics].where(topic_df[topics].isna(), topics)
topic_df = topic_df.set_index(['page_id', 'page_title'])

In [122]:
# Stack to DataFrame to have each page appearing in multiple rows with all associated topics
topic_df = topic_df.stack().reset_index()

topic_df = topic_df[['page_id', 'page_title', 'level_2']]
topic_df.columns = ['page_id', 'page_title', 'topic']

We see the topics related to Genghis Khan below

In [123]:
topic_df.head(8)

Unnamed: 0,page_id,page_title,topic
0,10,Cengiz_Han,Society
1,10,Cengiz_Han,West Asia
2,10,Cengiz_Han,Asia_
3,10,Cengiz_Han,Central Asia
4,10,Cengiz_Han,East Asia
5,10,Cengiz_Han,History
6,10,Cengiz_Han,Military and warfare
7,10,Cengiz_Han,North Asia


In [231]:
# Save the DataFrame
topic_df.to_csv('/dlabdata1/turkish_wiki/processed_data/thresholded_topics.csv', index = False)

In [124]:
topic_df.page_id = topic_df.page_id.astype(int)

### 2) Get edits by topic
Get daily number of edits by topic. 

DataFrame saved at ```../processed_data/daily_edits_by_topic.csv```

In [125]:
daily_edits_by_topic = pd.merge(daily_edits, topic_df)

In [126]:
daily_edits_by_topic = daily_edits_by_topic.groupby(['date', 'topic'])['event_user_id'].sum()

In [127]:
daily_edits_by_topic = daily_edits_by_topic.reset_index()

In [128]:
daily_edits_by_topic.columns = ['date', 'topic', 'number_of_edits']

In [129]:
daily_edits_by_topic['date'] = pd.to_datetime(daily_edits_by_topic['date'], utc=True)

In [130]:
daily_edits_by_topic = daily_edits_by_topic.set_index(['date', 'topic'])
idx = pd.date_range(daily_edits_by_topic.index.levels[0].min(), daily_edits_by_topic.index.levels[0].max())

daily_edits_by_topic = daily_edits_by_topic.reindex(
        pd.MultiIndex.from_product([idx, daily_edits_by_topic.index.levels[1]], 
                                   names=['date', 'topic']), fill_value=0)


In [99]:
daily_edits_by_topic.to_csv('/dlabdata1/turkish_wiki/processed_data/daily_edits_by_topic.csv')

### 3) Get revert rate by topic
Get the daily revert rate to all articles of all topics. 
DataFrame saved at ```../processed_data/revert_rate_by_topic.csv```

In [68]:
daily_edits = pd.read_csv('/dlabdata1/turkish_wiki/processed_data/edits.csv')

In [54]:
df_reverts = pd.read_csv('/dlabdata1/turkish_wiki/processed_data/df_reverts_by_pageid.csv')

In [72]:
daily_edits = daily_edits[daily_edits['user_kind'] != 'bot'].groupby(['date', 'page_id'])[['event_user_id']].sum().reset_index()

In [73]:
df_reverts = df_reverts.groupby(['date', 'page_id'])['revision_is_identity_revert'].sum().reset_index()

In [74]:
revert_rate = pd.merge(df_reverts, daily_edits, on=['date', 'page_id'], how= 'outer')

In [75]:
revert_rate = revert_rate.fillna(0)

In [76]:
revert_rate = pd.merge(revert_rate, topic_df)

In [77]:
revert_rate = revert_rate.groupby(['date', 'topic'])[['revision_is_identity_revert', 'event_user_id']].sum()

In [78]:
revert_rate['revert_rate'] = revert_rate['revision_is_identity_revert']/revert_rate['event_user_id']

In [79]:
revert_rate = revert_rate.reset_index()

In [80]:
revert_rate = revert_rate[['date', 'topic', 'revert_rate']]

In [81]:
revert_rate['date'] = pd.to_datetime(revert_rate['date'], utc=True)

In [82]:
revert_rate = revert_rate.set_index(['date', 'topic'])
idx = pd.date_range(revert_rate.index.levels[0].min(), revert_rate.index.levels[0].max())

revert_rate = revert_rate.reindex(
        pd.MultiIndex.from_product([idx, revert_rate.index.levels[1]], 
                                   names=['date', 'topic']), fill_value=0)


In [84]:
revert_rate.to_csv('/dlabdata1/turkish_wiki/processed_data/revert_rate_by_topic.csv')