# üê±Wikicat


# Goal and Purpose
Wikicat is a module that helps us extract all the pages of a given category and write them to a json file. You just need to pass a standard language code and also a valid category name and that is all!

## Prerequisites
You may need to install tqdm and wikipediaapi libraries first.
Use codes below to easily install via terminal:

In [1]:
!pip install Wikipedia-API
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Libraries
Now it's time to import our wikicat library as well as random and json libraries.
They will help us randomly choose a json object and print it. 


In [2]:
import wikicat
import random
import json

# Funcitons
In this section we will explore each function of this package one by one:
*   set_lang
*   get_category_data
*   create_category_file
*   get_duplicate_elements

## set_lang
Let's start with ***set_lang*** function.

This function is used to set the language of wikipedia.
Check out [language editions of Wikipedia](https://en.wikipedia.org/wiki/List_of_Wikipedias) to find out what code you should use. 

It is also used in ***create_category_file*** function.

This function gets:
    
*   *lang_code*: a standard language code (ex. en: English, fa: Farsi, etc.)

And returns:

*   wikipedia object of the language for further use.

We use *'fa'* code which indicates *Farsi* language for instance. You can use any valid code.

In [10]:
lang_code = 'fa'
fa_wiki = wikicat.set_lang(lang_code)

*fa_wiki* can be used to get single page title, url, summary, full text, page sections, categories, etc.
For more information check on [Wikipedia-API documentation](https://pypi.org/project/Wikipedia-API/).


## get_category_data
this function gets:


*   category_name: name of a valid category in a given language
*   categorymembers: list of members of this category
*   min_delay: (default value: 1s) minimum delay in seconds to wait in between sending requests to wikipedia
*   max_delay: (default value: 5s) maximum delay in seconds to wait in between sending requests to wikipedia
*  level: (default value: 0) level of the page being processed
*  max_level: (default value: 20) maximum number of levels to be traversed

And returns:
*  list of json objects. It may contain duplicate elements. Each page has keys:
    - title: title of the wikipedia page
    - main category: the main category that we aim to extract its data
    - all categories: all categories related to this page
    - content: content of the page (usually it needs to get preprocessed)
    - url: the url of the page

A recurssive implementation has been applied. 
It traverses all subcategories (branch nodes) 
and their pages (leaf nodes) in a depth-first-search manner to get all data related to a given main category.

We use "ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å" category to extract its data.
As mentioned this function gets 6 arguments which the last 4 of them are optional. If you don't pass these args, the default values will be used.




In [4]:
category_name = "ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å"
category = fa_wiki.page(f"Category:{category_name}")
category_members = category.categorymembers
min_delay = 1
max_delay = 10
level = 0
max_level = 10
category_data = wikicat.get_category_data(category_name, category_members, min_delay, max_delay, level, max_level)

 88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 15/17 [01:15<00:11,  5.66s/it]
  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|‚ñà‚ñé        | 1/8 [00:06<00:43,  6.17s/it][A
 25%|‚ñà‚ñà‚ñå       | 2/8 [00:16<00:51,  8.53s/it][A
 38%|‚ñà‚ñà‚ñà‚ñä      | 3/8 [00:21<00:34,  6.99s/it][A
 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 4/8 [00:29<00:29,  7.46s/it][A
 62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 5/8 [00:33<00:18,  6.28s/it][A
 75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 6/8 [00:36<00:09,  4.88s/it][A
 88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 7/8 [00:38<00:04,  4.00s/it][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:42<00:00,  5.30s/it]
 94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 16/17 [02:08<00:19, 19.78s/it]
  0%|          | 0/49 [00:00<?, ?it/s][A
  2%|‚ñè         | 1/49 [00:06<04:55,  6.16s/it][A
  4%|‚ñç         | 2/49 [00:15<06:13,  7.95s/it][A
  6%|‚ñå         | 3/49 [00:21<05:28,  7.14s/it][A
  8%|‚ñä         | 4/49 [00:27<05:04,  6.77s/it][A
 10%|‚ñà         | 5/49 [00:32<04:32,  6.19s/it][A
 12%|‚ñà‚ñè        | 6/49 [00:42<05:09

Here is a random page selected from category_data:


In [8]:
sample = random.choice(category_data)
parsed = json.loads(sample)
print(json.dumps(parsed, indent = 4,ensure_ascii=False))

{
    "title": "ŸÖŸÜÿ¥ÿ¢ÿ™ ŸÇÿßÿ¶ŸÖ ŸÖŸÇÿßŸÖ ŸÅÿ±ÿßŸáÿßŸÜ€å",
    "main category": "ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å",
    "all categories": [
        "ÿ±ÿØŸá:ÿµŸÅÿ≠Ÿá‚ÄåŸáÿß€å ÿØÿßÿ±ÿß€å ÿßÿ±ÿ¨ÿßÿπ ÿ®ÿß Ÿæÿßÿ±ÿßŸÖÿ™ÿ± Ÿæÿ¥ÿ™€åÿ®ÿßŸÜ€å‚ÄåŸÜÿ¥ÿØŸá",
        "ÿ±ÿØŸá:ŸÖŸÇÿßŸÑŸá‚ÄåŸáÿß€å ÿÆÿ±ÿØ ÿßÿØÿ®€åÿßÿ™",
        "ÿ±ÿØŸá:ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å",
        "ÿ±ÿØŸá:ŸáŸÖŸá ŸÖŸÇÿßŸÑŸá‚ÄåŸáÿß€å ÿÆÿ±ÿØ",
        "ÿ±ÿØŸá:⁄©ÿ™ÿßÿ®‚ÄåŸáÿß€å ÿßÿØÿ®€å",
        "ÿ±ÿØŸá:⁄©ÿ™ÿßÿ®‚ÄåŸáÿß€å ÿ≥ÿØŸá €±€≥"
    ],
    "content": "ŸÖŸÜÿ¥ÿ¢ÿ™ ŸÇÿßÿ¶ŸÖ ŸÖŸÇÿßŸÖ ŸÅÿ±ÿßŸáÿßŸÜ€å ÿπŸÜŸàÿßŸÜ ⁄©ÿ™ÿßÿ®€å ÿßÿ≥ÿ™ ŸÖÿ¥ÿ™ŸÖŸÑ ÿ®ÿ± ÿ®ÿ±ÿÆ€å ŸÜŸàÿ¥ÿ™Ÿá‚ÄåŸáÿß€å ŸÇÿßÿ¶ŸÖ ŸÖŸÇÿßŸÖ ŸÅÿ±ÿßŸáÿßŸÜ€å ⁄©Ÿá ÿ®Ÿá ÿØÿ≥ÿ™Ÿàÿ± Ÿà ÿ™ÿØÿ®€åÿ± ÿ¥ÿß⁄Øÿ±ÿØ ÿßŸàÿå ÿ≠ÿßÿ¨ ŸÅÿ±ŸáÿßÿØ ŸÖ€åÿ±ÿ≤ÿß ŸÖÿπÿ™ŸÖÿØÿßŸÑÿØŸàŸÑŸá ŸÇÿßÿ¨ÿßÿ± ÿØÿ± ÿ≥ÿßŸÑ €±€≤€∏€∞ ⁄Øÿ±ÿØÿ¢Ÿàÿ±€å Ÿà ÿØÿ± ÿ≥ÿßŸÑ €±€≤€π€¥ ÿ®ÿ±ÿß€å ŸÜÿÆÿ≥ÿ™€åŸÜ ÿ®ÿßÿ± ⁄ÜÿßŸæ ÿ¥ÿØ. ÿß€åŸÜ ÿßÿ´ÿ± ÿØÿßÿ±ÿß€å ŸÜÿ´ÿ±€å ÿ≤€åÿ®ÿß Ÿà ÿßÿØÿ®€å ÿßÿ≥ÿ™ Ÿà ÿØÿ± ÿØŸàÿ±Ÿá‚Äåÿß€å ŸÖŸÜÿ™ÿ¥ÿ± ÿ¥ÿØ ⁄©Ÿá ŸÜŸáÿ∂ÿ™ ÿ®ÿßÿ≤⁄Øÿ¥ÿ™ ÿßÿØÿ®€

As you see, there is a trade-off between speed and not getting errors caused by abundance of requests. You can set both min and max delay equal to zero to speed up the code, but it is more probable to get some errors.


## create_category_file
This function wrapps the previous ones up into one function and is the most important part of wikicat.The main purpose of this function is to create a json file containing all data of a given category under the name of the category.

Like **get_category_data** function this function gets:
*   category_name: name of a valid category in a given language
*   categorymembers: list of members of this category
*   min_delay: (default value: 1s) minimum delay in seconds to wait in between sending requests to wikipedia
*   max_delay: (default value: 5s) maximum delay in seconds to wait in between sending requests to wikipedia
*  level: (default value: 0) level of the page being processed
*  max_level: (default value: 20) maximum number of levels to be traversed

then it creates the file and returns:
*  list of deduplicated json objects where all its elements are unique.
*  list of all json objects including repetitive ones.

We continue our "ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å" example. The code below writes all unique data to a file named ŸÜÿ´ÿ±ŸÅÿßÿ±ÿ≥€å.json 
All the arguements passed, are initialized in the previous part.



In [12]:
deduplicated_data, data = wikicat.create_category_file(lang_code, category_name, min_delay, max_delay, max_level)


 88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 15/17 [01:06<00:08,  4.19s/it]
  0%|          | 0/8 [00:00<?, ?it/s][A
 12%|‚ñà‚ñé        | 1/8 [00:07<00:50,  7.17s/it][A
 25%|‚ñà‚ñà‚ñå       | 2/8 [00:15<00:46,  7.76s/it][A
 38%|‚ñà‚ñà‚ñà‚ñä      | 3/8 [00:17<00:26,  5.21s/it][A
 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 4/8 [00:19<00:16,  4.01s/it][A
 62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 5/8 [00:29<00:18,  6.23s/it][A
 75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 6/8 [00:35<00:11,  5.87s/it][A
 88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 7/8 [00:45<00:07,  7.28s/it][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:47<00:00,  5.93s/it]
 94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 16/17 [01:55<00:17, 17.55s/it]
  0%|          | 0/49 [00:00<?, ?it/s][A
  2%|‚ñè         | 1/49 [00:07<05:44,  7.17s/it][A
  4%|‚ñç         | 2/49 [00:09<03:19,  4.23s/it][A
  6%|‚ñå         | 3/49 [00:17<04:37,  6.03s/it][A
  8%|‚ñä         | 4/49 [00:27<05:45,  7.68s/it][A
 10%|‚ñà         | 5/49 [00:33<05:13,  7.13s/it][A
 12%|‚ñà‚ñè        | 6/49 [00:39<04:38

Here is a random page selected from deduplicated_data:

In [14]:
sample = random.choice(list(deduplicated_data))
parsed = json.loads(sample)
print(json.dumps(parsed, indent = 4, ensure_ascii=False))

{
    "title": "ÿ®ÿØÿß€åÿπ‚ÄåÿßŸÑŸàŸÇÿß€åÿπ",
    "main category": "ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å",
    "all categories": [
        "ÿ±ÿØŸá:ÿßÿØÿ®€åÿßÿ™",
        "ÿ±ÿØŸá:ÿßÿØÿ®€åÿßÿ™ ÿß€åÿ±ÿßŸÜ",
        "ÿ±ÿØŸá:ÿßÿØÿ®€åÿßÿ™ ŸÅÿßÿ±ÿ≥€å",
        "ÿ±ÿØŸá:ŸÖŸÇÿßŸÑŸá‚ÄåŸáÿß€å ÿÆÿ±ÿØ ÿßÿØÿ®€åÿßÿ™",
        "ÿ±ÿØŸá:ŸÖŸÇÿßŸÑŸá‚ÄåŸáÿß€å€å ⁄©Ÿá ÿ™ÿ¨ŸÖ€åÿπ ÿßÿ±ÿ¨ÿßÿπ ÿØÿ± ÿ¢ŸÜ‚ÄåŸáÿß ŸÖŸÖŸÜŸàÿπ ÿßÿ≥ÿ™",
        "ÿ±ÿØŸá:ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å",
        "ÿ±ÿØŸá:ŸáŸÖŸá ŸÖŸÇÿßŸÑŸá‚ÄåŸáÿß€å ÿÆÿ±ÿØ"
    ],
    "content": "ÿ®ÿØÿß€åÿπ‚ÄåÿßŸÑŸàŸÇÿß€åÿπ €åÿß ŸàÿØŸäÿπÿ©ÿßŸÑÿ≠ŸÇÿßŸäŸÇ ÿßÿ´ÿ± ÿ≤€åŸÜ‚ÄåÿßŸÑÿØ€åŸÜ ŸàÿßÿµŸÅ€å Ÿáÿ±Ÿà€å ÿ¥ÿßÿπÿ± ÿ±Ÿàÿ≤⁄Øÿßÿ± ÿ™€åŸÖŸàÿ±€å Ÿà ÿµŸÅŸà€å ÿßÿ≥ÿ™. ÿß€åŸÜ ÿ¥ÿ±ÿ≠ ÿ±Ÿà€åÿØÿßÿØŸáÿß€å€å ÿßÿ≤ ⁄Øÿ∞ÿ¥ÿ™Ÿá Ÿà ŸÖÿ∑ÿßŸÑÿ®€å ÿßÿ≤ ÿØ€åÿØŸá‚ÄåŸáÿß Ÿà ÿ¥ŸÜ€åÿØŸá‚ÄåŸáÿß€å ÿÆŸàÿØ ÿßŸà ÿ±ÿß ÿ¥ÿßŸÖŸÑ ŸÖ€å‚Äåÿ¥ŸàÿØ. ŸàÿßÿµŸÅ€å Ÿáÿ±Ÿà€å ÿ®ÿÆÿ¥€å ÿßÿ≤ ŸÇÿµ€åÿØŸá‚ÄåŸáÿß€å ÿÆŸàÿØ ÿØÿ± ŸÖÿØÿ≠ ÿßÿ≤ÿ®⁄©ÿßŸÜ ÿ±ÿß ÿØÿ± ÿß€åŸÜ ⁄©ÿ™ÿßÿ® ÿ¢Ÿàÿ±ÿØŸá‚Äåÿßÿ≥ÿ™. ÿß€åŸÜ ⁄©ÿ™ÿßÿ® ÿØÿ± ÿ≥ÿßŸÑ €±€π€∂€± ŸÖ ÿØÿ± ÿØŸà ÿ¨ŸÑÿØ ÿ

## get_duplicate_elements
Our last function gets a list and returns duplicate elements.
This function gets:
*  cat_data: any list (in this case pages of a category )

and  returns:
*  duplicate elements of the list (in this case, duplicate pages)

As we said, the second output create_category_file function may contain duplicate values. The code below extracts these repetitive pages.


In [15]:
duplicate_pages = wikicat.get_duplicate_elements(data)

Here is a random page selected from duplicate_pages:

In [16]:
sample = random.choice(duplicate_pages)
parsed = json.loads(sample)
print(json.dumps(parsed, indent = 4,ensure_ascii=False))

{
    "title": "ÿ™ÿ∞⁄©ÿ±ÿ©ÿßŸÑÿßŸàŸÑ€åÿßÿ°",
    "main category": "ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å",
    "all categories": [
        "ÿ±ÿØŸá:ÿßÿØÿ®€åÿßÿ™ ÿ™ÿµŸàŸÅ",
        "ÿ±ÿØŸá:ÿßÿØÿ®€åÿßÿ™ ŸÅÿßÿ±ÿ≥€å",
        "ÿ±ÿØŸá:ÿßŸÑ⁄ØŸàŸáÿß€å ÿØÿ±⁄ØÿßŸá ÿ®ÿß ÿØÿ±⁄ØÿßŸá‚ÄåŸáÿß€å ŸÜÿßŸÖŸàÿ¨ŸàÿØ",
        "ÿ±ÿØŸá:ÿ™ÿ∞⁄©ÿ±Ÿá‚ÄåŸáÿß€å ŸÅÿßÿ±ÿ≥€å",
        "ÿ±ÿØŸá:ÿπÿ∑ÿßÿ± ŸÜ€åÿ¥ÿßÿ®Ÿàÿ±€å",
        "ÿ±ÿØŸá:ŸÅÿ±ŸáŸÜ⁄Ø‚ÄåŸáÿß€å ÿßÿπŸÑÿßŸÖ",
        "ÿ±ÿØŸá:ŸÜÿ´ÿ± ŸÅÿßÿ±ÿ≥€å ÿØŸàÿ±Ÿá ÿ™⁄©Ÿà€åŸÜ",
        "ÿ±ÿØŸá:Ÿæ€åŸàŸÜÿØŸáÿß€å Ÿà€å‚Äåÿ®⁄© ÿßŸÑ⁄ØŸà€å ÿ®ÿß€å⁄ØÿßŸÜ€å ÿß€åŸÜÿ™ÿ±ŸÜÿ™",
        "ÿ±ÿØŸá:⁄©ÿ™ÿßÿ®‚ÄåŸáÿß€å ÿπÿ±ŸÅÿßŸÜ€å"
    ],
    "content": "ÿ™ÿ∞⁄©ÿ±ÿ©ÿßŸÑÿßŸàŸÑ€åÿßÿ° ⁄©ÿ™ÿßÿ®€å ÿπÿ±ŸÅÿßŸÜ€å ÿßÿ≥ÿ™ ÿ®Ÿá ŸÜÿ´ÿ± ÿ≥ÿßÿØŸá Ÿà ÿØÿ± ŸÇÿ≥ŸÖÿ™‚ÄåŸáÿß€å€å ŸÖÿ≥ÿ¨ÿπÿå ⁄©Ÿá ÿØÿ± ÿ¥ÿ±ÿ≠ ÿßÿ≠ŸàÿßŸÑ ÿ®ÿ≤ÿ±⁄ØÿßŸÜ ÿßŸàŸÑ€åÿßÿ° Ÿà ŸÖÿ¥ÿß€åÿÆ ÿµŸàŸÅ€åŸá ÿ™Ÿàÿ≥ÿ∑ ŸÅÿ±€åÿØÿßŸÑÿØ€åŸÜ ÿπÿ∑ÿßÿ± ŸÜ€åÿ¥ÿßÿ®Ÿàÿ±€å ÿ®Ÿá ŸÅÿßÿ±ÿ≥€å ŸÜŸàÿ¥ÿ™Ÿá ÿ¥ÿØŸá‚Äåÿßÿ≥ÿ™.\n\nÿ≥ÿßÿÆÿ™ÿßÿ± Ÿà ÿØÿ±ŸàŸÜ‚ÄåŸÖÿß€åŸáŸî ⁄©ÿ™ÿßÿ®\nÿß€åŸÜ ⁄©ÿ™ÿßÿ® ŸÖÿ¥