# Data Collecting

We will use Wikipedia articles as our 'corpus of documents'. Using 'wikipedia' and 'wikipediaapi' modules in data collecting.

* wikipedia ــــ python library supports search queries on wikipedia. 
* wikipediaapi ــــ python library supports extracting texts _or any other page sections_ from Wikipedia page.

So, 

   1. Query with 10 search-words on wiki-articles, 500 results for every search-word (using wikipedia).

   2. Get summary of all articles in search results (using wikipediaapi).

   3. Save our collected data into a dictionary, documents_file = { Title: Summary }.

In [1]:
# Import modules
import wikipedia
import wikipediaapi

# Set search words
search_words = ['ball', 'history', 'science', 'show', 'place', 'fun', 'formula', 'war', 'win', 'food']
documents_file = {}

# Initialize wikipedia object from wikipediaapi module, passing language and formatting parameters
wiki = wikipediaapi.Wikipedia(language='en', extract_format=wikipediaapi.ExtractFormat.WIKI)

# Query on every search words
for word in search_words:
    # Get summary from every page-result 
    for page in wikipedia.search(word, results=500):
        documents_file[page] = wiki.page(page).summary    # we can use .text for page-result full-text
        
# Print number of collected wikipedia documents
print("Number of collected documents = ", len(documents_file))

Number of collected documents =  4972


In [2]:
# Print all documents' titles
list(documents_file.keys())

['Ball',
 'BALL',
 'Dragon Ball',
 'Biggest ball of twine',
 'Volleyball',
 'Basketball',
 'Lucille Ball',
 'Dragon Ball Z',
 'Football',
 'Lonzo Ball',
 'Take the Ball Pass the Ball',
 'Goku',
 'LaMelo Ball',
 'Table tennis',
 'Baller',
 'On the ball',
 'Baseball',
 'Eight-ball',
 'Association football',
 'Ball turret',
 'Softball',
 'Handball',
 'Eight-ball (disambiguation)',
 'Ballance',
 'Cycle ball',
 'Mirror ball',
 'Cock and ball torture',
 'Masquerade ball',
 'Basketball (ball)',
 'Tape ball',
 'Adidas Jabulani',
 'No-ball',
 'Adidas Al Rihla',
 'List of ball games',
 'Four-ball golf',
 'Takeoff (rapper)',
 'Nine-ball',
 'Ball python',
 'Dragon Ball Super: Super Hero',
 'Ball tampering',
 'Koosh ball',
 'Bouncy ball',
 'Ballers',
 'Ten-ball',
 'Ball and beam',
 'Debutante ball',
 'Solder ball',
 'LiAngelo Ball',
 'Ball and chain',
 'Ball boy',
 'Cricket ball',
 'Zoe Ball',
 'Football (ball)',
 'Golden Ball',
 'Rolling ball sculpture',
 'Michael Ball',
 'Ball Corporation',
 'Bal

In [3]:
# Print our documents
documents_file

{'Ball': 'A ball is a round object (usually spherical, but can sometimes be ovoid) with several uses. It is used in ball games, where the play of the game follows the state of the ball as it is hit, kicked or thrown by players. Balls can also be used for simpler activities, such as catch or juggling. Balls made from hard-wearing materials are used in engineering applications to provide very low friction bearings, known as ball bearings. Black-powder weapons use stone and metal balls as projectiles.\nAlthough many types of balls are today made from rubber, this form was unknown outside the Americas until after the voyages of Columbus. The Spanish were the first Europeans to see the bouncing rubber balls (although solid and not inflated) which were employed most notably in the Mesoamerican ballgame. Balls used in various sports in other parts of the world prior to Columbus were made from other materials such as animal bladders or skins, stuffed with various materials.\nAs balls are one o

In [4]:
# Save documents for later use
import json

with open('documents-file.json', 'w') as f:
    json.dump(documents_file, f, indent=4)