## Wikipedia information fetcher

In this notebook I retrieve the information of one wikipedia article given a topic and its domain (to avoid disambiguation)
The content of the article is summarized and displayed.
The correlated topics are also displayed with the respective url

In [4]:
import networkx as nx
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
import os
import openai
from collections import Counter
import re
import pandas as pd
import requests
import pytrends
from pytrends.request import TrendReq
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


In [5]:
# input topic and domain

string = input("What scientific field or term would you like to know more about?: ")
domain = input("What domain you mean?: ")

What scientific field or term would you like to know more about?: nlp
What domain you mean?: deep learning


### Retrieve Wikipedia article from Google search (given topic name and its' domain)

In [6]:

from bs4 import BeautifulSoup
import requests
from cleantext import clean
    


def retrieve_wikipedia(search, domain):
    """
    Method to retrieve wikipedia url and extract text
    """
    headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
    }
    query = search + " " + domain + " site:en.wikipedia.org"
    url="https://www.google.com/search?q=" + query
    page = requests.get(url, headers=headers).text
    soup = BeautifulSoup(page, "lxml")
   
    urls = str(soup.find_all("div", "yuRUbf")[0])
    url = [i for i in urls.split() if "href" in i][0]
    url = url.split("=")[1].strip('"')
   

    page = requests.get(url, headers=headers).text
    soup = BeautifulSoup(page, "lxml")
    text=""
    for i in soup.find_all("p"):
        text += i.text.replace("\n", "")
    return re.sub("[\(\[].*?[\)\]]", "", text), url

text, url = retrieve_wikipedia(string, domain)


In [7]:
print(text)

Natural language processing  is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articulated as a 

### Summarize the whole Wikipedia article with huggingface's baseline summary transformer


In [8]:
pipe = pipeline("summarization", max_length=64)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)
2022-11-10 17:10:49.178947: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-11-10 17:10:49.229728: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:923] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-11-10 17:10:49.229805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3050 Laptop GPU computeCapability: 8.6
coreClock: 1.057GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 163.94GiB/s
2022-11-10 17:10:49.229845: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-11-10 17:10:49.229873: I tensorflow/stream_executor/platform/d

In [9]:
def summarize_article(text):
    length_article = len(text.split("."))

    summary=""
    for i in range(0, length_article, 10):
        split=""
        for i in text.split(".")[i:i+10]:
            split += i
        summ = pipe(split)
        summary+=summ[0]['summary_text']
        
    print(summary)
    
        
summarize_article(text)
  

 Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with interactions between computers and human language . The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them . The technology can then extract information and insights contained Representation learning and deep neural network-style machine learning methods became widespread in natural language processing in the 2010s . Such techniques can achieve state-of-the-art results in many natural language tasks, eg, in language modeling and parsing . This is increasingly important in medicine and healthcare, where NLP Many different machine-learning algorithms have been applied to natural-language-processing tasks . These algorithms take as input a large set of "features" that are generated from the input data . Increasingly, research has focused on statistical models, which make soft, probabilisti

### Retrieve related topics 

In [10]:


def related_wikipedia(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = soup.find("div", "div-col").text#.strip().replace("\n", ", ")
    return [i.strip() for i in text.split("\n") if i != ""]



related = related_wikipedia(url)

print(related)

['1 the Road', 'Automated essay scoring', 'Biomedical text mining', 'Compound term processing', 'Computational linguistics', 'Computer-assisted reviewing', 'Controlled natural language', 'Deep learning', 'Deep linguistic processing', 'Distributional semantics', 'Foreign language reading aid', 'Foreign language writing aid', 'Information extraction', 'Information retrieval', 'Language and Communication Technologies', 'Language technology', 'Latent semantic indexing', 'Native-language identification', 'Natural-language programming', 'Natural-language understanding', 'Natural-language search', 'Outline of natural language processing', 'Query expansion', 'Query understanding', 'Reification (linguistics)', 'Speech processing', 'Spoken dialogue systems', 'Text-proofing', 'Text simplification', 'Transformer (machine learning model)', 'Truecasing', 'Question answering', 'Word2vec']


### Retrieve categories topics and their urls

In [19]:

def categories_wikipedia(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = soup.find("div", "mw-normal-catlinks").text
    categories = re.sub(r"(\w)([A-Z])", r"\1, \2", categories)
    return [i.strip() for i in categories.split(":")[1].split(",")]

categories = categories_wikipedia(url)

print(categories)

['Natural language processing', 'Artificial intelligence', 'Computational fields of study', 'Computational linguistics', 'Speech recognition']


In [21]:

def urls_wikipedia(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = soup.find("div", "mw-normal-catlinks")
    return(["https://en.wikipedia.org/" + i.split('"')[1] for i in str(text).split(" ") if "href" in i])

urls = urls_wikipedia(url)

print(urls)

['https://en.wikipedia.org//wiki/Help:Category', 'https://en.wikipedia.org//wiki/Category:Natural_language_processing', 'https://en.wikipedia.org//wiki/Category:Artificial_intelligence', 'https://en.wikipedia.org//wiki/Category:Computational_fields_of_study', 'https://en.wikipedia.org//wiki/Category:Computational_linguistics', 'https://en.wikipedia.org//wiki/Category:Speech_recognition']


### Retrieve subtopics and their urls

In [9]:

def subtopics_wikipedia(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = (soup.find("td", "navbox-list navbox-odd hlist").text).split("\n")
    return [i for i in text if i != ""]




related = related_wikipedia(url)
print(related)

['1 the Road', 'Automated essay scoring', 'Biomedical text mining', 'Compound term processing', 'Computational linguistics', 'Computer-assisted reviewing', 'Controlled natural language', 'Deep learning', 'Deep linguistic processing', 'Distributional semantics', 'Foreign language reading aid', 'Foreign language writing aid', 'Information extraction', 'Information retrieval', 'Language and Communication Technologies', 'Language technology', 'Latent semantic indexing', 'Native-language identification', 'Natural-language programming', 'Natural-language understanding', 'Natural-language search', 'Outline of natural language processing', 'Query expansion', 'Query understanding', 'Reification (linguistics)', 'Speech processing', 'Spoken dialogue systems', 'Text-proofing', 'Text simplification', 'Transformer (machine learning model)', 'Truecasing', 'Question answering', 'Word2vec']


In [26]:

def sub_urls_wikipedia(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = soup.find("td", "navbox-list navbox-odd hlist")
    return(["https://en.wikipedia.org/" + i.split('"')[1] for i in str(text).split(" ") if "href" in i])

urls = sub_urls_wikipedia("https://en.wikipedia.org/wiki/Cognitive_behavioral_therapy")

print(urls)

['https://en.wikipedia.org//wiki/Acceptance_and_commitment_therapy', 'https://en.wikipedia.org//wiki/Behavior_therapy', 'https://en.wikipedia.org//wiki/Behavioral_activation', 'https://en.wikipedia.org//wiki/Cognitive_analytic_therapy', 'https://en.wikipedia.org//wiki/Cognitive_behavioral_analysis_system_of_psychotherapy', 'https://en.wikipedia.org//wiki/Cognitive_therapy', 'https://en.wikipedia.org//wiki/Community_reinforcement_approach_and_family_training', 'https://en.wikipedia.org//wiki/Compassion-focused_therapy', 'https://en.wikipedia.org//wiki/Contingency_management', 'https://en.wikipedia.org//wiki/Desensitization_(psychology)', 'https://en.wikipedia.org//wiki/Exposure_therapy', 'https://en.wikipedia.org//wiki/Direct_therapeutic_exposure', 'https://en.wikipedia.org//wiki/Exposure_and_response_prevention', 'https://en.wikipedia.org//wiki/Flooding_(psychology)', 'https://en.wikipedia.org//wiki/Prolonged_exposure_therapy', 'https://en.wikipedia.org//wiki/Systematic_desensitization