In [1]:
from pprint import pprint

import newspaper
import pandas as pd

# Newspaper3k - info
[docs](https://newspaper.readthedocs.io/en/latest/)  
[quickstart](https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html)  
[advanced](https://newspaper.readthedocs.io/en/latest/user_guide/advanced.html) 
(e.g. quickly download articles concurrently with multi-threading)  
  
[pypi](https://pypi.org/project/newspaper3k/)  
[github](https://github.com/codelucas/newspaper)

In [2]:
help(newspaper)

Help on package newspaper:

NAME
    newspaper - Wherever smart people work, doors are unlocked. -- Steve Wozniak

PACKAGE CONTENTS
    api
    article
    cleaners
    configuration
    extractors
    images
    mthreading
    network
    nlp
    outputformatters
    parsers
    settings
    source
    text
    urls
    utils
    version
    videos (package)

DATA
    __copyright__ = 'Copyright 2014, Lucas Ou-Yang'
    __license__ = 'MIT'
    __title__ = 'newspaper'
    news_pool = <newspaper.mthreading.NewsPool object>

VERSION
    0.2.8

AUTHOR
    Lucas Ou-Yang

FILE
    c:\users\gosia\anaconda3\lib\site-packages\newspaper\__init__.py




# Scraping single article

In [3]:
from newspaper import Article

### Download

In [4]:
url = 'https://www.hookedgamers.com/pc/forspoken/review/article-2402.html'
article = Article(url, keep_article_html=True)
article.download()
print(article.html)

<!DOCTYPE html>
<html lang="en" xmlns="https://www.w3.org/1999/xhtml">
<head>
 	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<meta property="fb:app_id" content="337785116266465"/>
    <meta name="viewport" content="width=device-width" />
    <meta name="application-name" content="Hooked Gamers" />
    <meta name="description" content="We played Forspoken - It frequently is so bad, it's almost good. Awful optimization, cringeworthy dialogue, forgettable narrative, and mediocre gameplay">
    <meta name="keywords" content="Forspoken, Action-Adventure game, Square Enix, Luminous Productions" />
	<title>Forspoken PC review - "Forspoken should be forgotten" | Hooked Gamers</title>
	<meta property="og:image" content="https://www.hookedgamers.com/images/7017/forspoken/reviews/header_2402_forspoken.jpg" />
	<meta property="og:image:type" content="image/jpeg" /> 	

	<meta property="fb:app_id"       content="337785116266465" />	
	<link href='//fonts.googleapis.com/css?fam

### Parse

In [5]:
# All properties of object article
article.parse()
pprint(vars(article))

{'additional_data': {},
 'article_html': '<div><h2>Aims low, lands lower.</h2><br>\r\n'
                 "Under the mire of <i>Forspoken</i>'s many flaws, there is "
                 'the individual fragments of what could have been a highly '
                 'compelling, or at least passable, narrative. '
                 '<i>Forspoken</i> follows the character of Frey, who grew up '
                 'with no family, no prospects, and forced to rely on street '
                 'gangs in order to scratch out a living. After facing the '
                 'prospect of prison time for her criminal offenses, Frey '
                 'instead has her sentence commuted by the judge, who gives '
                 'her one final chance to set things right. However, things go '
                 'wrong from that point on, and after a series of '
                 'misadventures, Frey finds herself transported to a magical '
                 "world, with no clear way home. Everything about Frey's 

In [6]:
article.title

'Forspoken PC review - "Forspoken should be forgotten"'

In [7]:
article.authors

[]

In [8]:
article.text

"Aims low, lands lower.\n\nAn open world devoid of life...\n\nTechnical difficulties ahead...\n\nFinal Thoughts:\n\nUnder the mire of's many flaws, there is the individual fragments of what could have been a highly compelling, or at least passable, narrative.follows the character of Frey, who grew up with no family, no prospects, and forced to rely on street gangs in order to scratch out a living. After facing the prospect of prison time for her criminal offenses, Frey instead has her sentence commuted by the judge, who gives her one final chance to set things right. However, things go wrong from that point on, and after a series of misadventures, Frey finds herself transported to a magical world, with no clear way home. Everything about Frey's origin story plaintively makes it clear that the player is expected to find her a sympathetic, or even relatable, character. The problem though is that this message definitely wasn't conveyed to whoever was in charge of writing the dialogue.Thro

In [9]:
article.article_html

'<div><h2>Aims low, lands lower.</h2><br>\r\nUnder the mire of <i>Forspoken</i>\'s many flaws, there is the individual fragments of what could have been a highly compelling, or at least passable, narrative. <i>Forspoken</i> follows the character of Frey, who grew up with no family, no prospects, and forced to rely on street gangs in order to scratch out a living. After facing the prospect of prison time for her criminal offenses, Frey instead has her sentence commuted by the judge, who gives her one final chance to set things right. However, things go wrong from that point on, and after a series of misadventures, Frey finds herself transported to a magical world, with no clear way home. Everything about Frey\'s origin story plaintively makes it clear that the player is expected to find her a sympathetic, or even relatable, character. The problem though is that this message definitely wasn\'t conveyed to whoever was in charge of writing the dialogue.<br>\r\n<br>\r\nThrough both the writ

### NLP

In [10]:
# import nltk  # because of a weird error https://github.com/delip/PyTorchNLPBook/issues/14
# nltk.download('punkt')

article.nlp()

In [11]:
print(article.summary)

However, things go wrong from that point on, and after a series of misadventures, Frey finds herself transported to a magical world, with no clear way home.
Everything about Frey's origin story plaintively makes it clear that the player is expected to find her a sympathetic, or even relatable, character.
If you were hoping that, over time, the player might find themselves connecting more with Frey, you probably shouldn't get your hopes up.
Similar tomakes the critical mistake of assuming random out-of-context swearing and an abrasive attitude are the two sole factors in a protagonist's personality.
The desktop boasted a Nvidia 2070 GTX gpu, and the laptop featured a Nvidia 3070 RTX gpu.


In [12]:
article.keywords

['review',
 'forspoken',
 'clear',
 'player',
 'gpu',
 'rtx',
 'game',
 'free',
 'minimum',
 'nvidia',
 'course',
 'forgotten',
 'pc',
 'frey']

# Test several articles

In [13]:
urls = [
    'https://www.pcinvasion.com/forspoken-pc-review/',
    'https://www.somosxbox.com/analisis-de-forspoken-pc/981286',
    'https://www.gamestar.de/artikel/forspoken-test-review,3389075.html',
    'https://www.pcgamer.com/forspoken-review/',
    'https://xboxera.com/2023/01/25/review-forspoken/',
    'https://www.hookedgamers.com/pc/forspoken/review/article-2402.html'
]

In [14]:
def parse_article(url: str) -> pd.DataFrame:
    article = Article(url, keep_article_html=True)
    article.download()
    article.parse()
    
    print(f"url:\t\t {url}")
    print(f"title:\t\t {article.title}")
    print(f"authors:\t {article.authors}")
    print(f"language:\t {article.meta_lang}")
    print(f"publish_date:\t {article.publish_date}")
    print(f"text:\t\t {article.text[:200]}")
    print(f"article_html:\t {article.article_html[:200]}")
    print("---------------------------------------------------------------------------------------------------------\n")

In [15]:
for url in urls:
    parse_article(url)

url:		 https://www.pcinvasion.com/forspoken-pc-review/
title:		 Forspoken PC review — Magic and mayhem
authors:	 ['Jason Rodriguez', 'Jason Rodriguez Is A Guides Writer. Most Of His Work Can Be Found On Pc Invasion', 'Around', "Published Articles . He'S Also Written For Ign", 'Gamespot', 'Polygon', 'Techraptor', 'Gameskinny', "More. He'S Also One Of Only Five Games Journalists The Philippines. Just Kidding. There Are Definitely More Around", "But He Doesn'T Know Anyone. Mabuhay"]
language:	 en
publish_date:	 2023-01-31 17:00:36+00:00
text:		 Forspoken is an open-world role-playing game (RPG) from Luminous Productions and Square Enix. I’ve been looking forward to this for quite a while now, as open-world adventures tend to be one of my fav
article_html:	 <div><p>Forspoken is an <a href="https://store.steampowered.com/app/1680880/Forspoken/" target="_blank" rel="noopener">open-world role-playing game (RPG)</a> from Luminous Productions and Square Enix.
-----------------------------------

# Verdict

- `title` is mostly fine (it's from \<title\> tag but not exactly)
- `authors` - not fine but we don't really need it
- `language` - works fine, extracted from html tag
- `publish_date` - mostly fine (but in last case - it didn't find it)
- `text` - sometimes in the wrong order, which is weird, e.g:  
  - conclusion at the beginning (pcgamer review - but it's not shown correctly even in instapaper)
  - headers before text (hookedgamers review - but this is in `article.html` so we can fix it if we need. idk if order matters)  
also: it would be nice to distinguish between headers and text (not possible currently but this info is in `article.html` so we can add this)
- NLP - we should ignore it

✅ Overall, I think for now we can use it (maybe fork the repo and make some changes).

# Languages

In [16]:
newspaper.languages()


Your available languages are:

input code		full name
  ar			  Arabic
  be			  Belarusian
  bg			  Bulgarian
  da			  Danish
  de			  German
  el			  Greek
  en			  English
  es			  Spanish
  et			  Estonian
  fa			  Persian
  fi			  Finnish
  fr			  French
  he			  Hebrew
  hi			  Hindi
  hr			  Croatian
  hu			  Hungarian
  id			  Indonesian
  it			  Italian
  ja			  Japanese
  ko			  Korean
  mk			  Macedonian
  nb			  Norwegian (Bokmål)
  nl			  Dutch
  no			  Norwegian
  pl			  Polish
  pt			  Portuguese
  ro			  Romanian
  ru			  Russian
  sl			  Slovenian
  sr			  Serbian
  sv			  Swedish
  sw			  Swahili
  tr			  Turkish
  uk			  Ukrainian
  vi			  Vietnamese
  zh			  Chinese



# Building a news source

❌ This doesn't help us

In [17]:
url_game = 'https://www.metacritic.com/game/pc/forspoken'
metacritic_game = newspaper.build(url_game)

metacritic_game.size()

0

In [18]:
for article in metacritic_game.articles:
    print(article.url)

In [19]:
for category in metacritic_game.category_urls():
    print(category)

https://www.metacritic.com/game/pc/forspoken
https://www.metacritic.com
https://www.metacritic.com/game
https://www.metacritic.com/tv
https://www.metacritic.com/features
https://www.metacritic.com/feature
https://www.metacritic.com/movie
https://www.metacritic.com/music
