In [1]:
import requests
from bs4 import BeautifulSoup
import re

The `requests` library is the go-to and perferred means for making HTTP requests in python. It is the most straight forward and low effort means to retrieve information from the web programmatically. It abstracts the complexities of making requests behind a simple API so that you can focus on interacting with services and consuming data in your application.

In [3]:
response = requests.get("https://www.pophorror.com/category/news/page/3/")

<h2> Status Codes</h2>
The first bit of information that you can gather from `Response` is the `status code`:
<b>A status code informs you of the status of the request.</b>

A respond of <u>200 OK</u> indicates that your request was successful</u>, whereas a <u>404 NOT FOUND</u> status means that the resource you were looking for was not found. There are many other possible status codes as well to give you specific insights into what happened with your request.

By accessing .status_code, you can see the status code that the server returned:

Sometimes, you might want to use this information to make decisions in your code:

if response.status_code == 200:<br>
 &nbsp;&nbsp; print('Success')<br>
elif response.status_code == 404:<br>
&nbsp;&nbsp;  print('Not Found.')<br>
elif reponse.status_code == 406:<br>
 &nbsp;&nbsp; print("Did Not Specify \'content-type\'")<br>
<br>

In [4]:
page.status_code

406

 .status_code returned a 406, which means your request was unsuccessful and is due to the fact the host requires content-type and what to accept as a return to be specified in the header

## Specifying Headers and Header parameters

In [5]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    'accept': '*/*',
};

page = requests.get("https://www.pophorror.com/category/news/page/3/",headers=headers)

In [6]:
page.status_code

200

 .status_code returned a 200, which means your request was successful and the server responded with the data you were requesting.

# Content
The response of a `GET` request often has some valuable information, known as a payload, in the message body. Using the attributes and methods ofResponse, you can view the payload in a variety of different formats.

To see the response’s content in bytes, you use .content:



In [7]:
page.content

b'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#">\n<head>\n<meta charset="UTF-8" />\n<link rel="profile" href="https://gmpg.org/xfn/11" />\n<link rel="pingback" href="https://www.pophorror.com/xmlrpc.php" />\n<meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n\n\t<!-- This site is optimized with the Yoast SEO plugin v17.9 - https://yoast.com/wordpress/plugins/seo/ -->\n\t<title>News Archives - Page 3 of 636 - PopHorror</title>\n\t<link rel="canonical" href="https://www.pophorror.com/category/news/page/3/" />\n\t<link rel="prev" href="https://www.pophorror.com/category/news/page/2/" />\n\t<link rel="next" href="https://www.pophorror.com/category/news/page/4/" />\n\t<meta property="og:locale" content="en_US" />\n\t<meta property="og:type" content="article" />\n\t<meta property="og:title" content="News Archives - Page 3 of 636 - PopHorror" />\n\t<meta property="og:url" content="https://www.pophorror.com/ca

<b>NOTE: While .content gives you access to the raw bytes of the response payload, you will often want to convert them into a string using a character encoding such as UTF-8. response will do that for you when you access .text:</b>



In [8]:
page_text = page.text

In [9]:
print(page_text)

<!DOCTYPE html>
<html lang="en-US" prefix="og: http://ogp.me/ns#">
<head>
<meta charset="UTF-8" />
<link rel="profile" href="https://gmpg.org/xfn/11" />
<link rel="pingback" href="https://www.pophorror.com/xmlrpc.php" />
<meta name='robots' content='index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1' />

	<!-- This site is optimized with the Yoast SEO plugin v17.9 - https://yoast.com/wordpress/plugins/seo/ -->
	<title>News Archives - Page 3 of 636 - PopHorror</title>
	<link rel="canonical" href="https://www.pophorror.com/category/news/page/3/" />
	<link rel="prev" href="https://www.pophorror.com/category/news/page/2/" />
	<link rel="next" href="https://www.pophorror.com/category/news/page/4/" />
	<meta property="og:locale" content="en_US" />
	<meta property="og:type" content="article" />
	<meta property="og:title" content="News Archives - Page 3 of 636 - PopHorror" />
	<meta property="og:url" content="https://www.pophorror.com/category/news/" />
	<meta propert

Because the decoding of bytes to a str requires an encoding scheme, requests will try to guess the encoding based on the response’s headers if you do not specify one. You can provide an explicit encoding by setting .encoding before accessing .text:

In [36]:
page.encoding = 'utf-8' # Optional: requests infers this internally
page.text

'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'

# Using BeautifulSoup with HTML Response

In [11]:
soup = BeautifulSoup(page_text,'html.parser')

In [12]:
type(soup)

bs4.BeautifulSoup

In [13]:
help(soup)

Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(n

In [14]:
dir(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'DEFAULT_INTERESTING_STRING_TYPES',
 'ROOT_TAG_NAME',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_check_markup_is_url',
 '_decode_markup',
 '_feed',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_linkage_fixer',
 '_most_recent_element',
 '_namespaces',
 '_popToTag',
 '_should_pretty_print',
 'append',
 'attrs',
 'builder',
 'can_be_empty_element',
 'cdata_list_attributes',
 'childGener

In [20]:
returned = soup.find_all('h2', class_="post-box-title")

In [23]:
returned[0]

<h2 class="post-box-title">
<a href="https://www.pophorror.com/slapface-review-2022/">SLAPFACE (2022): A Dark, Brooding, Emotional Monster Tale – Movie Review</a>
</h2>

In [28]:
urls=[]
for i in returned:
    url = re.search(r"href=\"(.*?)\""," ".join(str(i).split("\n")),re.DOTALL)
    urls.append(url[1])

In [29]:
for i in urls:
    print(i)

https://www.pophorror.com/slapface-review-2022/
https://www.pophorror.com/even-20-years-later-the-genre-bending-brotherhood-of-the-wolf-2002-still-excites-retro-review/
https://www.pophorror.com/linda-miller-stars-in-upcoming-documentary-the-legend-of-king-kong/
https://www.pophorror.com/new-film-announced-from-makers-of-the-great-buddha-arrival-ghost-cat-rhapsody/
https://www.pophorror.com/renegade-film-festival-2022-official-selections/
https://www.pophorror.com/full-lineup-for-final-girls-berlin-film-festival-2022/
https://www.pophorror.com/martin-strange-hansens-on-my-mind-shortlisted-for-an-oscar/
https://www.pophorror.com/coming-soon-to-theaters-dracula-the-original-living-vampire/
https://www.pophorror.com/coming-soon-to-digital-blake-ridders-help/
https://www.pophorror.com/whaddayagot-productions-to-release-cult-classic-short-film-astral-plane-drifter/


In [30]:
page2= " ".join(requests.get("https://www.pophorror.com/tailgate-coming-to-dvd/",headers=headers).text.split("\n"))

In [31]:
soup = BeautifulSoup(page2,'html.parser')

In [32]:
content = soup.find('div',class_="content")

In [33]:
content = str(content) 

In [34]:
print(content)

<div class="content"> <nav id="crumbs"><a href="https://www.pophorror.com/"><span aria-hidden="true" class="fa fa-home"></span> Home</a><span class="delimiter">-</span><a href="https://www.pophorror.com/category/news/">News</a><span class="delimiter">-</span><span class="current">Lodewijk Crijns’ Thriller ‘Tailgate’ (2020) Coming To DVD</span></nav> <div class="e3lan e3lan-post"> <script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <!-- PopHorror ADS1 --> <ins class="adsbygoogle" data-ad-client="ca-pub-3789268088863536" data-ad-slot="3645557203" style="display:inline-block;width:320px;height:100px"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <article class="post-listing post-110316 post type-post status-publish format-standard has-post-thumbnail category-news tag-dvd tag-frightfest tag-horror tag-jeroen-spitzenberger tag-liz-vergeer tag-lodewijk-crijns tag-road-rage tag-sitges-film-festival tag-tailgate tag-

In [35]:
soup = BeautifulSoup(content,'html.parser')

In [36]:
paras = soup.find_all('p')

In [37]:
empty_str=""
for i in paras:
    empty_str += i.text + " "

In [38]:
print(empty_str)

 Kenn Hoekstra  July 17, 2021  Lodewijk Crijns’ new film, Tailgate, is heading to DVD!  The road rage thriller, also known as Bumperkleef, tells the story of a road-raging family man who finds himself terrorized by a vengeful van driver he chooses to tailgate. Check out the trailer below, then read on for the details!  Hans, his wife, and two young children hit the highway on a trip to visit family. After getting stuck behind a slow-moving van, he recklessly starts to antagonize the eerily stoic driver, blaring the horn and riding his bumper. Little does he realize that he’s just crossed the wrong motorist – a deranged madman who sets out to teach Hans a lesson he’ll not soon forget. Lured into an alarming game of vehicular cat and mouse, a simple family road trip turns into a deadly obstacle course in this nerve-wracking, pulse-pounding thriller, an Official Selection at the Sitges Film Festival and FrightFest. The film stars Jeroen Spitzenberger (Suskind), Anniek Pheifer (Taart), Roo

In [39]:
title= re.search(r"<title.*?>(.*?)<",page2,re.DOTALL)

In [40]:
import html
title = re.search(r"<title.*?>(.*?)<",page_text,re.DOTALL)
html.unescape(title[1])

'News Archives - Page 3 of 636 - PopHorror'

In [41]:
full_dict={}
counter = 0
pages=0
for i in range(1,11):
    url = "https://www.pophorror.com/category/news/page/"+str(i)
    page_text=" ".join(requests.get(url).text.split("\n"))
    soup = BeautifulSoup(page_text,'html.parser')
    segments = soup.find_all('article', class_="item-list")
    urls=[]
    for j in segments:
        url2 = re.search(r"href=\"(.*?)\""," ".join(str(j).split("\n")),re.DOTALL)
        urls.append(url2[1])
    for l in urls:
        innerpage= " ".join(requests.get(l).text.split("\n"))
        title = html.unescape(re.search(r"<title.*?>(.*?)<",innerpage,re.DOTALL)[1])
        soup2 = BeautifulSoup(innerpage,'html.parser')
        content = str(soup2.find('div',class_="content"))
        soup2 = BeautifulSoup(content,'html.parser')
        paras = soup.find_all('p')
        empty_str=""
        for k in paras:
            empty_str += k.text + " "
        full_dict[l] = {"content":empty_str,"title":title}
        counter+=1
        print(counter)
    pages+=1
    print("\n\n",pages,"\n\n")



 1 




 2 




 3 




 4 




 5 




 6 




 7 




 8 




 9 




 10 




{}

In [42]:
full_dict['https://www.pophorror.com/movie-review-the-sleepless-unrest-the-real-conjuring-home-2021/']

KeyError: 'https://www.pophorror.com/movie-review-the-sleepless-unrest-the-real-conjuring-home-2021/'

In [None]:
full_dict['https://www.pophorror.com/movie-review-the-sleepless-unrest-the-real-conjuring-home-2021/']['title']

In [25]:
len(full_dict)

0