# Collecting critic reviews of _The Host_ from Nexis Uni

In this short notebook, I am going to be getting text files of critic reviews by searching on LexisNexis. The search criteria I used to gather these critic reviews was searching for: **title('the host') and movie and review and bong** 

I used the `title` to look up reviews across news and magazine articles, and blogs, that are primarily about 'the host', used `movie` so that the search results would convey movie reviews, used `review` so that the results would actually be reviews, instead of summaries or listicles of top movies (articles that are not actually movie reviews), and `bong` to return results about the correct movie (by Director Bong Joon Ho). 

Based on my search criteria, I got 27 results. It's a lot fewer than the number of results I got for the same search criteria but just replacing the title to 'parasite'. We can infer from this that _Parasite_ was probably more discussed and better known in mainstream media, since there were so many more critic reviews of this than there were of _The Host_ . 

Once I collect, clean, and organize the raw text data from the Nexis Uni search results, I will carry out textual analysis (frequency, sentiment analysis) on the data in a separate notebook in this folder, titled `analysis_critic_nexis`. 

In [28]:
# Setting up my notebook:
from collections import Counter
import string
import os
import json

In [29]:
%run functions.ipynb

In [30]:
chars_to_strip = ',.!?;""'

In [31]:
# Opening up the raw text file with downloaded documents from LexisNexis:
raw_text = open('../data/critic_reviews/nexis_host_critic.txt').read()

In [32]:
# How many characters in our corpus:
len(raw_text)

141365

In [33]:
# get a sneak peak of the text data:
print(raw_text[:5000])

Movie Review: The Host
Blogcritics.org Video
March 8, 2007 Thursday 8:18 PM EST

 Copyright 2007 Newstex LLC
All Rights Reserved
Newstex Web Blogs
Copyright 2007 Blogcritics.org Video
Length: 1073 words
Byline: Steve Carlson
Body




  Mar. 8, 2007 (Blogcritics.org delivered by Newstex) -- 
 If there.s one film that will plant South Korean cinema into the mind of the American public, The Host is it. Bong Joon-ho.s lively feature provides all the thrills and sensation of the average American summer spectacle, except that it does so while still remaining a good movie. It.s got heart, hilarity, triumph and tragedy as it gives us a fractured family that finally sets aside their differences and unites towards a common goal. It.s also got a giant mutant fish-monster that tries to eat everything in its path.

Still with me? That.s the kind of film that The Host is - able to shift tones on a moment.s notice (often within the same scene), Bong uses all his formidable talents to bring respect to

## Processing the text:

Now I will work on processing the text! My goal here is to clean each document and organize the data so that I can save out a new json file which I can work with for frequency and sentiment analysis!

The first step is to pre-process the text - Splitting the text by each document:

In [34]:
# Splitting the docs:
docs = raw_text.split('End of Document')

In [35]:
len(docs)

28

I should've downloaded 27 documents, so let's see why we have one extra:

In [36]:
# First item in the list of docs:
docs[0]

'Movie Review: The Host\nBlogcritics.org Video\nMarch 8, 2007 Thursday 8:18 PM EST\n\n\u2028Copyright 2007 Newstex LLC\nAll Rights Reserved\nNewstex Web Blogs\nCopyright 2007 Blogcritics.org Video\nLength:\xa01073 words\nByline:\xa0Steve Carlson\nBody\n\n\n\n\n  Mar. 8, 2007 (Blogcritics.org delivered by Newstex) -- \n If there.s one film that will plant South Korean cinema into the mind of the American public, The Host is it. Bong Joon-ho.s lively feature provides all the thrills and sensation of the average American summer spectacle, except that it does so while still remaining a good movie. It.s got heart, hilarity, triumph and tragedy as it gives us a fractured family that finally sets aside their differences and unites towards a common goal. It.s also got a giant mutant fish-monster that tries to eat everything in its path.\n\nStill with me? That.s the kind of film that The Host is - able to shift tones on a moment.s notice (often within the same scene), Bong uses all his formidab

In [37]:
# Last item in the list:
docs[-1]

'\n'

Aha! The last document in our list got cut off as an extra document, so we'll need to remove this from our list:

In [38]:
docs = docs[:-1]

In [39]:
len(docs)

27

In [40]:
docs[-1]

"\n\n\nThe Host\nVideo Business\nJune 11, 2007\n\n\u2028Copyright 2007 Reed Business Information, US, a division of Reed Elsevier Inc. All Rights Reserved\n\u2028\nSection:\xa0TIPSHEET; Horror; Pg. 6\nLength:\xa0294 words\nByline:\xa0By Irv Slifkin\nBody\n\n\n\n\nMAGNOLIA\nStreet: July 24\nPrebook: June 26\n> Unusual foreign mix of monster movie and family dramedy.\nThe most successful and expensive Korean film ever produced received terrific reviews and brisk arthouse business in the U.S. And it's easy to see why, for this is not only a dazzling technical achievement but a surprisingly complex affair spiked with political satire and pathos that also happens to be quite funny. Think Jaws  meets Little Miss Sunshine , and you get an idea of what to expect from a movie wherein a teenage girl is wrested away from her family snack shop near the Han River by a gigantic amphibious creature. The girl's fractured family-which includes a slacker father and an aunt who happens to be an archery e

There we go! We have successfully removed the last item so that our `docs` list accurately reflects the number of documents I downloaded. Now in order to get into analyzing these critic reviews of _The Host_ , I am going to write each document in our list to a separate file:

In [20]:
# check to see if the folder to hold the files exists and create if it doesn't
if not os.path.exists('../data/critic_reviews/sep_docs'):
    os.makedirs('../data/critic_reviews/sep_docs')
    
    
# loop over the list of documents and write each to a separate text file
for idx, doc in enumerate(docs, 1):
    
    filepath = '../data/critic_reviews/sep_docs/nexis_host_critic{:0>3}.txt'.format(idx)
    print('Creating', filepath)
    
    with open(filepath, 'w') as out:
        out.write(doc)

Creating ../data/critic_reviews/sep_docs/nexis_host_critic001.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic002.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic003.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic004.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic005.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic006.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic007.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic008.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic009.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic010.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic011.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic012.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic013.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic014.txt
Creating ../data/critic_reviews/sep_docs/nexis_host_critic015.txt
Creating .

In [41]:
len(docs)

27

In [42]:
docs[0]

'Movie Review: The Host\nBlogcritics.org Video\nMarch 8, 2007 Thursday 8:18 PM EST\n\n\u2028Copyright 2007 Newstex LLC\nAll Rights Reserved\nNewstex Web Blogs\nCopyright 2007 Blogcritics.org Video\nLength:\xa01073 words\nByline:\xa0Steve Carlson\nBody\n\n\n\n\n  Mar. 8, 2007 (Blogcritics.org delivered by Newstex) -- \n If there.s one film that will plant South Korean cinema into the mind of the American public, The Host is it. Bong Joon-ho.s lively feature provides all the thrills and sensation of the average American summer spectacle, except that it does so while still remaining a good movie. It.s got heart, hilarity, triumph and tragedy as it gives us a fractured family that finally sets aside their differences and unites towards a common goal. It.s also got a giant mutant fish-monster that tries to eat everything in its path.\n\nStill with me? That.s the kind of film that The Host is - able to shift tones on a moment.s notice (often within the same scene), Bong uses all his formidab

Now we have created a different text document for each text entry that we split by! 

The next step in processing our data is to extract only the body text from each document:

In [14]:
# Looking at the doc structure to see how we can extract body text:
print(docs[0])

Movie Review: The Host
Blogcritics.org Video
March 8, 2007 Thursday 8:18 PM EST

 Copyright 2007 Newstex LLC
All Rights Reserved
Newstex Web Blogs
Copyright 2007 Blogcritics.org Video
Length: 1073 words
Byline: Steve Carlson
Body




  Mar. 8, 2007 (Blogcritics.org delivered by Newstex) -- 
 If there.s one film that will plant South Korean cinema into the mind of the American public, The Host is it. Bong Joon-ho.s lively feature provides all the thrills and sensation of the average American summer spectacle, except that it does so while still remaining a good movie. It.s got heart, hilarity, triumph and tragedy as it gives us a fractured family that finally sets aside their differences and unites towards a common goal. It.s also got a giant mutant fish-monster that tries to eat everything in its path.

Still with me? That.s the kind of film that The Host is - able to shift tones on a moment.s notice (often within the same scene), Bong uses all his formidable talents to bring respect to

It looks like each document's body text comes after the word `Body` at the beginning of the document, and then the body text ends right above the word `Load-Date:`. Knowing this, we can extract the body text by index slicing:

In [17]:
# Index of the first character that we want 'Body'
docs[0].index('Body')

225

In [18]:
# Index of the last character that we want 'Load'
docs[0].index('Load-Date:')

8080

In [19]:
# This is what we want to do, but we wanna code for it, not manually find this out each time,
# We're generalizing our code like this:
docs[0][229:8080]

'\n\n\n\n\n  Mar. 8, 2007 (Blogcritics.org delivered by Newstex) -- \n If there.s one film that will plant South Korean cinema into the mind of the American public, The Host is it. Bong Joon-ho.s lively feature provides all the thrills and sensation of the average American summer spectacle, except that it does so while still remaining a good movie. It.s got heart, hilarity, triumph and tragedy as it gives us a fractured family that finally sets aside their differences and unites towards a common goal. It.s also got a giant mutant fish-monster that tries to eat everything in its path.\n\nStill with me? That.s the kind of film that The Host is - able to shift tones on a moment.s notice (often within the same scene), Bong uses all his formidable talents to bring respect to a disreputable genre. And thanks to his sure hand, somehow it all comes together.\n\nThe genesis of this giant mutant fish-monster comes in the year 2000 prologue, when an American government official, over the objectio

We can create an easier way to do this, instead of having to index slice and extract the text for each of the 27 documents.

Extracting the body text from each doc:

In [20]:
# Extracting body text:

body_start = docs[0].index('Body')+4
body_end = docs[0].index('Load-Date:')

body_text = docs[0][body_start:body_end].strip()

In [21]:
print(body_text)

Mar. 8, 2007 (Blogcritics.org delivered by Newstex) -- 
 If there.s one film that will plant South Korean cinema into the mind of the American public, The Host is it. Bong Joon-ho.s lively feature provides all the thrills and sensation of the average American summer spectacle, except that it does so while still remaining a good movie. It.s got heart, hilarity, triumph and tragedy as it gives us a fractured family that finally sets aside their differences and unites towards a common goal. It.s also got a giant mutant fish-monster that tries to eat everything in its path.

Still with me? That.s the kind of film that The Host is - able to shift tones on a moment.s notice (often within the same scene), Bong uses all his formidable talents to bring respect to a disreputable genre. And thanks to his sure hand, somehow it all comes together.

The genesis of this giant mutant fish-monster comes in the year 2000 prologue, when an American government official, over the objections of his Korean s

In [22]:
# Extracting the body text of all docs with a for-loop:

body_text_only = []

for doc in docs:
    body_start = doc.index('Body')+4
    body_end = doc.index('Load-Date:')
    
    body_text = doc[body_start:body_end].strip()
    body_text_only.append(body_text)

In [23]:
body_text_only[10]

"Last year saw the UK release of Gojira (Godzilla) in its original 1954 Japanese cut, revealing a subtextual depth and philosophical melancholy absent from the bastardised US version of this seminal creature-feature. The ghost of Gojira hangs heavy over The Host , in which South Korean wunderkind Bong Joon-ho unleashes a rampaging monster from the polluted waters of the River Han in the capital, Seoul.\nLike its thematic predecessor, The Host was inspired by a real-life news story - the case of Albert McFarland, a US forces mortuary attendant in Korea, who was reported to have dumped toxic waste down a drain leading to the Han in July 2000. In Bong's typically off-kilter nightmare, a mutant is duly spawned which attacks local people, providing much eye-popping, monster-munching fun, alongside the usual baffling blend of tragicomic human interest.\nAccording to Bong, the inspiration for The Host ranged from a childhood fascination with the Loch Ness monster to an admiration for M Night 

In [24]:
body_text_only[-1]

"MAGNOLIA\nStreet: July 24\nPrebook: June 26\n> Unusual foreign mix of monster movie and family dramedy.\nThe most successful and expensive Korean film ever produced received terrific reviews and brisk arthouse business in the U.S. And it's easy to see why, for this is not only a dazzling technical achievement but a surprisingly complex affair spiked with political satire and pathos that also happens to be quite funny. Think Jaws  meets Little Miss Sunshine , and you get an idea of what to expect from a movie wherein a teenage girl is wrested away from her family snack shop near the Han River by a gigantic amphibious creature. The girl's fractured family-which includes a slacker father and an aunt who happens to be an archery expert-has to put aside their differences to track her down before it's too late. At the same time, the government claims the beast is the product of a mysterious virus and begins putting citizens into quarantine, which puts a big crimp in the clan's rescue plans.

In [25]:
len(body_text_only)

27

In [48]:
# Making a list of dictionaries for critic reviews of The Host:
critic_reviews = []

for rev in body_text_only:
    rev_dict = {
        'text' : rev
    }
    critic_reviews.append(rev_dict)

In [50]:
critic_reviews[1]

{'text': 'Mar. 9, 2007 (Blogcritics.org delivered by Newstex) -- \n A nifty little monster movie with post-modernist touches that both add and detract from its effectiveness, writer-director Bong Joon-ho\'s The Host gets right to the good stuff. After a quick introduction to Gang-du, who works at his family\'s food stand (sort of a mini 7-Eleven), and his spunky young daughter Hyun-seo, the movie shifts immediately to a strange sight nearby, drawing a crowd to the bank of Seoul\'s Han River: something is hanging off a bridge right in the middle of its span. \n Suddenly, it drops into the water, swimming, and the excited crowd watches its approach. They start to throw food -- and cans of beer -- at the shape in the river. But when that shadow comes to the surface, the playful tone shifts, and the film quickly gets scary as hell: the shape is not the least bit friendly, and it immediately starts chasing, and eating, humans.\n\nThe Korean title translates as Creature, and indeed the creat

In [51]:
# Saving this out as a json file:
with open('../data/critic_reviews/nexis_host_critic.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(critic_reviews))

## Conclusion

Since I have finished cleaning, organizing, and extracting text from these critic reviews, let's move on to doing the same process for that of _Parasite_ ! This will occur in the next notebook titled `nexis_parasite_critic`. See you there!