# Collecting corpus from nexis uni on critic reviews of _Parasite_

In this notebook, similar to the structure of the notebook titled `nexis_host_critic`, I am going to be cleaning and organizing the text files of _Parasite_ critic reviews that I got by searching on LexisNexis. The search criteria I used to gather these critic reviews was searching for: **title('parasite') and movie and review and bong** 

I used the `title` to look up reviews across news and magazine articles, and blogs, that are primarily about 'parasite', used `movie` so that the search results would convey movie reviews, used `review` so that the results would actually be reviews, instead of summaries or listicles of top movies (articles that are not actually movie reviews), and `bong` to return results about the correct movie (by Director Bong Joon Ho). 

Based on my search criteria, I got 268 results. This is many more reviews than the 27 search results I got for critic reviews of _The Host_ through NexisUni.  

Once I collect, clean, and organize the raw text data from the Nexis Uni search results, I will carry out textual analysis (frequency, sentiment analysis) on the data in a separate notebook in this folder, titled `analysis_critic_nexis`. 

In [48]:
from collections import Counter
import string
import os
import json

In [49]:
%run functions.ipynb

In [50]:
chars_to_strip = '.,!?"";'

In [51]:
# Opening up the raw text files with downloaded documents from LexisNexis:
raw_text1 = open('../data/critic_reviews/nexis_parasite_critic1.txt').read()
raw_text2 = open('../data/critic_reviews/nexis_parasite_critic2.txt').read()
raw_text3 = open('../data/critic_reviews/nexis_parasite_critic3.txt').read()

In [52]:
# How many total characters in our raw text files:

len(raw_text1) + len(raw_text2) + len(raw_text3)

1633950

In [53]:
# get a sneak peak of the text data:
print(raw_text1[:5000])

Oscars 2020: What Movies Has #8216;Parasite' Director Bong Joon-ho Made?
Newstex Blogs 
The Cheat Sheet
January 27, 2020 Monday 12:05 AM EST

 Copyright 2020 Newstex LLC All Rights Reserved
Length: 1178 words
Byline: Robert Yaniz Jr.
Body




Jan 27, 2020( The Cheat Sheet: http://www.cheatsheet.com/ Delivered by Newstex)  Parasite[1] is taking Hollywood by storm[2] this award season. In fact, the South Korean release is gearing up as a Best Picture frontrunner[3] heading into Oscars 2020. Director Bong Joon-ho's latest film defies genre to tell a complex story about two interconnected families and, by extension, delivers some crucial social commentary[4].With Parasite#8216;s success, we're sure moviegoers will be interested to see some of Bong's previous work. Prior to his Oscar-nominated hit[5], the director had made six films. Here's a quick breakdown of the features he has directed throughout his career. 
 Bong Joon-ho and the cast of #8216;Parasite' at the Screen ActorsGuild Awards

## Processing the text:

Now I will work on processing the text! Right now, the 3 text files constitute the 264 documents I downloaded from Nexis Uni. I want to split the text by each document to create a list of docs, then I can combine all the docs in my 3 files into one giant list of reviews. 

My goal here is to clean each document and organize the data so that I can save out a new json file which I can work with for frequency and sentiment analysis!

The first step is to pre-process the text - Splitting the text by each document:

In [54]:
# Splitting the docs:
docs1 = raw_text1.split('End of Document')
docs2 = raw_text2.split('End of Document')
docs3 = raw_text3.split('End of Document')

In [55]:
len(docs1) + len(docs2) + len(docs3)

264

In [56]:
docs1[-1]

'\n'

In [57]:
docs2[-1]

'\n'

In [58]:
docs3[-1]

'\n'

In [59]:
docs1 = docs1[:-1]
docs2 = docs2[:-1]
docs3 = docs3[:-1]

In [60]:
# Combining the text files into one list:

docs = []

for doc in docs1:
    docs.append(doc)
    
for doc in docs2:
    docs.append(doc)
    
for doc in docs3:
    docs.append(doc)

In [61]:
# Making sure I have 264 docs:
len(docs)

261

In [44]:
# Looking at the first doc in my list:
docs[0]

"Oscars 2020: What Movies Has #8216;Parasite' Director Bong Joon-ho Made?\nNewstex Blogs \nThe Cheat Sheet\nJanuary 27, 2020 Monday 12:05 AM EST\n\n\u2028Copyright 2020 Newstex LLC All Rights Reserved\nLength:\xa01178 words\nByline:\xa0Robert Yaniz Jr.\nBody\n\n\n\n\nJan 27, 2020( The Cheat Sheet: http://www.cheatsheet.com/ Delivered by Newstex)  Parasite[1] is taking Hollywood by storm[2] this award season. In fact, the South Korean release is gearing up as a Best Picture frontrunner[3] heading into Oscars 2020. Director Bong Joon-ho's latest film defies genre to tell a complex story about two interconnected families and, by extension, delivers some crucial social commentary[4].With Parasite#8216;s success, we're sure moviegoers will be interested to see some of Bong's previous work. Prior to his Oscar-nominated hit[5], the director had made six films. Here's a quick breakdown of the features he has directed throughout his career. \n Bong Joon-ho and the cast of #8216;Parasite' at the

We have successfully made a list with the documents I downloaded. Now in order to get into analyzing these critic reviews of _Parasite_ , I am going to write each document in our list to a separate file:

In [62]:
# check to see if the folder to hold the files exists and create if it doesn't
if not os.path.exists('../data/critic_reviews/sep_docs'):
    os.makedirs('../data/critic_reviews/sep_docs')
    
    
# loop over the list of documents and write each to a separate text file
for idx, doc in enumerate(docs, 1):
    
    filepath = '../data/critic_reviews/sep_docs/nexis_parasite_critic{:0>3}.txt'.format(idx)
    print('Creating', filepath)
    
    with open(filepath, 'w') as out:
        out.write(doc)

Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic001.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic002.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic003.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic004.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic005.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic006.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic007.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic008.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic009.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic010.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic011.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic012.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic013.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic014.txt
Creating ../data/cri

Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic123.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic124.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic125.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic126.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic127.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic128.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic129.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic130.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic131.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic132.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic133.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic134.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic135.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic136.txt
Creating ../data/cri

Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic241.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic242.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic243.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic244.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic245.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic246.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic247.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic248.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic249.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic250.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic251.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic252.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic253.txt
Creating ../data/critic_reviews/sep_docs/nexis_parasite_critic254.txt
Creating ../data/cri

In [63]:
len(docs)

261

In [64]:
# Looking at the doc structure to see how we can extract body text:
print(docs[0])

Oscars 2020: What Movies Has #8216;Parasite' Director Bong Joon-ho Made?
Newstex Blogs 
The Cheat Sheet
January 27, 2020 Monday 12:05 AM EST

 Copyright 2020 Newstex LLC All Rights Reserved
Length: 1178 words
Byline: Robert Yaniz Jr.
Body




Jan 27, 2020( The Cheat Sheet: http://www.cheatsheet.com/ Delivered by Newstex)  Parasite[1] is taking Hollywood by storm[2] this award season. In fact, the South Korean release is gearing up as a Best Picture frontrunner[3] heading into Oscars 2020. Director Bong Joon-ho's latest film defies genre to tell a complex story about two interconnected families and, by extension, delivers some crucial social commentary[4].With Parasite#8216;s success, we're sure moviegoers will be interested to see some of Bong's previous work. Prior to his Oscar-nominated hit[5], the director had made six films. Here's a quick breakdown of the features he has directed throughout his career. 
 Bong Joon-ho and the cast of #8216;Parasite' at the Screen ActorsGuild Awards

In [65]:
# Extracting body text:

body_start = docs[0].index('Body')+4
body_end = docs[0].index('Load-Date:')

body_text = docs[0][body_start:body_end].strip()

In [66]:
body_text

"Jan 27, 2020( The Cheat Sheet: http://www.cheatsheet.com/ Delivered by Newstex)  Parasite[1] is taking Hollywood by storm[2] this award season. In fact, the South Korean release is gearing up as a Best Picture frontrunner[3] heading into Oscars 2020. Director Bong Joon-ho's latest film defies genre to tell a complex story about two interconnected families and, by extension, delivers some crucial social commentary[4].With Parasite#8216;s success, we're sure moviegoers will be interested to see some of Bong's previous work. Prior to his Oscar-nominated hit[5], the director had made six films. Here's a quick breakdown of the features he has directed throughout his career. \n Bong Joon-ho and the cast of #8216;Parasite' at the Screen ActorsGuild Awards | John Sciulli/Getty Images for Turner#8216;Barking Dogs Never Bite' (2000)Barking Dogs Never Bite (Trailer)[6]In his 2000 directorial debut, Bong already aimed to combine genres in an innovative way. This dark comedy-drama follows an unemp

In [67]:
body_text_only = []

for doc in docs:
    body_start = doc.index('Body')+4
    body_end = doc.index('Load-Date:')
    
    body_text = doc[body_start:body_end].strip()
    body_text_only.append(body_text)

In [69]:
body_text_only[10]

'ABSTRACT\nFor anyone not yet familiar with Bong, Parasite is ready and eager to infect your consciousness and soul, quickly turning you into an eager host for all things Bong\nFULL TEXT\nEvery week, the film industry offers a fresh slap to the face. Between the increasing dominance of mindless franchise product and the erosion of substantive films made for discerning audiences, it is difficult to feel hopeful about stepping into the dark of a cinema - as wondrous and transformative an experience as modern life affords.\nBut this past weekend, salvation reared its head in New York and Los Angeles, where a new film smashed box-office records. The unlikely culprit: Parasite, a small-scale South Korean drama with no special effects to speak of and zero sequel or spinoff potential to tempt skeptical audiences. Thanks to audience word-of-mouth and https://www.theglobeandmail.com/arts/film/tiff/article-tiff-2019-from-hustlers-to-joker-to-parasite-make-way-in-the/ culled       \xa0    https:/

In [70]:
# Making a list of dictionaries for critic reviews of Parasite:
critic_reviews = []

for rev in body_text_only:
    rev_dict = {
        'text' : rev
    }
    critic_reviews.append(rev_dict)

In [71]:
critic_reviews[1]

{'text': "How does one get rid of a parasite?\nBong Joon Ho sets the tone for the Oscar-nominated, Cannes Palme d'Or 2019-winner and Golden Globes 2020 Best Foreign Language Film-winner Parasite in the first scene itself. A family of four, living in a semi-basement level home, scouts the corners of their house to find WiFi signal. The neighbour must have changed the password, let's try 123456789 or in reverse. Still not working? Oops, no WhatsApp then.\nKim Ki-taek (played by Song Kang-ho) leads an unambitious family with wife Chung-sook (played by Chang Hyae-jin), daughter Kim Ki-jeong (played by Park So-dam) and son Kim Ki-woo (played by Choi Woo-shik). Their current source of income is folding pizza boxes for a pizza delivery joint, and it doesn't take someone solid knowledge in World Cinema to know that that money isn't going to be enough to feed four grown mouths. But, they have a plan, as Kim Ki-taek elaborates - it is to not have a plan at all. For then, the risk of anything goi

In [72]:
# Saving this out as a json file:
with open('../data/critic_reviews/nexis_parasite_critic.json','w', encoding='UTF-8') as out:
    out.write(json.dumps(critic_reviews))

## Conclusion

Awesome! In this notebook, I successfully cleaned, organized, and extracted the body text from critic review documents of _Parasite_ that I downloaded from Nexis Uni. Going through the same process for critic reviews of _The Host_ , I have saved out new JSON files that contain all the critic reviews of each movie.

Now, I can move onto analyzing these reviews. This will occur in two notebooks, the very first one titled `analysis_critic_nexis` looks at using VADER for sentiment analysis and also running word and n-gram frequency analysis to figure out what kinds of words or phrases occur more often in either list of critic reviews. The second notebook is titled `trying_TextBlob_SA` and in here, I try a different tool called TextBlob, rather than VADER, for carrying out sentiment analysis. Please refer to each of those notebooks to see what analysis I carried out with these critic reviews. 

Thank you!