### Justin Ehly
#### HW5 - NLP
#### Answer explanations and discussion are after the code for each section.


---
##### Exercise 
1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
- a.	It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
- b.	Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
- c.	Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
- d.	Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  


In [1]:
import numpy as np
import requests
import re
import nltk
from bs4 import BeautifulSoup




In [2]:

class scraper():
    '''
        Build web scraper to gather the top 50 comedies from imdb.com
        - gather the titles and individual comedy movie page urls
        - gather the main talent in each movie, mainly director(s) and main star(s)
        - gather the permalinks to <user selected> user comments from each movie's page
        - remove movies that don't have any comments
    '''
    
    def __init__(self, reviews):

        # set imdb header
        self.imdb_header = 'https://www.imdb.com'
        
        # set imdb reviews footer
        self.review_footer = 'reviews?ref_=tt_urv'        
        
        # number of movie reviews per movie
        self.movie_reviews = reviews
    

    # get a list of the 50 comedies listed on the top comedies page on imdb along with the link to each movie's individual page    
    def get_comedies(self):
        # download 50 comedy movie titles and links from imdb
        URL = 'https://www.imdb.com/search/title/?genres=comedy&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=E9DKJDX8CMCD24P485G8&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_1'
        response = requests.get(URL)
        soup = BeautifulSoup(response.text, 'lxml')

        # grab page contents with movie links and titles
        movies = soup.select('h3.lister-item-header')    

        # get links to each movie page
        movie_links = [link.attrs.get('href') for link in soup.select('h3.lister-item-header a')]
        movie_links = [self.imdb_header + link for link in movie_links]
        movie_links = [link + self.review_footer for link in movie_links]
        self.movie_links = movie_links
        
        # extrap the titles/ actors
        comedy_titles = []
        for idx, movie in enumerate(movies):
            ms = movie.get_text()
            mov = (' '.join(ms.split()).replace('.',''))
            mov = mov.split(' (')[0]
            mov = mov[len(str(idx))+1:]
            comedy_titles.append(mov)
        self.comedy_titles = comedy_titles
        
        # get the main directors and actors
        talent = soup.findAll('p', {'class':''})
        all_talent = []
        for t in talent:  # return a chunk of html with director and stars with titles, needs to be cleaned up
            ms = t.text      # just grab the text from the html
            names = ' '.join(ms.split()).replace('.','')    # extrap the names and titles
            if str.find(names,'|'):         # split up the list by titles ('Director(s)', 'Star(s)')
                names = names.split('|')
            talent = []
            for n in names:
                talent.append(n.split(': ')[1])   # drop the title from each list

            talent = [t.split(', ') for t in talent]
            talent = [item for sublist in talent for item in sublist]

            #clean up spaces at the beginning or end in case they escaped
            talent = [i[1:] if i[0] == ' ' else i[:-1] if i[-1]== ' ' else i for i in talent]
            all_talent.append(talent)
        
        all_talent = [n for m in all_talent for n in m]  # unpack the list of names
        all_talent = set(all_talent) # remove duplicates!
        
        self.talent_names = all_talent
        
        

    # grab the permalinks from each comedy's move page on imdb    
    def get_permalinks(self):
        self.get_comedies()
        movie_links = self.movie_links
        permalinks = []
        for link in movie_links:
            response = requests.get(link)
            soup = BeautifulSoup(response.text, 'lxml')
            plinks = [self.imdb_header + link.attrs.get('href') for link in soup.select('div.actions.text-muted a') if link.attrs.get('href')[1:7] == 'review']
            plinks = plinks[ 0 : self.movie_reviews ]
            permalinks.append(plinks)
        
        
        self.permalinks = permalinks
        
        self.clean_up()
        
        return self

    # remove the movies with less than 5 reviews
    def clean_up(self):
        for x,i in enumerate(self.permalinks):
            if len(i) < 5:
                self.permalinks.pop(x)
                self.comedy_titles.pop(x)
                self.movie_links.pop(x)
        return self
    
       
# pass how many reviews per movie
imdb = scraper(5).get_permalinks()                



In [3]:
count_permalinks = len([item for sublist in imdb.permalinks for item in sublist])
print(f'There are {count_permalinks} permalinks for {len(imdb.comedy_titles)} comedies with a total of {len(imdb.talent_names)} unqiue directors and actors from the imdb website')


There are 245 permalinks for 49 comedies with a total of 213 unqiue directors and actors from the imdb website


Comments: End #1

---
Exercise #2: Extract noun phrase (NP) chunks from your reviews using the following procedure:
- a.	In Python, use BeautifulSoup to grab the main review text from each link.  
- b.	Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
- c.	You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.





In [5]:
# Use Beautiful Soup to extract the reviews from each review page within imbd.
links = imdb.permalinks[0]

class movie_reviews():
    def __init__(self, permalinks_list):
        self.permalinks_list = permalinks_list
        
    # cycle through all movies and return a list of lists of movie reviews for each movie
    def get_movie_reviews(self):
        reviews = []
        for single_movie_permalinks in self.permalinks_list:
            single_comedy_reviews = self.movie_review_page_cycling(single_movie_permalinks)
            reviews.append(single_comedy_reviews)
        
        self.reviews = reviews
        return self

    # cycle through each movie's pages of reviews
    def movie_review_page_cycling(self, links):
        reviews = []
        for link in links:
            review = self.get_review(link)
            reviews.append(review)
        
        return reviews

    # get single review
    def get_review(self, link):
        response = requests.get(link)
        soup = BeautifulSoup(response.text, 'lxml')
         
        review =  soup.find('div', {'class':'text show-more__control'}).text
        return review


#reviews = movie_review_page_cycling(links)
#print(f'Reviews for {imdb.comedy_titles[0]}')
#for review in reviews:
#    print(f'{review}\n')

# only pass the first 100 permalinks from the first 20 comedies
reviews = movie_reviews(imdb.permalinks[:20]).get_movie_reviews()

    

In [6]:
# unpack the movie reviews into a list of 100 reviews!
movie_reviews = [item for sublist in reviews.reviews for item in sublist]
len(movie_reviews)

100

In [None]:
movie_reviews[0]

"From the world of dc comics, john cena is chris smith, self proclaimed peacemaker. He's strong enough to bail from the hospital, so he's back on the job. This opens with a quick summary of what came before. His mission, his jail sentence. And of course, the chip in his head that they can pop if he doesn't obey. So much cussing. And the violence. Not for the kiddies. When a one night stand goes very wrong, he realizes he's already caught up in a new mission. With harcourt and the crew to back him up, he's off to fight the butterflies. Lots of silly banter. Showing on hbomax. Good stuff! A new team member too. Written (and some episodes directed) by james gunn. Also did a whole lot of guardians of the galaxy."

---

- b.	Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 


In [7]:

# preprocess each review by tokenize each sentence, then each word, then POS tage each word
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk import pos_tag

class prepare():
    def __init__(self):
        self = self       
        
    def tokenized_single_review(self, review):
        sentences = nltk.sent_tokenize(review)
        sentences = [nltk.word_tokenize(sent) for sent in sentences]
        sentences = [nltk.pos_tag(sent) for sent in sentences]
        return sentences

    def token_reviews(self, reviews):
        tokenized_reviews = [self.tokenized_single_review(review) for review in reviews]
        return tokenized_reviews
        
prep = prepare()
tok_reviews = prep.token_reviews(movie_reviews)


In [8]:
tok_reviews[0][0]

[('From', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('of', 'IN'),
 ('dc', 'NN'),
 ('comics', 'NNS'),
 (',', ','),
 ('john', 'NN'),
 ('cena', 'NN'),
 ('is', 'VBZ'),
 ('chris', 'JJ'),
 ('smith', 'NN'),
 (',', ','),
 ('self', 'NN'),
 ('proclaimed', 'VBD'),
 ('peacemaker', 'NN'),
 ('.', '.')]

In [43]:
# let's see how many of each pos we have
from collections import Counter

       
s = unpacked_pos_words = [pos for sent in [sent for review in tok_reviews for sent in review] for pos in sent]
print(Counter([tag for word, tag in s]))

Counter({'NN': 3700, 'DT': 2735, 'IN': 2665, 'NNP': 2003, 'JJ': 1934, 'RB': 1531, 'PRP': 1352, 'VBZ': 1293, ',': 1237, '.': 1186, 'CC': 1027, 'NNS': 1015, 'VB': 781, 'VBP': 573, 'TO': 550, 'VBD': 532, 'VBG': 508, 'VBN': 471, 'PRP$': 326, 'CD': 261, 'MD': 250, '(': 183, ')': 182, ':': 175, 'POS': 151, 'WP': 147, 'WRB': 139, 'WDT': 133, 'RP': 129, "''": 98, '``': 92, 'JJR': 78, 'JJS': 71, 'EX': 64, 'RBR': 39, 'PDT': 25, 'RBS': 19, 'NNPS': 14, 'UH': 11, 'FW': 8, '$': 4, 'WP$': 3, '#': 1})


In [51]:
print(set([x for i,x in s if x not in ['#', '$', "''", '(', ')', ',', '.', ':','``']]))

{'WDT', 'JJ', 'NNP', 'UH', 'NNPS', 'PDT', 'NNS', 'TO', 'CC', 'VBN', 'VB', 'RBR', 'POS', 'FW', 'VBD', 'RB', 'VBG', 'PRP', 'NN', 'VBP', 'CD', 'JJS', 'DT', 'WP', 'WRB', 'JJR', 'MD', 'RBS', 'WP$', 'IN', 'PRP$', 'RP', 'VBZ', 'EX'}


In [207]:
from nltk.chunk.regexp import RegexpParser


def chunk_my_reviews(reviews):
    #grammar = """
    #    NP: {<DT>?<JJ>?<NN.*>}
    #    ADJP: {<JJ>}
    #    ADVB: {<RB.*>}
    #    PP: {<IN>}
    #    VP: {<MD>?<VB.*>+}
    #   """
    grammar = """
        NP: {<DT>?<JJ>?<NN.*>}
    """
    
    chunker = RegexpParser(grammar, loop=1)
    chunked_reviews = [[chunker.parse(sentence) for sentence in review] for review in tok_reviews]
    return chunked_reviews

chunked_reviews = chunk_my_reviews(tok_reviews)
 
# example of the first sentence of the first review

chunked_reviews[0]    
# what we see here is that the chunker is picking up on the proper nouns, but misclassifying.

[Tree('S', [('From', 'IN'), Tree('NP', [('the', 'DT'), ('world', 'NN')]), ('of', 'IN'), Tree('NP', [('dc', 'NN')]), Tree('NP', [('comics', 'NNS')]), (',', ','), Tree('NP', [('john', 'NN')]), Tree('NP', [('cena', 'NN')]), ('is', 'VBZ'), Tree('NP', [('chris', 'JJ'), ('smith', 'NN')]), (',', ','), Tree('NP', [('self', 'NN')]), ('proclaimed', 'VBD'), Tree('NP', [('peacemaker', 'NN')]), ('.', '.')]),
 Tree('S', [('He', 'PRP'), ("'s", 'VBZ'), ('strong', 'JJ'), ('enough', 'RB'), ('to', 'TO'), ('bail', 'VB'), ('from', 'IN'), Tree('NP', [('the', 'DT'), ('hospital', 'NN')]), (',', ','), ('so', 'IN'), ('he', 'PRP'), ("'s", 'VBZ'), ('back', 'RB'), ('on', 'IN'), Tree('NP', [('the', 'DT'), ('job', 'NN')]), ('.', '.')]),
 Tree('S', [('This', 'DT'), ('opens', 'VBZ'), ('with', 'IN'), Tree('NP', [('a', 'DT'), ('quick', 'JJ'), ('summary', 'NN')]), ('of', 'IN'), ('what', 'WP'), ('came', 'VBD'), ('before', 'RB'), ('.', '.')]),
 Tree('S', [('His', 'PRP$'), Tree('NP', [('mission', 'NN')]), (',', ','), ('his'

- c.	You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.

In [208]:
# now we need to scrape the actor names from the movie pages (but we have those links!) - this was actually added to the Scrape class as talent_names and a set was used to remove any duplicates from the extraction process.
tal = [name for name in imdb.talent_names]        

# separate the fnames, mnames, lnames, etc
tal = [name.split(' ') for name in tal]
# break apart the name lists and remove duplicates
tal = list(set([name for fullname in tal for name in fullname]))
tal.sort()
#tal

In [209]:
# set up the talent names with tags
tnames = [nltk.word_tokenize(name) for name in tal]
tnames = [nltk.pos_tag(name) for name in tnames]


In [210]:
count_tal_tags = [pos for tag in tnames for pos in tag]
Counter([tag for name, tag in count_tal_tags])

Counter({'NN': 264,
         'NNP': 95,
         'FW': 2,
         'NNS': 21,
         'VB': 4,
         'VBG': 1,
         'JJ': 9,
         'VBN': 2,
         'RB': 1,
         'IN': 1,
         'MD': 1})

In [212]:
# after much trail and error ( 3 days of work) and a lot of deleted code, I cannot figure out how to manually update the pos tagging for rechunkinig.

---

3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).
<br>

#### Summary

##### IMDB Site Scrubbing
I started the homework assignment using Beautiful Soup to scrape the list of the top 50 comedies from imdb.com that included the  comedy title, link to the movie’s page, main talent in the movie (Directors and Actors) and permalinks to each individual review. Since the scope of this project was limited to 100 reviews, I only used the first 5 reviews of the top 20 movies and ensured there were at least 5 reviews for each movie. This was done using a combination of list comprehensions within functions within a class object.

##### NP-chunk
In this section I used the permalinks to grab the individual reviews for each of the top 20 movies (that had at least 5 reviews, note of the 50, only 1 comey did not have 5 reviews and that is because it has not been released it and stars Sandra Bullock).

I then used the NLTK sent_tokenizer, word_tokenizer and pos_tag functions to breakdown each movie review using a combination of list comprehensions within functions within a class. I then inspected the first sentence of the first review our of curiosity.

Next I checked to see if there were any missing chunks by using the Counter from the collections package to count how many of each POS existed in my corpus…to my surprise, there were 182 with a very strange label, that on further inspection returned a null list so it might’ve been some blank space lurking.

Next I just chunked for Noun Phrase chunks using the RegexpParser and a manual grammar setting of NP: {<DT>?<JJ>?<NN.*>} 

##### Wrap it up
After about 3 days of trial and error, well just error, I could not figure out how to add anything to the lexicon nor manually update the talent name POS tags in order to re-chunk. But I did ensure that all the names I scraped from the IMDB website were represented in the POS tags; although, only 95 were correctly tagged as NNP, so there is a chance that quite a bit of information was lost in this process. 
<br> Below is the output of how the pos_tag thought they should be represented:
<br> 'NN': 264,
<br> 'NNP': 95,
<br> 'FW': 2,
<br> 'NNS': 21,
<br> 'VB': 4,
<br> 'VBG': 1,
<br> 'JJ': 9,
<br> 'VBN': 2,
<br> 'RB': 1,
<br> 'IN': 1,
<br> 'MD': 1})




In [218]:
m = 0
for i,chunk in enumerate(chunked_reviews):
    if i in [0,5,10,15,20]:
        print(f'{imdb.comedy_titles[m]} Movie Reviews Chunked:')
        m += 1
    print(f'{chunk}\n')
    

Peacemaker Movie Reviews Chunked:
[Tree('S', [('From', 'IN'), Tree('NP', [('the', 'DT'), ('world', 'NN')]), ('of', 'IN'), Tree('NP', [('dc', 'NN')]), Tree('NP', [('comics', 'NNS')]), (',', ','), Tree('NP', [('john', 'NN')]), Tree('NP', [('cena', 'NN')]), ('is', 'VBZ'), Tree('NP', [('chris', 'JJ'), ('smith', 'NN')]), (',', ','), Tree('NP', [('self', 'NN')]), ('proclaimed', 'VBD'), Tree('NP', [('peacemaker', 'NN')]), ('.', '.')]), Tree('S', [('He', 'PRP'), ("'s", 'VBZ'), ('strong', 'JJ'), ('enough', 'RB'), ('to', 'TO'), ('bail', 'VB'), ('from', 'IN'), Tree('NP', [('the', 'DT'), ('hospital', 'NN')]), (',', ','), ('so', 'IN'), ('he', 'PRP'), ("'s", 'VBZ'), ('back', 'RB'), ('on', 'IN'), Tree('NP', [('the', 'DT'), ('job', 'NN')]), ('.', '.')]), Tree('S', [('This', 'DT'), ('opens', 'VBZ'), ('with', 'IN'), Tree('NP', [('a', 'DT'), ('quick', 'JJ'), ('summary', 'NN')]), ('of', 'IN'), ('what', 'WP'), ('came', 'VBD'), ('before', 'RB'), ('.', '.')]), Tree('S', [('His', 'PRP$'), Tree('NP', [('missio