In this post, we will perform some web scraping. The goal is to find the movies that share many same actors as one of my favorite movie: La La Land. In total, I wrote 3 parsing methods in the class ImdbSpider: parse, parse_full_credits, and parse_actor_pages. Let's take a look one by one.

### Method 1. Parse

Our main parsing method takes us to the link of all the casts in the movie. The code is shown below:

In [None]:
def parse(self, response):
    '''
    This method assume that we start on a movie page, 
    and navigate to the Cast & Crew page.
    '''
    #find using dev tool where the url is located, and get the link
    url = response.css("div.SubNav__SubNavContainer-sc-11106ua-1.hDUKxp").\
            css("li.ipc-inline-list__item")[0].css("a").attrib["href"]
    #join the url with the main site url
    url = response.urljoin(url)
    #yield the request
    yield scrapy.Request(url, callback = self.parse_full_credits)

With this parsing method 1, we are able to navigate to the Cast page where we can find a list of all the actors.

### Method 2. Parse_full_credits

This second method is relatively simple, because it mainly gets all the urls for nativating to the pages of each specific actor. We will use the urls it generates in our method 3.

In [None]:
def parse_full_credits(self,response):
    """
    This method assume that we start on the Cast & Crew page. 
    It will yield a scrapy.Request for the page of each actor
    listed on the Cast & Crew page. 
    """
    #list of actors url
    actor_urls = [a.attrib["href"] for a in response.css("td.primary_photo a")]
    #call in each url the actor page method
    for url in actor_urls:
        url = response.urljoin(url)
        yield scrapy.Request(url, callback = self.parse_actor_page)

### Method 3. Parse_actor_page

This is our most important method. The goal is to find all the movies or TV series that the actor played a role in, and record the names of those works. The difficulty here is that there are other sections connected right below the section "Actors", so we have to make use of the unique ids in order to separate each section.

In [None]:
def parse_actor_page(self, response):
        '''
        This method assumes that we start on the page of an actor.
        It should yield a dictionary with two key-value pairs for each
        movie or TV show that the actor played in.
        '''
        #get the name of the actor at the top of the page
        actor_name = response.xpath('//h1[@class="header"]/span/text()')\
            .extract_first()
        #get a list of the movie names by specifying that the id
        #starts with the word "actor"
        movie_name = response.css('div.filmo-row[id^="actor"] b a::text')\
            .extract()
        
        #loop over all the movie names and create a dictionary for each,
        #while the name of the actor is the same
        for name in movie_name:
            yield {
            "actor": actor_name,
            "movie_or_TV_name": name
                }

It took me a long long time to figure out the simple "id^" part. Anyways, I made it! And after finishing writing all the three methods, we run **scrapy crawl imdb_spider -o movies.csv** in command line to produce the csv file containing the (actor,movie) pair. To visualize the result a bit, we import the file into Jupyter notebook, and sort the dataframe with the top movies and TV shows that share actors with your favorite movie or TV show.

In [None]:
import pandas as pd

In [None]:
#import the csv file as a dataframe

In [None]:
movies = pd.read_csv("movies.csv")
movies.head()

In [None]:
#groupby movie name and count the number of actors

In [None]:
movies = movies.groupby('movie_or_TV_name').count()

In [None]:
#sort by descending order, reset index, rename

In [None]:
movies = movies.sort_values(by = ['actor'],ascending = False).reset_index()

In [None]:
movies = movies.rename(columns = {"actor":"number of shared actors"})

In [None]:
movies[:10]

The result for my movie of choice isn't as impressive as that of professor's Star Trek, but it also makes some reasonable suggestions on what movies I will probably like. This is the end of our Blog Post 3!