# Scrapy with Everything Everywhere All At Once
---

>**What movie or TV shows share actors with your favorite movie or show?**

We will be answering this question by building a "recommender system" that will look at the number of shared actors between 2 movies/shows.

The blog will be formatted as follows:

1. We'll build a web scraper with Scrapy to scrape [TMDB](https://www.themoviedb.org/).
2. We will sort our scraped results and output an interesting visualization to help answer the question.

## 1. Setup
---

### a. Pick a Movie/Show

First, we will need a movie or show to work with and get its TMDB link. I'm going to choose *Everything Everywhere All At Once*, one of my favorite movies. Its TMDB page is 

https://www.themoviedb.org/movie/545611-everything-everywhere-all-at-once/

One can easily replicate this project if they choose another movie, they would just have to switch out the appropriate links!

### b. Initialize Scrapy Project

Now, we can create our Scrapy project! We open our terminal and type in the location we want our project to be this:

```
scrapy startproject TMDB_scraper
cd TMDB_scraper
```

### c. Tweak Settings

We still need to change a setting in `settings.py` that will allow us to avoid `403` errors while scraping. `403` errors mean that the website detects that we are a bot, so to get around this, we just modify `User_Agent` to be `"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"`

## 2. Scraping
---

Now, we can finally create our spider `tmdb_spider.py` inside the `spiders` directory. We can add these lines:
```python
import scrapy

class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    
    start_urls = ['https://www.themoviedb.org/movie/545611-everything-everywhere-all-at-once/']
```

Here, we import scrapy, and then define our spider with the nmae `tmdb_spider` and indicate our `start_urls` to be the page for *Everything Everywhere All At Once*.

Now, we need to write 3 different parsing methods to help get the data we want from tmdb.

### a. parse()
____

This method will navigate our spider from the main page for the movie to the *Full Cast & Crew* page.

```python
def parse(self, response):
        '''
        Start on a movie page, then navigate to the Full Cast & Crew page: which is just movie_url/cast.
        '''
        url = 'https://www.themoviedb.org/movie/545611-everything-everywhere-all-at-once/cast' #url for movie's cast page
        yield scrapy.Request(url, callback = self.parse_full_credits) #go to cast page
```

Here, the url is just `<start_urls>/cast`, so if one would like to use a different movie, then they would have to switch this url as well. Here, we just use a scrapy request to request that page, and we use a callback method in `parse_full_credits(self, response)` which will be detailed below!

### b. parse_full_credits()
___

This method will start from the *Full Cast & Crew* page and go through each actor in that page.

```python
 def parse_full_credits(self, response):
        '''
        Start on full cast and crew page. yield scrapy.request for the page of each actor listed on the page, not including crew members.
        '''

        actors = response.css('ol.people.credits:not(.crew) a::attr(href)').getall() #list of actors using html tags and response
        for actor in actors: #iterate through list of actors and go to each actor's page
            yield response.follow(actor, callback = self.parse_actor_page)
```

In this code, `actors` is a list of all the links to the actors' portfolio pages. Note, we include `:not(.crew)` in order to only get the *actors* and not the crew members for the movie/show. So we iterate through the list of actors and yield a response to follow to those pages. Our callback method is  `parse_actor_page()`, which will be detailed below.

### c. parse_actor_page()
---

This method will start from an actor's profile page and yield a dictionary containing all of the movies/TV shows the actor has worked on. So the output dictionary will be in the form of:
`{"actor": actor_name, "movie_or_TV_name": movie_or_TV_name}`

```python
def parse_actor_page(self, response):
        '''
        start on page of an actor. Yield a dictionary in the format of {"actor": actor_name, "movie_or_TV_name": movie_or_TV_name}
        Yield one such dictionary for each of the movies or TV shows on which that actor worked. 
        '''
        actor_name = response.css("h2 a::text").get() #get actor name by getting the a tags 
        print(actor_name)
        # iterate through all movies in actor's credit list and make a dictionary for each credit.
        for movie_or_TV_name in response.css("div.credits_list bdi::text").getall():
            yield {
                "actor": actor_name,
                "movie_or_TV_name": movie_or_TV_name
            }
```

Here, we specify actor name and movie/show titles with `HTML` tags. 

We are now done with defining our spider!

## 3. Recommendations
---

Let's finally get some results! Run this command:
```
scrapy crawl tmdb_spider -o results.csv
```
and we will end up with a csv file called `results.csv` that will contain all of our actors and what movies/TV shows they worked on. Let's use `pandas` to read the csv file into a dataframe

In [2]:
import pandas as pd
df = pd.read_csv("results.csv")
df

Unnamed: 0,actor,movie_or_TV_name
0,Michelle Yeoh,The Brothers Sun
1,Michelle Yeoh,ARK: The Animated Series
2,Michelle Yeoh,Rich People Problems
3,Michelle Yeoh,Star Trek: Section 31
4,Michelle Yeoh,The Legend of Nezha
...,...,...
1608,Stephanie Hsu,The Late Show with Stephen Colbert
1609,Stephanie Hsu,Unbreakable Kimmy Schmidt
1610,Stephanie Hsu,Girl Code
1611,Stephanie Hsu,The Oscars


Now that we have our data, we can finally answer the question we posed at the beginning! Let's find which movies/TV shows share the most actors with `Everything Everywhere All At Once`. To do this, let's just look at the most frequent movie/TV shows that come up in the dataframe! A great way to visualize this is with a `Plotly` pie chart!

In [6]:
from plotly import express as px
import numpy as np
#import these packages to save the plotly to the blog post
import plotly.io as pio
pio.renderers.default="iframe"

In [5]:
top30 = df["movie_or_TV_name"].value_counts()[1:30].reset_index() #new df of name of each movie/show and # of shared actors from Everything Everywhere
top30.columns = ["movie_or_TV_name", "#_of_shared_actors"] #rename columns

fig = px.pie(top30, values = "#_of_shared_actors", names = "movie_or_TV_name", title = "Top Recommendations based off of Everything Everywhere All At Once's cast")
fig.update_traces(textposition = "inside", text = top30["#_of_shared_actors"]) #change attributes of pie chart
fig.update_layout(height = 800, width = 1000) #height and width of chart
fig.show()

Looks like we got what we wanted! Maybe Westworld would be a good show to watch then!