# Scraping Movie Details from YIFY Movies

![banner-image](https://i.imgur.com/V1XCM6l.jpg)


#### Web Scraping 
- Web Scraping is a way to extract information (or simply data) from webpages using various tools and techniques. You can read more about web scraping [here](https://en.wikipedia.org/wiki/Web_scraping). 
#### Why Web Scraping?
- There are many websites that contain certain types of data which may prove to be invaluable in-terms of day-to-day needs, academic-research, industry-use, bussiness, etc

- [Stock-rates](https://www.moneycontrol.com/),  [product detials](https://www.amazon.in/Chitralekha-Bhagwaticharan-Verma/dp/8126715855/ref=tmm_pap_swatch_0?_encoding=UTF8&qid=&sr=),  [sports stats](https://www.skysports.com/football/tables),  [weather forecasts](https://www.accuweather.com/en/in/ghaziabad/206683/september-weather/206683),  [movie-ratings](https://yts.rs/movie/the-protege-2021)  and what not. 

#### YIFY Movies:
-  [YIFY Movies](https://yts.rs/) , a website that offers free to download movie torrent links, having an enormous database for movies and documentaries.
- We would like to extract movie details (like title, year, genre, rating, movie_link, synopsis and no. of times downloaded) for our project.


#### Tools
- [Pyhton 3.7](https://www.python.org/downloads/) and above along with [Juypter Notebooks](https://jupyter.org/install.html).
- [Pandas library](https://pandas.pydata.org/docs/) to create dataframe as well as saving the output to [.csv](https://en.wikipedia.org/wiki/Comma-separated_values) file.
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing the html_page.

#### Outline:
Here's an outline of the steps we'll follow:

1. Download the webpage using the `requests`
2. Parse the HTML source code using beautiful soup
3. Searching 'tags' containing data for movie title, year, genre, rating, movie-url, synopsis and number of times downloaded.
4. Scrap from multiple pages (in our case 20 pages) and compile the information into Python lists and dictionaries.
5. Save the extracted information to a CSV file.

By the time we finsih our project, we would have a CSV file created in the following format:

````````
Movie,Year,Genre,Ratings,Url,Synopsis,Downloaded
Whale Hunting,1984,Drama,6.5 / 10,https://yts.rs/movie/whale-hunting-1984," A disillusioned student meets a eccentric beggar and a mute prostitute he falls in love with. Together, without money, they cross South Korea to help the girl go home. "," Downloaded 101 times  Sep 27, 2021 at 09:08 PM
........
````````

#### How to Run the Code

You can execute the code using the "Run" button at the top of this page and selecting "Run on Binder". You can make changes and save your own version of the notebooks to [Jovian](https://jovian.ai) by executing the following code cells:

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="scraping-yify-movies-lists")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "omrahulpandey/scraping-yify-movies-lists" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/omrahulpandey/scraping-yify-movies-lists[0m


'https://jovian.ai/omrahulpandey/scraping-yify-movies-lists'

#### Download the webpage using the `requests`

We'll use the `requests` library to download the web page.

The library can be installed using `pip`.

In [4]:
!pip install requests --upgrade --quiet

In [5]:
#import `requests` library
import requests

The library is now installed and imported.

To download a web page, we'll use the `get` function from `requests`.

In [6]:
# The url of the website that needs to be scrapped needs to be stored in a varibale (in our case, say site_url)
site_url = 'https://yts.rs/browse-movies'

#requests.get() allows the webpage to be downloaded in the mentioned (site_url) variable.
response = requests.get(site_url)

`requests.get` returns a response object containing the data from the web page and some other informations.

The `.status_code` property can be used to check if the request was successful. A successful response will have the [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#successful_responses) between 200 to 299.

In [7]:
response.status_code

200

The request was successful! We can get the contents of the page using `response.text`

In [8]:
page_contents = response.text

Let's check the number of characters on the downloaded page.

In [9]:
len(page_contents)

110008

The page contains over 110,000 characters!

Here are the first 600 characters of the page:

In [10]:
page_contents[:600]

'<!DOCTYPE html><html><head><script type="application/ld+json">\n            {\n                "@context": "https://schema.org",\n                "@type": "Organization",\n                "url": "https://yts.rs",\n                "logo": "/images/og_yts_logo.png"\n            }</script><script async="" src="https://www.googletagmanager.com/gtag/js?id=G-H6FV1F987B"></script><script>\n                                window.dataLayer = window.dataLayer || [];\n                                function gtag(){dataLayer.push(arguments);}\n                                gtag(\'js\', new Date());\n              '

The characters presented above are nothing but part of the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the web page. 

We can also save the `page_contents` to a file and view the page locally within Jupyter using "File>Open" 

In [11]:
with open('yify_webpage.html', 'w') as f:
    f.write(page_contents)

The page preview looks similar to the original page, but none of the links work. It must be noted that on a web page, new stuff are always expected to be added, so the preview may not exactly be the same as this but similar will do.

![page_contents](https://i.imgur.com/g9b3vMs.jpg)

We have successfuly downloaded the web page using `requests`

#### Parse the HTML source code using BeautifulSoup



In [12]:
!pip install beautifulsoup4 --upgrade --quiet

In [12]:
from bs4 import BeautifulSoup

In [13]:
doc  = BeautifulSoup(page_contents, 'html.parser')

In [14]:
type(doc)

bs4.BeautifulSoup

In [15]:
doc.find('title')

<title>Search and Browse YIFY Movies Torrent Downloads - YTS</title>

In [16]:
doc.find('img')

<img alt="logo" class="header__logo-image" src="/images/logo-YTS.svg"/>

`doc.find('title')` and `doc.find('img')` gets the `title` and `img` tags along with its contents.


Let us now define a function `get_doc(url)` to create a BeautifulSoup `doc` and return it, for each url received.

In [17]:
def get_doc(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the page
    response = requests.get(url)
    # Check if download was successful
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    # Create a bs4 doc    
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [18]:
doc = get_doc(site_url)

In [19]:
doc.find('title')

<title>Search and Browse YIFY Movies Torrent Downloads - YTS</title>

We can now use the function `get_doc` to download any web page and create a BeautifulSoup `doc`.

#### Searching "tags" containing movie data

We'll inspect the [tags](https://www.w3schools.com/TAGS/default.ASP) within source code to find and get the following data:
- Movie
- Year
- Genre
- Rating
- Url
- Synopsis
- Downloads

![inspect_page](https://i.imgur.com/pWvMqnN.jpg)

Right-Click on the part of the web page you wish to inspect.

As it can be seen, `<a>` tag contains title of the movie.


In [20]:
""" Finding the title of the movie from the <a> tag"""

movie_title_tags = doc.find_all('a', class_ ='text--bold palewhite title')

In [21]:
movie_title_tags[0].text

'Gunda'

Movie title is successfully extracted! 



Lets define a function `get_movie_titles` to get a list of movie titles.

In [22]:
def get_movie_titles(doc):
    # get all the <a> tags with a unique class
    movie_title_tags = doc.find_all('a', class_ ='text--bold palewhite title')
    # create an empty list
    movie_titles = []
    for tag in movie_title_tags:
    # for 'title' in each <tag> append it to the list
        movie_titles.append(tag.text)
    # return list    
    return movie_titles

The `get_movie_titles()` function successfully returns a list of movie tiltes.

In [23]:
""" Finding the Year from <span> tag inside the <a> tag """

movie_year_tags = doc.find_all('span', class_ = 'text--gray year')

In [24]:
movie_year_tags[0].text

'2020'

Movie Year released is successfully extracted!

Let's now define a function `get_movie_years` to get a list of years for movies released.

In [25]:
def get_movie_years(doc):
    # get all the <sapn> tags with a unique class
    movie_year_tags = doc.find_all('span', class_ = 'text--gray year')
    # create an empty list
    movie_years =[]
    for tag in movie_year_tags:
    # for year in each <tag> append it to the list.    
        movie_years.append(tag.text)
    return movie_years

The `get_movie_years()` sucessfully returns a list of movie years

In [26]:
""" Finding movie genre from <h4> tag that has a unique class genre """

genre_tags = doc.find_all('h4', class_ = 'genre')

In [27]:
genre_tags[0].text

'Documentary'

Movie Genre sucessfully extracted!


Lets now define a function `get_movie_genres` to get a list of movie genres.

In [28]:
def get_movie_genres(doc):
    # get all the <h4> tags with unique a class
    genre_tags = doc.find_all('h4', class_ = 'genre')
    # create an empty list
    movie_genres = []
    for tag in genre_tags:
    # for genre in each <tag> append it to the list.    
        movie_genres.append(tag.text)
    return movie_genres

The `get_movie_genres()` function successfully retunrs a list of movie genres.

In [29]:
""" Finding movie rating from <h4> tag that has a unique class rating """

rating_tags = doc.find_all('h4', class_ = 'rating')

In [30]:
rating_tags[0].text

'7.4 / 10'

Movie rating is sucessfully extraced!

Lets now define a function `get_movie_ratings` to get a list of movie ratings.

In [31]:
def get_movie_ratings(doc):
    # get all the <h4> tags with a unique class
    rating_tags= doc.find_all('h4', class_ = 'rating')
    # create an empty list
    movie_ratings = []
    for tag in rating_tags:
    # for rating in each <tag> append it to the list.    
        movie_ratings.append(tag.text)
    return movie_ratings

The `get_movie_ratings()` function successfuly returns a list of movie ratings.

In [32]:
""" Finding movie url from <a> tag that has a unique `class text--bold palewhite title` """

movie_url_tags = doc.find_all('a', class_ ='text--bold palewhite title')

In [33]:
movie_url_tags[0]['href']

'/movie/gunda-2020'

The url is successfully extracted!

However the url extracted here is only partial, to get the full url, we need to add base url to the url extracted.

In [34]:
movie0_url = 'https://yts.rs' + movie_url_tags[0]['href']
print(movie0_url)

https://yts.rs/movie/gunda-2020


We have now successfully extracted full url of the movie.


Lets now define a function `get_movie_urls` to get a list of full urls for each movie. 

In [35]:
def get_movie_urls(doc):
    # get all the <a> tags with a unique class
    movie_url_tags = doc.find_all('a', class_ ='text--bold palewhite title')
    # create an empty list
    movie_urls = []
    # the base url for the website
    base_url = 'https://yts.rs'
    for tag in movie_url_tags:
    # for url in each tag, append it to the list after adding the base_url with url from each tag   
        movie_urls.append(base_url + tag['href'])
    return movie_urls

The `get_movie_urls()` function successfully returns a list of movie urls.


Similarily, we define functions `get_synopsis` and `get_downloaded` to get a list of movie synopsis and number of downloads information. 

In [36]:
def get_synopsis(doc):
    # create an empty list
    synopses =[]
    # get all the movie urls from the page
    urls = get_movie_urls(doc)
    for url in urls:
        # for each url (page) get the beautiful soup doc object
        movie_doc = get_doc(url)
        # get all the <div> tags with a unique class
        div_tag = movie_doc.find_all('div', class_ = 'synopsis col-sm-10 col-md-13 col-lg-12')
        # get all the <p> tags inside the first <div> tag
        p_tags = div_tag[0].find_all('p')
        # the text (i,e the synopsis) part from the <p> tag is extracted using .text feature
        synopsis = p_tags[0].text
        # the synopsis is appended to the list synopses
        synopses.append(synopsis)
    return synopses

In [37]:
# import re to perform regular expression operations
import re

In [38]:
def get_downloaded(doc):
    # create an empty list
    downloadeds = []
    # get all the movie urls on page
    urls = get_movie_urls(doc)
    for url in urls:
        # for each url(page) create a beautiful soup doc object
        movie_doc = get_doc(url)
        # get all the <div> tags with unique class
        div_tag = movie_doc.find_all('div', class_ = 'synopsis col-sm-10 col-md-13 col-lg-12')
        # get all the <p> tags inside the first <div> tag
        p_tags = div_tag[0].find_all('p')
        # get all the <em> tags inside the second <p> tag
        em_tag = p_tags[1].find_all('em')
        # extarct the text from the <em> tag using .text
        download = em_tag[0].text
        # using reular expressions to strip of alphabets from the text using .compile()
        regex = re.compile('[^0-9]')
        downloaded = regex.sub('',download)
        # append the integer to the list downloadeds
        downloadeds.append(downloaded)
    return downloadeds

The `get_synopsis()` and `get_downloaded()` successfully returns a list of `synopses` and `downloadeds` for each movie.

Let us now define a function `scrap_page` to get a lsit of detials such as `movies`, `years`, `genres`, `ratings`, `urls`, `synopses` and `downloadeds` from a web page. 

In [39]:
def scrap_page(url):
    # get beautiful soup doc object for url
    doc = get_doc(url)
    # create 7 empty lists for each field
    movies,years,genres,ratings,urls,synopses,downloadeds=[],[],[],[],[],[],[]
    
    # get list of movie titles
    movies = get_movie_titles(doc)
    # get list of years
    years = get_movie_years(doc)
    # get list of genres
    genres = get_movie_genres(doc)
    # get list of ratings
    ratings = get_movie_ratings(doc)
    # get list of urls
    urls = get_movie_urls(doc)
    # get list of synopsis
    synopses = get_synopsis(doc)
    # get list of downloads
    downloadeds = get_downloaded(doc)
    
    return movies,years,genres,ratings,urls,synopses,downloadeds

The `scrap_page()` function successfully returns a list of `movies`, `years`, `genres`, `ratings`. `urls`, `synopses` and `downloadeds` for a web page.



Let us now define a function `website_scrap` to get all the information required from multiple pages and create a dictionary `movies_dict` for it.

In [40]:
def website_scrap():
    # create 7 empty list for each field to append the corrsponding field list being returned
    all_movies,all_years,all_genres,all_ratings,all_urls,all_synopses,all_downloadeds = [],[],[],[],[],[],[]
    for i in range(1,21):
        url = 'https://yts.rs/browse-movies?page={}'.format(i)
        # get lists of movie filed details and append them to the final list
        movies,years,genres,ratings,urls,synopses,downloadeds = scrap_page(url)
        all_movies += movies
        all_years += years
        all_genres += genres
        all_ratings += ratings
        all_urls += urls
        all_synopses += synopses
        all_downloadeds += downloadeds
        
    # create a dictionary from the final list attained for each 'key' as movie detail    
    movies_dict = {
        'Movie': all_movies,
        'Year': all_years,
        'Genre': all_genres,
        'Rating': all_ratings,
        'Url': all_urls,
        'Synopsis': all_synopses,
        'Downloads': all_downloadeds
    }   

The above `website_scrap()` function gets list of details (`movies`, `years`, `genres`, `ratings`, `urls`, `synopses` and `downloadeds`) from multiple pages, adds it to a corresponding larger list (`all_movies`, `all_years`, `all_genres`, `all_ratings`, `all_urls`, `all_synopses`, and `all_downloadeds`) and returns it. ``.

A dictionary `movie_dict` is created using returned lists. 

#### Combining all of them together into a single cell

In [41]:
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

def get_movie_titles(doc):
    movie_title_tags = doc.find_all('a', class_ ='text--bold palewhite title')
    movie_titles = []
    for tag in movie_title_tags:
        movie_titles.append(tag.text)
    return movie_titles


def get_movie_years(doc):
    movie_year_tags = doc.find_all('span', class_ = 'text--gray year')
    movie_years =[]
    for tag in movie_year_tags:
        movie_years.append(tag.text)
    return movie_years


def get_movie_genres(doc):
    genre_tags = doc.find_all('h4', class_ = 'genre')
    movie_genres = []
    for tag in genre_tags:
        movie_genres.append(tag.text)
    return movie_genres


def get_movie_ratings(doc):
    rating_tags= doc.find_all('h4', class_ = 'rating')
    movie_ratings = []
    for tag in rating_tags:
        movie_ratings.append(tag.text)
    return movie_ratings


def get_movie_urls(doc):
    movie_url_tags = doc.find_all('a', class_ ='text--bold palewhite title')
    movie_urls = []
    base_url = 'https://yts.rs'
    for tag in movie_url_tags:
        movie_urls.append(base_url + tag['href'])
    return movie_urls    
    


def get_synopsis(doc):
    synopses =[]
    urls = get_movie_urls(doc)
    for url in urls:
        movie_doc = get_doc(url)
        div_tag = movie_doc.find_all('div', class_ = 'synopsis col-sm-10 col-md-13 col-lg-12')
        p_tags = div_tag[0].find_all('p')
        synopsis = p_tags[0].text
        synopses.append(synopsis)
    return synopses      
    
    


def get_downloaded(doc):
    downloadeds = []
    urls = get_movie_urls(doc)
    for url in urls:
        movie_doc = get_doc(url)
        div_tag = movie_doc.find_all('div', class_ = 'synopsis col-sm-10 col-md-13 col-lg-12')
        p_tags = div_tag[0].find_all('p')
        em_tag = p_tags[1].find_all('em')
        download = em_tag[0].text
        regex = re.compile('[^0-9]')
        downloaded = regex.sub('',download)
        downloadeds.append(downloaded)
    return downloadeds
    
    
def get_doc(url):
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc


    
def scrap_page(url):
    doc = get_doc(url)
    movies,years,genres,ratings,urls,synopses,downloadeds=[],[],[],[],[],[],[]
    
    movies = get_movie_titles(doc)
    years = get_movie_years(doc)
    genres = get_movie_genres(doc)
    ratings = get_movie_ratings(doc)
    urls = get_movie_urls(doc)
    synopses = get_synopsis(doc)
    downloadeds = get_downloaded(doc)
    
    return movies,years,genres,ratings,urls,synopses,downloadeds
    


    
def website_scrap():
    all_movies,all_years,all_genres,all_ratings,all_urls,all_synopses,all_downloadeds = [],[],[],[],[],[],[]
    for i in range(1,21):
        url = 'https://yts.rs/browse-movies?page={}'.format(i)
        movies,years,genres,ratings,urls,synopses,downloadeds = scrap_page(url)
        all_movies += movies
        all_years += years
        all_genres += genres
        all_ratings += ratings
        all_urls += urls
        all_synopses += synopses
        all_downloadeds += downloadeds
        
    movies_dict = {
        'Movie': all_movies,
        'Year': all_years,
        'Genre': all_genres,
        'Rating': all_ratings,
        'Url': all_urls,
        'Synopsis': all_synopses,
        'Downloads': all_downloadeds
    }
    
    movies_df = pd.DataFrame(movies_dict, index = None) # Creates a dataframe from the dictionary and saves it to 'movies_df'
    movies_df.to_csv('movies_data.csv') # Converts the Dataframe file 'movies_df' to a csv file and saves it in .csv format
    return movies_df 


In [42]:
website_scrap()

Unnamed: 0,Movie,Year,Genre,Rating,Url,Synopsis,Downloads
0,Gunda,2020,Documentary,7.4 / 10,https://yts.rs/movie/gunda-2020,Documentary looks at the daily life of a pig ...,202
1,Bright: Samurai Soul,2021,ActionAdventure,5.6 / 10,https://yts.rs/movie/bright-samurai-soul-2021,Set in Japan during the end of the Shogunate ...,2323
2,It Boy,2013,ComedyRomance,6.4 / 10,https://yts.rs/movie/it-boy-2013,The thirty-eight year-old ambitious and worka...,3434
3,The Bingo Long Traveling All-Stars & Motor Kings,1976,ComedySport,6.8 / 10,https://yts.rs/movie/the-bingo-long-traveling-...,Tired of being treated like a slave by team o...,202
4,The Merry Heirs,1933,ComedyRomance,6.1 / 10,https://yts.rs/movie/the-merry-heirs-1933,A young salesman may inherit a wine-estate on...,101
...,...,...,...,...,...,...,...
395,Lift,2021,DramaHorror,7.5 / 10,https://yts.rs/movie/lift-2021,A usual working day turns unusual for Guru an...,24139
396,2 or 3 Things I Know About Her,1967,ComedyDrama,6.8 / 10,https://yts.rs/movie/2-or-3-things-i-know-abou...,"In this film, 'Her' refers to both Paris, the...",3838
397,Under Wraps,2021,FamilyFantasy,4.7 / 10,https://yts.rs/movie/under-wraps-2021,"Friends Marshall, Gilbert, and Amy accidental...",34239
398,None's A Ton: A Turkuaz Live Concert Film,2020,Music,0 / 10,https://yts.rs/movie/nones-a-ton-a-turkuaz-liv...,A live concert film from Brooklyn-based power...,3939


### Summary

Here's what we have done so far:

1. Downloaded a web page using `requests.get()` and the url of the web page.
2. Parsed the HTML source code of the page using BeautifulSoup and created an object `doc` of type beautiful soup.
3. Defined a function to create `doc` object for each url page.
4. Defined functions to extract movie details such as `movies`, `years`, `genres`, `ratings`, `urls`, `synopses` and `downloadeds` from each page.
5. Compiled the extracted information into Python lists and dictionaries.
6. Created a pandas Dataframe oject to show the information extracted in tabular form.
7. Convert and Save the Dataframe into a file of .csv format.

### Future Works

1. From the data collected in .csv file, perform some data analysis to get some insights on the data and visualizations.
2. Maybe scrap an E-commerce or stocks prices websites