## Brief of Web-scrapping using Python - IMDB

#### Problem Statement: We want to fetch the 'Movie Title', 'Runtime' and 'Genre' for the movie from [this](https://www.imdb.com/search/title/?sort=num_votes,desc&title_type=feature&year=1950,2012) webpage.

#### Solution Steps:
##### 1. Import useful libraries and classes:
        urllib request, BeautifulSoup
##### 2. Steps:
        a. html upload
        b. html parser
        c. Extraction of data from webpage
        d. Transform into required file (csv in this case).

In [53]:
# Import useful libraries
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

In [54]:
#html upload
#Store URL in a variable
my_url = "https://www.imdb.com/search/title/?sort=num_votes,desc&title_type=feature&year=1950,2012"

#Opening the URL and storing the content of it in a variable
uClient = uReq(my_url)
page_html = uClient.read()

#Close the connection after reading the information
uClient.close()

In [55]:
#Read the html code using the Soup

#soup will give you a structure output
page_soup = soup(page_html, "html.parser")
S = soup(uClient)

##### Look for the tag/class/div you are interested in the webpage, in our case it is 'lister-item mode-advanced' class within a division
##### We will store this in a set and will make use of it to fetch the required details

In [56]:
containers = page_soup.findAll("div", {"class": "lister-item mode-advanced"})
print("Container Length: ", len(containers))
print(containers[0])

Container Length:  50
<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0111161"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" class="loadlate" data-tconst="tt0111161" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
<p class="text-muted">
<span class="certificate">A</span>
<span class="ghost">|</span>
<span class="runtime">142 min</span>
<span class="ghost">|</span>
<sp

 ##### Now that we have each container, we will fetch the required information from these containes and store the data in a csv

In [57]:
#Destination to store the data in
file = "C:\\Users\\iampu\\Documents\\Temporary files\\imdb_movies.csv"

#Opening the file in write mode
f = open(file, "w")

#Writing the column names to the file
headers = "Name, Year, Runtime\n"
f.write(headers)

#Iterating each of the container to pull required information
for container in containers:
    name = container.img["alt"]
    
    #Searching the tag inside a container to get the Year details of the movie
    year_movie = container.findAll("span", {"class": "lister-item-year text-muted unbold"})
    #The year_movie variable store the complete tag information, to extract only the year information and store we need
    year = year_movie[0].text
    
    #Extracting the runtime information from each container
    runtime_movie = container.findAll("span", {"class": "runtime"})
    
    #The run_time variable store the complete tag information, to extract only the runtime information and store we need
    runtime = runtime_movie[0].text
    
    #Writing to the csv file in required format
    f.write(name + "," + year + "," + runtime + "\n")
    
#Closing the file connection
f.close()

#### Ensuring the contents from the file are stored as expected using the Panadas dataframe

In [58]:
# Importing required pandas libraries
import pandas as pd

#Reading the file
imdb = pd.read_csv(file, encoding = "latin1")

#Outputting first file rows of file
imdb.head()

Unnamed: 0,Name,Year,Runtime
0,The Shawshank Redemption,(1994),142 min
1,The Dark Knight,(2008),152 min
2,Inception,(2010),148 min
3,Fight Club,(1999),139 min
4,Pulp Fiction,(1994),154 min
