# Web Crawler

The main goal of this project is to build a web crawler and scrape the text related to American movies, which were released between 2000 and 2016 years. Along with the text, we will also crawl the Movies posters. The major deliverables of this project are:

1. Text (or plot) of American movies, which were released between the years 2000 and 2016
2. Movies release posters

All the data will be obtained from Wikipedia.


## Design

Wikipedia maintains list of movies, released in each year. The list of American movies released in each year are present at https://en.wikipedia.org/wiki/List_of_American_films_of_XXXX, where XXXX is the year. For each year between 2000 and 2016 (XXXX = 2000 to 2016), we have to recurrsively visit each year's URL to obtain the movies URLs, along with movie details such as cast, director, genre etc. Once the URLs related to all the movies are obtained, the web crawler will visit the movie URLs to scrape the plot of the movie. 

The following flowchart will provide an overview of the web crawling process:

<img src="crawler_logic.png">

 **Figure-1: Web crawling process**

The whole web crawling process is divided into 2 steps. In STEP-1, we will visit Wikipedia to obtain a list of all the URLs related to the American movies which were released between the years 2000-2016. In STEP-2, we will visit each of the URLs obtained in STEP-1, to download movie text and release poster. There must be some delay of 2 to 3 seconds between successive requests to Wikipedia website.

The output of STEP-1 is a comma separated file (Movie_Details.csv), with the following details: 

**Movie** - Movie Name

**URL** - Wikipedia web page for the movie

**Year** - Year of release

**Director** - Director of the movie

**Cast** - Cast of the movie

**Genre** - Movie's genre

**Movie_ID** - Unique key to distinguish each movie

The output of STEP-2 will be a set of text files, each file named using the Movie_ID, and a set of image files, each of the image files will also be named using the Movie_ID. Naming the files using the Movie_ID will help us to refer uniquely identify the text and image files related to a movie uniquely and easily.

# Implementation

### Import the required packages
Let us import all the required packages:

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display # Allows the use of display() for DataFrames
import time
import pickle #To save the objects that were created using webscraping
import pprint
from lxml import html
import requests
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from urllib.request import urlopen
from bs4 import BeautifulSoup

import urllib
import os
#import io

### STEP-1 Implementation

In STEP-1 of the web crawling process, we will obtain the list of movies, along with the URLs of the movies, for all the movies which were released between the years 2000 and 2016. 

But the challenge was, the format of the HTML file. Wikipedia used one format for the movies released between 2000 and 2013, and another format for the movies released in 2014-16. So we will sub-divide the STEP-1 process further to scrape the list of movies between 2000 and 2013 in phase 1 and 2015-16 in phase2.

In [60]:
##PHASE-1: Get the movies and URLs for the years 2000-2013
#Define the lists to hold the details of the movies
URL = list()
Movie_Name = list()
Director = list()
Cast = list()
Genre = list()
year = list()

#Create a beautiful soup object
bs = BeautifulSoup(html)

#Iterate over the years 2000 to 2014.
for y in list(range(2000,2014)):
    
    #Prepare the URL String and open the URL
    url = "https://en.wikipedia.org/wiki/List_of_American_films_of_"+str(y)
    html = urlopen(url)
    
    #Mandatory wait of 3 seconds
    time.sleep(3)
    
    #Get the web page as HTML document
    bs = BeautifulSoup(html)
    
    #Parse and get the required data
    for table in bs.find_all('table', {"class":"wikitable"}):
        for row in table.find_all('tr'):
            columns = row.find_all('td')
            if len(columns) > 4:
                Movie_Name.append(columns[0].get_text())
                Director.append(columns[1].get_text())
                Cast.append(columns[2].get_text())
                Genre.append(columns[3].get_text())
                year.append(y)
                
                #Handle exceptions, so that the process continues
                try:
                    a = columns[0].find('a',href=True)['href']
                    URL.append("https://en.wikipedia.org"+a)
                except:
                    URL.append("NA")
                    continue


                

In [33]:
##PHASE-2: Get the movies details for the years 2014 to 2016
#Declare the lists
URL1 = list()
Movie_Name1 = list()
Director1 = list()
Cast1 = list()
Genre1 = list()
year1 = list()

#For the years between 2014 and 2016
for y in range(2014,2017):          
    
    #Prepare wiki URL
    url = "https://en.wikipedia.org/wiki/List_of_American_films_of_"+str(y)
    
    #Exception handling to ignore the failures and continue processing
    try: 
        html = urlopen(url)
    except:
        print("problem with the following URL...continuining...:")
        print(url)
        continue
    #Sleep for 3 secs    
    time.sleep(3)
    
    #Declare beautiful soup object
    bs = BeautifulSoup(html)
    for table in bs.find_all('table', {"class":"wikitable"}):
        for row in table.find_all('tr'):
            columns = row.find_all('td')
            if len(columns) > 3: #To make sure that we are accessing the movies tables only
                if len(columns) == 6:
                    #print(columns[0].get_text())
                    Movie_Name1.append(columns[0].get_text())
                    Director1.append(columns[1].get_text())
                    Cast1.append(columns[2].get_text())
                    Genre1.append(columns[3].get_text())
                    year1.append(y)
                    try:
                        a=columns[0].find('a',href=True)['href']
                        URL1.append("https://en.wikipedia.org"+a)
                    except:
                        URL1.append("NA")
                        continue
                
                if len(columns) == 7:
                    #print(columns[1].get_text())
                    Movie_Name1.append(columns[1].get_text())
                    Director1.append(columns[2].get_text())
                    Cast1.append(columns[3].get_text())
                    Genre1.append(columns[4].get_text())
                    year1.append(y)
                    
                    try:
                        a=columns[1].find('a',href=True)['href']
                        URL1.append("https://en.wikipedia.org"+a)
                    except:
                        URL1.append("NA")
                        continue

                if len(columns) > 7:
                    #print("col len:{}".format(len(columns)))
                    #print(columns[2].get_text())
                    Movie_Name1.append(columns[2].get_text())
                    Director1.append(columns[3].get_text())
                    Cast1.append(columns[4].get_text())
                    Genre1.append(columns[5].get_text())
                    year1.append(y)                    
                    try:
                        a=columns[2].find('a',href=True)['href']
                        URL1.append("https://en.wikipedia.org"+a)
                    except:
                        URL1.append("NA")
                        continue
                    

## Save the results
We will save the movies details as a CSV file named Movie_Details.csv. This helps us to avoid running step-1 again.

In [None]:
#Create a data frame:
df = pd.DataFrame(list(zip(Movie_Name+Movie_Name1,URL+URL1,year+year1,Director+Director1,Cast+Cast1,Genre+Genre1)),\
                  columns=["Movie","URL","Year","Director","Cast","Genre"])

#Remove the rows which do not have URL information
df = df[df["URL"] != "NA"]

df["Movie_ID"] = df.index + 1

#Write the file
df.to_csv("Movies_Details.csv",encoding='utf-8',index=False)

## Read the saved file to a data frame.
We will read back the saved file to a data frame

In [2]:
URL = pd.read_csv("Movie_Details.csv")

display(URL.head())
display(URL.tail())

URL.shape


Unnamed: 0,Movie,URL,Year,Director,Cast,Genre,Movie_ID
0,102 Dalmatians,https://en.wikipedia.org/wiki/102_Dalmatians,2000,Kevin Lima,"Glenn Close, Gérard Depardieu, Alice Evans","Comedy, family",1
1,28 Days,https://en.wikipedia.org/wiki/28_Days_(film),2000,Betty Thomas,"Sandra Bullock, Viggo Mortensen",Drama,2
2,3 Strikes,https://en.wikipedia.org/wiki/3_Strikes_(film),2000,DJ Pooh,"Brian Hooks, N'Bushe Wright",Comedy,3
3,The 6th Day,https://en.wikipedia.org/wiki/The_6th_Day,2000,Roger Spottiswoode,"Arnold Schwarzenegger, Robert Duvall",Science fiction,4
4,Across the Line,https://en.wikipedia.org/wiki/Across_the_Line_...,2000,Martin Spottl,"Brad Johnson, Adrienne Barbeau, Brian Bloom",Thriller,5


Unnamed: 0,Movie,URL,Year,Director,Cast,Genre,Movie_ID
4040,Inferno,https://en.wikipedia.org/wiki/Inferno_(2016_film),2016,Ron Howard,,,4041
4041,Friend Request,https://en.wikipedia.org/wiki/Friend_Request,2016,Simon Verhoeven,Alycia Debnam-Carey,Horror,4042
4042,Rogue One: A Star Wars Story (film),https://en.wikipedia.org/wiki/Rogue_One,2016,Felicity Jones,Diego Luna,Sci-Fi,4043
4043,The Founder,https://en.wikipedia.org/wiki/The_Founder_(film),2016,John Lee Hancock,Michael Keaton,,4044
4044,Rings,https://en.wikipedia.org/wiki/Rings_(2016_film),2016,F. Javier Gutiérrez,,,4045


(4045, 7)

There are 4045 movie URLs that have to be scraped from Wikipedia. Let us do this as batches. Our goal is to scrape the image of the movie (if exists), along with the plot and initial introduction texts.

### STEP-2 Implementation

In STEP-2 we will use the file Movie_Details.csv, which has the list of all movie URLs, along with some other details. Each URL of the movie will be crawled to extract the movies text and the images. In STEP-2, we will build a set of functions to perform the web crawling. These functions are explained below:

## Functions
We will code the following functions to obtain the plot information of the movie, along with the release poster image of the movie.

* **Open_URL(url)** Gets the HTML content, prepares Beautiful Soup object and returns the Beautiful Soup object. The _url_ parameter represents the complete URL of the webpage.

* **Get_Plot(bs)** Takes a beautiful soup object as input. It extracts the introductory text, and the text in the section _plot_
If plot section is NOT present, then it return a negative code. The function returns the extracted text (in the _plot_ section and the initial paragraph). If an error occurs, then a negative code is returned.
    Return code = -1: If an error has occurred while getting the paragraphs from bs object
    Return code = -2: If there is NO _Plot_ section in the document
    Return code = -3: If an error occurred while extracting the first paragrapg in HTML doc

* **Get_All_Text(bs)** This function is called only when **Get_Plot(bs)** returns a -2. This function will take a beautiful object as input and gives all the text (present in < p > tags) as output. It will return -1, if any error occurs.

* **Save_Text_File(text,text_file_name)** It will save the text string (_text_) as a text file (with the name contained in *text_file_name*). The file is saved into the _data_ directory. Returns 0, if sucessfully saved. Returns -1, when the change directory command fails (while changing to _data_ sub-directory), and -2 when the change to parent directory from _data_ fails.

* **Get_And_Save_Image(bs,image_file_name)** It will get the movies poster (image file) and saves the image in the _images_ directory. It will take beautiful soup object as input and extracts the image URL. The image URL will be downloaded and saved to _images_ directory with the file name as the value present in *image_file_name*. Returns 0, if successfully downloaded and saved. Returns -1 if image is not found, and -2 if the an error occurs when saving the image.

* **Write_Error(url,msg,file)** Will write a error/warning message (contained in the parameter _msg_), while parsing the URL (present in the _url_ parameter). The _file_ parameter contains the name of the error file.  

The source code of these functions is given below:

In [3]:
def Open_URL(url):
    '''
    Gets the HTML content, prepares Beautiful Soup object and 
    returns the  Beautiful Soup object. 
    The url parameter represents the complete URL of the webpage.
    '''
    try:
         html = urlopen(url)
    except:
        return -1
    try:
        #bs = BeautifulSoup(html).encode("ascii")
        bs = BeautifulSoup(html)
        
        return bs
    except:
        return -2       

def Get_Plot(bs):  
    """
    Takes a beautiful soup object as input.
    Extracts the introductory text, and the text in the section plot
    If plot section is NOT present, just get all the available text in the webpage
    Returns the extracted text (if NO error, else returns a negative error code).
    -1: If an error has occurred while getting the paragraphs from bs object
    -2: If an error occurred while extracting the plot text
    -3: If an error occurred while extracting the first paragrapg in HTML doc
    """
    try:
        p = bs.find("p")
        initial_paragraph = p.getText()
    except:
        return -1

    # collect plot in this list
    plot = []
    
    # find the node with id of "Plot"
    try:
        mark = bs.find(id="Plot")
        # walk through the siblings of the parent (H2) node 
        # until we reach the next H2 node
        for elt in mark.parent.nextSiblingGenerator():
            if elt.name == "h2":
                break
            if hasattr(elt, "text"):
                plot.append(elt.text)
    except:
         return -2
    
    try:
        plot="".join(plot)
        text = initial_paragraph + plot
        return text
    except:
        return -3    
    

    
def Get_All_Text(bs):
    try:
        p = bs.find_all("p")
        l = list()
        for i in p:
            l.append(i.getText())
        return " ".join(l)
    except:
        return -1

def Save_Text_File(text,text_file_name):
    try:
        os.chdir("./data")
    except:
        return -1
    
    with open(text_file_name, 'w',encoding='utf-8') as f:
         f.write(text)
    
    try:
        os.chdir("..")
        return 0
    except:
        return -2

def Get_And_Save_Image(bs,image_file_name):    
    try: 
        img=bs.findAll("img",{"class":"thumbborder"})
        img_URL="https:"+img[0]['src']
    except:
        return -1
     
    try:
        os.chdir("./images")
        ignore=urllib.request.urlretrieve(img_URL,image_file_name)
        os.chdir("..")
        return 0
    except:
        return -2

def Write_Error(url,msg,file):
    with open(file,'a') as f:
        f.write("\n"+msg)
        f.write("\n"+url)

## Beginning the crawling process

The following code block will crawl the movies text from Wikipedia. This code ran for approximately 7 hours. So do NOT execute this code, unless you really want to start the download process. The output of this code is a series of text files and image files. The text files are saved to the _data_ directory and images to the _image_ directory. You can download these files directly from the location: https://goo.gl/RDhVtf

In [57]:
tracker = 0
term=1
start = time.time() # Get start time
print("Beginning the files download...")
#k = 1
for movie, url, year,Movie_ID in zip(list(URL["Movie"]),list(URL["URL"]),list(URL["Year"]),list(URL["Movie_ID"])):
    #if k < 30:
    #    k = k+1
    #    continue
    #print("{},{},{}".format(movie,url,year))
    #Open the URL
    
    bs=Open_URL(url)

    if bs == -1:
        Write_Error(url,"Error in opening the URL","error.txt")
        #print("Error in opening the URL: {}".format(url))
        continue
        

    if bs == -2:
        Write_Error(url,"Error in the creation of bs object for the URL","error.txt")
        #print("Error in the creation of bs object for the URL: {}".format(url))
        continue

    time.sleep(3)

    #create a name for the files
    #image_file_name = str(year)+"_"+movie.strip()+".jpg"
    #text_file_name =  str(year)+"_"+movie.strip()+".txt"
    image_file_name = str(Movie_ID)+".jpg"
    text_file_name =  str(Movie_ID)+".txt"

    text=Get_Plot(bs)
    
    if text == -1:
        Write_Error(url,"No paragraphs are found","error.txt")
        #print("No paragraphs are found")
        #print(url)
        continue

    if text == -2:
        Write_Error(url,"Warning: No Plot ID found","error.txt")
        #print("Warning: No Plot ID found")
        #print(url)
        
        text = Get_All_Text(bs)
        if text == -1:
            Write_Error(url,"No paragraphs are found","error.txt")
            #print("No paragraphs are found")
            #print(url)
            continue
        
    if text == -3:
        Write_Error(url,"Error while appending the main plot with the initial paragraph","error.txt")
        #print("Error while appending the main plot with the initial paragraph")
        #print(url)
        continue
    
    status = Save_Text_File(text,text_file_name)
    
    if status == -1:
        Write_Error(url,"Not able to change the directory to ./data","error.txt")
        #print("Not able to change the directory to ./data")
        #print(url)
        continue
    
    if status == -2:
        Write_Error(url,"Not able to change the directory to .. (parent directory) from ./data","error.txt")
        #print("Not able to change the directory to .. (parent directory) from ./data")
        #print(url)
        continue
        
    #Downloading Image files    
    status = Get_And_Save_Image(bs,image_file_name)
    
    if status == -1:
        Write_Error(url,"Not able to find the image","error.txt")
        #print("Not able to find the image")
        #print(url)
        continue

    if status == -2:
        Write_Error(url,"Not able to save the image","error.txt")
        #print("Not able to save the image")
        #print(url)
        continue

    #Check the status of the webbot    
    tracker = tracker + 1
    if (tracker % 100 == 0):
        print("Processed {} URLs".format(tracker))
        end = time.time() # Get end time
        elapsed_time = end - start
        print("Elapsed time to process 100 URLs:{} secs".format(elapsed_time))
        start = time.time() # Get end time
        #break

    #if term == 1:
    #    break

Beginning the files download...
Processed 100 URLs
Elapsed time to process 100 URLs:446.38204622268677 secs
Processed 200 URLs
Elapsed time to process 100 URLs:433.5029966831207 secs
Processed 300 URLs
Elapsed time to process 100 URLs:461.3205850124359 secs
Processed 400 URLs
Elapsed time to process 100 URLs:458.53809428215027 secs
Processed 500 URLs
Elapsed time to process 100 URLs:476.4255542755127 secs
Processed 600 URLs
Elapsed time to process 100 URLs:473.6112816333771 secs
Processed 700 URLs
Elapsed time to process 100 URLs:458.1378722190857 secs
Processed 800 URLs
Elapsed time to process 100 URLs:434.91878509521484 secs
Processed 900 URLs
Elapsed time to process 100 URLs:569.373973608017 secs
Processed 1000 URLs
Elapsed time to process 100 URLs:533.2878279685974 secs
Processed 1100 URLs
Elapsed time to process 100 URLs:493.94636368751526 secs
Processed 1200 URLs
Elapsed time to process 100 URLs:560.7728536128998 secs
Processed 1300 URLs
Elapsed time to process 100 URLs:519.74306

Out of 4045 movies we obtained 4037 movies text successfully. But we were able to obtain only 3749 images, since images were not available to some of the movies.

# Data Cleansing
Now that we downloaded the data, let us clean the data to make the data ready for applying text analytics algorithms. For the purpose of data cleaning, we created the following 3 functions:

* **Read_File(p)** It will open the input file, reads the text in the file, converts all the test to lower case, removes the punctuation (if any), and returns a list of tokens.

* **Remove_Stop_Words(tokens)** It will remove all stop words from the list of input tokens. Returns a refined list of tokens, with no stop words


* **Clean_Text(tokens)** It will clean all the text by removing any square brackets "[...]", braces "(" and ")", commas, colons, apostrophes etc. Returns a list of tokens that are just alphanumeric. 

In [4]:
def Read_File(p):
   with open(p, 'r',encoding='utf-8') as f:
    text = f.read()
    #Convert all the text to lower case
    #
    lowers = text.lower()
    #remove the punctuation using the character deletion step of translate
    no_punctuation = lowers.translate(string.punctuation)
    tokens = nltk.word_tokenize(no_punctuation)
    return tokens

def Remove_Stop_Words(tokens):
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered



def Clean_Text(tokens):
    text = " ".join(tokens)
    #Remove punctuation marks, text in [], (, ), :
    filtered1 = re.sub('\.|\`|\'|\[.*\]|\(|\)|,|:', " ",text)
    
    #Remove any single characters
    filtered1 = re.sub('(^| ).( |$)', " ",filtered1)
    #Remove any contiguous spaces    
    filtered1 = re.sub(' +'," ",filtered1)
    
    #Include only alpha numeric characters
    filtered1=" ".join([i for i in filtered1.split() if re.search('[0-9 a-z]*',i)])
    return filtered1

The following code block will use the above functions to clean the text. Do NOT run this code, unless you want to test it, since it will run for some time. To save time, we saved the results.

In [None]:
import re
import string
import nltk
from collections import Counter
from nltk.corpus import stopwords

file_names = os.listdir("./data")
file_names = [i for i in file_names if re.search('[1-9]*\.txt',i)]
y = list()
x = list()
k = 0
for i in file_names:
    y.append(int(i.split(".")[0]))
    #print(y)
    tokens = Read_File("./data/"+i)
    tokens = Remove_Stop_Words(tokens)
    cleaned_text = Clean_Text(tokens)
    x.append(cleaned_text)
    k = k+1
    if(k%100 == 0):
        print("Processed {} files".format(k))
    

Processed 100 files
Processed 200 files
Processed 300 files
Processed 400 files
Processed 500 files
Processed 600 files
