# Film Remake Data
## Description
This notebook uses Python libraries Pandas, BeautifulSoup, and urllib to scrape raw data from Wikipedia, and creates clean files for use in the database notebook.

### Import Dependencies & Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import urllib.request
from bs4 import BeautifulSoup
from pprint import pprint

### Specify the urls
Due to the size of the list, Wikpedia divides up its list of film remakes onto two web pages, storing titles starting with letters A-M on one page and N-Z on another. 


In [2]:
wiki_A_M = "https://en.wikipedia.org/wiki/List_of_film_remakes_(A%E2%80%93M)"
wiki_N_Z = "https://en.wikipedia.org/wiki/List_of_film_remakes_(N%E2%80%93Z)"

### First we will scrape List A-M

In [3]:
#Set up A-M Query, and return the html of the wiki page
page_A_M = urllib.request.urlopen(wiki_A_M)
soup_A_M = BeautifulSoup(page_A_M, "lxml")

### Check out underlying code for HTML

In [4]:
print(soup_A_M.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of film remakes (A–M) - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XhY1SQpAIEIAABcUzgUAAABF","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_film_remakes_(A–M)","wgTitle":"List of film remakes (A–M)","wgCurRevisionId":929777349,"wgRevisionId":929777349,"wgArticleId":6963455,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short descrip

### Search for desired HTML table
Use BeautifulSoup to retrieve all instances with table tags within the page.

In [5]:
#Find all instances with <table> tag
all_tables_A_M = soup_A_M.find_all("table")

#check data type
print(type(all_tables_A_M))

all_tables_A_M

<class 'bs4.element.ResultSet'>


[<table class="wikitable">
 <tbody><tr>
 <th style="width:450px;">Remakes</th>
 <th style="width:450px;">Original version
 </th></tr>
 <tr>
 <td><i><a href="/wiki/13_(2010_film)" title="13 (2010 film)">13</a></i> (2010) dir. <a href="/wiki/G%C3%A9la_Babluani" title="Géla Babluani">Géla Babluani</a>
 </td>
 <td><i><a href="/wiki/13_Tzameti" title="13 Tzameti">13 Tzameti</a></i> (2005) dir. <a href="/wiki/G%C3%A9la_Babluani" title="Géla Babluani">Géla Babluani</a>
 </td></tr>
 <tr>
 <td><i><a href="/wiki/The_13th_Letter" title="The 13th Letter">The 13th Letter</a></i> (1951) dir. <a href="/wiki/Otto_Preminger" title="Otto Preminger">Otto Preminger</a>
 </td>
 <td><i><a href="/wiki/Le_Corbeau" title="Le Corbeau">Le Corbeau</a></i> (1943) dir. <a href="/wiki/Henri-Georges_Clouzot" title="Henri-Georges Clouzot">Henri-Georges Clouzot</a>
 </td></tr>
 <tr>
 <td><i><a href="/wiki/101_Dalmatians_(1996_film)" title="101 Dalmatians (1996 film)">101 Dalmatians</a></i> (1996) dir. <a href="/wiki/St

In [6]:
#Isolate chosen table using class type

right_tables_A_M = soup_A_M.find_all("table", class_="wikitable")

#check data type
print(type(right_tables_A_M))

right_tables_A_M

<class 'bs4.element.ResultSet'>


[<table class="wikitable">
 <tbody><tr>
 <th style="width:450px;">Remakes</th>
 <th style="width:450px;">Original version
 </th></tr>
 <tr>
 <td><i><a href="/wiki/13_(2010_film)" title="13 (2010 film)">13</a></i> (2010) dir. <a href="/wiki/G%C3%A9la_Babluani" title="Géla Babluani">Géla Babluani</a>
 </td>
 <td><i><a href="/wiki/13_Tzameti" title="13 Tzameti">13 Tzameti</a></i> (2005) dir. <a href="/wiki/G%C3%A9la_Babluani" title="Géla Babluani">Géla Babluani</a>
 </td></tr>
 <tr>
 <td><i><a href="/wiki/The_13th_Letter" title="The 13th Letter">The 13th Letter</a></i> (1951) dir. <a href="/wiki/Otto_Preminger" title="Otto Preminger">Otto Preminger</a>
 </td>
 <td><i><a href="/wiki/Le_Corbeau" title="Le Corbeau">Le Corbeau</a></i> (1943) dir. <a href="/wiki/Henri-Georges_Clouzot" title="Henri-Georges Clouzot">Henri-Georges Clouzot</a>
 </td></tr>
 <tr>
 <td><i><a href="/wiki/101_Dalmatians_(1996_film)" title="101 Dalmatians (1996 film)">101 Dalmatians</a></i> (1996) dir. <a href="/wiki/St

### Loop through rows of data within the tables to extract data.

In [7]:
remake = []
original =[]


for table in right_tables_A_M.find_all("tbody"):
    for row in table.find_all("tr"):
        for cells in row.find_all("td"):
            if len(cells)==2:
                remake.append(cells[0]).find(text=True)
                original.append(cells[1]).find(text=True)
            

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?