## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with through Thursday. The HTML page on the BBC site poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies that immediately follow them. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just DM me on Slack and I will help you!)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, Slack me and I will help  get your code working so you can move on to the next step.


### Getting started: Data Architecture

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

You will eventually want to bring this into pandas so you want to keep your table simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page. http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films-who-voted


One thing I should note there are two inconsistencies (actual errors in the HTML) that will cause you to lose a couple entries (which is okay but may be frustrating). I have posted a version of the exact same page with those inconsistencies fixed, if you want to scrape from that page: 

http://floatingmedia.com/columbia/BBC.html

It's up to you. Okay let's begin!

STEP 1:


In [8]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup
import re 

In [9]:
# read the URL, and put the HTML page into beautiful soup
url = 'https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted'
raw_html = requests.get(url).content 
soup_doc = BeautifulSoup(raw_html, "html.parser")
print(type(soup_doc))
print(soup_doc.prettify())


<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html class="no-js b-pw-1280 b-reith-sans-font b-reith-serif-font b-reith-serif-loaded b-reith-sans-loaded" lang="en">
 <head>
  <meta content="IE=edge" data-rh="true" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-rh="true"/>
  <meta content="width=device-width, initial-scale=1" data-rh="true" name="viewport"/>
  <meta content="I0_h0oRDIOewanIX2SQEN0-dHGdH6uBQT7V9l_WqRY8" data-rh="true" name="google-site-verification"/>
  <meta content="culture,ARTICLE,story,the-100-greatest-films-of-the-21st-century" data-rh="true" name="keywords"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." data-rh="true" name="description"/>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" data-rh="true" property="og:title"/>
  <meta content="article" data-rh="true" property="og:type"/>
  <meta content="https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted" d

In [10]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 

all_info = soup_doc.find(class_='body-text-card')

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [11]:
#find_all



find_p = soup_doc.find(class_='body-text-card').find_all('p')[1:-3]

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step Slack me with your code!



In [12]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you
for lines in find_p:
    if lines.strong is not None:
        critic_info = lines.strong
        movie_info = lines.next_sibling
        for critic in critic_info:
            print(critic)
            print(movie_info.find_all(string=True))
        
   
        

        





Simon Abrams – Freelance film critic (US)
['1. Mulholland Drive (David Lynch, 2001)', '2. In the Mood for Love (Wong Kar-wai, 2000)', '3. The Tree of Life (Terrence Malick, 2011)', '4. Yi Yi: A One and a Two (Edward Yang, 2000)', '5. Goodbye to Language (Jean-Luc Godard, 2014)', '6. The White Meadows (Mohammad Rasoulof, 2009)', '7. Night Across the Street (Raoul Ruiz, 2012)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. Sparrow (Johnnie To, 2008)', '10. Fados (Carlos Saura, 2007)']
Sam Adams – Freelance film critic (US)
['1. In the Mood for Love (Wong Kar-wai, 2000)', '2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)', '3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)', '4. Spirited Away (Hayao Miyazaki, 2001)', '5. The Act of Killing (Joshua Oppenheimer, 2012)', '6. The Grand Budapest Hotel (Wes Anderson, 2014)', '7. The New World (Terrence Malick, 2004)', '8. Certified Copy (Abbas Kiarostami, 2010)', '9. The World (Jia Zhangke, 2004)', '10. Elephant (Gu

Larushka Ivan-Zadeh – Metro (UK)
['1. The Act of Killing (Joshua Oppenheimer, 2012)', '2. Caché (Michael Haneke, 2005)', '3. Synecdoche, New York (Charlie Kaufman, 2008)', '4. Spirited Away (Hayao Miyazaki, 2001)', '5. WALL-E (Andrew Stanton, 2008)', '6. Uncle Boonmee Who Can Recall His Past Lives (Apichatpong Weerasethakul, 2010)', '7. Lost in Translation (Sofia Coppola, 2003)', '8. Talk to Her (Pedro Almodóvar, 2002)', '9. The Wrestler (Darren Aronofsky, 2008)', '10. Boyhood (Richard Linklater, 2014)']
Nick James – Sight & Sound (UK)
['1. In the Mood for Love (Wong Kar-wai, 2000)', '2. Caché (Michael Haneke, 2005)', '3. Yi Yi: A One and a Two (Edward Yang, 2000)', '4. Zero Dark Thirty (Kathryn Bigelow, 2012)', '5. Once Upon a Time in Anatolia (Nuri Bilge Ceylan, 2011)', '6. White Material (Claire Denis, 2009)', '7. There Will Be Blood (Paul Thomas Anderson, 2007)', '8. The Look of Silence (Joshua Oppenheimer, 2014)', '9. The Turin Horse (Béla Tarr and Ágnes Hranitzky, 2011)', '10. Th

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [13]:
#Practice/Build your regular expressions here
crit_sample ="Jean-Philippe Guerand – L'Avant-Scène Cinéma (France)"
regex_for_name = r"(.*)\s–"
regex_for_org = r"–\s(.*)\s[(]"
regex_for_cn = r"\W[\s?\b\w+\b]*\W$"
name = re.findall(regex_for_org,crit_sample)
name[0]



"L'Avant-Scène Cinéma"

In [14]:
for lines in find_p:
    if lines.strong is not None:
        critic_info = lines.strong.string
        critic_name= re.findall(regex_for_name, critic_info)[0]
        critic_org = re.findall(regex_for_org, critic_info)[0]
        critic_cn = re.findall(regex_for_cn, critic_info)[0]
        print(critic_name)
        print(critic_org)
        print(critic_cn)
        print('------')

Simon Abrams
Freelance film critic
(US)
------
Sam Adams
Freelance film critic
(US)
------
Thelma Adams
Freelance film critic
(US)
------
Arturo Aguilar
Rolling Stone Mexico
(Mexico)
------
Matthew Anderson
BBC Culture
(UK)
------
Tim Appelo
The Wrap
(US)
------
Adriano Aprà
Film historian
(Italy)
------
Michael Arbeiter
Nerdist
(US)
------
Ali Arikan
Dipnot TV
(Turkey)
------
Michael Atkinson
The Village Voice
(US)
------
Ana Maria Bahiana
Freelance film critic
(Brazil)
------
Cameron Bailey
Toronto Film Festival
(Canada)
------
Lindsay Baker
BBC Culture
(UK)
------
Miriam Bale
Freelance film critic
(US)
------
Nicholas Barber
BBC Culture
(UK)
------
Diego Batlle
La Nacion
(Argentina)
------
NT Binh
Positif
(France)
------
Lizelle Bisschoff
University of Glasgow
(UK)
------
Christian Blauvelt
BBC Culture
(US)
------
Mahen Bonetti
African Film Festival Inc
(US)
------
Andreas Borcholte
Spiegel Online
(Germany)
------
Utpal Borpujari
Freelance film critic
(India)
------
Richard Brody
Th

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [15]:
for lines in find_p:
    if lines.strong is not None:
        each_movie = movie_info.find_all(string=True)
        for movie in each_movie:
                print(movie)
            


1. Spring, Summer, Fall, Winter…and Spring (Kim Ki-duk, 2003)
2. The Hours (Stephen Daldry, 2002)
3. The Sun Also Rises (Jiang Wen, 2007)
4. A Separation (Asghar Farhadi, 2011)
5. Lust, Caution (Ang Lee, 2007)
6. The Lives of Others (Florian Henckel von Donnersmarck, 2006)
7. Still Life (Jia Zhangke, 2006)
8. Birdman (Alejandro González Iñárritu, 2014)
9. Infernal Affairs (Andrew Lau and Alan Mak, 2002)
10. City of God (Fernando Meirelles and Kátia Lund, 2002)
1. Spring, Summer, Fall, Winter…and Spring (Kim Ki-duk, 2003)
2. The Hours (Stephen Daldry, 2002)
3. The Sun Also Rises (Jiang Wen, 2007)
4. A Separation (Asghar Farhadi, 2011)
5. Lust, Caution (Ang Lee, 2007)
6. The Lives of Others (Florian Henckel von Donnersmarck, 2006)
7. Still Life (Jia Zhangke, 2006)
8. Birdman (Alejandro González Iñárritu, 2014)
9. Infernal Affairs (Andrew Lau and Alan Mak, 2002)
10. City of God (Fernando Meirelles and Kátia Lund, 2002)
1. Spring, Summer, Fall, Winter…and Spring (Kim Ki-duk, 2003)
2. The H

5. Lust, Caution (Ang Lee, 2007)
6. The Lives of Others (Florian Henckel von Donnersmarck, 2006)
7. Still Life (Jia Zhangke, 2006)
8. Birdman (Alejandro González Iñárritu, 2014)
9. Infernal Affairs (Andrew Lau and Alan Mak, 2002)
10. City of God (Fernando Meirelles and Kátia Lund, 2002)
1. Spring, Summer, Fall, Winter…and Spring (Kim Ki-duk, 2003)
2. The Hours (Stephen Daldry, 2002)
3. The Sun Also Rises (Jiang Wen, 2007)
4. A Separation (Asghar Farhadi, 2011)
5. Lust, Caution (Ang Lee, 2007)
6. The Lives of Others (Florian Henckel von Donnersmarck, 2006)
7. Still Life (Jia Zhangke, 2006)
8. Birdman (Alejandro González Iñárritu, 2014)
9. Infernal Affairs (Andrew Lau and Alan Mak, 2002)
10. City of God (Fernando Meirelles and Kátia Lund, 2002)
1. Spring, Summer, Fall, Winter…and Spring (Kim Ki-duk, 2003)
2. The Hours (Stephen Daldry, 2002)
3. The Sun Also Rises (Jiang Wen, 2007)
4. A Separation (Asghar Farhadi, 2011)
5. Lust, Caution (Ang Lee, 2007)
6. The Lives of Others (Florian Henck

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [16]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "5. Lust, Caution (Ang Lee, 2007)"
regex_for_mname = r"\s(.*)\s[(]"
movie_name = re.findall(regex_for_mname,movie_harder)
regex_for_director= r"\s[(](.*)[,]"
regex_for_myear = r"(\d\d\d\d)[)]$"
movie_director = re.findall(regex_for_director,movie_harder)
movie_year = re.findall(regex_for_myear,movie_harder)
regex_for_mrank = r"^\d\d?."
movie_rank = re.findall(regex_for_mrank, movie_sample)
movie_rank[0]



'1.'

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the movie name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional, but maybe helpful to keep)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [17]:
imdb_movies = []
for lines in find_p:
    if lines.strong is not None:
        movie_info = lines.next_sibling
        each_movie = movie_info.find_all(string=True)
        critic_info = lines.strong.string
        critic_name= re.findall(regex_for_name, critic_info)[0]
        critic_org = re.findall(regex_for_org, critic_info)[0]
        critic_cn = re.findall(regex_for_cn, critic_info)[0]
        for movies in each_movie:
            try:
                this_movie = []
                movie_rank = re.findall(regex_for_mrank,movies)[0]
                movie_name = re.findall(regex_for_mname,movies)[0]
                movie_director = re.findall(regex_for_director, movies)[0]
                movie_year = re.findall(regex_for_myear, movies)[0]
                #this_movie.append(movie_rank)
                this_movie.append(movie_name)
                #this_movie.append(movie_director)
                #this_movie.append(movie_year)
                #this_movie.append(critic_name)
                #this_movie.append(critic_org)
                #this_movie.append(critic_cn)
                imdb_movies.append(this_movie)
            except:
                pass
       
        
print(imdb_movies)








[['Mulholland Drive'], ['In the Mood for Love'], ['The Tree of Life'], ['Yi Yi: A One and a Two'], ['Goodbye to Language'], ['The White Meadows'], ['Night Across the Street'], ['Certified Copy'], ['Sparrow'], ['Fados'], ['In the Mood for Love'], ['Eternal Sunshine of the Spotless Mind'], ['Syndromes and a Century'], ['Spirited Away'], ['The Act of Killing'], ['The Grand Budapest Hotel'], ['The New World'], ['Certified Copy'], ['The World'], ['Elephant'], ['Zero Dark Thirty'], ['A History of Violence'], ['The Grand Budapest Hotel'], ['Stories We Tell'], ['Casino Royale'], ['Eternal Sunshine of the Spotless Mind'], ['Tabu'], ['Snow White'], ['Frozen River'], ['Gosford Park'], ['In the Mood for Love'], ['Mulholland Drive'], ['Inception'], ["Pan's Labyrinth"], ['Caché'], ['Grizzly Man'], ['4 Months, 3 Weeks & 2 Days'], ['Holy Motors'], ['The Last of the Unjust'], ['There Will Be Blood'], ['The Piano Teacher'], ['Margaret'], ['American Psycho'], ['4 Months, 3 Weeks & 2 Days'], ['Caché'], ['

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?


In the cell below, I give you a final architecture you need to use to get this most challenging list of lists.

In [18]:
single_list=[]
for movie in imdb_movies:
    movies=movie[0]
    print(movies)
    single_list.append(movies)


Mulholland Drive
In the Mood for Love
The Tree of Life
Yi Yi: A One and a Two
Goodbye to Language
The White Meadows
Night Across the Street
Certified Copy
Sparrow
Fados
In the Mood for Love
Eternal Sunshine of the Spotless Mind
Syndromes and a Century
Spirited Away
The Act of Killing
The Grand Budapest Hotel
The New World
Certified Copy
The World
Elephant
Zero Dark Thirty
A History of Violence
The Grand Budapest Hotel
Stories We Tell
Casino Royale
Eternal Sunshine of the Spotless Mind
Tabu
Snow White
Frozen River
Gosford Park
In the Mood for Love
Mulholland Drive
Inception
Pan's Labyrinth
Caché
Grizzly Man
4 Months, 3 Weeks & 2 Days
Holy Motors
The Last of the Unjust
There Will Be Blood
The Piano Teacher
Margaret
American Psycho
4 Months, 3 Weeks & 2 Days
Caché
Mulholland Drive
Lourdes
Red Road
Boyhood
Tony Manero
No Country For Old Men
Spirited Away
A Separation
Pan's Labyrinth
Finding Nemo
Hero
The Wolf of Wall Street
Mother
The Bourne Ultimatum
Traffic
These Encounters of Theirs
Vin

Bridesmaids
WALL-E
Brooklyn
The Look of Silence
Punch-Drunk Love
12 Years a Slave
Eternal Sunshine of the Spotless Mind
We Need to Talk About Kevin
Far From Heaven
Chuck & Buck
Gladiator
The Century of the Self
Moulin Rouge!
Munich
Casino Royale
Lilya 4-Ever
Sideways
Inglourious Basterds
Mulholland Drive
In the Mood for Love
The New World
Certified Copy
Femme Fatale
Margaret
Under the Skin
This Is Not a Film
Two Lovers
Heart of a Dog
In the Mood for Love
4 Months, 3 Weeks & 2 Days
A Separation
Son of Saul
Elephant
A Prophet
There Will Be Blood
The Lives of Others
Frances Ha
Eternal Sunshine of the Spotless Mind
Werckmeister Harmonies
The Act of Killing
White Material
Even If She Had Been a Criminal...
Spring Breakers
In the Mood for Love
Stranger by the Lake
Colossal Youth
Story of My Death
Aurora
Dogville
Zodiac
There Will Be Blood
Inside Llewyn Davis
Yi Yi: A One and a Two
The Turin Horse
The Master
It's Such a Beautiful Day
Boyhood
Shutter Island
Saraband
Memento
Blue Is the Warmest

Holy Motors
Mulholland Drive
Memento
The Master
Inside Llewyn Davis
The Diving Bell and the Butterfly
My Winnipeg
Inside Out
Step Brothers
Bright Star
Mad Max: Fury Road
Caché
Zodiac
Mr Turner
Amour
The Master
Only Lovers Left Alive
Inside Llewyn Davis
Tabu
Memento
Requiem for a Dream
The Lord of the Rings: The Return of the King
Drive
Holy Motors
Mad Max: Fury Road
Kill List
Toy Story 3
Kiss Kiss Bang Bang
Inside Llewyn Davis
Mulholland Drive
No Country For Old Men
Amélie
Memento
In the Mood for Love
Talk to Her
Memories of Murder
WALL-E
Bowling for Columbine
Children of Men
Mulholland Drive
Son of Saul
Ida
Eternal Sunshine of the Spotless Mind
The White Ribbon
Still Walking
The Act of Killing
Dogville
Birdman
Grizzly Man
Amour
Distant
A Separation
Samson & Delilah
Leviathan
Still Walking
Talk to Her
Million Dollar Baby
No Country For Old Men
The Man Without A Past
Mysteries of Lisbon
Margaret
The New World
Secret Things
La Ciénaga
Toni Erdmann
In the Family
Tabu
Gerry
Tropical Malady

In [19]:
print(single_list)

['Mulholland Drive', 'In the Mood for Love', 'The Tree of Life', 'Yi Yi: A One and a Two', 'Goodbye to Language', 'The White Meadows', 'Night Across the Street', 'Certified Copy', 'Sparrow', 'Fados', 'In the Mood for Love', 'Eternal Sunshine of the Spotless Mind', 'Syndromes and a Century', 'Spirited Away', 'The Act of Killing', 'The Grand Budapest Hotel', 'The New World', 'Certified Copy', 'The World', 'Elephant', 'Zero Dark Thirty', 'A History of Violence', 'The Grand Budapest Hotel', 'Stories We Tell', 'Casino Royale', 'Eternal Sunshine of the Spotless Mind', 'Tabu', 'Snow White', 'Frozen River', 'Gosford Park', 'In the Mood for Love', 'Mulholland Drive', 'Inception', "Pan's Labyrinth", 'Caché', 'Grizzly Man', '4 Months, 3 Weeks & 2 Days', 'Holy Motors', 'The Last of the Unjust', 'There Will Be Blood', 'The Piano Teacher', 'Margaret', 'American Psycho', '4 Months, 3 Weeks & 2 Days', 'Caché', 'Mulholland Drive', 'Lourdes', 'Red Road', 'Boyhood', 'Tony Manero', 'No Country For Old Men

In [20]:
url = 'https://www.imdb.com/find?q='
raw_html = requests.get(url).content 
soup_doc = BeautifulSoup(raw_html, "html.parser")
    

In [21]:
imdb_titles = []
for title in single_list:
    correct_title = title.replace(' ', '+')
    imdb_url = f"('https://www.imdb.com/find?q={correct_title}')"
    print(imdb_url)
    imdb_titles.append(imdb_url)

('https://www.imdb.com/find?q=Mulholland+Drive')
('https://www.imdb.com/find?q=In+the+Mood+for+Love')
('https://www.imdb.com/find?q=The+Tree+of+Life')
('https://www.imdb.com/find?q=Yi+Yi:+A+One+and+a+Two')
('https://www.imdb.com/find?q=Goodbye+to+Language')
('https://www.imdb.com/find?q=The+White+Meadows')
('https://www.imdb.com/find?q=Night+Across+the+Street')
('https://www.imdb.com/find?q=Certified+Copy')
('https://www.imdb.com/find?q=Sparrow')
('https://www.imdb.com/find?q=Fados')
('https://www.imdb.com/find?q=In+the+Mood+for+Love')
('https://www.imdb.com/find?q=Eternal+Sunshine+of+the+Spotless+Mind')
('https://www.imdb.com/find?q=Syndromes+and+a+Century')
('https://www.imdb.com/find?q=Spirited+Away')
('https://www.imdb.com/find?q=The+Act+of+Killing')
('https://www.imdb.com/find?q=The+Grand+Budapest+Hotel')
('https://www.imdb.com/find?q=The+New+World')
('https://www.imdb.com/find?q=Certified+Copy')
('https://www.imdb.com/find?q=The+World')
('https://www.imdb.com/find?q=Elephant')
('

('https://www.imdb.com/find?q=Finding+Nemo')
('https://www.imdb.com/find?q=Whiplash')
('https://www.imdb.com/find?q=Shame')
('https://www.imdb.com/find?q=Revolutionary+Road')
('https://www.imdb.com/find?q=Inside+Llewyn+Davis')
('https://www.imdb.com/find?q=Weekend')
('https://www.imdb.com/find?q=Persepolis')
('https://www.imdb.com/find?q=In+the+Mood+for+Love')
('https://www.imdb.com/find?q=Son+of+Saul')
('https://www.imdb.com/find?q=The+Hurt+Locker')
('https://www.imdb.com/find?q=The+Beat+That+My+Heart+Skipped')
('https://www.imdb.com/find?q=The+Edge+of+Heaven')
('https://www.imdb.com/find?q=Waltz+with+Bashir')
('https://www.imdb.com/find?q=Oldboy')
('https://www.imdb.com/find?q=Good+Bye+Lenin!')
('https://www.imdb.com/find?q=Ida')
('https://www.imdb.com/find?q=Incendies')
('https://www.imdb.com/find?q=Mulholland+Drive')
('https://www.imdb.com/find?q=Synecdoche,+New+York')
('https://www.imdb.com/find?q=Birth')
('https://www.imdb.com/find?q=Elena')
('https://www.imdb.com/find?q=Carol')


In [22]:
real_movie_list=[]
for title in single_list:
    plus = title.replace(' ', '+')
    real_movie_list.append(plus)

In [23]:
len(real_movie_list)

1760

In [24]:
import time 

In [25]:
fake_movie_list=('Mulholland+Drive', 'Spirited+Away')

In [26]:
titles_list = []
for movie in real_movie_list:
    try:
        url= 'https://www.imdb.com/find?q='+movie
        raw_html = requests.get(url).content
        soup_doc = BeautifulSoup(raw_html, "html.parser")
        title = soup_doc.find('tr').find('td').find('a')['href']
        #print(title)
        titles_list.append(title)
        time.sleep(.5)
    except:
        print("No title here")
        
#DO NOT RUN AGAIN. RUN CELL BENEATH IF NECESSARY

No title here
No title here
No title here
No title here
No title here
No title here
No title here


In [27]:
print(len(titles_list))

1753


In [None]:
unwanted = {'/name/nm0662127/', '/name/nm1176985/','/name/nm3692520/','/name/nm3587864/','/name/nm0423111/','/name/nm3692520/',
           '/name/nm1176985/', '/name/nm1176985/', '/name/nm0395353/', '/name/nm0124930/'}
updated_titles_list = [e for e in titles_list if e not in unwanted]







In [29]:
print(updated_titles_list))

1735


In [35]:
updated_titles_list = list(dict.fromkeys(updated_titles_list))
print(updated_titles_list)

['/title/tt0166924/', '/title/tt0118694/', '/title/tt0478304/', '/title/tt0244316/', '/title/tt2400275/', '/title/tt1509132/', '/title/tt1876360/', '/title/tt1020773/', '/title/tt1056422/', '/title/tt0338013/', '/title/tt0477731/', '/title/tt0245429/', '/title/tt2375605/', '/title/tt2278388/', '/title/tt0402399/', '/title/tt0423176/', '/title/tt0363589/', '/title/tt1790885/', '/title/tt0399146/', '/title/tt2366450/', '/title/tt0381061/', '/title/tt3647998/', '/title/tt1735898/', '/title/tt0978759/', '/title/tt0280707/', '/title/tt1375666/', '/title/tt0457430/', '/title/tt0387898/', '/title/tt0427312/', '/title/tt1032846/', '/title/tt2076220/', '/title/tt2340784/', '/title/tt0469494/', '/title/tt0254686/', '/title/tt0466893/', '/title/tt0144084/', '/title/tt1405809/', '/title/tt0471030/', '/title/tt1065073/', '/title/tt1223975/', '/title/tt0477348/', '/title/tt1832382/', '/title/tt0266543/', '/title/tt0299977/', '/title/tt0993846/', '/title/tt5109784/', '/title/tt0440963/', '/title/tt01

If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [31]:
for item in updated_titles_list:
    url= 'https://www.imdb.com/find?q='
    raw_html = requests.get(url).content
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    

In [32]:
fake_title_list = ['/title/tt0166924/','/title/tt0118694/']

In [33]:
for item in fake_title_list:
    url= 'https://www.imdb.com/'+item
    raw_html = requests.get(url).content
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    details_box=soup_doc.find('div',class_="article", id='titleDetails')
    txt_boxes= details_box.find_all('div', _class="txt-block")
    for box in txt_boxes:
        print(txt_boxes.find('h4'))
        

In [34]:
for thing in box:
    print(thing)

NameError: name 'box' is not defined

In [None]:
country = soup_doc.find_all('h4')[11].next_sibling.next_sibling.text