## BBC project: process, hints, and recipes

The major challenge of the BBC project is to transform the list of critics and movies into searchable Python lists and/or dictionaries. The most difficult aspect of this project is the first: scraping the page on the BBC and, using beautiful soup and regular expressions, building a data set that will work.

Once you have the data set, you will be in good shape going forward--the goal after that will be to search for interesting patterns (top movies by country/critic/director/year)--this is the conceptual work you need to be thinking about while you struggle through wrangling your data.

So, how do I wrangle this data? That is the central challenge that you'll be dealing with this week. The HTML page on the BBC site (mirrored on my site) poses a number of challenges. While the layout is relatively simple and consistent--the simplicity actually makes it a little bit harder, because there's not that many HTML tags to help you isolate each unit of data--you can use beautiful soup to isolate the line that contains all the information for the critic, and you can isolate each group of top 10 movies as well. You need to, and this is a bit harder, use beautiful soup find the critic--as well as the list of movies that immediately follow them. (Using beautiful soup to do that is challenging--I have instructions on how to figure it out, but if you can't figure it out--just DM me on Slack and I will help you!)

Yes, that is how this process will work--below I have step-by-step instructions so you can try to write the code yourself. Do your best--and if you can't get there, Slack me and I will help  get your code working so you can move on to the next step.


### Getting started: Data Architecture

The central challenge of this project it's figuring out how you are going to set up your table or tables from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: the main categories of analysis that are possible include movie, director, critic, critic's country, year, and whatever else you bring to this. Try to design a schema that will give you a table that you can run solid queries on. 

You will eventually want to bring this into pandas so you want to keep your table simple and structured as possible. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### Interpretive Architecture
**REMEMBER: secondary source** Part of the steps this week, is to find a source you can use to get the country of origin for each director. This is something you need to search for on your own--it will be hard for you to find a single page that has a list of every single director. But see what you can find. In the end, you don't have to have a complete database of every single director, but do your best to get as many as you can.

You don't necessarily have to go in the direction of directors' origin. You can certainly try to think of other categories of interpretation that you can join to this initial dataset. This is how you bring your point-of-view to a relatively large data set that seeks to frame the past 15 years of cinema. How can you bring a different point-of-view to this subject? You can certainly narrow your focus to a specific country, the group of countries, or a region. Either way, think about other data that might bring different types of insight to this list.

### Ready to code?

The first thing you need to do is import beautiful soup & requests like we did in the homework, and scrape the page. 

http://floatingmedia.com/columbia/BBC.html

Okay let's begin!

STEP 1:


In [1]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
from bs4 import BeautifulSoup
import requests
import pandas as pd

import time
import re

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager


In [2]:
# read the URL, and put the HTML page into beautiful soup

response = requests.get("http://floatingmedia.com/columbia/BBC.html")
doc = BeautifulSoup(response.text)

In [3]:
#Using beautiful soup find the div tag that contains 
#the entire list of critics and movies
#Make a variable (like all_info) that holds all that information 
# div class="body-content"



all_info = doc.find(class_ = "body-content")
all_info

<div class="body-content">
<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p><p><strong>Simon Abrams – Freelance film critic (US)</strong></p><p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Ra

**STEP 2** Here is where it begins to get tricky: obviously at this point everything we want is surrounded in `<p>` tags. Use a beautiful soup find_all to get a list of every thing in `<p>` tag. Make a variable that contains that list (you could call it all_p or something)


In [4]:
#find_all
all_p = all_info.find_all('p')
all_p

[<p>Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.</p>,
 <p><strong>Simon Abrams – Freelance film critic (US)</strong></p>,
 <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7.

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the `<p>` elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Critics should not be too hard--every critic entry is embedded in `<strong>` tags. But in order to get the movies attached to that critic--you need to find the `<p>` tag immediately following each `<p><strong>` -- you can do this using next_sibling.

So, you need to build a loop that searches to your `all_p` list:

if it has a `<strong>` tag then 
critic_info = p_line.strong.string
movie_info = p_line.next_sibling

As you go through this loop print(critic_info, movie_info) and see what comes out. If you're getting the critic string followed by movie line's HTML--you've got it!

I give you the beginning of the loop below, and then you can build it piece by piece. If you want to see the overall architecture of the final loop, I have a commented example at the end of the page--it might not be helpful to look at at this point. See how you do step-by-step and if you get stuck at a step Slack me with your code!



In [5]:
##Write your loop for STEP 3 here
#I started this for you,
#Because you only want it to search starting with each critic
#   if line.strong is not None: does that for you

for line in all_p:
    if line.strong is not None:
        critic_info = line.strong.string
        movie_info = line.next_sibling
        print(critic_info,movie_info)
   
        





Simon Abrams – Freelance film critic (US) <p>1. Mulholland Drive (David Lynch, 2001)<br/>2. In the Mood for Love (Wong Kar-wai, 2000)<br/>3. The Tree of Life (Terrence Malick, 2011)<br/>4. Yi Yi: A One and a Two (Edward Yang, 2000)<br/>5. Goodbye to Language (Jean-Luc Godard, 2014)<br/>6. The White Meadows (Mohammad Rasoulof, 2009)<br/>7. Night Across the Street (Raoul Ruiz, 2012)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. Sparrow (Johnnie To, 2008)<br/>10. Fados (Carlos Saura, 2007)</p>
Sam Adams – Freelance film critic (US) <p>1. In the Mood for Love (Wong Kar-wai, 2000)<br/>2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)<br/>3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)<br/>4. Spirited Away (Hayao Miyazaki, 2001)<br/>5. The Act of Killing (Joshua Oppenheimer, 2012)<br/>6. The Grand Budapest Hotel (Wes Anderson, 2014)<br/>7. The New World (Terrence Malick, 2004)<br/>8. Certified Copy (Abbas Kiarostami, 2010)<br/>9. The World (Jia Zhangke, 2004

**STEP 4**
If your loop is successfully isolating those two lines: now it's time to parse each line with regular expressions. This needs to happen inside the loop--for every critic, and then (in STEP 5) for every movie. Here just **focus on getting the critics name, organization, and country.**

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. It might help, before you do all these regular expressions in a loop, to just grab one critics line and test regular expressions on it--to make sure that you're getting the right thing. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [6]:
#Practice/Build your regular expressions here
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^([^\W].+) – "
regex_for_org = r"– ([^\W].+) \("
regex_for_cn = r"\(([^\W].+)\)$"
name = re.findall(regex_for_name,crit_sample)
org = re.findall(regex_for_org,crit_sample)
cn = re.findall(regex_for_cn,crit_sample)


print(name[0])
print(org[0])
print(cn[0])

Arturo Aguilar
Rolling Stone Mexico
Mexico


In [7]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it

for line in all_p[:-3]:
    if line.strong is not None:
        critic_info = line.strong.string
        regex_for_name = r"^([^\W].+) – "
        regex_for_org = r"– ([^\W].+) \("
        regex_for_cn = r"\(([^\W].+)\)$"
        critic_name = re.findall(regex_for_name,critic_info)
        critic_org = re.findall(regex_for_org,critic_info)
        critic_cn = re.findall(regex_for_cn,critic_info)
        print(critic_name[0])
#         print(critic_org[0])
#         print(critic_cn[0])

Simon Abrams
Sam Adams
Thelma Adams
Arturo Aguilar
Matthew Anderson
Tim Appelo
Adriano Aprà
Michael Arbeiter
Ali Arikan
Michael Atkinson
Ana Maria Bahiana
Cameron Bailey
Lindsay Baker
Miriam Bale
Nicholas Barber
Diego Batlle
NT Binh
Lizelle Bisschoff
Christian Blauvelt
Mahen Bonetti
Andreas Borcholte
Utpal Borpujari
Richard Brody
Hannah Brown
Luke Buckmaster
Luciano Castillo
Monica Castillo
Samuel Castro
Justin Chang
Enrico Chiesa
Cho Seongyong
Robbie Collin
Scott Collura
Colin Covert
Oggs Cruz
Ken Dancyger
Erik Davis
Peter Debruge
Fernand Denis
Lindiwe Dovey
Alonso Duralde
Bilge Ebiri
David Ehrlich
Kate Erbland
Mario Espinosa
Joseph Fahim
Devin Faraci
David Fear
Scott Feinberg
Javier Porta Fouz
Kenji Fujishima
Ernesto Garratt
Steven Gaydos
Noah Gittell
Owen Gleiberman
Ed Gonzalez
Juan Carlos González
Carmen Gray
Tim Grierson
Jean-Philippe Guerand
Antoine Guillot
Tom Gunning
Shubhra Gupta
Hauvick Habechian
Angie Han
Aisha Harris
Tina Hassannia
Shiguehiko Hasumi
Katarina Hedrén
Jordan H

**STEP 5**
Now you need to get your **movie names**--this is the trickiest part. You want to use the same loop you have been working on, and get the name of each movie along with the critic information.

To do this you need to search the movie_info variable -- which is each movie followed by a `<BR>` tag. I showed you this in class, but I'll just tell you again how to do this. To get a list of everything that is not a `<BR>` tag, use this method:

`each_movie = movie_info.find_all(string=True)`

This will give you a list called `each_movie`. Which will contain a string for each movie. Like this:

`1. Zero Dark Thirty (Kathryn Bigelow, 2012)`

Build a loop inside the main loop, that goes to each movie and prints out each movie.


In [8]:
##TakeYou're working loop And add the find_all for each_movie
#And the inner loop that loops through each_movie

for line in all_p[:-3]:
    if line.strong is not None:
        movie_info = line.next_sibling
        each_movie = movie_info.find_all(string=True)
        for movie in each_movie:
            print(movie)

1. Mulholland Drive (David Lynch, 2001)
2. In the Mood for Love (Wong Kar-wai, 2000)
3. The Tree of Life (Terrence Malick, 2011)
4. Yi Yi: A One and a Two (Edward Yang, 2000)
5. Goodbye to Language (Jean-Luc Godard, 2014)
6. The White Meadows (Mohammad Rasoulof, 2009)
7. Night Across the Street (Raoul Ruiz, 2012)
8. Certified Copy (Abbas Kiarostami, 2010)
9. Sparrow (Johnnie To, 2008)
10. Fados (Carlos Saura, 2007)
1. In the Mood for Love (Wong Kar-wai, 2000)
2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
4. Spirited Away (Hayao Miyazaki, 2001)
5. The Act of Killing (Joshua Oppenheimer, 2012)
6. The Grand Budapest Hotel (Wes Anderson, 2014)
7. The New World (Terrence Malick, 2004)
8. Certified Copy (Abbas Kiarostami, 2010)
9. The World (Jia Zhangke, 2004)
10. Elephant (Gus Van Sant, 2003)
1. Zero Dark Thirty (Kathryn Bigelow, 2012)
2. A History of Violence (David Cronenberg, 2005)
3. The Grand Budapest Hotel (

Now that you have that loop working, you need to use regular expressions to get out the name of the movie. First practice getting a regular expression that gets you the name of the movie.


In [9]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "71. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"\d+?\. ([^\W].+) \("
regex_dir = r"\(([^\W].+),"
regex_year = r"(\d\d\d\d)\)"
regex_rank = r"(\d+?)\."


movie_name = re.findall(regex_for_mname,movie_sample)
harder_name = re.findall(regex_for_mname,movie_harder)
movie_dir = re.findall(regex_dir,movie_sample)
movie_year = re.findall(regex_year,movie_sample)
movie_rank = re.findall(regex_rank,movie_sample)
harder_rank = re.findall(regex_rank,movie_harder)



movie_name
# harder_name[0]
movie_dir[0]
# movie_year
movie_rank
harder_rank

['71']

**STEP 6**
You're almost there!!! Now that you have a working regular expression put that in your inner loop to get the movie name.

So now the entire loop should be getting you 13 elements:
-critic_name
-critic_org
-critic_cn

And an inner loop that will run 10 times (for the 10 movies) and give you 10 instances of:
-rank (this is actually optional, but maybe helpful to keep)
-movie_name
-director
-year

Build this loop using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [10]:
#Get that loop working here
mytable =[]
for line in all_p[:-3]:
    if line.strong is not None:
        critic_info = line.strong.string
        regex_for_name = r"^([^\W].+) – "
        regex_for_org = r"– ([^\W].+) \("
        regex_for_cn = r"\(([^\W].+)\)$"
        critic_name = re.findall(regex_for_name,critic_info)[0]
        critic_org = re.findall(regex_for_org,critic_info)[0]
        critic_cn = re.findall(regex_for_cn,critic_info)[0]
        critic_info = critic_name, critic_org, critic_cn

        movie_info = line.next_sibling
        each_movie = movie_info.find_all(string=True)


        for movie in each_movie:
            
            movie_row = []

            regex_for_mname = r"\d+?. ([^\W].+) \("
            regex_dir = r"\(([^\W].+),"
            regex_year = r"(\d\d\d\d)\)"
            regex_rank = r"(\d+?)\."
            m_names= re.findall(regex_for_mname,movie)[0]
            m_dirs = re.findall(regex_dir,movie)[0]
            m_years = re.findall(regex_year,movie)[0]
            critic_rank =re.findall(regex_rank,movie)[0]
            movie_row = [m_names, m_dirs, m_years, critic_rank, critic_name, critic_org, critic_cn]
#             print(movie_row)
            mytable.append(movie_row)
mytable
            
    

[['Mulholland Drive',
  'David Lynch',
  '2001',
  '1',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['In the Mood for Love',
  'Wong Kar-wai',
  '2000',
  '2',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['The Tree of Life',
  'Terrence Malick',
  '2011',
  '3',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Yi Yi: A One and a Two',
  'Edward Yang',
  '2000',
  '4',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Goodbye to Language',
  'Jean-Luc Godard',
  '2014',
  '5',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['The White Meadows',
  'Mohammad Rasoulof',
  '2009',
  '6',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Night Across the Street',
  'Raoul Ruiz',
  '2012',
  '7',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Certified Copy',
  'Abbas Kiarostami',
  '2010',
  '8',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Sparrow',
  'Johnnie To',
  '2008',
  '9',
  'Simon Abrams',
  'Freelance film 

**STEP 7**
This is the final step of the hardest part! If you make it all the way to the end of this let me know and we can discuss what to do next. If you've made it just following instructions, you are in great shape for the rest of this project--if not, don't worry! I will get you through by midweek.

The final step is building a list of lists of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [13]:
##Take a peek at your final lists of lists
mytable

col_names = ['movie', 'director', 'm_year', 'crit_rank','critic','crit_org','crit_cn']
df = pd.DataFrame(mytable, columns = col_names)
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1770 entries, 0 to 1769
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   movie      1770 non-null   object
 1   director   1770 non-null   object
 2   m_year     1770 non-null   object
 3   crit_rank  1770 non-null   object
 4   critic     1770 non-null   object
 5   crit_org   1770 non-null   object
 6   crit_cn    1770 non-null   object
dtypes: object(7)
memory usage: 96.9+ KB


If you made it this far, congratulations!

You can go ahead and try to build the list of movies and/or the list of directors on your own--they will use similar logic, but they will not be nearly as complicated as this one.

In [14]:
df.to_csv('bbc_movie_2.csv', index = False)

In [15]:
df = pd.read_csv('bbc_movie_2.csv')
df.head()

Unnamed: 0,movie,director,m_year,crit_rank,critic,crit_org,crit_cn
0,Mulholland Drive,David Lynch,2001,1,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,Wong Kar-wai,2000,2,Simon Abrams,Freelance film critic,US
2,The Tree of Life,Terrence Malick,2011,3,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,Edward Yang,2000,4,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,Jean-Luc Godard,2014,5,Simon Abrams,Freelance film critic,US


In [16]:
# Which movie chose by critic most frequesntly?

df.movie.value_counts().head(10)

#there are 598 movies in rankings
#In the Mood for Love 花様年華 is chosen most frequesntly

In the Mood for Love                     49
Mulholland Drive                         47
There Will Be Blood                      35
Spirited Away                            34
Boyhood                                  30
Eternal Sunshine of the Spotless Mind    29
A Separation                             28
The Tree of Life                         23
Yi Yi: A One and a Two                   22
No Country For Old Men                   21
Name: movie, dtype: int64

In [22]:
#search dir name Bong joon hoo |  David Lynch
driver.find_element(By.XPATH, '//*[@id="suggestion-search"]').send_keys("")


In [23]:
# click
# //*[@id="suggestion-search-button"]/svg
driver.find_element(By.ID, "suggestion-search-button").click()


In [24]:
# click the first result
# //*[@id="main"]/div/div[2]/table/tbody/tr/td[2]/a
# //*[@id="main"]/div/div[3]/table/tbody/tr[1]/td[2]/a

driver.find_element(By.XPATH, '//*[@id="main"]/div/div[2]/table/tbody/tr/td[2]/a').click()

In [25]:
doc =  BeautifulSoup(driver.page_source)
dir_doc = doc.find(id =  "name-born-info" )('a')[-1].string
dir_doc
# re_dir_cn1 = r".*, ([^\W].+)$"


# dir_cn1 = re.findall(re_dir_cn1, dir_doc)
# dir_cn1

'Daegu, South Korea'

In [85]:
#parse from the "name" table
directors = df.director.tolist()
director_links = []

for names in directors:
    string = names.replace(" ","+").lower()
    hrefs = "https://www.imdb.com/find?q=" + string
    raw_html = requests.get(hrefs).content
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    headers = soup_doc.find_all('h3')
    for head in headers:
        if (re.search("Names", head.text)):
            td_result = head.parent.find(class_ = "result_text")
            director_info ={}
            director_info['name'] = names
            director_info['link']  = td_result.a["href"]
            director_links.append(director_info)
        else:
            director
director_links

[{'name': 'David Lynch', 'link': '/name/nm0000186/'},
 {'name': 'Wong Kar-wai', 'link': '/name/nm0939182/'},
 {'name': 'Terrence Malick', 'link': '/name/nm0000517/'},
 {'name': 'Edward Yang', 'link': '/name/nm0945981/'},
 {'name': 'Jean-Luc Godard', 'link': '/name/nm0000419/'},
 {'name': 'Mohammad Rasoulof', 'link': '/name/nm1488024/'},
 {'name': 'Raoul Ruiz', 'link': '/name/nm0749914/'},
 {'name': 'Abbas Kiarostami', 'link': '/name/nm0452102/'},
 {'name': 'Johnnie To', 'link': '/name/nm0864775/'},
 {'name': 'Carlos Saura', 'link': '/name/nm0767022/'},
 {'name': 'Wong Kar-wai', 'link': '/name/nm0939182/'},
 {'name': 'Michel Gondry', 'link': '/name/nm0327273/'},
 {'name': 'Apichatpong Weerasethakul', 'link': '/name/nm0917405/'},
 {'name': 'Hayao Miyazaki', 'link': '/name/nm0594503/'},
 {'name': 'Joshua Oppenheimer', 'link': '/name/nm1484791/'},
 {'name': 'Wes Anderson', 'link': '/name/nm0027572/'},
 {'name': 'Terrence Malick', 'link': '/name/nm0000517/'},
 {'name': 'Abbas Kiarostami', '

In [91]:
df2 = pd.DataFrame(director_links)
df2.head()
df2.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1661 entries, 0 to 1660
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    1661 non-null   object
 1   link    1661 non-null   object
dtypes: object(2)
memory usage: 26.1+ KB


In [94]:
df2.to_csv('bbc_dir_link.csv', index = False)

In [96]:
dir_df = pd.read_csv('bbc_dir_link.csv')
dir_df

Unnamed: 0,name,link
0,David Lynch,/name/nm0000186/
1,Wong Kar-wai,/name/nm0939182/
2,Terrence Malick,/name/nm0000517/
3,Edward Yang,/name/nm0945981/
4,Jean-Luc Godard,/name/nm0000419/
...,...,...
1656,Asghar Farhadi,/name/nm1410815/
1657,Ang Lee,/name/nm0000487/
1658,Florian Henckel von Donnersmarck,/name/nm0003697/
1659,Jia Zhangke,/name/nm0422605/


In [161]:
dir_df[dir_df.name.str.contains('Pippo')]

Unnamed: 0,name,link
64,Pippo Delbono,/name/nm2647632/


In [155]:
dict ={}
url = "https://www.imdb.com/name/nm0073047/"
raw_html = requests.get(url).content
doc = BeautifulSoup(raw_html, "html.parser")
headers = doc.find(id="name-born-info")
dir_cn = headers.find_all('a')[-1].string
dict['name'] = "paolo"
dict['dir_cn'] = dir_cn
dir_country.append(dict)
dict

#in dir_df, link of "Paolo Benvenuti" is wrong (diffrent Paolo Benvenuti)
# 61	Paolo Benvenuti	/name/nm2379049/
#/name/nm0073047/ is corrrect
dir_df.at[61,'link'] = "/name/nm0073047/"


In [None]:
dict ={}
url = "https://www.imdb.com/name/nm0073047/"
raw_html = requests.get(url).content
doc = BeautifulSoup(raw_html, "html.parser")
headers = doc.find(id="name-born-info")
dir_cn = headers.find_all('a')[-1].string
dict['name'] = "paolo"
dict['dir_cn'] = dir_cn
dir_country.append(dict)
dict

#in dir_df, link of "Paolo Benvenuti" is wrong (diffrent Paolo Benvenuti)
# 61	Paolo Benvenuti	/name/nm2379049/
#/name/nm0073047/ is corrrect
# dir_df.at[61,'link'] = "/name/nm0073047/"


In [173]:
links = dir_df.link.tolist()
dir_country = []
for name in links:
    url = "https://www.imdb.com" + name
    raw_html = requests.get(url).content
    doc = BeautifulSoup(raw_html, "html.parser")
    headers = doc.find(id="name-born-info")
    try:
        dir_cn = headers.find_all('a')[-1].string
        dir_country.append(dir_cn)
        print(dir_cn)
    except Exception:
        dir_country.append("NaN")
        print("NaN")
        pass




Missoula, Montana, USA
Shanghai, China
Ottawa, Illinois, USA
Shanghai, China
Paris, France
Shiraz, Iran
Puerto Montt, Chile
Tehran, Iran
Hong Kong
Huesca, Aragón, Spain
Shanghai, China
Versailles, Seine-et-Oise [now Yvelines], France
Bangkok, Thailand
Tokyo, Japan
Texas, USA
Houston, Texas, USA
Ottawa, Illinois, USA
Tehran, Iran
Fenyang, Shanxi, China
Louisville, Kentucky, USA
San Carlos, California, USA
Toronto, Ontario, Canada
Houston, Texas, USA
Toronto, Ontario, Canada
Hastings, New Zealand
Versailles, Seine-et-Oise [now Yvelines], France
Lisbon, Portugal
Bilbao, Vizcaya, País Vasco, Spain
Memphis, Tennessee, USA
Kansas City, Missouri, USA
Shanghai, China
Missoula, Montana, USA
London, England, UK
Guadalajara, Jalisco, Mexico
Munich, Bavaria, Germany
Munich, Bavaria, Germany
Iasi, Romania
Suresnes, Seine [now Hauts-de-Seine], France
Bois-Colombes, Hauts-de-Seine, France
Studio City, California, USA
Munich, Bavaria, Germany
New York City, New York, USA
Bracebridge, Ontario, Canada
I

Houston, Texas, USA
London, England, UK
Studio City, California, USA
London, England, UK
Boston, Massachusetts, USA
Chinchilla, Queensland, Australia
Knoxville, Tennessee, USA
Mexico City, Distrito Federal, Mexico
Ottawa, Illinois, USA
Shanghai, China
Daegu, South Korea
Versailles, Seine-et-Oise [now Yvelines], France
Munich, Bavaria, Germany
Palermo, Sicily, Italy
Budapest, Hungary
Boston, Massachusetts, USA
Brooklyn, New York City, New York, USA
Shanghai, China
Warsaw, Mazowieckie, Poland
Houston, Texas, USA
Budapest, Hungary
Tokyo, Japan
Calzada de Calatrava, Ciudad Real, Castilla-La Mancha, Spain
Mexico City, Distrito Federal, Mexico
Ramos Mejía, Buenos Aires, Argentina
Moresnet-Chapelle, Belgium
Montreal, Quebec, Canada
London, England, UK
Bangkok, Thailand
NaN
Kiffa, Mauritania
NaN
Ziguenchor, Casamance, Senegal
Jiangxi Province, China
New York City, New York, USA
Pingtung, Taiwan
Harrogate, North Yorkshire, England, UK
Iasi, Romania
Tokyo, Japan
NaN
Calzada de Calatrava, Ciudad 

Bad Aussee, Styria, Austria
Studio City, California, USA
Bremen, West Germany [now Germany]
Los Angeles, California, USA
NaN
Daegu, South Korea
Karlsruhe, Baden-Württemberg, Germany
Fenyang, Shanxi, China
Salta, Argentina
NaN
Houston, Texas, USA
Suresnes, Seine [now Hauts-de-Seine], France
Ottawa, Illinois, USA
Shanghai, China
Tehran, Iran
London, England, UK
Mexico City, Distrito Federal, Mexico
Missoula, Montana, USA
Toronto, Ontario, Canada
Lexington, Kentucky, USA
Queens, New York City, New York, USA
Warsaw, Mazowieckie, Poland
Khomeyni Shahr, Isfahan, Iran
Munich, Bavaria, Germany
Texas, USA
Chicago, Illinois, USA
Taree, New South Wales, Australia
Beirut, Lebanon
London, England, UK
London, England, UK
Xi'an, Shaanxi, China
Knoxville, Tennessee, USA
Queens, New York City, New York, USA
NaN
Shanghai, China
NaN
Cologne, North Rhine-Westphalia, Germany
Paris, France
Akron, Ohio, USA
Bangkok, Thailand
Novosibirsk, Novosibirskaya oblast, RSFSR, USSR [now Russia]
Roanne, Loire, Rhône-Al

Philadelphia, Pennsylvania, USA
Bronx, New York City, New York, USA
Paris, France
Denver, Colorado, USA
Novosibirsk, Novosibirskaya oblast, RSFSR, USSR [now Russia]
Missoula, Montana, USA
Wellington, New Zealand
Kalispell, Montana, USA
Mexico City, Distrito Federal, Mexico
Houston, Texas, USA
Munich, Bavaria, Germany
Pingtung, Taiwan
Vannes, Morbihan, France
Brussels, Belgium
Tulle, Corrèze, France
Oporto, Portugal
Ferrara, Emilia-Romagna, Italy
Vicksburg, Mississippi, USA
Ottawa, Illinois, USA
Meixian, Guangdong, China
Bronx, New York City, New York, USA
Knoxville, Tennessee, USA
Versailles, Seine-et-Oise [now Yvelines], France
Pingtung, Taiwan
Suresnes, Seine [now Hauts-de-Seine], France
Buenos Aires, Argentina
El Paso, Texas, USA
Copenhagen, Denmark
Chinchilla, Queensland, Australia
Pukerua Bay, North Island, New Zealand
Munich, Bavaria, Germany
Studio City, California, USA
Missoula, Montana, USA
Copenhagen, Denmark
Studio City, California, USA
London, England, UK
Bressuire, Deux-Sè

Mianeh, Azarbaijan Province, Iran
Beersheba, Israel
Bethlehem, Palestine
North Carolina, USA
New York City, New York, USA
Paris, France
Pingtung, Taiwan
Khomeyni Shahr, Isfahan, Iran
Denver, Colorado, USA
Missoula, Montana, USA
Shanghai, China
Shanghai, China
Houston, Texas, USA
Bracebridge, Ontario, Canada
Malmö, Skåne län, Sweden
Copenhagen, Denmark
Atlanta, Georgia, USA
Ottawa, Illinois, USA
Mexico City, Distrito Federal, Mexico
Podorvikha, Irkutskaya oblast, RSFSR, USSR [now Russia]
Studio City, California, USA
San Francisco, California, USA
Bucharest, Romania
Nazareth, Israel
Novosibirsk, Novosibirskaya oblast, RSFSR, USSR [now Russia]
Gothenburg, Västra Götalands län, Sweden
Roanne, Loire, Rhône-Alpes, France
Atlanta, Georgia, USA
Palm Springs, California, USA
New York City, New York, USA
Studio City, California, USA
Denver, Colorado, USA
Versailles, Seine-et-Oise [now Yvelines], France
Denver, Colorado, USA
Beech Grove, Indiana, USA
New Providence, New Jersey, USA
Shanghai, Chin

Pingtung, Taiwan
Calzada de Calatrava, Ciudad Real, Castilla-La Mancha, Spain
Missoula, Montana, USA
Versailles, Seine-et-Oise [now Yvelines], France
Munich, Bavaria, Germany
Bonghwa, South Korea
Dorset, England, UK
Tangshan, China
Khomeyni Shahr, Isfahan, Iran
Pingtung, Taiwan
Cologne, North Rhine-Westphalia, Germany
Fenyang, Shanxi, China
Mexico City, Distrito Federal, Mexico


In [174]:
dir_df['dir_cn'] = dir_country

dir_df.head()

Unnamed: 0,name,link,dir_cn
0,David Lynch,/name/nm0000186/,"Missoula, Montana, USA"
1,Wong Kar-wai,/name/nm0939182/,"Shanghai, China"
2,Terrence Malick,/name/nm0000517/,"Ottawa, Illinois, USA"
3,Edward Yang,/name/nm0945981/,"Shanghai, China"
4,Jean-Luc Godard,/name/nm0000419/,"Paris, France"


In [175]:
dir_df.to_csv('bbc_dir_link.csv', index = False)

In [180]:
dir_df = pd.read_csv('bbc_dir_link.csv')
dir_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1661 entries, 0 to 1660
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    1661 non-null   object
 1   link    1661 non-null   object
 2   dir_cn  1618 non-null   object
dtypes: object(3)
memory usage: 39.1+ KB


In [206]:
#merge dir_df with original df

original_df = pd.read_csv('bbc_movie_2.csv')
original_df

Unnamed: 0,movie,director,m_year,crit_rank,critic,crit_org,crit_cn
0,Mulholland Drive,David Lynch,2001,1,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,Wong Kar-wai,2000,2,Simon Abrams,Freelance film critic,US
2,The Tree of Life,Terrence Malick,2011,3,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,Edward Yang,2000,4,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,Jean-Luc Godard,2014,5,Simon Abrams,Freelance film critic,US
...,...,...,...,...,...,...,...
1765,The Lives of Others,Florian Henckel von Donnersmarck,2006,6,Raymond Zhou,China Daily,China
1766,Still Life,Jia Zhangke,2006,7,Raymond Zhou,China Daily,China
1767,Birdman,Alejandro González Iñárritu,2014,8,Raymond Zhou,China Daily,China
1768,Infernal Affairs,Andrew Lau and Alan Mak,2002,9,Raymond Zhou,China Daily,China


In [201]:
dir_df = pd.read_csv('bbc_dir_link.csv')

dir_df = dir_df.rename(columns={'name': 'director'})
dir_df.head()

Unnamed: 0,director,link,dir_cn
0,David Lynch,/name/nm0000186/,"Missoula, Montana, USA"
1,Wong Kar-wai,/name/nm0939182/,"Shanghai, China"
2,Terrence Malick,/name/nm0000517/,"Ottawa, Illinois, USA"
3,Edward Yang,/name/nm0945981/,"Shanghai, China"
4,Jean-Luc Godard,/name/nm0000419/,"Paris, France"


In [212]:
merged_df = pd.merge(original_df, dir_df, how='left', on='director').drop_duplicates().reset_index(drop=True)
merged_df

Unnamed: 0,movie,director,m_year,crit_rank,critic,crit_org,crit_cn,link,dir_cn
0,Mulholland Drive,David Lynch,2001,1,Simon Abrams,Freelance film critic,US,/name/nm0000186/,"Missoula, Montana, USA"
1,In the Mood for Love,Wong Kar-wai,2000,2,Simon Abrams,Freelance film critic,US,/name/nm0939182/,"Shanghai, China"
2,The Tree of Life,Terrence Malick,2011,3,Simon Abrams,Freelance film critic,US,/name/nm0000517/,"Ottawa, Illinois, USA"
3,Yi Yi: A One and a Two,Edward Yang,2000,4,Simon Abrams,Freelance film critic,US,/name/nm0945981/,"Shanghai, China"
4,Goodbye to Language,Jean-Luc Godard,2014,5,Simon Abrams,Freelance film critic,US,/name/nm0000419/,"Paris, France"
...,...,...,...,...,...,...,...,...,...
1765,The Lives of Others,Florian Henckel von Donnersmarck,2006,6,Raymond Zhou,China Daily,China,/name/nm0003697/,"Cologne, North Rhine-Westphalia, Germany"
1766,Still Life,Jia Zhangke,2006,7,Raymond Zhou,China Daily,China,/name/nm0422605/,"Fenyang, Shanxi, China"
1767,Birdman,Alejandro González Iñárritu,2014,8,Raymond Zhou,China Daily,China,/name/nm0327944/,"Mexico City, Distrito Federal, Mexico"
1768,Infernal Affairs,Andrew Lau and Alan Mak,2002,9,Raymond Zhou,China Daily,China,,


In [213]:
# save
    merged_df.to_csv('bbc_movie_3.csv', index = False)

In [208]:
!pip install geocoder

You should consider upgrading via the 'C:\Users\nao22\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Using legacy 'setup.py install' for future, since package 'wheel' is not installed.
Installing collected packages: ratelim, future, geocoder
    Running setup.py install for future: started
    Running setup.py install for future: finished with status 'done'
Successfully installed future-0.18.2 geocoder-1.38.1 ratelim-0.1.6


In [264]:
# These dir_cn are hard to convert in lat.lang 
merged_df[merged_df.dir_cn.str.contains("now ", na = False)].dir_cn.value_counts()


Versailles, Seine-et-Oise [now Yvelines], France                                  30
Suresnes, Seine [now Hauts-de-Seine], France                                      16
Novosibirsk, Novosibirskaya oblast, RSFSR, USSR [now Russia]                      15
Podorvikha, Irkutskaya oblast, RSFSR, USSR [now Russia]                            5
Rouen, Seine-Inférieure [now Seine-Maritime], France                               2
Leningrad, Russian SFSR, USSR [now St. Petersburg, Russia]                         2
Sol-Iletsk, Sol-Iletskiy rayon, Orenburgskaya oblast, RSFSR, USSR [now Russia]     2
Bremen, West Germany [now Germany]                                                 1
Lwów, Lwowskie, Poland [now Lviv, Ukraine]                                         1
Name: dir_cn, dtype: int64

In [329]:
# First, convert "RSFSR, USSR [now Russia]" to "Russia"
merged_df.dir_cn = merged_df.dir_cn.replace('RSFSR, USSR \[now Russia\]', 'Russia', regex = True)
# # "SFSR, USSR [now St. Petersburg, Russia]" to "St. Petersburg, Russia"
merged_df.dir_cn = merged_df.dir_cn.replace("SFSR, USSR \[now St. Petersburg, Russia\]", "St. Petersburg, Russia", regex = True)
# #Seine-et-Oise [now Yvelines]
merged_df.dir_cn = merged_df.dir_cn.replace("Seine-et-Oise \[now Yvelines\]", "Yvelines", regex = True)
# # #Seine [now Hauts-de-Seine]
merged_df.dir_cn = merged_df.dir_cn.replace("Seine \[now Hauts-de-Seine\]", "Hauts-de-Seine", regex = True)
# #Seine-Inférieure [now Seine-Maritime]
merged_df.dir_cn = merged_df.dir_cn.replace("Seine-Inférieure \[now Seine-Maritime\]", "Seine-Maritime", regex = True)
# # # West Germany [now Germany] 
merged_df.dir_cn = merged_df.dir_cn.replace("West Germany \[now Germany\]", "Germany", regex = True)
# # # Lwowskie, Poland [now Lviv, Ukraine]   
merged_df.dir_cn = merged_df.dir_cn.replace("Poland \[now Lviv, Ukraine\]", "Lviv, Ukraine", regex = True)




In [334]:
# find giocodes of each dir_cn
import geocoder
from tqdm.auto import tqdm

dir_cn_list = merged_df.dir_cn.tolist()



dir_cn_latlng = []
for cn in tqdm(dir_cn_list):
    if cn != "NaN":
        try:
            ret = geocoder.osm(cn, timeout=6.0)
            ret.latlng
            dir_cn_latlng.append(ret.latlng)
        except Exception:
            cn
    else:
        dir_cn_latlng.append("NaN")

dir_cn_latlng

  0%|          | 0/1770 [00:00<?, ?it/s]

[[46.8701049, -113.995267],
 [31.2322758, 121.4692071],
 [41.3516628, -88.845436],
 [31.2322758, 121.4692071],
 [48.8588897, 2.3200410217200766],
 [29.6060218, 52.5378041],
 [-41.4718121, -72.939621],
 [35.6892523, 51.3896004],
 [22.2793278, 114.1628131],
 [42.13606145, -0.029802662719165485],
 [31.2322758, 121.4692071],
 [48.8035403, 2.1266886],
 [13.7525438, 100.4934734],
 [35.6828387, 139.7594549],
 [31.168570000000003, -99.68300099546674],
 [29.7589382, -95.3676974],
 [41.3516628, -88.845436],
 [35.6892523, 51.3896004],
 [37.2683004, 111.7830029],
 [38.2542376, -85.759407],
 [37.504936, -122.261823],
 [43.6534817, -79.3839347],
 [29.7589382, -95.3676974],
 [43.6534817, -79.3839347],
 [-39.6417678, 176.8430781],
 [48.8035403, 2.1266886],
 [38.7077507, -9.1365919],
 [43.2630018, -2.9350039],
 [35.1490215, -90.0516285],
 [39.100105, -94.5781416],
 [31.2322758, 121.4692071],
 [46.8701049, -113.995267],
 [51.5073219, -0.1276474],
 [20.6720375, -103.338396],
 [48.1371079, 11.5753822],
 [

In [337]:
merged_df['dir_cn_geocode'] = dir_cn_latlng
merged_df.head()

Unnamed: 0,movie,director,m_year,crit_rank,critic,crit_org,crit_cn,link,dir_cn,dir_cn_geocode
0,Mulholland Drive,David Lynch,2001,1,Simon Abrams,Freelance film critic,US,/name/nm0000186/,"Missoula, Montana, USA","[46.8701049, -113.995267]"
1,In the Mood for Love,Wong Kar-wai,2000,2,Simon Abrams,Freelance film critic,US,/name/nm0939182/,"Shanghai, China","[31.2322758, 121.4692071]"
2,The Tree of Life,Terrence Malick,2011,3,Simon Abrams,Freelance film critic,US,/name/nm0000517/,"Ottawa, Illinois, USA","[41.3516628, -88.845436]"
3,Yi Yi: A One and a Two,Edward Yang,2000,4,Simon Abrams,Freelance film critic,US,/name/nm0945981/,"Shanghai, China","[31.2322758, 121.4692071]"
4,Goodbye to Language,Jean-Luc Godard,2014,5,Simon Abrams,Freelance film critic,US,/name/nm0000419/,"Paris, France","[48.8588897, 2.3200410217200766]"


In [339]:
# save
merged_df.to_csv('bbc_movie_4.csv', index = False)