<h2><u>Situation </u></h2>

It's the end of the decade. This means we get a buttload of top "X" of the decade! I love movies and I wanted to see what movies I may have missed out over the last decade so I decided to scrape all the top movies of the decade lists that I could get my hands on. After creating this list, I want to:

1. Analyze the database for interesting trends 
2. Create a TOP top list of the decade using all the lists
3. Deploy a reccomender system that would suggest movies from this top movie list.

<h2><u> Task </u></h2>

The first thing I had to get done was data collection and that's what this notebook is all about.

<h2><u> Action </u></h2>

I defined a couple of functions that can scrape a website given a link and the html tag of the element you want to scrape. I then wrote a few more functions over the course of scraping 35 websites that would clean the list of movies as required. After each successful scrape I manually added them to a CSV file and did any additionally cleanup of the data if required.

In [527]:
import requests
from scrapy.http import TextResponse
from bs4 import BeautifulSoup


#parses web link and returns matches found for given tag
def parser(link, tag):
    res = requests.get(link)
    response = TextResponse(res.url, body=res.text, encoding='utf-8')
    return response.css(tag).getall()
    
#function to remove html tags
def remove_html(movies):
    sol = []
    for movie in movies:
        s = BeautifulSoup(movie, "lxml").text
        if re.search('[a-zA-Z]', s):
            sol.append(s)
    return sol

#function to remove dates and special chars
def remove_non_chars(movies):
    regex = re.compile('[^a-z : - A-Z]')
    for idx, movie in enumerate(movies):
        movies[idx] = regex.sub('', movie)
    return movies

#function to find movie titles using regex in the more stubborn to parse websites
def find_movies(movies, expr):
    res = []
    for title in movies:
        f = re.search(expr, title)
        if f: #if a movie title is found
            res.append(f.group(1))
    return res


<h2> No Film School </h2>

In [308]:
link = "https://nofilmschool.com/best-movies-of-the-decade"
tag = 'strong, strong em::text'
movies = parser(link, tag)
movies

['<strong>Honorable Mentions</strong>',
 '<strong><em>Bridesmaids</em>,\xa0<em>Frozen</em>,\xa0<em>Gone Girl, Birdman</em>,\xa0<em>Manchester by the Sea</em>,\xa0<em>The Master</em>, <em>12 Years a Slave</em>,\xa0<em>Nightcrawler</em>,\xa0<em>Black Swan</em>,\xa0<em>Lady Bird</em>,\xa0<em>Before Midnight</em>,\xa0<em>It Follows</em>, <em>The Avengers,\xa0</em><em>Guardians of the</em>\xa0<em>Galaxy</em>,\xa0<em>Thor: Ragnarok</em>,\xa0<em>The Post</em>, <em>Spider-Man: Into the Spider-Verse</em>,\xa0<em>A Star Is Born</em>, <em>Mission: Impossible - Fallout</em>,\xa0<em>Brooklyn</em>,\xa0<em>Avengers: Endgame</em>, <em>Once Upon a Time In Hollywood</em>,\xa0<em>Jojo Rabbit,\xa0</em>and <em>Zero Dark Thirty</em>.</strong>',
 'Bridesmaids',
 'Frozen',
 'Gone Girl, Birdman',
 'Manchester by the Sea',
 'The Master',
 '12 Years a Slave',
 'Nightcrawler',
 'Black Swan',
 'Lady Bird',
 'Before Midnight',
 'It Follows',
 'The Avengers,\xa0',
 'Guardians of the',
 'Galaxy',
 'Thor: Ragnarok',
 

In [311]:
import re
sol = []
for title in movies:
    if re.match('<strong>', title):
        sol.append(title)

In [312]:
for title in sol:
    if re.match('<strong>', title):
        print(title)

<strong>Honorable Mentions</strong>
<strong><em>Bridesmaids</em>, <em>Frozen</em>, <em>Gone Girl, Birdman</em>, <em>Manchester by the Sea</em>, <em>The Master</em>, <em>12 Years a Slave</em>, <em>Nightcrawler</em>, <em>Black Swan</em>, <em>Lady Bird</em>, <em>Before Midnight</em>, <em>It Follows</em>, <em>The Avengers, </em><em>Guardians of the</em> <em>Galaxy</em>, <em>Thor: Ragnarok</em>, <em>The Post</em>, <em>Spider-Man: Into the Spider-Verse</em>, <em>A Star Is Born</em>, <em>Mission: Impossible - Fallout</em>, <em>Brooklyn</em>, <em>Avengers: Endgame</em>, <em>Once Upon a Time In Hollywood</em>, <em>Jojo Rabbit, </em>and <em>Zero Dark Thirty</em>.</strong>
<strong><span id="docs-internal-guid-3f7fa7ee-7fff-ad99-f999-20b3e8547765"><em>Get Out</em> (2017) </span></strong>
<strong><em>The Tree of Life</em></strong>
<strong>(2011)</strong>
<strong><em>Spotlight </em>(2015)</strong>
<strong><em>Stories We Tell</em> (2013)</strong>
<strong><em>Mad Max: Fury Road</em> (2015)</strong>
<s

In [319]:
remove_non_chars(remove_html(sol))[2:]

['Get Out ',
 'The Tree of Life',
 'Spotlight',
 'Stories We Tell ',
 'Mad Max: Fury Road',
 'Hereditary',
 'Paddington ',
 'Moneyball',
 'Black Panther',
 'Madelines Madeline ',
 'Citizen Four ',
 'Boyhood',
 'The Wolf of Wall Street',
 'Tangerine',
 'Francis Ha',
 'Ex Machina',
 'Spring Breakers',
 'Her',
 'Moonlight',
 'Parasite',
 'Melancholia ',
 'Inside Out',
 'Thunder Road',
 'Swiss Army Man',
 'The Art of Killing',
 'Cameraperson',
 'The Revenant ',
 'Inception',
 'Roma ',
 'Foxtrot',
 'Django Unchained',
 'Inside Llewyn Davis',
 'Drive',
 'Skyfall',
 'The Social Network ',
 'Moana',
 'The Witch']

<h2> Insider </h2>

In [320]:
link = "https://www.insider.com/best-films-of-the-decade-2010-2019-11#1-get-out-director-jordan-peele-2017-100"
tag = '.slide-title-text'
movies = parser(link, tag)
movies

['<h2 class="slide-title-text">100. "Nocturnal Animals" (Director: Tom Ford, 2016)</h2>',
 '<h2 class="slide-title-text">99. "Free Solo" (Directors: Jimmy Chin and Elizabeth Chai Vasarhelyi, 2018)</h2>',
 '<h2 class="slide-title-text">98. "Creed" (Director: Ryan Coogler, 2015)</h2>',
 '<h2 class="slide-title-text">97. "Deadpool" (Director: Tim Miller, 2016)</h2>',
 '<h2 class="slide-title-text">96. "Interstellar" (Director: Christopher Nolan, 2014)</h2>',
 '<h2 class="slide-title-text">95. "Take Shelter" (Director: Jeff Nichols, 2011)</h2>',
 '<h2 class="slide-title-text">94. "Guardians of the Galaxy" (Director: James Gunn, 2014)</h2>',
 '<h2 class="slide-title-text">93. "Django Unchained" (Director: Quentin Tarantino, 2012)</h2>',
 '<h2 class="slide-title-text">92. "Phoenix" (Director: Christian Petzold, 2014)</h2>',
 '<h2 class="slide-title-text">91. "The Revenant" (Director: Alejandro González Iñárritu, 2016)</h2>',
 '<h2 class="slide-title-text">90. "Booksmart" (Director: Olivia Wi

In [321]:
#finds quoted movies
for index, title in enumerate(movies):
    movies[index] = re.findall('"([^"]*)"',  title)

In [323]:
sol = []
for movie in movies:
    sol.append(movie[1])

In [324]:
sol

['Nocturnal Animals',
 'Free Solo',
 'Creed',
 'Deadpool',
 'Interstellar',
 'Take Shelter',
 'Guardians of the Galaxy',
 'Django Unchained',
 'Phoenix',
 'The Revenant',
 'Booksmart',
 'Scott Pilgrim vs. the World',
 'John Wick',
 'Gravity',
 'Zero Dark Thirty',
 'The Conjuring',
 'Your Name',
 'What We Do in the Shadows',
 'Snowpiercer',
 'Spring Breakers',
 'Room',
 'Skyfall',
 'Toy Story 3',
 'The Farewell',
 'Train to Busan',
 'Coco',
 'Sing Street',
 'Avengers: Endgame',
 'The Babadook',
 'Selma',
 'Amour',
 'Shoplifters',
 'Sorry to Bother You',
 'Logan',
 'Spotlight',
 'The Favourite',
 'A Quiet Place',
 'Bridesmaids',
 'Prisoners',
 'The Great Beauty',
 'A Separation',
 'Star Wars: The Last Jedi',
 'First Man',
 'If Beale Street Could Talk',
 'The Big Short',
 'Tangerine',
 'Minding the Gap',
 'Upstream Color',
 'Short Term 12',
 'Carol',
 'A Star Is Born',
 'The Shape of Water',
 'Burning',
 'Brooklyn',
 'Gone Girl',
 '12 Years a Slave',
 'Once Upon a Time...in Hollywood',
 '

<h2> Games Radar </h2>

In [325]:
link = "https://www.gamesradar.com/decade-best-movies-2010-2019/"
tag = 'strong , h2'
movies = parser(link, tag)
movies

['<strong>Best of the Decade</strong>',
 '<strong>100 Best TV shows of the decade</strong>',
 '<strong>100 Best games of the decade</strong>',
 '<h2 id="100-once-upon-a-time-x2026-in-hollywood">100. Once Upon A time… In Hollywood</h2>',
 '<strong>Once Upon A Time... In Hollywood<br>\nYear:</strong>',
 '<strong>Director:</strong>',
 '<strong>Jamie Graham\xa0</strong>',
 '<strong>99. A Girl Walks Home Alone At Night<br>\nYear:</strong>',
 '<strong>Director:</strong>',
 '<strong>Jamie Graham</strong>',
 '<strong>98. The Shape of Water<br>\nYear:</strong>',
 '<strong>Director:</strong>',
 '<strong>Matt Maytum</strong>',
 '<strong>97. Booksmart<br>\nYear:</strong>',
 '<strong>Director:</strong>',
 '<strong>Chris Schilling</strong>',
 '<strong>96. Senna<br>\nYear:</strong>',
 '<strong>Director:</strong>',
 '<strong>Neil Smith</strong>',
 '<h2 id="95-attack-the-block">95. Attack The Block</h2>',
 '<strong>Year:</strong>',
 '<strong>Director:</strong>',
 '<strong>James Mottram</strong>',
 '<st

In [448]:
link = "https://www.gamesradar.com/decade-best-movies-2010-2019/"
links = [link + str(i) for i in range(1,5)]
tag = 'strong , h2'
sol = []
for link in links:
    movies = parser(link, tag)
    sol.append(find_movies(movies, '\. (.*?)<'))
sol

[['Once Upon A time… In Hollywood',
  'In Hollywood',
  'A Girl Walks Home Alone At Night',
  'The Shape of Water',
  'Booksmart',
  'Senna',
  'Attack The Block',
  'BPM',
  'Scott Pilgrim Vs. The World',
  'Dredd',
  'Mother',
  'Youth',
  'The Wailing',
  'Sicario',
  'Brooklyn',
  'Mission: Impossible – Fallout',
  'Star Wars: The Last Jedi',
  'Paddington 2',
  'Leave No trace',
  'Burning',
  'Inherent Vice',
  'Once Upon A Time In Anatolia',
  'The Turin Horse',
  'A Quiet Place',
  'Before Midnight',
  'The Wolf of Wall Street',
  'Monsters',
  'Ida',
  'The Great Beauty',
  'Beasts of the Southern Wild',
  'Cold War'],
 ['Thor: Ragnarok',
  'Roma',
  'Coco',
  '12 Years A Slave',
  'Uncle Boonmee Who Can Recall His Past Lives',
  'Guardians of the Galaxy',
  'The Favourite',
  'Birdman',
  'The Artist',
  'Inside Llewyn Davis',
  'The Lobster',
  'The Lost City of Z',
  'Captain America: The Winter Soldier',
  'You Were Never Really Here',
  'The Master',
  'Spider-Man: Into t

In [449]:
#unwrap list of lists for ease of copy paste
flat_sol = [item for sublist in sol for item in sublist]
flat_sol

['Once Upon A time… In Hollywood',
 'In Hollywood',
 'A Girl Walks Home Alone At Night',
 'The Shape of Water',
 'Booksmart',
 'Senna',
 'Attack The Block',
 'BPM',
 'Scott Pilgrim Vs. The World',
 'Dredd',
 'Mother',
 'Youth',
 'The Wailing',
 'Sicario',
 'Brooklyn',
 'Mission: Impossible – Fallout',
 'Star Wars: The Last Jedi',
 'Paddington 2',
 'Leave No trace',
 'Burning',
 'Inherent Vice',
 'Once Upon A Time In Anatolia',
 'The Turin Horse',
 'A Quiet Place',
 'Before Midnight',
 'The Wolf of Wall Street',
 'Monsters',
 'Ida',
 'The Great Beauty',
 'Beasts of the Southern Wild',
 'Cold War',
 'Thor: Ragnarok',
 'Roma',
 'Coco',
 '12 Years A Slave',
 'Uncle Boonmee Who Can Recall His Past Lives',
 'Guardians of the Galaxy',
 'The Favourite',
 'Birdman',
 'The Artist',
 'Inside Llewyn Davis',
 'The Lobster',
 'The Lost City of Z',
 'Captain America: The Winter Soldier',
 'You Were Never Really Here',
 'The Master',
 'Spider-Man: Into the Spider-Verse',
 'Shame',
 'Toy Story 3',
 'Fo

<h2> AV Club </h2>

In [395]:
link = "https://film.avclub.com/the-100-best-movies-of-the-2010s-1839846306"
tag = '.bTZVlP em'
movies = parser(link, tag)
movies

['<em>Miss Bala</em>',
 '<em>The Immigrant</em>',
 '<em>\xa0</em>',
 '<em>The Comedy</em>',
 '<em>Uncle Boonmee Who Can Recall His Past Lives</em>',
 '<em>Bridesmaids</em>',
 '<em> </em>',
 '<em>Cameraperson</em>',
 '<em> </em>',
 '<em>The Turin Horse</em>',
 '<em> </em>',
 '<em>Creed</em>',
 '<em>Uncut Gems</em>',
 '<em>The Arbor</em>',
 '<em>High Life</em>',
 '<em> </em>',
 '<em>Carol</em>',
 '<em> </em>',
 '<em>Drug War</em>',
 '<em> </em>',
 '<em>Weekend</em>',
 '<em>The Lost City Of Z</em>',
 '<em> </em>',
 '<em>Happy Hour</em>',
 '<em> </em>',
 '<em>Tangerine</em>',
 '<em>Gone Girl</em>',
 '<em> </em>',
 '<em>You Were Never Really Here</em>',
 '<em>Spring Breakers</em>',
 '<em>Hereditary</em>',
 '<em>Hell Or High Water</em>',
 '<em>Shoplifters</em>',
 '<em>Paddington 2</em>',
 '<em> </em>',
 '<em>Zero Dark Thirty</em>',
 '<em>\xa0</em>',
 '<em>Mudbound</em>',
 '<em>Minding The Gap</em>',
 '<em> </em>',
 '<em>Stranger By The Lake</em>',
 '<em>Computer Chess</em>',
 '<em> </em>',
 

In [396]:
remove_html(movies)

['Miss Bala',
 'The Immigrant',
 'The Comedy',
 'Uncle Boonmee Who Can Recall His Past Lives',
 'Bridesmaids',
 'Cameraperson',
 'The Turin Horse',
 'Creed',
 'Uncut Gems',
 'The Arbor',
 'High Life',
 'Carol',
 'Drug War',
 'Weekend',
 'The Lost City Of Z',
 'Happy Hour',
 'Tangerine',
 'Gone Girl',
 'You Were Never Really Here',
 'Spring Breakers',
 'Hereditary',
 'Hell Or High Water',
 'Shoplifters',
 'Paddington 2',
 'Zero Dark Thirty',
 'Mudbound',
 'Minding The Gap',
 'Stranger By The Lake',
 'Computer Chess',
 'Mustang',
 'Tabu',
 'The Witch',
 'American Honey',
 '12 Years A Slave',
 'Force Majeure',
 'Right Now, Wrong Then',
 'Spider-Man: Into The Spiderverse',
 'Eighth Grade',
 'The Loneliest Planet',
 'Inside Out',
 'The Favourite',
 'Support The Girls',
 'A Ghost Story',
 'Drive',
 'Inception',
 'Her Smell',
 'La La Land',
 'The Lobster',
 'Parasite',
 'Her',
 'Amour',
 'Call Me By Your Name',
 'Paterson',
 'It Follows',
 'Arrival',
 'Moonrise Kingdom',
 'Green Room',
 'Marr

<h2> Esquire </h2>

In [397]:
link = "https://www.esquire.com/entertainment/movies/g29892894/best-movies-of-the-2010s/"
tag = '.listicle-slide-hed-text'
movies = parser(link, tag)
movies

['<span class="listicle-slide-hed-text">10. Paddington 2 (2017)</span>',
 '<span class="listicle-slide-hed-text">9. Moonlight (2016)</span>',
 '<span class="listicle-slide-hed-text">8. The Handmaiden (2016)</span>',
 '<span class="listicle-slide-hed-text">7. The Lobster (2015)</span>',
 '<span class="listicle-slide-hed-text">6. Once Upon a Time...in Hollywood (2019)</span>',
 '<span class="listicle-slide-hed-text">5. Whiplash (2014)</span>',
 '<span class="listicle-slide-hed-text">4. Call Me By Your Name (2017)</span>',
 '<span class="listicle-slide-hed-text">3. Mad Max: Fury Road (2015)</span>',
 '<span class="listicle-slide-hed-text">2. Before Midnight (2013)</span>',
 '<span class="listicle-slide-hed-text">1. The Social Network (2010)</span>']

In [399]:
remove_non_chars(remove_html(movies))

[' Paddington  ',
 ' Moonlight ',
 ' The Handmaiden ',
 ' The Lobster ',
 ' Once Upon a Timein Hollywood ',
 ' Whiplash ',
 ' Call Me By Your Name ',
 ' Mad Max: Fury Road ',
 ' Before Midnight ',
 ' The Social Network ']

<h2> New Yorker </h2>

In [400]:
link = "https://www.newyorker.com/culture/decade-in-review/the-twenty-seven-best-movies-of-the-decade"
tag = 'strong'
movies = parser(link, tag)
remove_html(movies)

['The Wolf of Wall Street',
 'Madeline’s Madeline',
 'Get Out',
 'An Elephant Sitting Still',
 'Did You Wonder Who Fired the Gun?',
 'The Future',
 'Margaret',
 'The Grand Budapest Hotel',
 'Somewhere',
 'Li’l Quinquin',
 'Film Socialisme',
 'An Oversimplification of Her Beauty',
 'Holy Motors',
 'Coma',
 'Red Hook Summer',
 'Zama',
 'Moonlight',
 'The Mule',
 'It Felt Like Love',
 'The Last of the Unjust',
 'In Jackson Heights',
 'A Ghost Story',
 'A Screaming Man',
 'A Quiet Passion',
 'Taxi',
 'Let the Sunshine In',
 'Infinite Football']

<h2> Stacker </h2>

This website has 100 pages, however upon scraping I found I didn't have to iterate through all the pages which is great.

In [401]:
link = "https://thestacker.com/stories/3678/100-best-movies-last-decade-according-critics#1"
tag = '.views-field-field-slide-caption .field-content'
movies = parser(link, tag)

In [404]:
remove_non_chars(remove_html(movies))[1:]

[' Manuscripts Dont Burn ',
 ' The Arbor ',
 ' A Film Unfinished ',
 ' Selma ',
 ' Blue Is the Warmest Color ',
 ' The Kings Speech ',
 ' Inside Job ',
 ' Western ',
 ' Zama ',
 ' Elle ',
 ' Long Days Journey Into Night ',
 ' Columbus ',
 ' Uncertain ',
 ' National Gallery ',
 ' Phoenix ',
 ' Two Days One Night ',
 ' The Overnighters ',
 ' Eighth Grade ',
 ' The Artist ',
 ' The Farewell ',
 ' The Tale of The Princess Kaguya ',
 ' Gangs of Wasseypur ',
 ' Dead Souls ',
 ' The Fits ',
 ' Hard to Be a God ',
 ' Big Men ',
 ' My Perestroika ',
 ' Winters Bone ',
 ' American Hustle ',
 ' I Called Him Morgan ',
 ' The Tale ',
 ' Paterson ',
 ' Paths of the Soul ',
 ' This Is Not a Film ',
 ' Phantom Thread ',
 ' Burning ',
 ' Cold War ',
 ' The Favourite ',
 ' Gett: The Trial of Viviane Amsalem ',
 ' A Prophet ',
 ' Her ',
 ' Minding the Gap ',
 ' Mad Max: Fury Road ',
 ' The Act of Killing ',
 ' Its Such a Beautiful Day ',
 ' For Sama ',
 ' Gavagai ',
 ' A Bread Factory Part One  Part Two 

<h2> Vanity Fair </h2>

In [405]:
link = "https://www.vanityfair.com/hollywood/2019/11/best-movies-decade-2010s-lawson"
tag = "h2 em"
movies = remove_html(parser(link, tag))

In [406]:
movies

['Princess Cyd',
 'Weekend',
 'Force Majeure',
 'Eden',
 'Get Out',
 'Parasite',
 'Melancholia',
 'Dawson City: Frozen Time',
 'Phantom Thread',
 'Mad Max: Fury Road']

<h2> Uproxx </h2>

In [407]:
link = "https://uproxx.com/movies/uproxx-best-movies-of-2010s-decade/"
tag = "strong em"
movies = remove_html(parser(link, tag))

In [409]:
remove_non_chars(movies)

['Inside Llewyn Davis ',
 'Get Out ',
 'The Social Network  ',
 ' Years a Slave ',
 'Mad Max: Fury Road ',
 'OJ: Made in America ',
 'Roma ',
 'John Wick ',
 'Once Upon a Time in Hollywood ',
 'Moonlight ',
 'Mission Impossible: Fallout ',
 'Selma ',
 'Wonder Woman ',
 'Star Wars: The Last Jedi ',
 'Ex Machina ',
 'Black Panther ',
 'La La Land ',
 'Lady Bird ',
 'Sorry to Bother You ',
 'Snowpiercer ',
 'Drive ']

<h2> The Playlist </h2>

This website proved quite hard to parse.

In [410]:
URLs = ["https://theplaylist.net/best-100-films-decade-2010s-20191202/#cb-content",]
for i in range(2,11):
    URLs.append(URL + str(i) + "/#cb-content")

In [411]:
URLs
URLs[-1] == "https://theplaylist.net/best-100-films-decade-2010s-20191202/10/#cb-content"

True

In [437]:
movies = []
tag = "b:nth-child(2) , b:nth-child(1)"
for url in URLs:
    for item in parser(url,tag):
        movies.append(item)
movies

['<b>Greta Gerwigâ\x80\x99s</b>',
 '<b>Little Women</b>',
 '<b>100. â\x80\x9cMadelineâ\x80\x99s Madelineâ\x80\x9d </b>',
 '<b>99. â\x80\x9cPrivate Lifeâ\x80\x9d </b>',
 '<b>98.</b>',
 '<b>Martha Marcy May Marlene</b>',
 '<b>97</b>',
 '<b>Meekâ\x80\x99s Cutoff</b>',
 '<b>96. </b>',
 '<b>What We Do In The Shadows</b>',
 '<b>95. </b>',
 '<b>Mother!</b>',
 '<b>The Wailing</b>',
 '<b>â\x80\x9cMargaretâ\x80\x9d </b>',
 '<b>Inceptionâ\x80\x9d </b>',
 '<b>Zero Dark Thirty</b>',
 '<b>Shame</b>',
 '<b>The Hunt</b>',
 '<b>Blumhouse</b>',
 '<b>Ida</b>',
 '<b>The Lighthouse</b>',
 '<b>Spotlight</b>',
 '<b>85.</b>',
 '<b>â\x80\x9cNoâ\x80\x9d (2012)<br>\n</b>',
 '<b>â\x80\x9cGirlhoodâ\x80\x9d </b>',
 '<b>Leviathanâ\x80\x9d </b>',
 '<b>â\x80\x9cStranger by the Lakeâ\x80\x9d<br>\n</b>',
 '<b>â\x80\x9cBoyhoodâ\x80\x9d </b>',
 '<b>Force Majeure</b>',
 '<b>â\x80\x9cEdenâ\x80\x9d<br>\n</b>',
 '<b>â\x80\x9cThe Duke of Burgundyâ\x80\x9d </b>',
 '<b>The Immigrant</b>',
 '<b>â\x80\x9cThe Social Networkâ\x80\x9

In [468]:
movies = []
url = "https://theplaylist.net/best-100-films-decade-2010s-20191202/"
tag = "p"
movies = parser(url, tag)
movies
find_movies(movies, '\d+.(.*?)\(\d+\)')

[' â\x80\x9cMadelineâ\x80\x99s Madelineâ\x80\x9d </b>',
 ' â\x80\x9cPrivate Lifeâ\x80\x9d </b>',
 '</b> â\x80\x9c<b>Martha Marcy May Marlene</b>â\x80\x9d ',
 '/b>. â\x80\x9c<b>Meekâ\x80\x99s Cutoff</b>â\x80\x9d ',
 ' </b>â\x80\x9c<b>What We Do In The Shadows</b>â\x80\x9d ',
 ' </b>â\x80\x9c<b>Mother!</b>â\x80\x9d ',
 '</strong> â\x80\x9c<b>The Wailing</b>â\x80\x9d ',
 '</strong> <b>â\x80\x9cMargaretâ\x80\x9d </b>',
 ' </strong>â\x80\x9c<b>Inceptionâ\x80\x9d </b>',
 '</strong> â\x80\x9c<b>Zero Dark Thirty</b>â\x80\x9d ']

With some trial and error, I decided to just extract entire paragraphs and then use regex to fin dthe titles. This approach proved successful.

In [492]:
movies = []
tag = "p"
for url in URLs:
    page_list = parser(url,tag)
    movies.append(find_movies(page_list, '\d+.(.*?)\(\d+\)'))


In [493]:
flat_sol = [item for sublist in movies for item in sublist]
len(flat_sol)

97

In [497]:
remove_non_chars(flat_sol)

[' Madelines Madeline b',
 ' Private Life b',
 'b bMartha Marcy May Marleneb ',
 'b bMeeks Cutoffb ',
 ' bbWhat We Do In The Shadowsb ',
 ' bbMotherb ',
 'strong bThe Wailingb ',
 'strong bMargaret b',
 ' strongbInception b',
 'strong bZero Dark Thirtyb ',
 'strong bShameb ',
 'strong bThe Huntb ',
 'strong bIdab ',
 ' strongbThe Lighthouseb ',
 ' strongbSpotlightb ',
 'b bNo ',
 'strong bGirlhood b',
 'strong bLeviathan b',
 ' strongbBoyhood b',
 ' bForce Majeureb ',
 ' strongbThe Duke of Burgundy b',
 ' strongbThe Immigrantb ',
 'strong bThe Social Network b',
 'strong bThe Wolf of Wall Streetb ',
 'strong bZama b',
 'strong bWeekendb ',
 ' White Material ',
 'strong bShopliftersb ',
 ' bThe Souvenirb ',
 'strong bSuspiriab ',
 'strong bTangerine b',
 'b bUpstream Colorb ',
 ' OJ: Made in America ',
 ' bstrongBlue Valentinestrong ',
 'b bPortrait Of A Lady On Fireb ',
 ' Tabu b',
 ' bbUncle Boonmee Who Can Recall His Past Livesb ',
 'b bA Ghost Storyb ',
 ' bCall Me By Your Nameb ',


In [479]:
movies = []
tag = "p"
for url in URLs:
    page_list = parser(url,tag)
    print(len(find_movies(page_list, '\d+.(.*?)\(\d+\)')))
    movies.append(find_movies(page_list, '\d+.(.*?)\(\d+\)'))

10
9
9
10
10
10
10
9
10
10


The 3 missing movies are from page 2, 3 and 8! I examined those lists manually and found that each of those pages had a movie missing a data and hence the regex wasn't picking it up. To fix this I just will add them manually.

"82. Stranger by The Lake"

"79. Eden"

"22. Stories We Tell"

<h2> Rotten Tomatoes </h2>

In [483]:
link = "https://editorial.rottentomatoes.com/guide/the-200-best-movies-of-the-2010s/"
tag = '.article_movie_title a'
movies = parser(link, tag)

In [484]:
remove_html(movies)

['12 Years a Slave',
 '20 Feet From Stardom',
 '45 Years',
 'All Is Lost',
 'Amazing Grace',
 'American Hustle',
 'Amy',
 'Anomalisa',
 'Ant-Man and the Wasp',
 'Apollo 11',
 'Argo',
 'Arrival',
 'Ash Is Purest White',
 'The Artist',
 'Avengers: Endgame',
 'Avengers: Infinity War',
 'The Babadook',
 'Baby Driver',
 'A Beautiful Day in the Neighborhood',
 'Before Midnight',
 'The Big Sick',
 'Birdman',
 'Birds of Passage (Pájaros de verano)',
 'Black Panther',
 'Blackfish',
 'BlacKkKlansman',
 'Blade Runner 2049',
 'Booksmart',
 'Boyhood',
 'BPM (Beats Per Minute) (120 battements par minute)',
 'Bridge of Spies',
 'Brooklyn',
 'Bumblebee',
 'Burning (Beoning)',
 'Call Me by Your Name',
 'Cameraperson',
 'Can You Ever Forgive Me?',
 'Captain America: Civil War',
 'Captain Phillips',
 'Carol',
 'Coco',
 'Crazy Rich Asians',
 'Creed',
 'Dallas Buyers Club',
 'Dawn Of The Planet Of The Apes',
 'The Death of Stalin',
 'The Disaster Artist',
 'Doctor Strange',
 'Dolemite Is My Name',
 "Don't 

<h2>SlashFilm</h2>

In [200]:
link = "https://www.slashfilm.com/jacob-halls-top-10-movies-of-the-decade/"
tag = ".s1"
movies = remove_html(parser(link, tag))

In [201]:
movies

['10. The Cabin in the Woods',
 '9. The Grand Budapest Hotel',
 '8. Green Room',
 '7. Arrival',
 '6. Star Wars: The Last Jedi',
 '5. Spider-Man: Into the Spider-Verse',
 '4. Parasite',
 '3. Get Out',
 '2. Mad Max: Fury Road',
 '1. Inside Llewyn Davis']

<h2>GQ</h2>

In [202]:
link = "https://www.gq.com/story/the-best-movies-of-the-2010s"
tag = "h2"
movies = remove_html(parser(link, tag))

In [203]:
movies

['The Social Network',
 'The Handmaiden',
 'Phantom Thread',
 'Bridesmaids',
 'Arrival',
 'Lady Bird',
 'Spider-Man: Into the Spider-Verse',
 'Moonlight',
 'Call Me By Your Name',
 'Parasite',
 'Uncut Gems',
 'First Reformed',
 'Inside Llewyn Davis',
 'Widows',
 'The Favourite',
 'Toni Erdmann',
 'Mad Max: Fury Road',
 'Minding the Gap',
 'Black Panther',
 'Shirkers',
 'The Wolf of Wall Street',
 'Inception',
 'Hereditary',
 'Coco',
 'Get Out']

<h2> Hollywood Reporter </h2>

In [499]:
link = 'https://www.hollywoodreporter.com/lists/10-best-films-decade-1260056/item/1-carlos-2010-10-best-films-decade-1260057'
tag = '.list-item__title'
movies = remove_html(parser(link, tag))

In [500]:
remove_non_chars(movies)

[' Carlos ',
 ' The Social Network ',
 ' Inside Llewyn Davis ',
 ' Only Lovers Left Alive ',
 ' The Handmaiden ',
 ' Leviathan ',
 ' Once Upon a Time in Hollywood ',
 ' Brooklyn  ',
 ' Mad Max: Fury Road ',
 ' The Gatekeepers ']

<h2> Redbook </h2>

In [501]:
link = 'https://www.redbookmag.com/life/g30084079/best-movies-of-the-decade/'
tag = '.slideshow-slide-hed'
movies = remove_html(parser(link, tag))

In [502]:
for idx, movie in enumerate(movies):
    movies[idx] = movie.strip()
remove_non_chars(movies)

['Get Out ',
 'Frozen ',
 'The Avengers ',
 'Mad Max: Fury Road ',
 'Call Me By Your Name ',
 'Bridesmaids ',
 'Spotlight ',
 'Lady Bird ',
 'Moonlight ',
 'Guardians of the Galaxy ',
 'The Social Network ',
 'La La Land ',
 'Roma ',
 'Wonder Woman ',
 'Inception ',
 'The Wolf Of Wall Street ',
 'Black Panther ',
 'The Favourite ',
 'Star Wars: The Force Awakens ',
 'Boyhood ']

<h2> Vanity Fair </h2>

In [503]:
link = "https://www.vanityfair.com/hollywood/2019/11/best-movies-decade-2010s-collins"
tag = "h2 em"
movies = remove_html(parser(link, tag))

In [504]:
movies

['The Act of Killing: Director’s Cut',
 'Carol',
 'The Day He Arrives',
 'Did You Wonder Who Fired the Gun?',
 'Drug War',
 'Field Niggas',
 'Frances Ha',
 'The Future',
 'Gone Girl',
 'Happy Hour',
 'Heaven Knows What',
 'In Jackson Heights',
 'Inside Llewyn Davis',
 'It Felt Like Love',
 'Leviathan',
 'Like Someone in Love',
 'Mad Max: Fury Road',
 'Margaret',
 'The Master',
 'The Missing Picture',
 'Moonlight',
 'Sunset Song',
 'Timbuktu',
 'A Touch of Sin',
 'The Tree of Life',
 'Universal Soldier: Day of Reckoning',
 'Unstoppable',
 'Upstream Color',
 'The Wind Rises',
 'The Wolf of Wall Street']

<h2> Time </h2>

In [238]:
link = "https://time.com/5725149/best-movies-2010s-decade/"
tag = "h2 em"
movies = remove_html(parser(link, tag))

In [240]:
movies[:-2]

['Somewhere',
 'Cave of Forgotten Dreams',
 'Melancholia',
 'Before Midnight',
 'Phoenix',
 'John Wick',
 'Selma',
 'Moonlight',
 'The Lost City of Z',
 'Roma']

<h2> Entertainment.ie </h2>

In [507]:
link = "https://entertainment.ie/cinema/movie-news/20-best-movies-of-the-decade-429849/"
tag = "p strong"
movies = remove_html(parser(link, tag))
remove_non_chars(movies)

[' Hell Or High Water  ',
 ' The Guard  ',
 ' The Cabin In The Woods  ',
 ' Mission: Impossible  Fallout  ',
 ' Whiplash  ',
 ' Avengers: Endgame  ',
 ' The Grand Budapest Hotel  ',
 ' John Wick  ',
 ' Shame  ',
 ' Spotlight     Ireland',
 ' SpiderMan: Into The SpiderVerse  ',
 ' Inception  ',
 'Lady Bird  ',
 ' Ad Astra  ',
 ' Moonlight  ',
 'The Death of Stalin  ',
 ' The Social Network  ',
 ' Mad Max: Fury Road  ',
 ' Get Out  ',
 ' Boyhood  ']

<h2> CNET </h2>

In [509]:
link = "https://www.cnet.com/news/the-30-best-films-of-the-decade-ranked/"
tag = "h2"
remove_non_chars(remove_html(parser(link, tag)))

['  Mad Max: Fury Road  ',
 '  SpiderMan: Into the SpiderVerse  ',
 '  Boyhood  ',
 '  Get Out  ',
 '  Lady Bird  ',
 '  The Favourite  ',
 '  Roma  ',
 '  Black Panther  ',
 '  Ex Machina  ',
 '  The Master  ',
 '  Her  ',
 '  Moonlight  ',
 '  The Social Network  ',
 '  Drive  ',
 '  The Shape of Water  ',
 '  Avengers: Infinity War  ',
 '  Inception  ',
 '  Birdman  ',
 '  Spotlight  ',
 '  Toy Story   ',
 '  Hereditary  ',
 '  What We Do In The Shadows  ',
 '   Years a Slave  ',
 '  Whiplash  ',
 '  Annihilation  ',
 '  Call Me By Your Name  ',
 '  The Witch  ',
 '  Hunt for the Wilderpeople  ',
 '  Star Wars: The Force Awakens  ',
 '  Wonder Woman  ',
 'Discuss: The  best films of the decade ranked']

<h2> Looper </h2>

In [510]:
link = "https://www.looper.com/178003/the-best-movies-of-the-last-decade/"
tag = "h2"
remove_non_chars(remove_html(parser(link, tag)))

['Inception ',
 'Toy Story  ',
 'The Social Network ',
 'Skyfall ',
 'Gravity ',
 'Her ',
 'Inside Llewyn Davis ',
 'Selma ',
 'The Grand Budapest Hotel ',
 'Boyhood ',
 'Inside Out ',
 'Mad Max: Fury Road ',
 'Midnight Special ',
 'Moonlight ',
 'Lady Bird ',
 'Get Out ',
 'The Shape of Water ',
 'Black Panther ',
 'A Quiet Place ',
 'SpiderMan: Into the SpiderVerse ',
 'BlacKkKlansman ',
 'Us ',
 'Avengers: Endgame ',
 'Parasite ']

<h2>Columbus Underground</h2>

In [512]:
link = "https://www.columbusunderground.com/25-best-movies-of-the-decade-hm1"
tag = "h2"
movies = remove_non_chars(remove_html(parser(link, tag)))
movies[18:]

[' Mad Max: Fury Road ',
 ' Toy Story  ',
 '  Years a Slave ',
 ' Take Shelter ',
 ' The Tree of Life ',
 ' The Master ',
 ' Selma ',
 ' Moonlight ',
 ' The Act of Killing ',
 ' Cave of Forgotten Dreams ',
 ' Drive ',
 ' The Revenant ',
 ' Boyhood ',
 ' Roma ',
 ' Toy Story  ',
 ' The Witch ',
 ' You Were Never Really Here ',
 ' Get Out ',
 ' Parasite ',
 ' The Irishman ',
 ' Django Unchained ',
 ' Dunkirk ',
 ' Black Panther ',
 ' The Babadook ',
 ' Young Adult ']

<h2> NME </h2>

In [513]:
link = "https://www.nme.com/features/films-of-the-decade-nme-movies-2583834"
tag = ".tdb-sml-current-item-title"
remove_non_chars(remove_html(parser(link, tag)))

['KickAss ',
 'Rogue One: A Star Wars Story ',
 'Insidious ',
 'The Babadook ',
 'Guardians of the Galaxy ',
 'Hereditary ',
 'Joker ',
 'Baby Driver ',
 'A Quiet Place ',
 'Whiplash ',
 'Lady Bird ',
 'Deadpool ',
 'The Grand Budapest Hotel ',
 'Inception ',
 'Django Unchained ',
 'Moonlight ',
 'Avengers: Endgame ',
 'Mad Max: Fury Road ',
 'Call Me By Your Name ',
 'Get Out ']

<h2> Town and Country Mag </h2>

In [514]:
link = "https://www.townandcountrymag.com/leisure/arts-and-culture/g30199550/best-movies-of-2010s/"
tag = ".slideshow-slide-hed"
remove_non_chars(remove_html(parser(link, tag)))

['        The Social Network                    ',
 '        Margaret                    ',
 '        Bridesmaids                    ',
 '        The Wolf of Wall Street                    ',
 '        The Kings Speech                    ',
 '         Years a Slave                    ',
 '        The Grand Budapest Hotel                    ',
 '        Boyhood                    ',
 '        Carol                    ',
 '        Moonlight                    ',
 '        La La Land                    ',
 '        Get out                    ',
 '        Lady Bird                    ',
 '        Roma                    ',
 '        The Favourite                    ']

<h2>Spin</h2>

In [515]:
links = [
    "https://www.spin.com/featured/30-best-movies-2010s",
    "https://www.spin.com/featured/30-best-movies-2010s/2/",
    "https://www.spin.com/featured/30-best-movies-2010s/3/",
    "https://www.spin.com/featured/30-best-movies-2010s/4/",
    "https://www.spin.com/featured/30-best-movies-2010s/5/"
]
tag = "h3"
movies = []
for link in links:
    movies.append(remove_non_chars(remove_html(parser(link, tag))))
movies

[['Winters Bone ',
  'Inception ',
  'The Social Network ',
  'Bridesmaids ',
  'A Separation ',
  'The Master '],
 ['Holy Motors ',
  'The Queen of Versailles ',
  'Spring Breakers ',
  'Frances Ha ',
  'Her ',
  'Boyhood '],
 ['John Wick',
  'It Follows ',
  'Mad Max: Fury Road ',
  'Magic Mike XXL ',
  'The Witch ',
  'Hunt for the Wilderpeople'],
 ['Moonlight ',
  'Popstar: Never Stop Never Stopping ',
  'OJ: Made in America ',
  'Toni Erdmann ',
  'Personal Shopper ',
  'Call Me by Your Name '],
 ['Get Out ',
  'First Reformed ',
  'Minding the Gap ',
  'Annihilation ',
  'Burning ',
  'High Life ']]

<h2> IMDB </h2>

In [516]:
link = "https://www.imdb.com/list/ls021078225/"
tag = ".lister-item-header a"
remove_html(parser(link, tag))[0:10]

['Prisoners',
 'The Grand Budapest Hotel',
 'Interstellar',
 'The Wolf of Wall Street',
 'Mad Max: Fury Road',
 'Your Name.',
 'Manchester by the Sea',
 'Inception',
 'Toy Story 3',
 'Paterson']

<h2> Hyper Allergic </h2>

In [519]:
link = "https://www.irishtimes.com/culture/film/the-25-best-films-of-the-2010s-none-of-them-won-best-picture-oscar-1.4116777"
tag = "strong"
remove_non_chars(remove_html(parser(link, tag)))[:-1]

[' Arrival',
 ' Upstream Colour',
 ' Holy Motors',
 ' Blue Is the Warmest Colour',
 ' You Were Never Really Here',
 ' Melancholia',
 ' Monos',
 ' Uncle Boonmee Who Can Recall His Past Lives',
 ' Lady Bird',
 ' Get Out',
 ' Climax',
 ' Ida',
 ' Parasite',
 ' Madelines Madeline',
 ' Birds of Passage',
 ' The Turin Horse',
 ' Tangerine',
 ' The Act of Killing',
 ' Beanpole',
 ' The Master',
 ' The Tribe',
 ' The Duke of Burgundy',
 ' Son of Saul',
 ' Under the Skin',
 ' Loveless']

<h2><u>Manually Parsed</u></h2>

<h2> Reddit </h2>

In [520]:
link = "https://www.reddit.com/r/movies/comments/cksn97/best_films_of_the_decade/"
tag = "div ol"
remove_html(parser(link, tag))

['The Tree of LifeMoonlightBlue is the Warmest Colour12 Years a SlaveIdaThe RevenantHerUnder the SkinThe Social NetworkShame',
 'Mad Max: Fury RoadThe WitchEx MachinaSpider-Man: Into The Spider-VerseIt FollowsManchester By The SeaHerIT: Chapter 1Your NameDunkirk',
 'The Grand Budapest HotelThe Tale of the Princess KaguyaBefore MidnightA SeparationThe ArtistThe Red TurtleArrivalHerEx Machina(Tie) The Shape of Water/Moonrise Kingdom',
 'BirdmanThe MasterMad Max: Fury RoadInterstellarThe Social NetworkEx MachinaLa La LandNightcrawlerWhiplashThe Grand Budapest Hotel',
 'The Turin HorseThe Tree of LifeTwin Peaks: The ReturnThe Look of SilenceMelancholiaSilenceThe MasterUnder the SkinWormwoodPaterson']

<h2> Paste Magazine </h2>

In [521]:
link = "https://www.pastemagazine.com/articles/2019/11/best-movies-2010s-decade.html"
tag = ".big"
parser(link, tag)

[]

<h2>Independent</h2>

In [522]:
link = "https://www.independent.co.uk/arts-entertainment/films/features/best-films-decade-2010s-paddington-moonlight-carol-ranked-a9213966.html"
tag = "span"
parser(link, tag)

#manually parsed

[]

<h2> Indiewire </h2>

In [525]:
link = "https://www.indiewire.com/gallery/best-movies-of-2010s-decade/"
tag = ".c-gallery-vertical-slide__title"
movies = parser(link, tag)
movies

[]

<h2><u> Result </u></h2>

Resulted in a database of 1357 non-unique titles scraped from 35 different sources. Over the course of this notebook I:

1. Efficiently scraped 35 websites
2. Used regex for a variety of purposes
3. Wrote a variety of helper functions that further streamlined my work
4. Finally, created a clean, near-perfect database of the top movies of the decade for future analysis

In [548]:
import pandas as pd
movieDB = pd.read_csv("MovieDB.csv")
sources = pd.read_csv("Sources.csv")
#movieDB = movieDB.dropna(how='all', axis='columns')
#sources = sources.dropna(how='all', axis='columns')

In [549]:
movieDB

Unnamed: 0,Title,Website,Rank
0,Melancholia,Vulture,1
1,Mad Max: Fury Road,Vulture,2
2,The Tree of Life,Vulture,3
3,The Rider,Vulture,4
4,A Separation,Vulture,5
5,Moonlight,Vulture,6
6,The Fits,Vulture,7
7,Margaret,Vulture,8
8,Spider-Man: Into the Spider-Verse,Vulture,9
9,The Florida Project,Vulture,10


In [550]:
sources

Unnamed: 0,Website,URL
0,Vulture,https://www vulture com/2019/12/every-movie-of...
1,Insider,https://www insider com/best-films-of-the-deca...
2,No Film School,https://nofilmschool com/best-movies-of-the-de...
3,GamesRadar,https://www gamesradar com/decade-best-movies-...
4,AV Club,https://film.avclub.com/the-100-best-movies-of...
5,Esquire,https://www.esquire.com/entertainment/movies/g...
6,New Yorker,https://www.newyorker.com/culture/decade-in-re...
7,The Stacker,https://thestacker.com/stories/3678/100-best-m...
8,Vanity Fair - Richard Lawson,https://www.vanityfair.com/hollywood/2019/11/b...
9,Uproxx,https://uproxx.com/movies/uproxx-best-movies-o...
