# Trabalho Final de LUI - Webscrapping IMDB
**Curso:** BI Master 2019.2 <br>
**Alunos:** Manoela Lacombe, Marcelo Bittencurt, Ivan Madeira de Oliveira <br>

---

## 1. Introdução

---

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1. Descrição do problema

#
<h4>Foi passado para vocês um web scrapping para extrair os dados de um site do IMBD, por favor obtendo os dados obtidos pelo scrappig, organize em tabelas os filmes extraidos.</h4>
<ol>
    <li>Organize os filmes por gênero;</li>
    <li>Organize os filmes por ano;</li>
    <li>O trabalho pode ser feito em grupos de 5 pessoas e sera explicado na próxima aula</li> 
</ol>
 
<p><strong>Bônus: pegar os 100 filmes e series mostradas na pagina.</p></strong>

In [1]:
import lxml
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from requests import get

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2. Código disponibilizado com o problema

In [2]:
url = "https://www.imdb.com/search/title?count=100&title_type=feature,tv_series&ref_=nv_wl_img_2"

In [3]:
class IMDB(object):
    """docstring for IMDB"""
    def __init__(self, url):
        super(IMDB, self).__init__()
        page = get(url)

        self.soup = BeautifulSoup(page.content, 'lxml')

    def articleTitle(self):
        return self.soup.find("h1", class_="header").text.replace("\n","")

    def bodyContent(self):
        content = self.soup.find(id="main")
        return content.find_all("div", class_="lister-item mode-advanced")

    def movieData(self):
        movieFrame = self.bodyContent()
        movieTitle = []
        movieDate = []
        movieRunTime = []
        movieGenre = []
        movieRating = []
        movieScore = []
        movieDescription = []
        movieDirector = []
        movieStars = []
        movieVotes = []
        movieGross = []
        for movie in movieFrame:
            movieFirstLine = movie.find("h3", class_="lister-item-header")
            movieTitle.append(movieFirstLine.find("a").text)
            movieDate.append(re.sub(r"[()]","", movieFirstLine.find_all("span")[-1].text))
            try:
                movieRunTime.append(movie.find("span", class_="runtime").text[:-4])
            except:
                movieRunTime.append(np.nan)
            movieGenre.append(movie.find("span", class_="genre").text.rstrip().replace("\n","").split(","))
            try:
                movieRating.append(movie.find("strong").text)
            except:
                movieRating.append(np.nan)
            try:
                movieScore.append(movie.find("span", class_="metascore unfavorable").text.rstrip())
            except:
                movieScore.append(np.nan)
            movieDescription.append(movie.find_all("p", class_="text-muted")[-1].text.lstrip())
            movieCast = movie.find("p", class_="")

            try:
                casts = movieCast.text.replace("\n","").split('|')
                casts = [x.strip() for x in casts]
                casts = [casts[i].replace(j, "") for i,j in enumerate(["Director:", "Stars:"])]
                movieDirector.append(casts[0])
                movieStars.append([x.strip() for x in casts[1].split(",")])
            except:
                casts = movieCast.text.replace("\n","").strip()
                movieDirector.append(np.nan)
                movieStars.append([x.strip() for x in casts.split(",")])

            movieNumbers = movie.find_all("span", attrs={"name": "nv"})

            if len(movieNumbers) == 2:
                movieVotes.append(movieNumbers[0].text)
                movieGross.append(movieNumbers[1].text)
            elif len(movieNumbers) == 1:
                movieVotes.append(movieNumbers[0].text)
                movieGross.append(np.nan)
            else:
                movieVotes.append(np.nan)
                movieGross.append(np.nan)

        movieData = [movieTitle, movieDate, movieRunTime, movieGenre, movieRating, movieScore, movieDescription,
                     movieDirector, movieStars, movieVotes, movieGross]
        return movieData

---

## 2. Tratamento dos dados

---

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1. Scrapping dos dados

In [4]:
imdb_scrapper = IMDB(url)
data = imdb_scrapper.movieData()

---

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2. Verifica se os dados foram copiados

---

In [5]:
for i, col in enumerate(data):
    first_row_value = col[0]
    print(f'Column {i+1} has {len(col)} values. First value: {col[0]}')

Column 1 has 100 values. First value: The Crown
Column 2 has 100 values. First value: 2016– 
Column 3 has 100 values. First value: 58
Column 4 has 100 values. First value: ['Biography', ' Drama', ' History']
Column 5 has 100 values. First value: 8.7
Column 6 has 100 values. First value: nan
Column 7 has 100 values. First value: Follows the political rivalries and romance of Queen Elizabeth II's reign and the events that shaped the second half of the twentieth century.
Column 8 has 100 values. First value: nan
Column 9 has 100 values. First value: ['Stars:Claire Foy', 'Olivia Colman', 'Imelda Staunton', 'Matt Smith']
Column 10 has 100 values. First value: 130,250
Column 11 has 100 values. First value: nan


---

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3. Nomeia as colunas

---

In [6]:
data2 = {}
data2['Title'] = data[0]
data2['Year'] = data[1]
data2['Runtime'] = data[2] 
data2['Genre'] = data[3]
data2['Rating'] = data[4]
data2['Score'] = data[5]
data2['Description'] = data[6]
data2['Director'] = data[7]
data2['Stars'] = data[8]
data2['Votes'] = data[9]
data2['Gross'] = data[10]

---

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4. Trata a coluna de ano, que tem formato diferente para filmes (aaaa) e séries (aaaa-aaaa)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A diferença no formato também é utilizada para criar um novo campo/coluna, que informa se o título é um filme ou uma série

---

In [7]:
import re

def parse_years(years_list):
    title_types = []
    year_begins = []
    year_ends = []
    for raw_years in years_list:
        # Strip value of anythin but numbers and dash (-)
        clean_years = re.sub(r'[^0-9–]+', '', raw_years)
        # If it has a dash (-), it is a TV Series
        if '–' in clean_years:
            title_type = 'TV Series'
            splitted_years = clean_years.split('–')
            year_begin = int(splitted_years[0])
            year_end = splitted_years[1]
            if year_end == '':
                year_end = 'ongoing'
        # If it don't have a dash (-), it is a movie
        else:
            title_type = 'Feature Film'
            year_begin = int(clean_years)
            year_end = ''            
        title_types.append(title_type)
        year_begins.append(year_begin)     
        year_ends.append(year_end)
    return [title_types, year_begins, year_ends]

# a, b, c = parse_years(data[1])
# pd.DataFrame(parse_years(data[1])).T.head(8)

types, y_begin, y_end = parse_years(data2['Year'])

data2['Title Type'] = types
data2['Year (Begin)'] = y_begin
data2['Year (End)'] = y_end

---

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.5. Transforma os dados em um pandas dataframe

---

In [8]:
df = pd.DataFrame(data2)
df.head(5)

Unnamed: 0,Title,Year,Runtime,Genre,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
0,The Crown,2016–,58,"[Biography, Drama, History]",8.7,,Follows the political rivalries and romance of...,,"[Stars:Claire Foy, Olivia Colman, Imelda Staun...",130250,,TV Series,2016,ongoing
1,O Mandaloriano,2019–,40,"[Action, Adventure, Sci-Fi]",8.7,,The travels of a lone bounty hunter in the out...,,"[Stars:Pedro Pascal, Carl Weathers, Gina Caran...",222268,,TV Series,2019,ongoing
2,Alguém Avisa?,2020,102,"[Comedy, Romance]",6.9,,A holiday romantic comedy that captures the ra...,Clea DuVall,"[Kristen Stewart, Mackenzie Davis, Mary Steenb...",8366,,Feature Film,2020,
3,Era uma vez um sonho,2020,116,[Drama],6.6,39.0,An urgent phone call pulls a Yale Law student ...,Ron Howard,"[Amy Adams, Glenn Close, Gabriel Basso, Haley ...",9126,,Feature Film,2020,
4,Sobrenatural,2005–2020,44,"[Drama, Fantasy, Horror]",8.4,,Two brothers follow their father's footsteps a...,,"[Stars:Jared Padalecki, Jensen Ackles, Jim Bea...",386427,,TV Series,2005,2020


In [9]:
df.tail(5)

Unnamed: 0,Title,Year,Runtime,Genre,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
95,Entre Facas e Segredos,2019,130,"[Comedy, Crime, Drama]",7.9,,A detective investigates the death of a patria...,Rian Johnson,"[Daniel Craig, Chris Evans, Ana de Armas, Jami...",421300,$165.36M,Feature Film,2019,
96,Doctor Who,2005–,45,"[Adventure, Drama, Family]",8.6,,The further adventures in time and space of th...,,"[Stars:Jodie Whittaker, Peter Capaldi, Pearl M...",197325,,TV Series,2005,ongoing
97,Ted Lasso,2020–,30,"[Comedy, Drama, Sport]",8.7,,Follows US American Football coach Ted Lasso h...,,"[Stars:Jason Sudeikis, Hannah Waddingham, Jere...",22567,,TV Series,2020,ongoing
98,O Poderoso Chefão,1972,175,"[Crime, Drama]",9.2,,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan, Diane K...",1596220,$134.97M,Feature Film,1972,
99,Fúria Incontrolável,2020,90,"[Action, Thriller]",6.0,,After a confrontation with an unstable man at ...,Derrick Borte,"[Russell Crowe, Caren Pistorius, Gabriel Batem...",23897,,Feature Film,2020,


---

### &nbsp; &nbsp; &nbsp; 2.6 Ordena os filmes por Ano do mais antigo para o mais recente. (Ascending Order)

---

In [10]:
df.sort_values(by=['Year'], ascending= True)

Unnamed: 0,Title,Year,Runtime,Genre,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
98,O Poderoso Chefão,1972,175,"[Crime, Drama]",9.2,,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan, Diane K...",1596220,$134.97M,Feature Film,1972,
54,Antes Só do que Mal Acompanhado,1987,93,"[Comedy, Drama]",7.6,,A man must struggle to travel home for Thanksg...,John Hughes,"[Steve Martin, John Candy, Laila Robins, Micha...",121923,$49.53M,Feature Film,1987,
31,Férias Frustradas de Natal,1989,97,[Comedy],7.6,,The Griswold family's plans for a big family C...,Jeremiah S. Chechik,"[Chevy Chase, Beverly D'Angelo, Juliette Lewis...",150128,$71.32M,Feature Film,1989,
34,Esqueceram de Mim,1990,103,"[Comedy, Family]",7.6,,An eight-year-old troublemaker must protect hi...,Chris Columbus,"[Macaulay Culkin, Joe Pesci, Daniel Stern, Joh...",460291,$285.76M,Feature Film,1990,
55,Um Maluco no Pedaço,1990–1996,22,[Comedy],7.9,,"A streetwise, poor young man from Philadelphia...",,"[Stars:Will Smith, James Avery, Alfonso Ribeir...",119689,,TV Series,1990,1996
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44,Saved by the Bell,2020–,,[Comedy],4.9,,A follow-up series to the 1990s sitcom. A grou...,,"[Stars:Haskiri Velazquez, Mitchell Hoog, Josie...",924,,TV Series,2020,ongoing
5,Mundo em Caos,2021,,"[Adventure, Sci-Fi]",,,A dystopian world where there are no women and...,Doug Liman,"[Mads Mikkelsen, Tom Holland, Daisy Ridley, Ra...",,,Feature Film,2021,
60,Titãs,I 2018–,45,"[Action, Adventure, Crime]",7.7,,A team of young superheroes combat evil and ot...,,"[Stars:Brenton Thwaites, Teagan Croft, Anna Di...",63395,,TV Series,2018,ongoing
59,Mosul,I 2019,86,"[Action, Drama, War]",7.3,,A police unit from Mosul fight to liberate the...,Matthew Michael Carnahan,"[Waleed Elgadi, Hayat Kamille, Thaer Al-Shayei...",11486,,Feature Film,2019,


---

### &nbsp; &nbsp; &nbsp; 2.7 Ordena os filmes por Ano do mais recente para o mais antigo. (Descending Order)

---

In [11]:
df.sort_values(by=['Year (Begin)'], ascending= False)

Unnamed: 0,Title,Year,Runtime,Genre,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
5,Mundo em Caos,2021,,"[Adventure, Sci-Fi]",,,A dystopian world where there are no women and...,Doug Liman,"[Mads Mikkelsen, Tom Holland, Daisy Ridley, Ra...",,,Feature Film,2021,
99,Fúria Incontrolável,2020,90,"[Action, Thriller]",6.0,,After a confrontation with an unstable man at ...,Derrick Borte,"[Russell Crowe, Caren Pistorius, Gabriel Batem...",23897,,Feature Film,2020,
73,O Diabo de Cada Dia,2020,138,"[Crime, Drama, Thriller]",7.1,,Sinister characters converge around a young ma...,Antonio Campos,"[Donald Ray Pollock, Bill Skarsgård, Tom Holla...",77652,,Feature Film,2020,
27,Os Novos Mutantes,2020,94,"[Action, Horror, Sci-Fi]",5.3,,"Five young mutants, just discovering their abi...",Josh Boone,"[Maisie Williams, Anya Taylor-Joy, Charlie Hea...",30489,,Feature Film,2020,
80,Emma.,2020,124,"[Comedy, Drama, Romance]",6.7,,"In 1800s England, a well meaning but selfish y...",Autumn de Wilde,"[Anya Taylor-Joy, Johnny Flynn, Mia Goth, Angu...",24685,,Feature Film,2020,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55,Um Maluco no Pedaço,1990–1996,22,[Comedy],7.9,,"A streetwise, poor young man from Philadelphia...",,"[Stars:Will Smith, James Avery, Alfonso Ribeir...",119689,,TV Series,1990,1996
34,Esqueceram de Mim,1990,103,"[Comedy, Family]",7.6,,An eight-year-old troublemaker must protect hi...,Chris Columbus,"[Macaulay Culkin, Joe Pesci, Daniel Stern, Joh...",460291,$285.76M,Feature Film,1990,
31,Férias Frustradas de Natal,1989,97,[Comedy],7.6,,The Griswold family's plans for a big family C...,Jeremiah S. Chechik,"[Chevy Chase, Beverly D'Angelo, Juliette Lewis...",150128,$71.32M,Feature Film,1989,
54,Antes Só do que Mal Acompanhado,1987,93,"[Comedy, Drama]",7.6,,A man must struggle to travel home for Thanksg...,John Hughes,"[Steve Martin, John Candy, Laila Robins, Micha...",121923,$49.53M,Feature Film,1987,


### &nbsp; &nbsp; &nbsp; 2.8 Ordena a base de dados de acordo com o gênero de filme escolhido.

In [12]:
#xlwriter = pd.ExcelWriter('IMDB_df.xlsx')       
    #df.to_excel(xlwriter, sheet_name='IMDB Movies', index=False)
#xlwriter.close()

### &nbsp; &nbsp; &nbsp; Obs: eu exportei a base para o excel para poder fazer texto para colunas da coluna Genre.

In [14]:
df_movies = pd.read_excel(r'C:\\Users\\olivi\\Desktop\\7 - Construção do Portifólio\\Projetos do BI Master\\LUI - Localização e Uso da Informação\\IMDB_df.xlsx')
df_movies

Unnamed: 0,Title,Year,Runtime,Genre_1,Genre_2,Genre_3,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
0,The Boys,2019–,60,Action,Comedy,Crime,8.7,-,A group of vigilantes sets out to take down co...,-,"['Stars:Karl Urban', 'Jack Quaid', 'Antony Sta...",197179,-,TV Series,2019,ongoing
1,O Halloween do Hubie,2020,102,Comedy,Fantasy,Mystery,5.2,-,Despite his devotion to his hometown of Salem ...,Steven Brill,"['Adam Sandler', 'Kevin James', 'Julie Bowen',...",19686,-,Feature Film,2020,-
2,Emily em Paris,2020–,30,Comedy,Drama,Romance,7.3,-,A young American woman from the Midwest is hir...,-,"['Stars:Lily Collins', 'Philippine Leroy-Beaul...",16953,-,TV Series,2020,ongoing
3,The Walking Dead,2010–,44,Drama,Horror,Thriller,8.2,-,Sheriff Deputy Rick Grimes wakes up from a com...,-,"['Stars:Andrew Lincoln', 'Norman Reedus', 'Mel...",836518,-,TV Series,2010,ongoing
4,Ratched,2020–,-,Crime,Drama,Mystery,7.4,-,"In 1947, Mildred Ratched begins working as a n...",-,"['Stars:Sarah Paulson', 'Finn Wittrock', 'Cynt...",22668,-,TV Series,2020,ongoing
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Família Soprano,1999–2007,55,Crime,Drama,-,9.2,-,New Jersey mob boss Tony Soprano deals with pe...,-,"['Stars:James Gandolfini', 'Lorraine Bracco', ...",289859,-,TV Series,1999,2007
96,News of the World,2020,-,Action,Adventure,Drama,-,-,"A Civil War veteran agrees to deliver a girl, ...",Paul Greengrass,"['Tom Hanks', 'Elizabeth Marvel', 'Ray McKinno...",-,-,Feature Film,2020,-
97,Ava,IV 2020,96,Action,Crime,Drama,5.3,39,Ava is a deadly assassin who works for a black...,Tate Taylor,"['Jessica Chastain', 'John Malkovich', 'Common...",14662,-,Feature Film,2020,-
98,O Homem Invisível,I 2020,124,Horror,Mystery,Sci-Fi,7.1,-,When Cecilia's abusive ex takes his own life a...,Leigh Whannell,"['Elisabeth Moss', 'Oliver Jackson-Cohen', 'Ha...",140141,$64.91M,Feature Film,2020,-


In [15]:
genero = input('Digite o gênero de filme desejado: ')
print('Variável genero: ' + genero)

Variável genero: Comedy


In [19]:
df_movies.loc[df_movies['Genre_1'] == genero]

Unnamed: 0,Title,Year,Runtime,Genre_1,Genre_2,Genre_3,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
1,O Halloween do Hubie,2020,102,Comedy,Fantasy,Mystery,5.2,-,Despite his devotion to his hometown of Salem ...,Steven Brill,"['Adam Sandler', 'Kevin James', 'Julie Bowen',...",19686,-,Feature Film,2020,-
2,Emily em Paris,2020–,30,Comedy,Drama,Romance,7.3,-,A young American woman from the Midwest is hir...,-,"['Stars:Lily Collins', 'Philippine Leroy-Beaul...",16953,-,TV Series,2020,ongoing
6,Schitt's Creek,2015–2020,22,Comedy,-,-,8.5,-,When rich video-store magnate Johnny Rose and ...,-,"['Stars:Eugene Levy', ""Catherine O'Hara"", 'Dan...",47830,-,TV Series,2015,2020
15,Abracadabra,1993,96,Comedy,Family,Fantasy,6.9,-,"A curious youngster moves to Salem, where he s...",Kenny Ortega,"['Bette Midler', 'Sarah Jessica Parker', 'Kath...",94948,$39.51M,Feature Film,1993,-
21,American Pie Apresenta: Meninas ao Ataque,2020,-,Comedy,-,-,3.7,-,"It's Senior year at East Great Falls. Annie, K...",Mike Elliott,"['Madison Pettis', 'Lizze Broadway', 'Natasha ...",2214,-,Feature Film,2020,-
27,Ted Lasso,2020–,30,Comedy,Drama,Sport,8.7,-,Follows US American Football coach Ted Lasso h...,-,"['Stars:Jason Sudeikis', 'Hannah Waddingham', ...",15322,-,TV Series,2020,ongoing
32,Borat: Subsequent Moviefilm,2020,95,Comedy,-,-,-,-,Follow-up film to the 2006 comedy centering on...,Jason Woliner,"['Sacha Baron Cohen', 'Irina Novak', 'Luenell'...",-,-,Feature Film,2020,-
34,Vida de Escritório,2005–2013,22,Comedy,-,-,8.9,-,A mockumentary on a group of typical office wo...,-,"['Stars:Steve Carell', 'Jenna Fischer', 'John ...",385023,-,TV Series,2005,2013
44,Vampiros X the Bronx,2020,85,Comedy,Horror,-,5.4,-,A group of young friends from the Bronx fight ...,Osmany Rodriguez,"['Jaden Michael', 'Gerald Jones III', 'Gregory...",4002,-,Feature Film,2020,-
49,Friends,1994–2004,22,Comedy,Romance,-,8.9,-,Follows the personal and professional lives of...,-,"['Stars:Jennifer Aniston', 'Courteney Cox', 'L...",792703,-,TV Series,1994,2004


Unnamed: 0,Title,Year,Runtime,Genre_1,Genre_2,Genre_3,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)


Unnamed: 0,Title,Year,Runtime,Genre_1,Genre_2,Genre_3,Rating,Score,Description,Director,Stars,Votes,Gross,Title Type,Year (Begin),Year (End)
