# Get actors, rating and movie title based on titleId

We want to retrieve relevant information from the IMDb API based on the titleId retrieved from the reviews data. In addition to this, the objective is to connect the titleIds and reviews, with corresponding data from the api into a single dataframe. 

Link to the API: https://developer.imdb.com/documentation/api-documentation/?ref_=/documentation/_PAGE_BODY

In [1]:
pip install imdbpy

Collecting imdbpy
  Downloading IMDbPY-2022.7.9-py3-none-any.whl (1.2 kB)
Collecting cinemagoer
  Downloading cinemagoer-2022.2.11-py3-none-any.whl (301 kB)
     -------------------------------------- 301.4/301.4 kB 6.2 MB/s eta 0:00:00
Installing collected packages: cinemagoer, imdbpy
Successfully installed cinemagoer-2022.2.11 imdbpy-2022.7.9
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install import-ipynb

Collecting import-ipynb
  Downloading import_ipynb-0.1.4-py3-none-any.whl (4.1 kB)
Installing collected packages: import-ipynb
Successfully installed import-ipynb-0.1.4
Note: you may need to restart the kernel to use updated packages.


In [3]:
import imdb
import pandas as pd
import numpy as np
import re


db = imdb.IMDb()

#print(dir(moviesDB)) #Get available keys

def getMovie(id):
    movie = db.get_movie(id)
    return movie

def getActors(movie):
    try:
        top_actors_list = movie['cast'][:10]
        actors = re.findall('_([A-Za-z ]+)_', str(top_actors_list))
        return list(actors)
    except KeyError:
        print("Oops!  Movie has no or few actors...")
        return False

In [127]:
reviewsDf = pd.read_pickle("reviewsDf.pkl")
reviewsDf

Unnamed: 0,reviews,titleId
0,"[""I went and saw this movie last night after b...",0406816
1,"[""My boyfriend and I went to watch The Guardia...",0406816
2,['My yardstick for measuring a movie\'s watch-...,0406816
3,"[""How many movies are there that you can think...",0406816
4,"[""This movie was sadly under-promoted but prov...",0406816
...,...,...
24995,"[""CyberTracker is set in Los Angeles sometime ...",0109515
24996,"[""Eric Phillips (Don Wilson) is a secret servi...",0109515
24997,['Plot Synopsis: Los Angeles in the future. Cr...,0109515
24998,"['Oh, dear! This has to be one of the worst fi...",0109515


In [143]:
titleId_column = reviewsDf["titleId"].to_numpy()
listOfMovieIds = np.unique(titleId_column)
listOfMovieIds = list(listOfMovieIds)

In [170]:
df = pd.DataFrame()
for titleId in listOfMovieIds: 
    print(listOfMovieIds.index(titleId), "of", len(listOfMovieIds))
    movie = getMovie(titleId)
    title = movie['title']
    hasActors = getActors(movie)
    if(hasActors):
        new_row = pd.Series({'TitleId':titleId, 'MovieTitle':title, 'Actors': hasActors})
        df = pd.concat([df, new_row.to_frame().T], ignore_index=True)
        
    else:
        print("removing title: ", title)
        cleaned_movieId_list = listOfMovieIds.remove(titleId)
    


0 of 3581
Oops!  Movie has no or few actors...
removing title:  Rough Sea at Dover
1 of 3580
2 of 3580
3 of 3580
4 of 3580
5 of 3580
6 of 3580
7 of 3580
8 of 3580
9 of 3580
10 of 3580
11 of 3580
12 of 3580
13 of 3580
14 of 3580
15 of 3580
16 of 3580
17 of 3580
18 of 3580
19 of 3580
20 of 3580
21 of 3580
22 of 3580
23 of 3580
24 of 3580
25 of 3580
26 of 3580
27 of 3580
28 of 3580
29 of 3580
30 of 3580
Oops!  Movie has no or few actors...
removing title:  The Frogs Who Wanted a King
31 of 3579
32 of 3579
33 of 3579
34 of 3579
35 of 3579
36 of 3579
37 of 3579
38 of 3579
39 of 3579
40 of 3579
41 of 3579
42 of 3579
Oops!  Movie has no or few actors...
removing title:  Felix in Hollywood
43 of 3578
44 of 3578
45 of 3578
46 of 3578
47 of 3578
48 of 3578
49 of 3578
50 of 3578
51 of 3578
52 of 3578
53 of 3578
54 of 3578
55 of 3578
56 of 3578
57 of 3578
58 of 3578
59 of 3578
60 of 3578
61 of 3578
62 of 3578
63 of 3578
64 of 3578
65 of 3578
66 of 3578
67 of 3578
68 of 3578
69 of 3578
70 of 3578
7

673 of 3578
674 of 3578
675 of 3578
676 of 3578
677 of 3578
678 of 3578
679 of 3578
680 of 3578
681 of 3578
682 of 3578
683 of 3578
684 of 3578
685 of 3578
686 of 3578
687 of 3578
688 of 3578
689 of 3578
690 of 3578
691 of 3578
692 of 3578
693 of 3578
694 of 3578
695 of 3578
696 of 3578
697 of 3578
698 of 3578
699 of 3578
700 of 3578
701 of 3578
702 of 3578
703 of 3578
704 of 3578
705 of 3578
706 of 3578
707 of 3578
708 of 3578
709 of 3578
710 of 3578
711 of 3578
712 of 3578
713 of 3578
714 of 3578
715 of 3578
716 of 3578
717 of 3578
718 of 3578
719 of 3578
720 of 3578
721 of 3578
722 of 3578
723 of 3578
724 of 3578
725 of 3578
726 of 3578
727 of 3578
728 of 3578
729 of 3578
730 of 3578
731 of 3578
732 of 3578
733 of 3578
734 of 3578
735 of 3578
736 of 3578
737 of 3578
738 of 3578
739 of 3578
740 of 3578
741 of 3578
742 of 3578
743 of 3578
744 of 3578
745 of 3578
746 of 3578
747 of 3578
748 of 3578
749 of 3578
750 of 3578
751 of 3578
752 of 3578
753 of 3578
754 of 3578
755 of 3578
756 

2022-11-03 18:00:35,043 CRITICAL [imdbpy] C:\Users\Maria\anaconda3\lib\site-packages\imdb\_exceptions.py:32: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0088512/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out')},); kwds: {}
Traceback (most recent call last):
  File "C:\Users\Maria\anaconda3\lib\site-packages\imdb\parser\http\__init__.py", line 221, in retrieve_unicode
    response = uopener.open(url)
  File "C:\Users\Maria\anaconda3\lib\urllib\request.py", line 517, in open
    response = self._open(req, data)
  File "C:\Users\Maria\anaconda3\lib\urllib\request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Users\Maria\anaconda3\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\Maria\anaconda3\lib\urllib\request.py", line 1389, in https_open
    return

IMDbDataAccessError: {'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0088512/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out')}

In [171]:
df

Unnamed: 0,TitleId,MovieTitle,Actors
0,0000399,Jack and the Beanstalk,[Thomas White]
1,0000430,A Chess Dispute,[Alfred Collins]
2,0000653,A Calamitous Elopement,"[Harry Solter, Linda Arvidson, Charles Inslee,..."
3,0002816,The Drummer of the 8th,"[Cyril Gardner, Mildred Harris, Frank Borzage]"
4,0003662,The Battle of Elderbush Gulch,"[Mae Marsh, Leslie Loveridge, Alfred Paget, Ro..."
...,...,...,...
971,0088249,Terror in the Aisles,"[Donald Pleasence, Nancy Allen, Fred Asparagus..."
972,0088263,Michael Jackson: Thriller,"[Michael Jackson, Ola Ray, Brandon Scott Mille..."
973,0088323,The NeverEnding Story,"[Barret Oliver, Gerald McRaney, Chris Eastman,..."
974,0088381,Nausicaä of the Valley of the Wind,"[Sumi Shimamoto, Mahito Tsujimura, Minoru Yada..."


In [172]:
df.to_pickle("moviesDf.pkl")