# HW4 : A visit to the movie zoo!

![](https://vignette.wikia.nocookie.net/bojackhorseman/images/f/f2/HSACWDTK%3FDTKT%3FLFO%21%21.png/revision/latest?cb=20150720050503)

## Task
In this homework, your task is to visualize THREE non-typical charts on anything related to your favorite **movie star!**
This means you CANNOT use the Big 4 chart types or their close variants (i.e. Pie, Bar, Line and Scatter, Area, etc.)

You are free to use any other chart type whether or not they were covered in class!
The lecture on Visit To The Zoo is a good place to start to get ideas on what kinds of charts exist.

For the data, you are free to use any data source you deem fit.
For charting, we will NOT be constraining the technology you use. 
You are free to produce the charts in any way you would like.

You will be judged on
* Creativity
* Presentation Quality
* Data Quality (Did your visualization reveal something interesting?)

For extra credit, you can make a fully interactive visualization.

## Ideas for Data Collection

Here, we show an example of how to collect data about Arnold Schwarzenegger!
Do note that this is just an example of the kind of data you can collect.
You are **NOT** constrained
* To the same movie star (you can pick your own!)
* To the same *kind* of data
* To the same data sources
* or to anything else!

This assignment gives you the power to do what you like!

In [32]:
from imdbpie import ImdbFacade
from IPython.core.display import display, HTML
from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd

import seaborn as sns
import plotly.express as px
import matplotlib as plt

import chart_studio
import chart_studio.plotly as py
import chart_studio.tools as tls

#### Get data for Arnold Schwarzenegger

In [None]:
# Get an instance of IMDb class
imdb = ImdbFacade()

# Search for Arnold Schwarzenegger
people = imdb.search_for_name('Arnold Schwarzenegger')
print(people)

In [None]:
# Fetch information about him
arnold = imdb.get_name(people[0].imdb_id)

# What information do I have about him?
print('\n'.join([x for x in dir(arnold) if not x.startswith('__')]))

In [None]:
# How many movies does he have?
print(len(arnold.filmography))

In [None]:
# Let's fetch some more information about a movie
movie = imdb.get_title(arnold.filmography[-1])

In [None]:
# What information can I get about this movie?
print('\n'.join([x for x in dir(movie) if not x.startswith('__')]))

In [None]:
print(movie.imdb_id)

In [None]:
html = """
    <div style="background-color:#FFDDDD">
    <h2> Warning! </h2>
    <p> This code below is meant to be an example of what you can do. <br>
        It is not guaranteed to work always, and will need to be tweaked!
    </p>
    </div>
"""
display(HTML(html))

#### Box office numbers

In [None]:
# Let's experiment with Terminator
imdb_id = 'tt0088247'

# Fetch the box office numbers
base = 'https://www.boxofficemojo.com'
url = base + '/title/' + imdb_id  + '/?ref_=bo_tt_gr_1'
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')

table = soup('th', text=re.compile(r'Release Group'))[0].parent.parent
group = table.findAll('tr', recursive=False)[1].find('a').get('href')
url = base + group

# Get total earnings domestic and international
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
earnings = soup('h2', text=re.compile(r'Rollout'))[0].parent.parent.findAll('div')
domestic = earnings[1].find('span', {'class': 'money'}).get_text()
domestic_url = earnings[1].find('a').get('href')
international = earnings[2].find('span', {'class': 'money'}).get_text()

# Get weekly domestic earnings
url = base + domestic_url
url = url[:url.rfind('/')] + '/weekly/'
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('div', {'class':'a-section imdb-scroll-table-inner'}).findAll('tr')
weekly = []
for tr in table[1:]:
    date = tr.findAll('td')[0].get_text()
    earning = tr.findAll('td')[2].get_text()
    weekly.append((date, earning))

# Print the values we've just got!
print("Total Domestic Earnings: %s" % domestic)
print("Total International Earnings: %s" % international)
print("Weekly Domestic Earnings:")
for date, earning in weekly:
    print("\t%s \t: %s" % (date, earning))

In [None]:
# Search for Johnny Depp
people = imdb.search_for_name('Johnny Depp')
print(people)

In [None]:
depp = imdb.get_name(people[0].imdb_id)

# What information do I have about him?
print('\n'.join([x for x in dir(depp) if not x.startswith('__')]))

In [None]:
print(len(depp.filmography))

In [None]:
movie = imdb.get_title(depp.filmography[-1])

In [None]:
dic = {}
dic['integer'] = imdb.get_title(depp.filmography[-1]).genres
dic

In [None]:
data = []
count = 0
for i in range(len(depp.filmography)):
    count += 1
    try:
        movie = imdb.get_title(depp.filmography[i])
        d = (movie.imdb_id, movie.title, movie.type, movie.year, str(movie.genres)
            , movie.rating, movie.rating_count, movie.release_date)
        data.append(d)
    except:
        continue

In [None]:
films_df = pd.DataFrame(data, columns = ['imdb_id', 'title', 'type', 'year', 'genres','rating', 'rating_count', 'release_date'])

In [None]:
#films_df.to_csv('johnny_depp_csv')

In [None]:
films = pd.DataFrame(list(genres.values()))
films['keys'] = pd.Series(list(genres.keys()))

In [None]:
#films.to_csv(path_or_buf  = 'johnny_depp_films')

In [None]:
imdb.get_title('tt2224471')

In [None]:
print('\n'.join([x for x in dir(movie) if not x.startswith('__')]))

In [None]:
print(movie.imdb_id)

In [2]:
#Plotly account info
username = 'lmd003' # your username
api_key = 'ZdD5zrT748FuJscu6VOb' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

In [3]:
actor_df = pd.read_csv('johnny_depp_csv')
actor_df = actor_df.drop('Unnamed: 0', axis = 1)
actor_df.head()

Unnamed: 0,imdb_id,title,type,year,genres,rating,rating_count,release_date
0,tt3715848,"Fortunately, the Milk",movie,,"('animation',)",,0,
1,tt4123432,Fantastic Beasts and Where to Find Them 3,movie,2021.0,"('adventure', 'family', 'fantasy')",,0,2021-11-12
2,tt9179096,Minamata,movie,2020.0,"('drama',)",,0,2020-11-24
3,tt6149154,Waiting for the Barbarians,movie,2019.0,"('drama',)",6.3,254,2019-09-06
4,tt2677722,City of Lies,movie,2018.0,"('biography', 'crime', 'drama', 'mystery', 'th...",6.3,3548,2018-12-08


In [4]:
just_movies = actor_df[actor_df.type == 'movie']

In [5]:
cats = {}
for i in just_movies['genres'].apply(eval).apply(list):
    for j in i:
        if j in cats:
            cats[j] += 1
        else:
            cats[j] = 0

In [6]:
cats

{'animation': 4,
 'adventure': 25,
 'family': 11,
 'fantasy': 28,
 'drama': 59,
 'biography': 18,
 'crime': 19,
 'mystery': 15,
 'thriller': 18,
 'comedy': 39,
 'romance': 19,
 'action': 18,
 'horror': 12,
 'musical': 6,
 'sci-fi': 2,
 'war': 4,
 'western': 4,
 'history': 4,
 'music': 10,
 'documentary': 39}

In [7]:
rats = {}
a = just_movies.dropna(subset = ['rating'])
a['genres'] = a['genres'].apply(eval).apply(list)
for i in range(len(a)):    
    for j in a['genres'].iloc[i]:
        if j in rats:
            rats[j] += a['rating'].iloc[i]
        else:
            rats[j] = a['rating'].iloc[i]



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [8]:
b = pd.DataFrame([rats, cats])
c = pd.DataFrame(dict(round(b.loc[0]/b.loc[1], 2)), index = range(len (b))).T.reset_index()
c.columns = ['cat', 'AvgRatg', 'trash']
c = c.drop('trash', axis = 1)

In [9]:
df =pd.DataFrame(cats, index = range(len(cats))).T[[0]].reset_index()
df.columns = ['type','number']
df = df.merge(c, left_on = 'type', right_on = 'cat').drop('cat', axis =1)
df = df.sort_values('number', ascending=False)
df.head()

Unnamed: 0,type,number,AvgRatg
4,drama,59,5.59
19,documentary,39,7.13
9,comedy,39,5.75
3,fantasy,28,6.19
1,adventure,25,5.86


In [40]:
## figure out how to add rating
fig = px.line_polar(df[0:7], r='number', theta='type', line_close=True, template="plotly_dark",
                   title = "Top 7 Genres of movies with Johnny Depp"
                   , text = 'number')
fig.update_traces(fill='toself')
fig.show()

In [41]:
#py.plot(fig, filename = 'top_7_categories_dark', auto_open=True)
tls.get_embed('https://plot.ly/~lmd003/3/')

'<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://plot.ly/~lmd003/3.embed" height="525" width="100%"></iframe>'

In [12]:
wind = px.data.wind()
fig = px.bar_polar(wind, r="frequency", theta="direction", color="strength", template="plotly_dark",
            color_discrete_sequence= px.colors.sequential.Plasma[-2::-1])
fig.show()

In [43]:
## figure out how to add rating as strength column (maybe avg?)
fig = px.bar_polar(df[0:7].sort_values('AvgRatg', ascending = False), r='number', theta='type', color = 'AvgRatg',
                   title = "Top 7 Genres of movies with Johnny Depp", template="plotly_dark",
                color_discrete_sequence= px.colors.colorbrewer.RdBu[::-1]
                   )
fig.show()

In [14]:
#py.plot(fig, filename = 'top_7_movies_with_AvgRtng', auto_open=True)

'https://plot.ly/~lmd003/3/'

In [15]:
just_movies[just_movies['title'].str.contains('Pirates')]

Unnamed: 0,imdb_id,title,type,year,genres,rating,rating_count,release_date
12,tt1790809,Pirates of the Caribbean: Dead Men Tell No Tales,movie,2017.0,"('action', 'adventure', 'fantasy')",6.6,238145,2017-05-11
34,tt1298650,Pirates of the Caribbean: On Stranger Tides,movie,2011.0,"('action', 'adventure', 'fantasy')",6.6,456261,2011-05-07
42,tt0449088,Pirates of the Caribbean: At World's End,movie,2007.0,"('action', 'adventure', 'fantasy')",7.1,566639,2007-05-19
43,tt0383574,Pirates of the Caribbean: Dead Man's Chest,movie,2006.0,"('action', 'adventure', 'fantasy')",7.3,623741,2006-06-24
53,tt0325980,Pirates of the Caribbean: The Curse of the Bla...,movie,2003.0,"('action', 'adventure', 'fantasy')",8.0,977720,2003-06-28
504,tt0325980,Pirates of the Caribbean: The Curse of the Bla...,movie,2003.0,"('action', 'adventure', 'fantasy')",8.0,977720,2003-06-28


In [16]:
p5 = pd.read_csv('POTC_5.csv').dropna(how = 'all').drop('Release Date', axis = 1)
p5['Year'] = 2017
p5['Movie'] = '2017 - Dead Men Tell No Tales'
p4 = pd.read_csv('POTC_4.csv').dropna(how = 'all').drop('Release Date', axis = 1)
p4['Year'] = 2011
p4['Movie'] = '2011 - On Stranger Tides'
p3 = pd.read_csv('POTC_3.csv').dropna(how = 'all').drop('Release Date', axis = 1)
p3['Year'] = 2007
p3['Movie'] = '2007 - At Worlds End'
p2 = pd.read_csv('POTC_2.csv').dropna(how = 'all').drop('Release Date', axis = 1)
p2['Year'] = 2006
p2['Movie'] = '2006 - Dead Mans Chest'
p1 = pd.read_csv('POTC_1.csv').dropna(how = 'all').drop('Release Date', axis = 1)
p1['Year'] = 2003
p1['Movie'] = '2003 - Curse of the black Pearl'
p1.head()

Unnamed: 0,Market,Opening,Gross,Year,Movie
0,Domestic,"$46,630,690","$305,413,918",2003,2003 - Curse of the black Pearl
1,Austria,–,"$5,323,779",2003,2003 - Curse of the black Pearl
2,Bulgaria,–,"$119,897",2003,2003 - Curse of the black Pearl
3,Czech Republic,"$176,482","$1,166,187",2003,2003 - Curse of the black Pearl
4,Egypt,–,"$216,536",2003,2003 - Curse of the black Pearl


In [17]:
m1 = p5.merge(p4, how = 'outer', left_on = 'Market', right_on = 'Market')
m2 = m1.merge(p3, how = 'outer', left_on = 'Market', right_on = 'Market')
m3 = m2.merge(p2, how = 'outer', left_on = 'Market', right_on = 'Market')
m4 = m3.merge(p1, how = 'outer', left_on = 'Market', right_on = 'Market')
m4.head()

Unnamed: 0,Market,Opening_x,Gross_x,Year_x,Movie_x,Opening_y,Gross_y,Year_y,Movie_y,Opening_x.1,...,Year_x.1,Movie_x.1,Opening_y.1,Gross_y.1,Year_y.1,Movie_y.1,Opening,Gross,Year,Movie
0,Domestic,"$62,983,253","$172,558,876",2017.0,2017 - Dead Men Tell No Tales,"$90,151,958","$237,710,309",2011.0,2011 - On Stranger Tides,"$114,732,820",...,2007.0,2007 - At Worlds End,"$135,634,554","$422,614,379",2006.0,2006 - Dead Mans Chest,"$46,630,690","$305,413,918",2003.0,2003 - Curse of the black Pearl
1,Austria,"$633,316","$3,717,328",2017.0,2017 - Dead Men Tell No Tales,"$2,208,651","$8,019,203",2011.0,2011 - On Stranger Tides,"$1,999,236",...,2007.0,2007 - At Worlds End,–,"$8,381,848",2006.0,2006 - Dead Mans Chest,–,"$5,323,779",2003.0,2003 - Curse of the black Pearl
2,Belgium,"$1,099,321","$5,287,437",2017.0,2017 - Dead Men Tell No Tales,"$2,281,293","$8,462,456",2011.0,2011 - On Stranger Tides,"$2,838,144",...,2007.0,2007 - At Worlds End,–,"$9,810,197",2006.0,2006 - Dead Mans Chest,,,,
3,Bulgaria,"$323,780","$1,143,629",2017.0,2017 - Dead Men Tell No Tales,"$438,734","$1,394,133",2011.0,2011 - On Stranger Tides,"$219,183",...,2007.0,2007 - At Worlds End,"$164,477","$498,300",2006.0,2006 - Dead Mans Chest,–,"$119,897",2003.0,2003 - Curse of the black Pearl
4,Czech Republic,"$1,025,058","$3,616,902",2017.0,2017 - Dead Men Tell No Tales,"$1,100,843","$3,876,389",2011.0,2011 - On Stranger Tides,"$490,836",...,2007.0,2007 - At Worlds End,"$629,176","$2,731,967",2006.0,2006 - Dead Mans Chest,"$176,482","$1,166,187",2003.0,2003 - Curse of the black Pearl


In [44]:
codes = pd.read_csv('countries_codes_and_coordinates.csv')
codes = codes[['Country', 'Alpha-3 code']]
codes.columns = ['Country', 'iso_alpha']
codes['iso_alpha'] = codes['iso_alpha'].str.replace('"', '').str.replace(" ", '')
codes[codes['Country'].str.contains('ana')]

Unnamed: 0,Country,iso_alpha
28,Botswana,BWA
39,Canada,CAN
76,French Guiana,GUF
83,Ghana,GHA
94,Guyana,GUY
168,Northern Mariana Islands,MNP
174,Panama,PAN


In [48]:
countries = pd.concat([p1, p2, p3, p4,p5]).sort_values(["Market", "Year"]).reset_index(drop = True)
countries['Opening'] = pd.to_numeric(countries['Opening'].str.replace(',', '').str.replace('$', ''), errors='coerce').fillna(0)
countries['Gross'] = pd.to_numeric(countries['Gross'].str.replace(',', '').str.replace('$', ''), errors='coerce').fillna(0)
countries.Market = countries.Market.str.replace("Domestic", "United States")
countries.Market = countries.Market.str.replace("Russia/CIS", "Russia")
countries.Market = countries.Market.str.replace("Serbia and Montenegro", "Serbia")
countries.Market = countries.Market.str.replace("Syria", "Syrian Arab Republic")
countries = countries.merge(codes, left_on = 'Market', right_on ='Country',how = "left" )
countries = countries.append({"Market": 'Canada', 'Opening': 46630690.0, 'Gross': 305413918.0,'Year':2003,'Movie': '2003 - Curse of the black Pearl', 'Country':'Canada', 'iso_alpha':'CAN'}, ignore_index=True)
countries = countries.append({"Market": 'Canada', 'Opening': 135634554.0, 'Gross': 422614379.0,'Year':2006,'Movie': '2006 - Dead Mans Chest', 'Country':'Canada', 'iso_alpha':'CAN'}, ignore_index=True)
countries = countries.append({"Market": 'Canada', 'Opening': 114732820.0, 'Gross': 309420425.0,'Year':2007,'Movie': '2007 - At Worlds End', 'Country':'Canada', 'iso_alpha':'CAN'}, ignore_index=True)
countries = countries.append({"Market": 'Canada', 'Opening': 90151958.0, 'Gross': 237710309.0,'Year':2011,'Movie': '2011 - On Stranger Tides', 'Country':'Canada', 'iso_alpha':'CAN'}, ignore_index=True)
countries = countries.append({"Market": 'Canada', 'Opening': 62983253.0, 'Gross': 172558876.0,'Year':2017,'Movie': '2017 - Dead Men Tell No Tales', 'Country':'Canada', 'iso_alpha':'CAN'}, ignore_index=True)
#countries[countries['iso_alpha'] == 'RUS']
countries[countries['Market'].str.contains('Can')]




Unnamed: 0,Market,Opening,Gross,Year,Movie,Country,iso_alpha
263,Canada,46630690.0,305413918.0,2003,2003 - Curse of the black Pearl,Canada,CAN
264,Canada,135634554.0,422614379.0,2006,2006 - Dead Mans Chest,Canada,CAN
265,Canada,114732820.0,309420425.0,2007,2007 - At Worlds End,Canada,CAN
266,Canada,90151958.0,237710309.0,2011,2011 - On Stranger Tides,Canada,CAN
267,Canada,62983253.0,172558876.0,2017,2017 - Dead Men Tell No Tales,Canada,CAN


In [49]:
fig = px.choropleth(countries, locations="iso_alpha", color="Gross", hover_name="Market", animation_frame="Movie",
                   title = "Johnny Depp's Pirates of the Caribbean Movies - Gross Revenue by Country")
fig.show()

In [50]:
#py.plot(fig, filename = 'POTC_Gross_by_Country', auto_open=True)

'https://plot.ly/~lmd003/5/'

In [22]:
display(actor_df[actor_df['year'] > 1980])
actor_df = actor_df[actor_df['year'] > 1980]

Unnamed: 0,imdb_id,title,type,year,genres,rating,rating_count,release_date
1,tt4123432,Fantastic Beasts and Where to Find Them 3,movie,2021.0,"('adventure', 'family', 'fantasy')",,0,2021-11-12
2,tt9179096,Minamata,movie,2020.0,"('drama',)",,0,2020-11-24
3,tt6149154,Waiting for the Barbarians,movie,2019.0,"('drama',)",6.3,254,2019-09-06
4,tt2677722,City of Lies,movie,2018.0,"('biography', 'crime', 'drama', 'mystery', 'th...",6.3,3548,2018-12-08
5,tt4123430,Fantastic Beasts: The Crimes of Grindelwald,movie,2018.0,"('adventure', 'family', 'fantasy')",6.6,186066,2018-11-08
6,tt6865690,The Professor,movie,2018.0,"('comedy', 'drama')",6.7,13667,2018-10-05
7,tt1273221,London Fields,movie,2018.0,"('crime', 'mystery', 'thriller')",4.3,3950,2018-09-20
8,tt8925680,Dior: Sauvage - Legend of the Magic Hour,video,2018.0,"('short',)",7.4,173,2018-06-04
9,tt2296777,Sherlock Gnomes,movie,2018.0,"('animation', 'adventure', 'comedy', 'family',...",5.1,8746,2018-03-15
10,tt3402236,Murder on the Orient Express,movie,2017.0,"('crime', 'drama', 'mystery')",6.5,186700,2017-11-03


In [23]:
fig = px.scatter(actor_df, x="year", y="rating", size="rating_count", color="type",
           log_x=True,hover_name="title", size_max=60,
                title = "Timeline of Depp's Filmogrophy based on their rating")
fig.show()

In [24]:
#py.plot(fig, filename = 'Filmogrophy_timeline_by_rating', auto_open=True)

'https://plot.ly/~lmd003/7/'

In [25]:
fig = px.density_heatmap(actor_df, x="year", y="rating", marginal_x="rug", marginal_y="histogram"
                        ,hover_name="title", title = "Depp's Filmogrophy Rating's by Year")
fig.show()

In [28]:
#py.plot(fig, filename = 'Filmogrophy_Ratings_by_Year', auto_open=True)

## Other resources!

This assignment doesn't have a restriction on where you can look for data.
Further, we don't mind how you collect the data, or what data you collect.

Here are some additional resources for this example, and you can customize it for your own!
* Arnold Schwarzenegger Kill Count: https://www.youtube.com/watch?v=OE6jpTaOYMU
* Arnold Schwarzenegger Top Quotes: https://www.youtube.com/watch?v=pDxn0Xfqkgw

You could think about the IMDB network as a graph, with different actors connected through movies.

Some other useful libraries/ databases:
* IMDBPy
* http://www.omdbapi.com/