# HW4 : A visit to the movie zoo!

![](https://vignette.wikia.nocookie.net/bojackhorseman/images/f/f2/HSACWDTK%3FDTKT%3FLFO%21%21.png/revision/latest?cb=20150720050503)

## Task
In this homework, your task is to visualize THREE non-typical charts on anything related to your favorite **movie star!**
This means you CANNOT use the Big 4 chart types or their close variants (i.e. Pie, Bar, Line and Scatter, Area, etc.)

You are free to use any other chart type whether or not they were covered in class!
The lecture on Visit To The Zoo is a good place to start to get ideas on what kinds of charts exist.

For the data, you are free to use any data source you deem fit.
For charting, we will NOT be constraining the technology you use. 
You are free to produce the charts in any way you would like.

You will be judged on
* Creativity
* Presentation Quality
* Data Quality (Did your visualization reveal something interesting?)

For extra credit, you can make a fully interactive visualization.

## Ideas for Data Collection

Here, we show an example of how to collect data about Arnold Schwarzenegger!
Do note that this is just an example of the kind of data you can collect.
You are **NOT** constrained
* To the same movie star (you can pick your own!)
* To the same *kind* of data
* To the same data sources
* or to anything else!

This assignment gives you the power to do what you like!

In [171]:
# %pip install imdbpie
from imdbpie import ImdbFacade
from IPython.core.display import display, HTML
from bs4 import BeautifulSoup
import urllib.request
import re
import requests
import pandas as pd
import pycountry
import pycountry_convert as pc
from selenium import webdriver

#### Get data for actor

In [None]:
# Get an instance of IMDb class
imdb = ImdbFacade()

# Search for actor
people = imdb.search_for_name('robert downey jr')
print(people)

In [None]:
# Fetch information about him
robert = imdb.get_name(people[0].imdb_id)

# What information do I have about him?
print('\n'.join([x for x in dir(arnold) if not x.startswith('__')]))

In [None]:
films = robert.filmography
# [imdb.get_title(x).title for x in films]

In [None]:
# How many movies does he have?
print(len(robert.filmography))

In [None]:
# Let's fetch some more information about a movie
movie = imdb.get_title(films[2])

In [None]:
# What information can I get about this movie?
print('\n'.join([x for x in dir(movie) if not x.startswith('__')]))

In [None]:
movie.title

In [None]:
print(movie.writers)

In [None]:
print(movie.imdb_id)

In [None]:
html = """
    <div style="background-color:#FFDDDD">
    <h2> Warning! </h2>
    <p> This code below is meant to be an example of what you can do. <br>
        It is not guaranteed to work always, and will need to be tweaked!
    </p>
    </div>
"""
display(HTML(html))

#### Box office numbers

In [None]:
# Let's experiment with Terminator
imdb_id = 'tt0088247'

# Fetch the box office numbers
base = 'https://www.boxofficemojo.com'
url = base + '/title/' + imdb_id
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')

table = soup('th', text=re.compile(r'Release Group'))[0].parent.parent
group = table.findAll('tr', recursive=False)[1].find('a').get('href')
url = base + group

# Get total earnings domestic and international
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
earnings = soup('h2', text=re.compile(r'Rollout'))[0].parent.parent.findAll('div')
domestic = earnings[1].find('span', {'class': 'money'}).get_text()
domestic_url = earnings[1].find('a').get('href')
international = earnings[2].find('span', {'class': 'money'}).get_text()

# Get weekly domestic earnings
url = base + domestic_url
url = url[:url.rfind('/')] + '/weekly/'
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('div', {'class':'a-section imdb-scroll-table-inner'}).findAll('tr')
weekly = []
for tr in table[1:]:
    date = tr.findAll('td')[0].get_text()
    earning = tr.findAll('td')[2].get_text()
    weekly.append((date, earning))

# Print the values we've just got!
print("Total Domestic Earnings: %s" % domestic)
print("Total International Earnings: %s" % international)
print("Weekly Domestic Earnings:")
for date, earning in weekly:
    print("\t%s \t: %s" % (date, earning))

## Other resources!

This assignment doesn't have a restriction on where you can look for data.
Further, we don't mind how you collect the data, or what data you collect.

Here are some additional resources for this example, and you can customize it for your own!
* Arnold Schwarzenegger Kill Count: https://www.youtube.com/watch?v=OE6jpTaOYMU
* Arnold Schwarzenegger Top Quotes: https://www.youtube.com/watch?v=pDxn0Xfqkgw

You could think about the IMDB network as a graph, with different actors connected through movies.

Some other useful libraries/ databases:
* IMDBPy
* http://www.omdbapi.com/

words in one movie (word cloud)
movies released by date (calendar)


In [None]:
# get quotes from https://www.brainyquote.com/authors/robert-downey-jr-quotes
url = 'https://www.brainyquote.com/authors/robert-downey-jr-quotes'
# resp = request.get(url)
resp = requests.get(url)
if resp.ok:
    text = resp.text
    soup = BeautifulSoup(text)

In [None]:
quoteList = soup.find('div', attrs={'id':'quotesList', 'class':'new-msnry-grid bqcpx'}).find_all('a')

In [None]:
quotes = [i.text for i in quoteList]
quotes = [i for i in quotes if len(i)>0 and i[:1]!='\n']
pd.DataFrame({'quotes': quotes}).to_csv('quotes.csv')

In [None]:
marvel_url = 'https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films'
marvel = requests.get(marvel_url)
if marvel.ok:
    content = marvel.content
    text = marvel.text
    soup = BeautifulSoup(text)

In [None]:
movies = [x.text[:-1] for x in soup.find('table', {'class':'wikitable plainrowheaders'}).find_all('th')]

In [None]:
movies = [x for x in movies[6:] if('Phase' not in x)]

In [None]:
movies[5] = 'The Avengers'
movies

In [None]:
# films
url = 'https://pro.imdb.com/name/nm0000375/?ref_=instant_nm_1&q=robert%20downet'
resp = requests.get(url)
if resp.ok:
    content = resp.content
    text = resp.text
    soup = BeautifulSoup(text)

In [None]:
# films year and gross
films = pd.read_html(content)[5]
# films = films[films.Notes != 'Uncredited']#films.iloc[:, :2]
pattern = r'^.+\((\d{4})\)'
year = r'\(\d{4}\)'

titles = films.iloc[:,0]
title_year = titles.apply(lambda x: re.match(pattern, x)[0])
# title_year
titles = title_year.apply(lambda x: re.split(year, x)[0][:-1])
years = title_year.apply(lambda x: int(re.findall('\d{4}', x)[0]))

In [None]:
films['title'] = titles
films['year'] = years

In [None]:
films = films[films['Gross (Worldwide)'].notnull()]
films

In [None]:
def currency(amount):
    billion = 10**9
    million  = 10**6
    amount = amount[1:]
    if (amount[-1] == 'B'):
        return int(float(amount[:-1])*billion)
    elif (amount[-2:] == 'MM'):
        return int(float(amount[:-2])*million)

In [None]:
films['gross'] = films['Gross (Worldwide)'].apply(lambda x: currency(x))

In [None]:
films.head(15)

In [None]:
films['is_marvel'] = [1 if(x in movies) else 0 for x in films.title.values]

In [None]:
films.head(30)

In [None]:
films[['year', 'title', 'gross','is_marvel']].to_csv('movie_gross.csv')

In [114]:
box_office_person = pd.read_csv('box_office_person.csv').iloc[1:, :2]
box_office_person.head()

Unnamed: 0,Area,Total Gross
1,US & Canada,"$5,893,164,206 (60)"
2,China,"$1,741,738,739 (9)"
3,United Kingdom,"$722,401,438 (28)"
4,South Korea,"$583,913,939 (18)"
5,Brazil,"$448,587,322 (19)"


In [115]:
box_office_person['gross'] = box_office_person['Total Gross'].apply(lambda x: ''.join(re.split(pattern, x)[0].split(','))[1:])

In [116]:
box_office_person = box_office_person.rename({'Area': 'Country'}, axis=1)

In [117]:
# box_office_person.iloc[0,0] = 'North America'

In [118]:
box_office_person

Unnamed: 0,Country,Total Gross,gross
1,US & Canada,"$5,893,164,206 (60)",5893164206
2,China,"$1,741,738,739 (9)",1741738739
3,United Kingdom,"$722,401,438 (28)",722401438
4,South Korea,"$583,913,939 (18)",583913939
5,Brazil,"$448,587,322 (19)",448587322
...,...,...,...
82,Jamaica,"$297,261 (1)",297261
83,Ghana,"$126,054 (5)",126054
84,Cyprus,"$125,668 (2)",125668
85,Mongolia,"$100,575 (1)",100575


In [119]:
countries = [c.name for c in list(pycountry.countries)]

In [120]:
# box_office_person = box_office_person.append(pd.DataFrame({'Country': 'Canada', 'Total Gross': '$0', 'gross': 0}, index=[0]))

In [136]:
def country_code(country):
    try:
        code = pc.country_name_to_country_alpha2(country, cn_name_format="default")
        return pc.country_alpha2_to_continent_code(code)
    except:
        return country
box_office_person['continent'] = box_office_person.Country.apply(lambda x: country_code(x))

In [144]:
continent_codes = ['AS', 'EU', 'SA', 'NA', 'OC', 'AF']

In [148]:
box_office_person[box_office_person.continent.apply(lambda x: x not in continent_codes)]
mapping = [
    [1,'NA'],
    [9, 'AS'],
#     [57, 'EU'],
    [62, 'EU'],
    [63, 'SA'],
    [67, 'AF'],
    [70, 'AF'],
    [72, 'EU']
]

In [149]:
for i in mapping:
    row = i[0] - 1
    code = i[1]
    box_office_person.iloc[row, 3] = code

In [153]:
box_office_person = box_office_person[box_office_person.continent.apply(lambda x: x in continent_codes)]

In [154]:
box_office_person.to_csv('box_office.csv')

In [159]:
# network map data prep
also_viewed_url = 'https://pro.imdb.com/name/nm0000375/people_also_viewed?ref_=nm_pav_see_more&subpage=boxoffice'
also = requests.get(also_viewed_url)
if also.ok:
    content = also.content
    text = also.text
    soup = BeautifulSoup(text)

In [170]:
soup

<!DOCTYPE html>
<!--[if IE 8]><html class="a-no-js a-lt-ie10 a-lt-ie9 a-ie8" data-19ax5a9jf="dingo"><![endif]--><!--[if IE 9]><html class="a-no-js a-lt-ie10 a-ie9" data-19ax5a9jf="dingo"><![endif]--><!--[if !IE]><!--><html class="a-no-js" data-19ax5a9jf="dingo"><!--<![endif]--><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/> <meta content="https://m.media-amazon.com/images/G/01/IMDbPro/images/share/social_sharing_imdb_logo-3939248796._CB470254011_.png" property="og:image"/>
<title dir="ltr">IMDbPro Official Site | Start Your Free Trial</title> <link href="https://m.media-amazon.com/images/G/01/IMDbPro/images/favicon-689397331._CB470253993_.ico" rel="shortcut icon"/>
<link href="https://m.media-amazon.com/images/G/01/IMDbPro/images/mobile/pro_icon_48x48-3843847253._CB470253986_.png" rel="apple-touch-icon"/>
<link href="https://m.media-amazon.com/images/G/01/IMDbPro/images/mobile/pro_icon_72x72-92651944._CB470253986_.png" rel="apple-touch-icon" sizes

In [176]:
# driver = webdriver.Safari()

In [261]:
network = pd.read_csv('network.csv', header=None)

In [262]:
network

Unnamed: 0,0,1,2
0,,142,The Avengers
1,Chris Evans,,
2,Actor,,
3,,282,The Avengers
4,,,
...,...,...,...
394,Actor,,
395,,391,Casino Royale
396,,,
397,Daniel Craig,,


In [263]:
pd.notnull(network.iloc[:,0][0])

False

In [264]:
actors = [x for x in network.iloc[:,0].values if(pd.notnull(x))]
actors = [actors[x] for x in range(len(actors)) if(x%2==0)]
# actors

In [265]:
starmeter = [x for x in network.iloc[:,1].values if (pd.notnull(x))]

In [266]:
known_for = [x for x in network.iloc[:,2].values if(pd.notnull(x))]

In [267]:
network = pd.DataFrame({'name': actors, 'starmeter': starmeter, 'known_for': known_for})

In [268]:
network = network.append(pd.DataFrame({'name':'Robert Downey Jr.', 'starmeter':0, 'known_for': 'The Avengers'}, index=[0])).reset_index(drop=True)

In [352]:
top = network.iloc[:10, :]

In [353]:
top.head()

Unnamed: 0,name,starmeter,known_for
0,Chris Evans,142,The Avengers
1,Chris Hemsworth,282,The Avengers
2,Mark Ruffalo,784,The Kids Are All Right
3,Scarlett Johansson,81,Her
4,Tom Holland,162,Spider-Man: Homecoming


In [465]:
top['relationship'] = [x**2 for x in range(len(top),0 ,-1)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [466]:
top['x'] = [x*7 for x in range(len(top))]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [467]:
top['y'] = [0 for x in range(len(top))]
top['Circley'] = [0 for x in range(len(top))]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [491]:
base = pd.DataFrame({
    'name': 'Robert Downey Jr',
    'starmeter': 0,
    'known_for': 'The Avengers',
    'relationship': top['relationship'].values,
    'x': int(np.mean(top['x'].values)),
    'y': 5,
    'Circley': 5
    },
    index = range(len(top))
)

In [492]:
network_df = top.copy()#pd.DataFrame(columns=['name', 'starmeter','known_for','x','y','Circley'])

In [493]:
network_df = network_df.append(base).reset_index(drop=True)

In [509]:
size = [x for x in range(len(top),0 ,-1)]
size.extend([5 for _ in range(10)])

In [510]:
size

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [511]:
network_df['Marvel'] = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]

In [512]:
network_df['size'] = size

In [513]:
network_df.to_csv('also.csv')