### IMDB
#### In this assignment, you will work on movie data from IMDB.
- The data includes movies and ratings from the IMDB website
- Data File(s): imdb.xlsx

#### Data file contains 3 sheets:
- “imdb”: contains records of movies and ratings scraped from IMDB website
- “countries”: contains the country (of origin) names
- “directors”: contains the director names

In [1]:
""" Q1: 
Load and read the 'imdb.xlsx' file. Read the 'imdb' sheet into a DataFrame, df.
"""

import pandas as pd

# your code here
xls = pd.ExcelFile('imdb.xlsx')
df = xls.parse('imdb')

In [2]:
""" Q2: 
Store the dimensions of the DataFrame in a variable called 'shape' and print it.
"""

# your code here
shape=df.shape

In [3]:
""" Q3: 
Store the column titles and the types of data in variables named 'columns' and 'dtypes', then print them.
"""

# your code here
columns=df.columns
dtypes=df.dtypes

In [4]:
""" Q4: 
Examine the first 10 rows of data; store them in a variable called first10
"""

# your code here
first10=df.head(10)

In [5]:
""" Q5: 
Examine the first 5 rows of data; store them in a variable called first5
"""

# your code here
first5=df.head()

In [6]:
""" Q6: 
Import the "directors" and "countries" sheets into their own DataFrames, df_directors and df_countries.
"""

# your code here
df_directors = xls.parse('directors')
df_countries = xls.parse('countries')

In [7]:
""" Q7: 
Check the "directors" sheet
1. Count how many records there are based on the "id" column. (To get the number of records per "id", 
   use the value_counts method.) Store the result in a variable named count.
2. Remove the duplicates from the directors dataframe and store the result in a variable called df_directors_clean.
"""

# your code here
count=df_directors['id'].value_counts()
df_directors_clean=df_directors.drop_duplicates()

In [8]:
""" Q8: 
Join three Dataframes: df, df_directors, and df_countries with an inner join.
Store the joined DataFrames in df.
"""

# your code here
temp=pd.merge(df,df_countries,left_on='country_id',right_on='id',how='inner')
df=pd.merge(temp,df_directors,left_on='director_id',right_on='id',how='inner')

# After the join, the resulting Dataframe should have 12 columns.
df.shape

(178, 12)

In [9]:
""" Q9: 
Save the first ten rows of movie titles in a variable called first10, then print it
"""

# your code here
first10=df['movie_title'][:10]
print(first10)

0    The Shawshank RedemptionÊ
1              The Green MileÊ
2               The GodfatherÊ
3      The Godfather: Part IIÊ
4              Apocalypse NowÊ
5             The Dark KnightÊ
6                   InceptionÊ
7                InterstellarÊ
8                     MementoÊ
9                The PrestigeÊ
Name: movie_title, dtype: object


In [10]:
""" Q10: 
There's an extra character at the end of each movie title. 
Remove it from the data using str.replace.
And print the first ten rows of movie titles again. 
"""

# your code here
last_char=df['movie_title'][0][-1]
df['movie_title']=df['movie_title'].str.replace(last_char,'')
print(df['movie_title'][:10])

0    The Shawshank Redemption
1              The Green Mile
2               The Godfather
3      The Godfather: Part II
4              Apocalypse Now
5             The Dark Knight
6                   Inception
7                Interstellar
8                     Memento
9                The Prestige
Name: movie_title, dtype: object


In [11]:
""" Q11:
Who is the director with the most movies? First get the number of movies per "director_name", then save the director's name
and count as a series of length 1 called "director_with_most"
"""

# your code here
director_with_most=df['director_name'].value_counts()[:1]

In [12]:
"""Q12:
Save all of this director's movies and their ratings in a variable called all_movies_ratings, then print this variable.
(The director with the most movies you got from the last question.)
"""

# your code here
all_movies_ratings=df.loc[df['director_name']==director_with_most.index[0]][['movie_title','imdb_score']]

In [13]:
"""Q13:
Recommend a **random** movie that has a rating of over 8.3. 
What is the title and imdb_score of your recommendation?

Name your variables as follows:
-----------------------------------------------------------------------------
  goodmovie       <- Those movies with a rating over 8.3
  rand_int        <- The random integer index location of your recommendation
  rand_goodmovie  <- The random recommendation
"""

import random
random.seed(0)

# your code here
goodmovie=df['imdb_score']>8.3
filtered_df=df.loc[goodmovie]
rand_int=random.randint(0,len(filtered_df)-1)
rand_goodmovie=filtered_df.iloc[rand_int:rand_int+1]

In [14]:
""" Q14: 
Get the summary statistics for imdb_score and gross, then use the describe() function to summarize this visually. Save the
result in a variable called score_gross_description and print it.
"""

# your code here
score_gross_description=df.describe()[['imdb_score','gross']]
score_gross_description

Unnamed: 0,imdb_score,gross
count,178.0,178.0
mean,8.294382,103040200.0
std,0.26696,124254900.0
min,8.0,8060.0
25%,8.1,13185100.0
50%,8.2,51943710.0
75%,8.475,152243600.0
max,9.3,623279500.0


In [15]:
"""Q15:
What is the average rating of the director Christopher Nolan's movies? Save this value in a variable called nolan_mean and 
print
"""

# your code here
nolan_mean=df.loc[df['director_name']=='Christopher Nolan']['imdb_score'].mean()
nolan_mean

8.6

In [16]:
"""Q16: 
Create a series called 'directors' that contains each director's name and his or her average rating.
Use your new data frame to find the average rating for Steven Spielberg.
Print out the type of your variable, then the contents
"""

# your code here
directors=df.groupby('director_name')['imdb_score'].mean()

In [17]:
"""Q17:
Select the non-USA (country_id=1) movies or movies made before 1960 by Hayao Miyazaki (director_id=46).
What are the years returned? Save them in a series called 'miyazaki', then print it
"""

# your code here
mask1=df['country']!='USA'
mask2= df['title_year']<1960
mask3=df['director_name']=='Hayao Miyazaki'
miyazaki=df.loc[(mask1 | mask2) & mask3]

In [18]:
"""Q18: 
Create a Pivot Table that shows the median rating for each director, grouped by their respective countries. Name your variable
'pivot_agg'
"""

# your code here
pivot_agg=pd.pivot_table(df,index=['country','director_name'],values=['imdb_score'],aggfunc={'imdb_score':'median'})
pivot_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,imdb_score
country,director_name,Unnamed: 2_level_1
Argentina,Juan Jose Campanella,8.20
Australia,George Miller,8.10
Brazil,Fernando Meirelles,8.70
Brazil,Jose Padilha,8.10
Canada,Denis Villeneuve,8.20
...,...,...
USA,Tony Scott,8.00
USA,Victor Fleming,8.15
USA,Wes Anderson,8.10
USA,Woody Allen,8.10


In [19]:
"""Q19:
ARE YOU NOT ENTERTAINED? How long did the Gladiator aim to keep your attention? Save the series with this information 
in a variable called gladiator_duration, then print it
"""

# your code here
gladiator_duration=df.loc[df['movie_title']=='Gladiator']['duration']