# Part I - (Dataset Exploration Title)
## by (your name here)

## Introduction
> Introduce the dataset

>**Rubric Tip**: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

> **Rubric Tip**: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

>**Rubric Tip**: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.  



## Preliminary Wrangling


In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import ast
import json

%matplotlib inline

> Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.


In [2]:
new_movie_credits = pd.read_csv('tmdb_5000_credits.csv')
new_movie_summary = pd.read_csv('tmdb_5000_movies.csv')



In [3]:
# merge tables on key 'id'
movie_df = pd.merge(new_movie_summary, new_movie_credits, left_on = 'id', right_on = 'movie_id')
movie_df.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,spoken_languages,status,tagline,title_x,vote_average,vote_count,movie_id,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [None]:
def extract_director(obj):
    res = []
    for i in ast.literal_eval(obj):
        if i['job'] == "Director":
            res.append(i['name'])
            break
    return res

def extract_Executive_Producer(obj):
    res = []
    for i in ast.literal_eval(obj):
        if i['job'] == "Executive Producer":
            res.append(i['name'])
            break
    return res

def extract_Producer(obj):
    res = []
    for i in ast.literal_eval(obj):
        if i['job'] == "Producer":
            res.append(i['name'])
            break
    return res

def extract_Writer(obj):
    res = []
    for i in ast.literal_eval(obj):
        if i['job'] == "Writer":
            res.append(i['name'])
            break
    return res

def extract_Screenplay(obj):
    res = []
    for i in ast.literal_eval(obj):
        if i['job'] == "Screenplay":
            res.append(i['name'])
            break
    return res

In [None]:
movie_df['Director'] = movie_df.crew.apply(extract_director)
movie_df['Executive Producer'] = movie_df.crew.apply(extract_Executive_Producer)
movie_df['Producer'] = movie_df.crew.apply(extract_Producer)
movie_df['Writer'] = movie_df.crew.apply(extract_Writer)
movie_df['Screenplay'] = movie_df.crew.apply(extract_Screenplay)

In [5]:
# transform all columns which had json string into json format
# several columns has id and name
# leave only name for them beacause id is not so important in this case
json_columns = {'cast', 'genres', 'keywords', 'production_countries', 
                'production_companies', 'spoken_languages'}

for c in json_columns:
    movie_df[c] = movie_df[c].apply(json.loads)
    if c != "crew": # We need other information besides the name
        movie_df[c] = movie_df[c].apply(lambda row: [x["name"] for x in row])

### What is the objective of my project?  What questions am I trying to answer with my dataset?

> Through the data that I downloaded from Kaggle, I am interested to learn what are the characteristics that help ensure a high grossing movie.

> My project plan is as follows:

> <b>Data Cleaning</b>
> <li> Use the gathering, assessing and cleaning framework to first clean the data.  I will use some exploratory analysis as well to identify potential issues in the data.</li>

> <b> Data Exploration </b>
> <li> Review the cleaned data and look for insights that would help me answer my questions above.  Here I will use a mix of univariate, bivariate and multivariate analysis to look for patterns in the data.</li>

> <b> Data Explanation </b>
> <li> Using Part 2 of this project, I will present my analysis and findings in a slide show for the reader </li>


### What is the structure of your dataset?

> Let us look at the set up of the 2 tables in the dataset. The first table gives various details by movie.  The 2nd table gives further detail on the cast and crew.


> To understand the structure, I will look at: shape, first few rows, whether any null items (okay as long as not a field I am interested in), type of fields and if what I expect

> From below, the movie summary dataset has various details by movie.  It comprises 4803 rows and 20 columns.  The columns that are null are okay as I do not need them.  The key columns that I am interested in appear to be correct type.  However, one issue is that the genres and keywords, production companies, production countries, spoken languages are all shown as JSON format.  I would like to split this out.

In [None]:
new_movie_summary.head(2)

In [None]:
new_movie_summary.isna().sum()

In [None]:
new_movie_summary.info()

In [None]:
new_movie_summary.describe()

> Now let me look at the movie credits data set to understand structure.  Again, the main issue I see is that the cast and crew columns are JSON format.  I would like to split this data out so that we can do deeper analysis later.

In [None]:
new_movie_credits.head(2)

In [None]:
new_movie_credits.shape

In [None]:
new_movie_credits.isna().sum()

In [None]:
new_movie_credits.info()

In [None]:
new_movie_credits.describe()


### What is/are the main feature(s) of interest in your dataset?

> The main features that would help me understand what impacts highest grossing movies are: Revenue, Budget, Genre of Movie, Popularity, Vote Average, Production companies and countries where films were made.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> The main columns of interest that would help me understand highest grossing movies are:

Movie Summary Dataset

<li>Budget</li>
<li>Genres</li>
<li>Keywords</li>
<li>Popularity</li>
<li>Production Companies</li>
<li>Production Countries</li>
<li>Revenue</li>
<li>Vote Average</li>
<li>Vote Count</li>

Movie Credits Dataset
<li>Actor</li>
<li>Director</li>


> Before moving further, I would like to split out certain columns into their own tables so I can analyse later.  These are:

Movie Summary Dataset
<li>Genres</li>
<li>Keywords</li>
<li>Production Companies</li>
<li>Production Countries</li>

Movie Credits Dataset
<li>Cast</li>
<li>Crew</li>


In [None]:
Genresdf = new_movie_summary[['id', 'keywords']]

In [None]:
Genresdf.head()

In [None]:
Genresdf.keywords[0]
    

In [None]:
var = json.loads(Genres_df.keywords[0].replace("'", '"'))

In [None]:
pd.DataFrame(var)

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 




>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

