## Fundamentals of Visualizations Final Project ##


### Imports ###

The libraries we will use for this project include:

Pandas - This package will clean, wrangle, and subset our data for our visualizations

Altair - This package will create the visualizations for the project

In [2]:
# Importing Libraries
import pandas as pd
import altair as alt

### Data source ###
The data will be sourced from vega-datasets repository, which contains 3,201 movies that is collected in 2010. The dataset is a combination of data from multiples sources including: Rotten Tomatoes, The Numbers, and IMDB.

In [3]:
# Loading data into Jupyter Notebook
movies = pd.read_json("https://cdn.jsdelivr.net/npm/vega-datasets@2.5.2/data/movies.json")
movies.head()

Unnamed: 0,Title,US Gross,Worldwide Gross,US DVD Sales,Production Budget,Release Date,MPAA Rating,Running Time min,Distributor,Source,Major Genre,Creative Type,Director,Rotten Tomatoes Rating,IMDB Rating,IMDB Votes
0,The Land Girls,146083.0,146083.0,,8000000.0,Jun 12 1998,R,,Gramercy,,,,,,6.1,1071.0
1,"First Love, Last Rites",10876.0,10876.0,,300000.0,Aug 07 1998,R,,Strand,,Drama,,,,6.9,207.0
2,I Married a Strange Person,203134.0,203134.0,,250000.0,Aug 28 1998,,,Lionsgate,,Comedy,,,,6.8,865.0
3,Let's Talk About Sex,373615.0,373615.0,,300000.0,Sep 11 1998,,,Fine Line,,Comedy,,,13.0,,
4,Slam,1009819.0,1087521.0,,1000000.0,Oct 09 1998,R,,Trimark,Original Screenplay,Drama,Contemporary Fiction,,62.0,3.4,165.0


### Data Filtering ###
Limit the columns based on most relevant information that can be graphed, focusing on having handful of each of the data types that can be interpreted further with possible deeper insights. Removed entire row of data if it contains NaN.

In [4]:
# Filtering/Wrangling Dataframe
movies_df = movies[['Title', 'Worldwide Gross', 'Production Budget', 'MPAA Rating', 'Running Time min',
                   'Distributor', 'Major Genre', 'Creative Type', 'IMDB Rating', 'IMDB Votes']]
movies_df = movies_df.dropna()
movies_df = movies_df.loc[movies_df['MPAA Rating'] != 'Not Rated']
movies_df['Title'] = movies_df['Title'].astype(str)
movies_df

Unnamed: 0,Title,Worldwide Gross,Production Budget,MPAA Rating,Running Time min,Distributor,Major Genre,Creative Type,IMDB Rating,IMDB Votes
134,Broken Arrow,148345997.0,65000000.0,R,108.0,20th Century Fox,Action,Contemporary Fiction,5.8,33584.0
138,Brazil,9929135.0,15000000.0,R,136.0,Universal,Black Comedy,Fantasy,8.0,76635.0
164,The Cable Guy,102825796.0,47000000.0,PG-13,95.0,Sony Pictures,Comedy,Contemporary Fiction,5.8,51109.0
168,Chain Reaction,60209334.0,55000000.0,PG-13,106.0,20th Century Fox,Action,Contemporary Fiction,5.2,15817.0
218,City Hall,20278055.0,40000000.0,R,111.0,Sony Pictures,Drama,Contemporary Fiction,6.1,9908.0
...,...,...,...,...,...,...,...,...,...,...
3194,Zoolander,60780981.0,28000000.0,PG-13,89.0,Paramount Pictures,Comedy,Contemporary Fiction,6.4,69296.0
3195,Zombieland,98690286.0,23600000.0,R,87.0,Sony Pictures,Comedy,Fantasy,7.8,81629.0
3196,Zack and Miri Make a Porno,36851125.0,24000000.0,R,101.0,Weinstein Co.,Comedy,Contemporary Fiction,7.0,55687.0
3199,The Legend of Zorro,141475336.0,80000000.0,PG,129.0,Sony Pictures,Adventure,Historical Fiction,5.7,21161.0


### Description of the dataset ###

The filtered data has 1,128 rows and 10 columns. From our columns, we have:

Title - Title of the movie (Categorical)

Worldwide Gross - Worldwide Box Office earnings (Continuos)

Production Budget - Cost of producing movie (Continuos)

MPAA Rating - Rating of the movie based on the 6 possible scales (Ordinal):

* G – General Audiences
* PG – Parental Guidance Suggested
* PG-13 – Parents Strongly Cautioned
* R – Restricted
* NC-17 – Adults Only

Running Time min - Movie's duration in minutes (Continuos)

Distributor - The company that distributed the movie (Categorical)

Major Genre - Main genre of movie from the following genres (Categorical):

* Action
* Black Comedy
* Comedy
* Drama
* Adventure
* Romantic Comedy
* Horror
* Thriller/Suspense
* Documentary
* Musical
* Western
* Concert/Performance

Creative Type - Creative level based on the follow creative types (Categorical):

* Contemporary Fiction
* Fantasy
* Historical Fiction
* Science Fiction
* Dramatization
* Kids Fiction
* Factual
* Super Hero

IMDB Rating - Average voting on a scale of 1 to 10 (Continuos)

IMDB Votes - Number of votes from reviewers on IMDB (Continuos)

### Goal 1 ###

Given a scatter plot between IMDB Rating and Running Time, describe any trends of Creative Types when comparing the IMDB Rating and Runnning Time and identify the IMDB Rating bin with the highest count by filted by Creative Types. 

**Why is a task pursued? (goal)**
I want to **explore** if there any patterns within Creative Types based on each movie's average IMDB Rating and Running Time and **identify** the number of movies for each IMDB Rating bins by the Creative Types.  

**How is a task conducted? (means)**
The task will be conducted by utilizing a scatter plot of IMDB Rating vs Running Time that will **connect** to a histogram with count of movies within each IMDB Rating bins.  

**What does a task seek to learn about the data? (characteristics)**
This task will **indicate** trends of IMDB Rating vs Running Time for each of the Creative Types and the highest counts of the IMDB bins for each Creative Types. 

**Where does the task operate? (target data)**
The task operates by **comparing** the trends formed by Creative Types when comparing IMDB Rating vs Running Time and **comparing** the bins of each IMDB Rating within each Creative Types.

**When is the task performed? (workflow)**
The task is perfomed when the user clicks on one of the scatter points and highlights one of the Creative Types, which will indicate the Creative Types' trend and the distribution of the counts of movies for each of the Creative Types. 

**Who is executing the task? (roles)**
The task will be executed by the user. The user will need to utilize the graph to analyze if there is upward/downward/neutral trend for each Creative Type and where majority of the movie's IMDB Ratings are binned to identify if there is correlation between movie's rating and movie's duration and identify the highest movie's IMDB Ratings bins by Creative Types. 

In [51]:
# Linked views
# Creating a selection: 
selection = alt.selection(type="multi", fields=["Creative Type"])

# Create a container for our two different views
base = alt.Chart(movies_df).properties(width=250, height=250)

# Create our scatterplot
scatterplot = base.mark_circle().encode(
    x = 'IMDB Rating',
    y = 'Running Time min',
    color = alt.condition(selection, "Creative Type", alt.value('lightgray'))
).add_selection(selection)

# Create a histogram
hist = base.mark_bar().encode(
    x = alt.X("IMDB Rating", bin=alt.Bin(maxbins=20)), 
    y = "count()"
).transform_filter(selection)

# Connect our charts using the pipe operation
scatterplot | hist

### Goal 2 ###

Identify the patterns presented by the scatter plot Production Budget vs IMDB Rating based on each Major Genre.

**Why is a task pursued? (goal)**
The task is to **confirm** if having higher Production Budget would affect IMDB Rating based on Major Genre and identify if Genre are clustered around certain regions. 

**How is a task conducted? (means)**
The task will be conducted by utilizing a scatter plot of Production Budget vs IMDB Rating that will apply the **filtering** interaction when user selects the Major Genre. 

**What does a task seek to learn about the data? (characteristics)**
This task will **indicate** trends of Production Budget vs IMDB Rating for each Major Genre and clusters the Major Genre to identify if Genre impacts Production Budget or IMDB Rating. 

**Where does the task operate? (target data)**
The task operates by **comparing** the scatter point's IMDB rating to Production Budget after filtering for Major Genre that will showcase any upward/downward/neutral trends. 

**When is the task performed? (workflow)**
The task is performed when the user clicks on one of the Major Genres, which will highlight the movies with the selected Major Genre.

**Who is executing the task? (roles)**
The task will be executed by the user. The user will need to utilize the graph to analyze for any patterns by selecting each of the Major Genre categories, thus uncovering trends Production Budget has on IMDB Rating when filtered for Major Genre. 

In [73]:
# Bind our selection to the legend
selection = alt.selection(type='multi', fields=['Major Genre'], bind='legend')

alt.Chart(movies_df).mark_circle().encode(
    x = "Production Budget",
    y = "IMDB Rating",
    #color = alt.condition(selection, 'Major Genre', alt.value('lightgray')),
    color=alt.Color('Major Genre', scale=alt.Scale(scheme='dark2')),
    tooltip=["Title", "IMDB Rating"],
    opacity=alt.condition(selection,alt.value(1),alt.value(.05))
).add_selection(selection)

### Goal 3 ###

Identify the Worldwide Gross for each MPAA Rating, which can be subdivided to understanding the Worldwide Gross for the Creative Type.  

**Why is a task pursued? (goal)** The task is to **play around** with the bar chart and compare the changes in sum of Worldwide Gross amount within each MPAA Rating's Creative Type, and the changes in Worldwide Gross amount within each MPAA Rating.

**How is a task conducted? (means)** The task is conducted by utilizing a **sorted** bar chart of the sum of Worldwide Gross of each MPAA's rating that is **connected** to a bar chart Worldwide Gross amount of Category Type respective to the MPAA Rating selected. 

**What does a task seek to learn about the data? (characteristics)** This task will **identify** the highest Worldwide Gross amount for each MPAA's Rating and the trends of Worldwide Gross amount within each MPAA Rating's Category Type. 

**Where does the task operate? (target data)** The task operates by **comparing** the Worldwide Gross amount of each MPAA's Rating and within each MPAA Rating, and the **comparison** within the Worldwide Gross amount of each Category Type. 

**When is the task performed? (workflow)** The task is performed whenever the user clicks on one of bars in the bar chart showcasing the Worldwide Gross for each MPAA Rating.

**Who is executing the task? (roles)** The task is executed by the user. The user will need to first observe which MPAA Rating has the highest Worldwide Gross amount, then within each MPAA Rating, observe which Category Type has the highest Worldwide Gross amount. They should also observe the changes in amount with the categories as they move between the ordinal MPAA's Ratings. 

In [36]:
# Let's implement filtering using dynamic queries. 
selection = alt.selection(type="multi", fields=["MPAA Rating"])

# Create a container for our two different views
base = alt.Chart(movies_df).properties(width=500, height=250)

# Let's specify our overview chart
overview = alt.Chart(movies_df).mark_bar().encode(
    x = alt.X(field = "MPAA Rating", sort=alt.EncodingSortField(field='MPAA Rating', op='count', order='descending')),
    y = "sum(Worldwide Gross)",
    color=alt.condition(selection, alt.value("orange"), alt.value("lightgrey"))
).add_selection(selection).properties(height=250, width=250)

# Create a detail chart
detail = base.mark_bar().encode(
    y = "sum(Worldwide Gross)", 
    x = alt.X("Creative Type", sort = movies_df['Creative Type'].unique())
).transform_filter(selection).properties(height=250, width=250)

overview | detail

### Evaluation ###

The 3 goals outlined above were designed to outline movie's performances (Worldwide Gross amount and IMDB Rating) based on movie's features. The target/core question for my project: are there patterns or trends within movie's features to suggest  similaries in movies?

For the procedure, I decided to structure a Qualitative Evaluation using Semistructured Interviews. The user will be introduced to each of the visualizations in the order of Goal 1, 2, then 3. For each visualization, I will ask the following question
1. Are there any noticable patterns or trends to suggest similarties in movies? If so, please describe all the patterns you notice that you think are significant. 
I will keep track of all observations the user makes and will count the number of insights for each users. 

I recruited 3 people to particpate in this project. 

User A is a friend of mine, who is currently a professional Data Scientist.

User B is my younger sibling, who is currently in the 11th grade. 

User C is my cousin, who has a master degree in Public Health. 

**User A:**

Task 1 - they were able to identify a wide variety of insights. They first mentioned how the initial plot did not have too many trends, but then identified the negative correlation in Run Time and IMDB rating in Dramatization. They also mentioned the positive correlation in Run Time and Rating in Historical Fiction. They made in total 6 different insights and did not use the Histogram. All insights were not deep (initial insights).

Task 2 - the user was initially unable to get the filtering to work (they thought it was similar to the first task where they would click on the point instead of the legend). This might be from the fact they first interacted with task 1, then task 2. The only observation they made from this Task is high number of Action and Adventure Genre film. The user was more interested in seeing what movies the scatter plots were and interested in how it ranked against other movies. 

Task 3 - the user was able to draw an insight on how the change in Creative Type as the movie was more censored, focusing on how the movies Grossworld Wide were Kids Fiction for G and PG ratings, and Contemporary Fiction in PG 13 and R. They then tried to use Task 2 to identify what movie was NC-17. 

**User B:**

Task 1 - they were able to create 5 different insights, all within the positive/negative correlations found by filtering the data. One of their insights pertained to the Science Fiction Category, which they clained only had 7 movies by using the histogram. 

Task 2 - they were able identify the high number of Action and Adventure related film and suggestted the small but existing positive correlation between IMDB rating and Production Budget. They made 4 insights for this task. 

Task 3 - the user was able to identify how the PG-13 Rating has no Factual Creative Type despite having the highest Worldwide Gross profit. They also identified that only PG Rating has all 8 Creative Types of movies. 

**User C:**

Task 1 - they were able to identify 2 deeper level insights. The first insight mentioned is within how Kids Fiction did not exceed 110 minutes, claiming it might be the attention span of kids the producers were trying to account for. They also  observed the normal distribution for Contemporary Fiction. In total, they made 8 different observations. 

Task 2 - they were able to identify the outliers in Production Budget for the Adventure movies. They also made a comment on how Thriller/Suspence rated higher than Horror. They also identified the difference in clusters between Black Comedy vs Comedy, citing the huge gap difference in Production Budget. They identified 10+ observations.

Task 3 - the user drew similar conclusions to User B, citing no Factual Creative Type in PG-13 despite having highest Worldwide Gross profit, but also drew insights on all the missing Creative Types for each Rating, like missing Science Fiction and Superhero Creative Types in General MPAA Rating because those genres typically have some fighting/damage that would classify it at a higher MPAA rating. 

### Conclusion ###

I think the graphs and data I selected worked well with the interactions of the visualizations. From this experiment, I noticed that interaction seems to be very important to user's insight. Being able to scope the graph enabled users to draw insights easily. Especially when User A was unable to get the graph to filter after clicking repeatedly on the scatter point and tried to create a general insight from the visualization before I stepped in to assist. The elements that worked well for this experiment was the connections of visualizations for Task 1 and 3. However, pointed out by User B, the lack of data made some of the insights obsolete, since there has been multiple record breaking movies since 2010 that could've skewed the data. For the future, I will try to incorperate more data, including more frequent movies and utilize insertion methods for NaN values rather than complete removal of NaN rows. Another refinement is within the Task 3. I tried to create a stacked bar chart, where the color was determined by the Title or Distributor of the movie, but was unsuccessful in producing the output without it looking too clustered (it looked like a rainbow on a bar). 