# The NarWolves: DC vs Marvel Data Analysis

## Load Libraries

In [None]:
# Import libraries
import pandas as pd 
import numpy as np 

## Importing Data

The dataset was obtained from **Kaggle**. The dataset first cleaned and processed in this [notebook](https://github.com/nguyenjenny/spark_shared_repo/blob/main/group_02/CleanData.ipynb).

In [None]:
path = "https://raw.githubusercontent.com/nguyenjenny/spark_shared_repo/main/group_02/Marvel_DC_imdb_cleaned.csv"
hero = pd.read_csv(path)
hero

## Getting to Know our Data: Aggregates and Basic Stats

### Basic statistics

We can use the `.describe()` function to get basic information about our numerical data.

In [None]:
# Use the `.describe()` to get information on the hero DataFrae



In [None]:
hero.describe()

### Mean IMDB_Score

We can use the `.mean()` function to get information about the average `"IMDB_Score"`

First, we need to select the column of the DataFrame that we want. 


We can use `.iloc[:, "column_name" ]` or we can use the short form of just `["column"]`

In [None]:
# Use loc to select the `"IMBD_Score"`
hero.loc[:, "IMDB_Score"]

In [None]:
# Use the short hand version ["column"] to get the "IMBD_Score"




In [None]:
# Now that we know how to select a specific column we can add the `.mean()` to it





#### The mean IMDB score for both Marvel and DC movies is ______________. 

### Lowest and Highest IMDB_Score

We can use the `.min()` function to get information about the lowest `"IMDB_Score"` and `.max()` to get the highest `"IMDB_Score"` for comic/superhero movies. 



In [None]:
# Get min/lowest IMBD_Rating
hero["IMDB_Score"]

In [None]:
# Get max/highest IMBD_Rating
hero["IMDB_Score"]


#### The IMDB scores for both Marvel and DC movies range from ______________ to ____________. 

### Standard Deviation of IMDB_Scores

Standard deviation represents the spread or variability of the data.

↑ stdev = ↑ variability 

We can use the `.std()` function to get the stadard devoatop of the `"IMDB_Score"`.

In [None]:
hero["IMDB_Score"]

#### The standard deviation of the IMDB scores for both Marvel and DC movies is ___________. 

### Mean Run Time of Movies and TV Series

In [None]:
# Calculate mean run time of the movies and tv series in this data using the `"RunTime"` column



#### The mean run time is ___________. 

### Finding the highest grossing movies in the US using  `.sort_values()`

We can sort our data set by a column by using the function called `.sort_values()`.

We simply pass the name of the column we want to sort by as a parameter.

- We can also set the parameter `ascending=True` or `ascending=False` if we want to sort by increasing or descreasing values
    - `ascending=True`: Values are sorted from lowest to largest
    - `ascending=False`: Values are sorted from largest to lowest

The format for `.sort_values()` looks like this:

```
    hero.sort_values("column_name", ascending=False)
```

In [None]:
# Sort the hero DataFrame by `"USA_Gross"` and make `ascending=False`
hero.sort_values("column_name_here", ascending=False)


#### The highest grossing film in the US was _______________. It made $ ________ million in the US alone. 

### Finding the highest rated movies using `.sort_values()`


In [None]:
# Sort the hero DataFrame by `"IMDB_Score"` and make `ascending=False`
hero.sort_values("column_name_here", ascending=False)



#### The highest rated film according to the IMDB was _______________. It made was rated ___________ by ___________ people. The lowest rated film was __________ with a rating of _______.  

## Comparing DC vs Marvel cinematic universes

The battle of the cinematic universes begins! Who will reign supreme?

### Figuring out the number of Marvel and DC films using `.value_counts()`

The first order of business is figuring out how many of our datapoints are from either Marvel or DC? Are there more Marvel or DC films? 

To do this we need to use `.value_counts()` function. It is important to note that this function can only be applied on a single column of a DataFrame

For example:

```
    hero["column_name"].value_counts()
```

In [None]:
# Figure out how many datapoints are from Marvel or the DC cinematic universe.  Use the column name "Category"
hero["Category"].value_counts()



#### Our dataset has ______ filmns from the DC universe and ________ films from the Marvel universe.

### Figuring out if Marvel or DC films are more highly rated using `.groupby()`

Which movies are more highly rate on IMDB? DC or Marvel?

To answer this question, we need to use the `.groupby()` method. We can select one our multiple columns to group the data by, and then we can aggregate that data to find out information about mean, min, max, standard deviation, sum, etc.

The format that `.groupby()` uses is as follows:

```
hero.groupby("column_name").aggregate_function()
```

Where `"column_name"` is the name of the column and `.aggregate_function()` is the a specific aggregate function like `.mean()`, `.sum()`, `.max()`, `min()`.

In [None]:
# Figure out if DC or Marvel films have a greater average IMDB_Score, Metascore, Votes, USA_Gross.
# Remember the column that tells us is the film is DC or Marvel is called `"Category"``

hero.groupby("Column_Name").mean()

In [None]:
hero.groupby("Category").mean()

#### ___ has a higher mean IMDB Score than _______.  And, ____ has a higher mean Metacritic score than _______. In general, _______ films tend to gross more in the US than _____ films. 

### Grouping by both `"Category"` (i.e., DC vs Marvel) and `"Type"` (i.e., Movie vs Series)

Our dataset also includes information if the film is a stand-alone movie (only one iteration of it) or if it is a series (multiple episodes).

To get a break down of our data by both `"Category"` and `"Type"`, we need to pass two columns into `.groupby()`

The format should be as follows: 

```
hero.groupby(["column_name_1", "column_name_2"]).aggregate_function()
```


In [None]:
# Calculate the mean for the numerical data by grouping by "Category" and "Type"
hero.groupby(["column_name_1", "column_name_2"]).mean()

#### When comparing stand-alone movies, ______  had a higher IMDB score than ______.  When comparing series, ______ had a higher IMDB rating than ______.

### Everyone knows that there was a resurgence in superhero movies in the mid to late 2000s, what if we only look at data after 2005 using `.query()` 

`.query()` to apply some sort of condition onto the data.  Usually the format is you set some sort of condition based on a column of your DataFrame.


Some examples of queries include the following:


- Only include films that were before the year 1995: `hero.query("Year_Start < 1995")`
- Only include films that were directed by Tim Burton: `hero.query("Director == 'TimBurton'")`
- Only include films that have a IMDB score of greater or equal to 8 **and** were made after 2010: `hero.query("IMDB_Score >= 8 & Year_Start > 2010")`
- Only include films that have a age rating of "R" **or** "M": `hero.query("Rating == 'R' | Rating == 'M'")`




In [None]:
# Query that hero data for films where the Year_Start is > 2005

hero_2005 = hero.query("Year_Start > 1900") # Replace the year 
hero_2005

In [None]:
# Calculate the mean for the data only including 2005 and beyond by grouping by "Category" and "Type"
hero_2005.groupby(["column_name_1", "column_name_2"]).mean()


## Graphing Data