# Week 5 - Mini-Project

## <font color='#1A9FFF'>Case study: IMDB Movie Dataset</font>

The data of this dataset comes from the Top 100 Movies List of All Time in IMDb. Movies recorded in this dataset are in this decade (between 2010-2016). There are 118 movies included in this dataset, and each movie entry bears 54 attributes. Most of the attributes are related to the rating. The goal of this dataset is to help us understand the success of movies with high rating.

**Data dictionary**

|# | Attribute   |   Description                                                                     |
|::|:------------|:----------------------------------------------------------------------------------|
|0 |Title        | Name of the movies with the year of release in the brackets.                      |
|1 |Rating       | Total average rating of the movie on IMDb.                                        |
|2 |TotalVotes   | Total number of votes given to the movie.                                         |
|3 |Genre1       | Genre attributed to the movie. A lot of movies have just one or two genres.       |
|4 |Genre2       | Second genre attributed to a movie.                                               |
|5 |Genre3       | Third genre attributed to a movie.                                                |
|6 |MetaCritic   | Metacritic score.                                                                 |
|7 |Budget       | Budget of the movie as per IMDB, data on some of the movies is wrong or is in a different currency <br> (GBP, EUR etc), this needs to be cleaned before analysing.                                      |
|8 |Runtime      | Duration of the movie. (min)                                                      |
|9 |Cvotes10     | Number of votes rating the movie 10 stars.                                        |
|10|Cvotes09     | Number of votes rating the movie 9 stars.                                         |
|11|Cvotes08     | Number of votes rating the movie 8 stars.                                         |
|12|Cvotes07     | Number of votes rating the movie 7 stars.                                         |
|13|Cvotes06     | Number of votes rating the movie 6 stars.                                         |
|14|Cvotes05     | Number of votes rating the movie 5 stars.                                         |
|15|Cvotes04     | Number of votes rating the movie 4 stars.                                         |
|16|Cvotes03     | Number of votes rating the movie 3 stars.                                         | 
|17|Cvotes02     | Number of votes rating the movie 2 stars.                                         | 
|18|Cvotes01     | Number of votes rating the movie 1 stars.                                         |
|19|CvotesMale   | Total number of male voters.                                                      |   
|20|CvotesFemale | Total number of female voters.                                                    |
|21|CvotesU18    | Number of votes by people aged under 18.                                          |
|22|CvotesU18M   | Number of votes by male voters aged under 18.                                     |
|23|CvotesU18F   | Number of votes by female voters aged under 18.                                   |
|24|Cvotes1829   | Number of votes by voters between ages 18 and 29, inclusive of both years.        |
|25|Cvotes1829M  | Number of votes by male voters between ages 18 and 29, inclusive of both years.   |
|26|Cvotes1829F  | Number of votes by female voters between ages 18 and 29, inclusive of both years. |
|27|Cvotes3044   | Number of votes by voters between ages 30 and 44, inclusive of both years.        |
|28|Cvotes3044M  | Number of votes by male voters between ages 30 and 44, inclusive of both years.   |
|29|Cvotes3044F  | Number of votes by female voters between ages 30 and 44, inclusive of both years. |
|30|Cvotes45A    | Number of votes by voters aged 45 and above.                                      |
|31|Cvotes45AM   | Number of votes by male voters aged 45 and above.                                 | 
|32|Cvotes45AF   | Number of votes by female voters aged 45 and above.                               |  
|33|Cvotes1000   | Total count of votes by top 1000 users of IMDb.                                   |
|34|CvotesUS     | Total count of votes by US based users.                                           | 
|35|CvotesnUS    | Total count of votes by viewers based outside US.                                 |
|36|VotesM       | Average rating by male users.                                                     |
|37|VotesF       | Average rating by female users.                                                   | 
|38|VotesU18     | Average rating by users under 18 years of age.                                    |
|39|VotesU18M    | Average rating by male users under 18 years of age.                               | 
|40|VotesU18F    | Average rating by female users under 18 years of age.                             |
|41|Votes1829    | Average rating by users between ages 18 and 29.                                   |
|42|Votes1829M   | Average rating by male users between ages 18 and 29.                              | 
|43|Votes1829F   | Average rating by female users between ages 18 and 29.                            |
|44|Votes3044    | Average rating by users between ages 30 and 44.                                   |
|45|Votes3044M   | Average rating by male users between ages 30 and 44.                              | 
|46|Votes3044F   | Average rating by female users between ages 30 and 44.                            |
|47|Votes45A     | Average rating by users age 45 and above.                                         | 
|48|Votes45AM    | Average rating by male users age 45 and above.                                    | 
|49|Votes45AF    | Average rating by female users age 45 and above.                                  |
|50|VotesIMDB    | Average rating of IMDb staff.                                                     | 
|51|Votes1000    | Average rating of IMDb's top 1000 users.                                          |
|52|VotesUS      | Average rating of U.S. based users.                                               |
|53|VotesnUS     | Average rating of users based outside the U.S.                                    |


**The project aims to:**

* What are the characteristics of top movies?
* Are there any trend of top movies, for example, genre, budget, targeted age group or gender?

## <font color='#1A9FFF'>Sample Solution</font>

#### Import Libraries and Dataset

In [None]:
import numpy  as np
import pandas as pd              # import pandas for data wrangling
import matplotlib.pyplot as plt  # import matplotlib for ploting
%matplotlib inline

data = pd.read_csv('IMDB.csv',encoding = "latin")   # read data

###  <font color='#1A9FFF'>1. Brief Summary of Dataset</font>
After reading in the data, we first do some simple exploration, check available columns, data structure, and data summary.

#### Check the size of the dataset

In [None]:
# Check the size of the IMDB dataset


As we can see, there are `118` movies recored in this dataset, and each movie has `54` attributes

#### Check the data structure 

In [None]:
# Check the data structure (use .info())


#### Check main attributes

In [None]:
# Check the main attributes of the first 10 movies
# the main attributes are "Title"(0), Rating"(1), "TotalVotes"(2), "Genre1"(3), "MetaCritic"(6), "Budget"(7), "Runtime"(8)


#### Brief Summary

In [None]:
# Briefly summarize the statistical distribution of main attributes (use .describe())


###  <font color='#1A9FFF'>2. Check the Data</font>
Check the data to see whether there is any wrong data in the dataset.

#### Unique Budget

In [None]:
# As we can see from the Brief Summary, the minimum budget in the movie list is 804, which is unusual. 
# We therefore check the budget first. Let's take a look of the unque budget. (use .unique())


In [None]:
# Check the movies with budget lower than 10,000


In [None]:
# Remove the movies with budget lower than 10,000


###  <font color='#1A9FFF'>3. Data Observation</font>

#### Genre Summary

In [None]:
# find number of movies in each genre


#### Rating Distribution

In [None]:
# plot histogram showing rate distribution


#### Budget Distribution

In [None]:
# plot histogram showing budget distribution


#### Runtime Distribution

In [None]:
# plot histogram showing runtime distribution


###  <font color='#1A9FFF'>4. Exploratory Data Analysis</font>

#### Quantify Rating

Separate rating into 3 groups, HighRate, MidRate, and LowRate.

In [None]:
# define rating ranges
rating_cut = pd.cut(data['Rating'], [7.0, 7.66, 8.33, 9.0])

# groupby pre-defined rating ranges (HighRate, MidRate, LowRate), and check size of each group


#### Rating - Budget relationship exploration

In [None]:
# average budget in each of the 3 rating group


In [None]:
# define budget ranges
budget_cut = pd.cut(data['Budget'], list(np.linspace(10739, 250000000, num=6)))

# average rating in each budget range


#### Rating - Genre relationship exploration

In [None]:
# find the average rating in each genre


In [None]:
# check average voting number of each genre


#### Rating - Gender relationship exploration

In [None]:
# average rating in each of the 3 rating group (seperate Male and Female)


#### Genre - Gender relationship exploration

In [None]:
# explore the average rating in each genre group (seperate Male and Female)


### Actionable Insights
Record any insights you found to improve the rating of the movie during the exploratory data analysis.
1. 
2. 
3. 
