# Exploratory Data Analysis


## Dataset
#### [Movies Dataset](https://grouplens.org/datasets/movielens/)
- movies.csv: The main Movies file. Contains information on 45,000 movies featured in the Full MovieLens dataset. Features include:

    - budget - The budget in which the movie was made.
    - genre - The genre of the movie, Action, Comedy ,Thriller etc.
    - homepage - A link to the homepage of the movie.
    - id - This is infact the movie_id as in the first dataset.
    - keywords - The keywords or tags related to the movie.
    - original_language - The language in which the movie was made.
    - original_title - The title of the movie before translation or adaptation.
    - overview - A brief description of the movie.
    - popularity - A numeric quantity specifying the movie popularity.
    - production_companies - The production house of the movie.
    - production_countries - The country in which it was produced.
    - revenue - The worldwide revenue generated by the movie.
    - release_date - The date on which it was released.
    - runtime - The running time of the movie in minutes.
    - status - "Released" or "Rumored".
    - tagline - Movie's tagline.
    - title - Title of the movie.
    - vote_average - average ratings the movie recieved.
    - vote_count - the count of votes recieved.
    
    
- credits.csv: 
   - movieId - unique identifier for each movie
   - title - name of movie
   - crew - behind the scences crew
   - cast - actors



In [28]:
  
import pandas as pd
import numpy as np




## Import Data

In [29]:

movies = pd.read_csv('..//data//raw//tmdb_5000_movies.csv')
credit = pd.read_csv('..//data//raw//tmdb_5000_credits.csv')



In [30]:
movies = movies.merge(credit, on = 'title')

#### Movie Dataset Analysis

In [31]:
movies.head(2)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [32]:
initial_movie_shape = movies.shape
initial_movie_shape

(4809, 23)

#### Missing values

In [33]:
round(movies.isnull().sum()/len(movies) * 100),2

(budget                   0.0
 genres                   0.0
 homepage                64.0
 id                       0.0
 keywords                 0.0
 original_language        0.0
 original_title           0.0
 overview                 0.0
 popularity               0.0
 production_companies     0.0
 production_countries     0.0
 release_date             0.0
 revenue                  0.0
 runtime                  0.0
 spoken_languages         0.0
 status                   0.0
 tagline                 18.0
 title                    0.0
 vote_average             0.0
 vote_count               0.0
 movie_id                 0.0
 cast                     0.0
 crew                     0.0
 dtype: float64,
 2)

#### 

#### Dropped columns

We have a lot of unnecessary columns that are unnecessary, redundant or are missing values. 

In [34]:
movies.drop(['homepage','tagline', 'spoken_languages', 'status', 'production_countries', 'production_companies', 'original_title', 'original_language', 'budget', 'id', 'release_date','revenue', 'overview', 'runtime'], axis=1, inplace=True)


In [35]:
movies.columns

Index(['genres', 'keywords', 'popularity', 'title', 'vote_average',
       'vote_count', 'movie_id', 'cast', 'crew'],
      dtype='object')

#### Genre Analysis

The genres in the genres column are treated as a dictionary with string literals. We need to create tags.

In [36]:
import ast


def convert(obj):
    L = []
    for i in ast.literal_eval(obj):
        L.append(i['name'])
    return L


In [37]:
movies['genres'] = movies['genres'].apply(convert)


#### Check for Movies with missing genres

In [38]:
movies.drop(movies[movies['genres'].str.len() == 0].index, inplace=True)

In [39]:
all_genres = sum(movies.genres, [])
print(f'We have {len(set(all_genres))} unique genres in our dataset.')


We have 20 unique genres in our dataset.


In [72]:
import nltk
all_genres = nltk.FreqDist(all_genres)
all_genres_df = pd.DataFrame(
    {'Genre': list(all_genres.keys()), 'Count': list(all_genres.values())})


#### Genres Bar Graph

In [41]:
import plotly.express as px

fig = px.bar(all_genres_df, x='Genre', y='Count', title='# of Genres' )
fig.show()

It looks like the top 3 genres in our movie dataset are Drama, Comedy and Thriller. 

#### Keywords

It appears that our Keywords columns suffers from the same issue that our genre column was facing.

In [42]:
movies['keywords'] = movies['keywords'].apply(convert)


#### Vote Average

In [43]:
outOfRangeVotes = movies[(movies.vote_average < 0) |
                         (movies.vote_average > 10)]
outOfRangeVotes


Unnamed: 0,genres,keywords,popularity,title,vote_average,vote_count,movie_id,cast,crew


The vote_average column should contain values that range from 0-10. It we don't have any votes out of range.

#### Cast

Each movie can contains hundreds of cast so we'll limit the cast to 10.

In [44]:
def convertCast(obj):
    L = []
    counter = 0
    for i in ast.literal_eval(obj):
        if counter != 10:
            L.append(i['name'])
            counter += 1
        else:
            break
    return L


In [45]:
movies['cast'] = movies['cast'].apply(convertCast)

In [88]:
all_actors = sum(movies.cast, [])
print(f'We have {len(set(all_actors))} unique actors in our dataset.')

We have 18987 unique actors in our dataset.


In [95]:
import nltk
all_actors = nltk.FreqDist(all_actors)
all_actors_df = pd.DataFrame(
    {'Actors': list(all_actors), 'Count': list(all_actors.values())})
top_10_act = all_actors_df.sort_values(by='Count', ascending=False).head(15)
top_10_act


Unnamed: 0,Actors,Count
76,Dustin Hoffman,58
1677,Judd Hirsch,55
37,Laurence Fishburne,46
1013,Kelly Reilly,42
728,Melanie Lynskey,41
766,Sharon Stone,37
807,Simon Callow,37
273,Michael Peña,37
363,Ryan Gosling,37
323,John C. McGinley,36


In [97]:

fig = px.bar(top_10_act, x='Actors', y='Count',
             title='Most Popular Actors')
fig.show()


#### Director

In [46]:
def get_Director(obj):
    L = []
    for i in ast.literal_eval(obj):
        if i['job'] == 'Director':
            L.append(i['name'])
            break
    return L


In [47]:
movies['crew'] = movies['crew'].apply(get_Director)


In [48]:
movies['crew']

0           [James Cameron]
1          [Gore Verbinski]
2              [Sam Mendes]
3       [Christopher Nolan]
4          [Andrew Stanton]
               ...         
4803     [Neill Dela Llana]
4804     [Robert Rodriguez]
4805         [Edward Burns]
4806          [Scott Smith]
4808     [Brian Herzlinger]
Name: crew, Length: 4781, dtype: object

In [69]:
all_directors = sum(movies.crew, [])
print(f'We have {len(set(all_directors))} unique directors in our dataset.')


We have 2338 unique directors in our dataset.


In [98]:
all_directors=nltk.FreqDist(all_directors)
all_directors_df = pd.DataFrame(
    {'Directors': list(all_directors), 'Count': list(all_directors.values())})
top_10_dir = all_directors_df.sort_values(by='Count', ascending=False).head(15)


In [99]:

fig = px.bar(top_10_dir, x='Directors', y='Count', title='Most Popular Directors')
fig.show()


#### Top 10 Highest Rated Movies

In [49]:
movies.groupby('title')['vote_average'].mean().sort_values(ascending=False).head(10)

title
Little Big Top              10.0
Stiff Upper Lips            10.0
Me You and Five Bucks       10.0
Dancer, Texas Pop. 81       10.0
One Man's Hero               9.3
There Goes My Baby           8.5
The Shawshank Redemption     8.5
The Godfather                8.4
The Prisoner of Zenda        8.4
Counting                     8.3
Name: vote_average, dtype: float64

#### Movies with the most Ratings

In [50]:
movies.groupby('title')['vote_count'].count().sort_values(ascending=False).head(10)

title
Out of the Blue                          4
The Host                                 4
Batman                                   4
#Horror                                  1
Spun                                     1
Spy Kids 3-D: Game Over                  1
Spy Kids 2: The Island of Lost Dreams    1
Spy Kids                                 1
Spy Hard                                 1
Spy Game                                 1
Name: vote_count, dtype: int64

In [51]:
print(f'Our initial movie and credit dataset had {initial_movie_shape} rows and columns')
print(f'Our final movie and credit dataset has {movies.shape} rows and columns')

Our initial movie and credit dataset had (4809, 23) rows and columns
Our final movie and credit dataset has (4781, 9) rows and columns


In [52]:
movies.to_csv('movies.csv')