# Importing Libaries

In [2]:
import pandas as pd
import numpy as np

In [3]:
credits_df = pd.read_csv('../csv/tmdb_5000_credits.csv')
movies_df = pd.read_csv('../csv/tmdb_5000_movies.csv')

# Data Exploration

Explore the data we are dealing with:
- Understanding the shape, columns, and rows in the data.
- Type of data.
- Look for any missing values.
- Summarize the differences and similarities between the datasets.

In [4]:
print (f'The shape of the movies file is: {movies_df.shape}')
print (f'The shape of the credits file is: {credits_df.shape}')

# Datasets contain the same amount of columns

The shape of the movies file is: (4803, 20)
The shape of the credits file is: (4803, 4)


In [16]:
print (f'The columns of the movies dataset are: {movies_df.columns}\n')

print (f'The columns of the credits dataset are: {credits_df.columns}')

# The movies dataset has interesting columns that can be used for the machine learning model. For example, the vote_count and the keywords columns

The columns of the movies dataset are: Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

The columns of the credits dataset are: Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')


In [34]:
print (f'The movies dataset types are:\n{movies_df.dtypes}\n')

print (f'The credits dataset types are:\n{credits_df.dtypes}')

The movies dataset types are:
budget                    int64
genres                   object
homepage                 object
id                        int64
keywords                 object
original_language        object
original_title           object
overview                 object
popularity              float64
production_companies     object
production_countries     object
release_date             object
revenue                   int64
runtime                 float64
spoken_languages         object
status                   object
tagline                  object
title                    object
vote_average            float64
vote_count                int64
dtype: object

The credits dataset types are:
movie_id     int64
title       object
cast        object
crew        object
dtype: object


### Data Types

**Movies:**
- Contains some integer and float values in the dataset.
- Most of the data types are a object.

**Credits:**
- Only one integer value column.
- All the other data types are an object.

## Missing Values

Summing both datasets for any missing value columns. This way we don't run into an error moving forward and are aware of the data.

In [18]:
movies_df.isna().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                  31
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

In [19]:
credits_df.isna().sum()

# The credits file is not missing any of the values

movie_id    0
title       0
cast        0
crew        0
dtype: int64

## ID Column

Note: The **ID** column can be used later in the project to match the datasets.

In [26]:
movies_df['id'][:5]

0     19995
1       285
2    206647
3     49026
4     49529
Name: id, dtype: int64

In [25]:
credits_df['movie_id'][:5]

0     19995
1       285
2    206647
3     49026
4     49529
Name: movie_id, dtype: int64

## Side-to-Side Comparison

Using the first row only to see the similarities and differences between the datasets. This will help us understand how the data is organized.

In [28]:
movies_df.loc[0]

budget                                                          237000000
genres                  [{"id": 28, "name": "Action"}, {"id": 12, "nam...
homepage                                      http://www.avatarmovie.com/
id                                                                  19995
keywords                [{"id": 1463, "name": "culture clash"}, {"id":...
original_language                                                      en
original_title                                                     Avatar
overview                In the 22nd century, a paraplegic Marine is di...
popularity                                                     150.437577
production_companies    [{"name": "Ingenious Film Partners", "id": 289...
production_countries    [{"iso_3166_1": "US", "name": "United States o...
release_date                                                   2009-12-10
revenue                                                        2787965087
runtime                               

In [29]:
credits_df.loc[0]

movie_id                                                19995
title                                                  Avatar
cast        [{"cast_id": 242, "character": "Jake Sully", "...
crew        [{"credit_id": "52fe48009251416c750aca23", "de...
Name: 0, dtype: object