# IMDB Movies exploratory data analysis

## Overview
This notebook demonstrates the exploratory data analysis on the features of IMDB movies from 2009-01-01 to 2019-10-31. The goal is to find some discriptive statistics for each field and find some relationship between movie's popularity and other features

## Dataset
We crawled a list of movies in recent 10 years on IMDB from 2009-01-01 to 2019-10-31 and used IMDbPY API to get each movie's features including each movie's title, release year, genre, plot summaries, the number of votes and rating.

Note:

1. Here, rating is the average rating for each movie and number of votes is the number of ratings from users to each movie. We used the votes to measure each movie’s popularity.

2. Keywords of each movie in the original dataset were generated by the users, which are different from the keywords generated through our future topic modeling.

## Items to explore
1. Find the most popular key words
2. Try to understand relationship between a movie's number of votes and features
* get votes distribution for different ratings
* get votes distribution for different key words
* get votes distribution for different genre

## Motivation
This will give us an insight about what the people’s preferences are for the different movie genres, keywords, and such. These insights possibly are useful for the critics.

In [None]:
import sys
import os
import json
import numpy as np
import pandas as pd

import seaborn.apionly as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
## load data


In [None]:
df = pd.read_csv("../dataset/movie_info.csv")
df['genre'] = [genre.split("|") for genre in df['genre']]
df['key words'] = [genre.split("|") for genre in df['key words']]

In [None]:
df.head(10)


## Basic EDA
### Key Words

In [None]:
import itertools
keywords = list(itertools.chain(*df['key words'].values))
pd.DataFrame(keywords, columns=['key words'])\
    .groupby('key words')['key words']\
    .count()\
    .sort_values(ascending=False)[:40][::-1]\
    .plot(
        kind='barh',
        figsize=(16, 12),
        fontsize=12,
        title='Key Words Occurence',
)

### Votes Analysis

In [None]:
def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

In [None]:
df_keywords = explode(df, ['key words'])


In [None]:
high_vol_key_words = df_keywords.groupby('key words')['key words']\
    .count()\
    .sort_values(ascending=False)[:30]\
    .index.values
high_vol_key_words_filter = df_keywords['key words'].isin(high_vol_key_words)

data = df_keywords[high_vol_key_words_filter.values]
data = data[data['number of votes']<=1e3]
f, ax = plt.subplots(figsize=(16, 12))
ax.set_title('Votes vs. Key Words', fontsize=12)
sns.stripplot(x= 'number of votes', y='key words', data=data, ax=ax, jitter=1, marker='.', size=4)

Based the plot above, movies with keyword "christmas" are more likely to get a relatively higer number of votes.

In [None]:
df_genre = explode(df, ['genre'])

In [None]:
high_vol_genre = df_genre.groupby('genre')['genre']\
    .count()\
    .sort_values(ascending=False)[:30]\
    .index.values
high_vol_genre_filter = df_genre['genre'].isin(high_vol_genre)

data = df_genre[high_vol_genre_filter.values]
data = data[data['number of votes']<=1e3]
f, ax = plt.subplots(figsize=(16, 12))
ax.set_title('Votes vs. genre', fontsize=12)
sns.stripplot(x= 'number of votes', y='genre', data=data, ax=ax, jitter=1, marker='.', size=4)

Based the plot above, drama movies are more likely to get a relatively higer number of votes.


In [None]:
data = df[df['number of votes'] <= 1e3]
f, ax = plt.subplots(figsize=(32, 8))
ax.set_title('number of votes vs. rating', fontsize=12)
sns.stripplot(x='rating', y='number of votes', data=data, ax=ax, jitter=0.2)

Based the plot above, movies with the rating between 5.5 and 7.5 are more likely to get a relatively higer number of votes.
