
<h1 id="Project:-TMDb-Data-Analysis">Project: TMDb Data Analysis<a class="anchor-link" href="#Project:-TMDb-Data-Analysis">¶</a></h1><h2 id="Table-of-Contents">Table of Contents<a class="anchor-link" href="#Table-of-Contents">¶</a></h2><p></p><li><a href="#intro">Introduction</a></li>
<p></p><li><a href="#wrangling">Data Wrangling</a></li>
<p></p><li><a href="#eda">Exploratory Data Analysis</a></li>
<p></p><li><a href="#conclusions">Conclusions</a></li>
&lt;/ul&gt;




<p><a id="intro"></a></p>
<h2 id="Introduction">Introduction<a class="anchor-link" href="#Introduction">¶</a></h2><blockquote><p><strong>Tip</strong>: In this data analysis, we will be looking at information about 10K movies from the Movie Database (TMDb). We are looking at which genres were most popular from year to year and exploring the relationship between the popularity of a film and it's vote average score</p>
</blockquote>
<p>Dataset analyzed: TMDb Data</p>
<p>Questions to explore: Which genres were most popular throughout the years? Is there a correlation between popularity and vote average score of a film?</p>


In [None]:

# Use this cell to set up import statements for all of the packages that you
#   plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html




<p><a id="wrangling"></a></p>
<h2 id="Data-Wrangling">Data Wrangling<a class="anchor-link" href="#Data-Wrangling">¶</a></h2><blockquote><p><strong>Tip</strong>: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.</p>
</blockquote>
<h3 id="General-Properties">General Properties<a class="anchor-link" href="#General-Properties">¶</a></h3>


In [None]:

df= pd.read_csv('tmdb_movies.csv', sep=',')
df.head()



In [None]:

df.info()



In [None]:

#drop columns not needed
df.drop(['imdb_id', 'id', 'budget', 'revenue', 'homepage', 'tagline', 'keywords', 'overview', 'production_companies', 'release_date'], axis=1, inplace=True)



In [None]:

#check that this is correct
df.head()



In [None]:

df.info()



In [None]:

#drop all 'missing values' rows
df.dropna(inplace=True)
df.info()



In [None]:

df.describe()



In [None]:

#all unique values in genres
df.genres.unique()



In [None]:

#create new DF from the series with original_title as index; splitting up genres sep by pipes
new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()



In [None]:

# We now want to get rid of the secondary index
# To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)
new_df = new_df.reset_index([0, 'original_title'])
new_df.columns = ['original_title', 'mgenres']
new_df.head(5)



In [None]:

#combine the new_df with the original df

genres_df= pd.merge(df, new_df, on='original_title')
genres_df.head(5)



In [None]:

#drop the old genres column and check

genres_df.drop(['genres'], axis=1, inplace=True)

genres_df.head(5)



In [None]:

genres_df.plot(x='release_year',y='popularity', kind='scatter' );
plt.title('Popularity by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Popularity')



In [None]:

genres_df.plot(x='release_year',y='vote_average', kind='scatter');
plt.title('Vote Average by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Vote Average')



In [None]:

#check datatype
type(genres_df['popularity'][0])



In [None]:

# convert popularity from float to int
genres_df['popularity'] = genres_df['popularity'].astype(int)

#check datatype
type(genres_df['popularity'][0])



In [None]:

# convert popularity from float to int
genres_df['vote_average'] = genres_df['vote_average'].astype(int)

#check datatype
type(genres_df['vote_average'][0])



In [None]:

#find the mean popularity score of each genre type with groupby
genres_df.groupby('mgenres').mean().popularity



In [None]:

#find the 25%, 50%, 75%, and max popularity values with Pandas describe
genres_df.describe().popularity



In [None]:

#top films grouped by genres and popularity means, sorting by top 5
topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)



In [None]:

topfilms_df.plot(kind= 'bar', color='#3caea3')
plt.title('Top Genres over Time')
plt.xlabel('Genres')
plt.ylabel('Popularity')



In [None]:

#top rated films grouped by genres and popularity means
rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
rated_df



In [None]:

genres_df.plot(x='vote_average',y='popularity', kind='scatter');
plt.title('Vote Average Correlation with Popularity')
plt.xlabel('Vote Average')
plt.ylabel('Popularity')



In [None]:

np.corrcoef(genres_df.vote_average, genres_df.popularity)




<p><a id="eda"></a></p>
<h2 id="Exploratory-Data-Analysis">Exploratory Data Analysis<a class="anchor-link" href="#Exploratory-Data-Analysis">¶</a></h2><blockquote><p><strong>Tip</strong>: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.</p>
</blockquote>
<h3 id="Research-Question-1--What-genres-were-most-popular-over-the-years?">Research Question 1- What genres were most popular over the years?<a class="anchor-link" href="#Research-Question-1--What-genres-were-most-popular-over-the-years?">¶</a></h3>



<h4 id="1.-Separate-the-genres-from-the-genres-column-into-a-new-column,-mgenres">1. Separate the genres from the genres column into a new column, mgenres<a class="anchor-link" href="#1.-Separate-the-genres-from-the-genres-column-into-a-new-column,-mgenres">¶</a></h4><p>create new DF from the series with original_title as index; splitting up genres sep by pipes-</p>
<pre><code>new_df = pd.DataFrame(df.genres.str.split('|').tolist(), index=df.original_title).stack()  </code></pre>



<p>We now want to get rid of the secondary index
To do this, we will make original_title as a column (it can't be an index since the values will be duplicate)</p>
<pre><code>new_df = new_df.reset_index([0, 'original_title'])
new_df.columns = ['original_title', 'mgenres']
new_df.head(5)</code></pre>



<p>combine the new_df with the original df</p>
<pre><code>genres_df= pd.merge(df, new_df, on='original_title')
genres_df.head(5)</code></pre>



<p>Drop the old genres column and check</p>
<pre><code>genres_df.drop(['genres'], axis=1, inplace=True)

genres_df.head(5)</code></pre>



<p>find the mean popularity score of each genre type with groupby</p>
<pre><code>genres_df.groupby('mgenres').mean().popularity</code></pre>



<p>top films grouped by genres and popularity means, sorting by top 5</p>
<pre><code>topfilms_df = genres_df.groupby('mgenres')['popularity'].mean().sort_values().tail(5)</code></pre>



<p>plot/visualize top 5 genres by popularity score</p>


In [None]:

topfilms_df.plot(kind= 'bar', color='#3caea3')
plt.title('Top Genres over Time')
plt.xlabel('Genres')
plt.ylabel('Popularity')




<h3 id="Research-Question-2--Does-the-Popularity-of-a-movie-correlate-with-the-Vote-Score-Average?">Research Question 2- Does the Popularity of a movie correlate with the Vote Score Average?<a class="anchor-link" href="#Research-Question-2--Does-the-Popularity-of-a-movie-correlate-with-the-Vote-Score-Average?">¶</a></h3>



<p>top rated films grouped by vote average and popularity means</p>
<pre><code>rated_df = genres_df.groupby('vote_average')['popularity'].mean().sort_values()
rated_df</code></pre>



<p>Plot the relationship between Vote Average and Popularity</p>
<pre><code>genres_df.plot(x='vote_average',y='popularity', kind='scatter');</code></pre>


In [None]:

genres_df.plot(x='vote_average',y='popularity', kind='scatter');
plt.title('Vote Average Correlation with Popularity')
plt.xlabel('Vote Average')
plt.ylabel('Popularity')




<p>Find the correlation</p>


In [None]:

np.corrcoef(genres_df.vote_average, genres_df.popularity)




<p><a id="conclusions"></a></p>
<h2 id="Conclusions">Conclusions<a class="anchor-link" href="#Conclusions">¶</a></h2>



<p>There were some limitations/challenges to these conclusions which may make these findings not conclusive.</p>
<pre><code>1) Missing data
    All rows with missing data in Cast, Director, and Genres were dropped
2) Vote counts were, for the most part, on the lower count side- which likely skewed results
    Older titles had much less votes since IMDb was not as widely used (or existed)
    Titles in more recent years had a lot more data</code></pre>



<p>The top 5 genres over the years are</p>
<pre><code>1) Adventure
2) Science Fiction
3) Fantasy
4) Action
5) Animation</code></pre>



<p>There was a weak positive correlation between a film's popularity and the average score it gets</p>
